Internal sorted dictionary¶
Traces uses an internal SortedDict class (in traces.sorted_dict)
to store the time-value pairs in a TimeSeries. This is a lightweight
implementation built on Python’s dict and bisect module, with no
external dependencies.
Previously, traces used sortedcontainers.SortedDict for this
purpose. The sortedcontainers
library is an excellent, well-tested, pure-Python library for sorted
data structures.
Design¶
The built-in SortedDict maintains two data structures:
A Python
dictfor O(1) key-value lookupsA sorted
listof keys for ordered iteration and binary search via thebisectmodule
This design has different performance characteristics than
sortedcontainers.SortedDict, which uses a B-tree-like structure
(a list of sublists) that provides more balanced performance across
operations.
Performance comparison¶
Benchmarks were run using the full TimeSeries API (not just raw
data structure operations), with data sizes from N=100 to N=100,000.
The ratio column shows sortedcontainers time divided by built-in
time, so values greater than 1x indicate the built-in is faster.
Where the built-in SortedDict is faster:
Random lookups: 1.6–3.3x faster across all sizes. The built-in uses O(1) dict lookup when the key exists and falls back to
bisectotherwise. This advantage grows with N.Iterating periods and computing distributions: 1.4–3.7x faster across all sizes. These operations iterate over sorted keys and look up values, benefiting from the same dict + bisect combination.
Set interval: 3.3–4.5x faster across all sizes. The built-in uses a single list-slice deletion to remove a range of keys, while the
sortedcontainerspath deletes keys individually.Building incrementally at small N (under ~1,000): 1.3–1.5x faster due to lower per-operation overhead.
Where sortedcontainers.SortedDict is faster:
Building incrementally at large N: At N=10,000,
sortedcontainersis 1.6x faster; at N=100,000 it is 10x faster. The built-in usesbisect.insort, which has O(n) cost per insertion due to list element shifting.sortedcontainersuses a B-tree-like structure with O(log n) amortized insertions. This only applies to individual insertions (ts[time] = valuein a loop); usingset_many()avoids this penalty entirely.
Where they are comparable:
Building from data and building with set_many: Within a few percent of each other at all sizes. Both paths use bulk insertion (
dict.update+sorted) rather than per-element insertion.
Bulk insertion with set_many¶
TimeSeries.set_many(data) accepts an iterable of (time, value)
pairs or a dictionary, and inserts all points in a single operation.
This avoids the per-element bisect.insort cost that makes
individual insertions slow at large N:
# Instead of this:
for time, value in data:
ts[time] = value
# Use this:
ts.set_many(data)
set_many accepts generators, so data does not need to be
materialized into a list first. from_csv and from_json use
set_many internally.
Choosing an implementation¶
The built-in SortedDict is a good default for most use cases.
Typical time series workloads involve loading data in bulk (from files,
databases, or lists of measurements) and then performing read
operations like lookups, distribution(), or iterperiods().
The built-in is faster for all of these read operations.
sortedcontainers.SortedDict is a better choice if your workload
involves building very large time series incrementally through many
individual, random-order insertions (e.g., more than 10,000 points
added one at a time) and you cannot use set_many() to batch them.
Reproducing the benchmarks¶
You can reproduce these benchmarks by running:
python experiments/bench_sorted_dict.py
