Merge implementation strategies¶
Merging multiple TimeSeries is one of the most important operations in traces. Given K TimeSeries with N total measurement points across all of them, the merge produces a single TimeSeries where each value is a list of all the input values at that time (or a function of all the values if operation is given).
As a concrete example, consider two TimeSeries tracking whether a light is on (1) or off (0):
>>> a = TimeSeries(default=0)
>>> a[1] = 1
>>> a[3] = 0
>>> b = TimeSeries(default=0)
>>> b[2] = 1
>>> b[4] = 0
Without an operation, merge produces a TimeSeries where each value is a list of the states of all inputs at that time:
>>> merged = TimeSeries.merge([a, b])
>>> for t, v in merged:
... print(t, v)
1 [1, 0]
2 [1, 1]
3 [0, 1]
4 [0, 0]
With operation=sum, the list is reduced to a single value — here,
the count of lights that are on:
>>> count = TimeSeries.merge([a, b], operation=sum)
>>> for t, v in count:
... print(t, v)
1 1
2 2
3 1
4 0
The core challenge is iterating through all transitions in time order while tracking the current state of each input series. There are several ways to do this, each with different performance characteristics. This page describes three approaches and their tradeoffs.
This is useful context for reimplementing merge in another language — the right choice depends on the expected workload.
The three approaches¶
1. Naive (collect-keys)¶
The simplest approach: collect all unique measurement times across all K series, then at each time, query every series for its value.
def naive_iter_merge(timeseries_list):
# Collect all unique measurement times
all_times = set()
for ts in timeseries_list:
for t, _v in ts:
all_times.add(t)
# At each time, query every series
for t in sorted(all_times):
yield t, [ts[t] for ts in timeseries_list]
How it works: Build a set of all times (O(N)), sort them (O(N log N)), then for each of the T unique times, look up the value in each of the K series via binary search (O(log N) per lookup).
Complexity: O(N log N + T × K × log N) time, O(T × K) memory for the output state lists.
Pros:
Dead simple to implement and understand.
No auxiliary data structures beyond a set and a sorted list.
Easy to port to any language — uses only sets, sorted arrays, and binary search.
Cons:
Performs K binary searches at every unique time, even when only one series changed. This is the dominant cost: for K=1000 with 2 transitions each, there are ~2000 unique times and each requires 1000 lookups — 2 million binary searches total.
Scales poorly with K. At K=1000, this approach is 40× slower than the alternatives (see benchmarks below).
When to use it: For small K (under ~20 series) where simplicity matters more than performance, or as a reference implementation for testing.
This was the first approach used by the traces JavaScript implementation:
function merge(traceList, operation = (d) => d) {
let keys = new Set(traceList.map((d) => d.list).flat());
let result = new Trace([], operation(traceList.map((d) => d.defaultValue)));
for (let k of keys) {
result.set(k, operation(traceList.map((d) => d.get(k))));
}
return result;
}
2. Priority queue (heap merge)¶
Use a min-heap to perform a K-way merge of the pre-sorted input series. Each series contributes an iterator; the heap always contains the next upcoming transition from each active series.
import heapq
def heapq_iter_merge_transitions(timeseries_list):
heap = []
for index, ts in enumerate(timeseries_list):
iterator = iter(ts)
try:
t, value = next(iterator)
except StopIteration:
pass
else:
heapq.heappush(heap, (t, index, value, iterator))
state = [ts.default for ts in timeseries_list]
while heap:
t, index, next_value, iterator = heapq.heappop(heap)
previous_value = state[index]
state[index] = next_value
yield t, index, previous_value, next_value
try:
t, value = next(iterator)
except StopIteration:
pass
else:
heapq.heappush(heap, (t, index, value, iterator))
How it works: The heap holds at most K entries (one per series).
Each heappop yields the next transition in time order. After
processing a transition, the next value from that series is pushed
back onto the heap.
Complexity: O(N log K) time (each of N transitions requires one heap push and one heap pop, each O(log K)). O(K) memory for the heap plus K iterator objects.
Pros:
Only processes actual transitions, never queries a series that didn’t change.
Streaming: produces transitions lazily without materializing all data up front.
Cons:
Each heap operation has Python function call overhead (
heapq.heappushandheapq.heappopare C functions, but each call crosses the Python/C boundary).Maintains K live iterator objects in memory for the duration of the merge.
For small K, the log K advantage over flat sort’s log N is negligible, and the per-element overhead is higher.
When to use it: When N per series is very large (tens of thousands or more) and K is small — the heap avoids materializing all N triples and uses significantly less memory (see benchmarks). Also the right choice when the input series are streams (e.g. reading from files or network sources) rather than in-memory data, since it produces transitions lazily without buffering.
Note on thread-safe wrappers: Python’s queue.PriorityQueue wraps
heapq with a threading lock, adding acquire/release overhead on every
push and pop. For single-threaded merge, use heapq directly.
3. Flat sort (extract and sort)¶
Extract all (time, index, value) triples from every series into a single list, sort it once, then iterate linearly.
def flat_sort_iter_merge_transitions(timeseries_list):
triples = []
for index, ts in enumerate(timeseries_list):
for t, v in ts:
triples.append((t, index, v))
triples.sort()
state = [ts.default for ts in timeseries_list]
for t, index, next_value in triples:
previous_value = state[index]
state[index] = next_value
yield t, index, previous_value, next_value
How it works: Flattening all transitions into a single list and sorting
turns the K-way merge into a standard sort problem. The subsequent
iteration is a simple linear scan with no function call overhead per
element (no heappush/heappop).
Complexity: O(N log N) time for the sort, O(N) memory for the triples list. The log N factor is theoretically worse than the priority queue’s log K, but in practice timsort’s lower constant factor wins for most workloads.
Pros:
Fastest in practice for in-memory data (see benchmarks).
No iterator objects or heap — just a flat list of tuples.
Python’s timsort exploits existing order within each series’ contribution, performing well on the partially-sorted input.
Cons:
Materializes all transitions at once. Not suitable for streaming inputs that don’t fit in memory.
O(N) memory for the triples list (though this is typically small compared to the TimeSeries objects themselves).
When to use it: For in-memory data, which is the common case. This is the current default in traces.
This is what traces uses as of version 0.7.
Performance characteristics are discussed below and benchmarks can be
reproduced via python experiments/bench_merge_strategies.py.
Benchmarks¶
All benchmarks on a 2021 Apple M1 MacBook Pro, Python 3.12, minimum of
3 runs. Series values are integers; operation is sum.
Many series, few transitions per series¶
This is the “light switch” or “ticket open/close” pattern: many boolean or categorical series with only a few state changes each. Each series has 2 transitions, so N (total transitions) = K × 2.
These benchmarks call merge(operation=sum), which goes through
iter_merge and copies the full K-element state list at each yield.
K |
N total |
Naive |
Priority queue |
Flat sort |
|---|---|---|---|---|
10 |
20 |
0.1 ms / 4 KB |
0.1 ms / 5 KB |
0.1 ms / 4 KB |
50 |
100 |
1.8 ms / 14 KB |
0.3 ms / 18 KB |
0.3 ms / 13 KB |
100 |
200 |
6.6 ms / 19 KB |
0.7 ms / 29 KB |
0.6 ms / 20 KB |
500 |
1,000 |
150 ms / 81 KB |
6.0 ms / 172 KB |
5.1 ms / 119 KB |
1,000 |
2,000 |
588 ms / 234 KB |
17.7 ms / 300 KB |
15.3 ms / 213 KB |
At K=1000, the naive approach is 38× slower than flat sort. The naive approach’s cost is dominated by performing K binary searches at each of the ~2000 unique times (2 million lookups). The priority queue and flat sort appear close here, with flat sort ~15% faster — but this understates the difference (see “Consuming transitions directly” below for why).
Few series, many transitions per series¶
These benchmarks also call merge(operation=sum). N/series is
the number of transitions per series; N total = K × N/series.
K |
N/series |
N total |
Naive |
Priority queue |
Flat sort |
|---|---|---|---|---|---|
2 |
100 |
200 |
0.5 ms / 25 KB |
0.4 ms / 18 KB |
0.4 ms / 18 KB |
2 |
1,000 |
2,000 |
8.6 ms / 263 KB |
6.2 ms / 138 KB |
5.9 ms / 136 KB |
5 |
100 |
500 |
2.2 ms / 67 KB |
1.3 ms / 36 KB |
1.3 ms / 35 KB |
5 |
1,000 |
5,000 |
35 ms / 790 KB |
16.9 ms / 448 KB |
16.4 ms / 468 KB |
With small K, all three approaches are closer in speed. The naive approach is still 1.5–2× slower due to the redundant binary searches. The priority queue and flat sort are nearly identical here — with K=5, the heap holds only 5 entries and log K is tiny.
iter_merge memory (full state lists)¶
iter_merge yields (time, list_of_all_values) at each unique time.
Regardless of strategy, this requires copying a K-element list at each
yield. For large K, this copy dominates memory. Each series has 2
transitions (N total = K × 2).
K |
N total |
Naive |
Priority queue |
Flat sort |
|---|---|---|---|---|
100 |
200 |
6.4 ms / 192 KB |
0.5 ms / 174 KB |
0.3 ms / 174 KB |
500 |
1,000 |
150 ms / 4.1 MB |
3.2 ms / 3.9 MB |
3.1 ms / 4.0 MB |
1,000 |
2,000 |
588 ms / 17.0 MB |
9.2 ms / 15.4 MB |
10.9 ms / 15.4 MB |
The priority queue and flat sort look nearly identical here. This is because the K-element list copies at each yield dominate both time and memory, swamping the sort-vs-heap difference.
Consuming transitions directly (many series)¶
The flat sort advantage becomes clear when transitions are consumed
directly — without the O(K) list copy overhead. This is the pattern
used by count_by_value and any consumer that maintains its own
incremental state. Each series has 2 transitions (N total = K × 2).
K |
N total |
Priority queue |
Flat sort |
|---|---|---|---|
100 |
200 |
0.3 ms / 12 KB |
0.1 ms / 4 KB |
500 |
1,000 |
1.7 ms / 82 KB |
0.8 ms / 24 KB |
1,000 |
2,000 |
3.5 ms / 175 KB |
1.7 ms / 53 KB |
5,000 |
10,000 |
25 ms / 1.8 MB |
13 ms / 0.9 MB |
Flat sort is 2× faster and uses 3× less memory than the
priority queue when consuming transitions directly with many series.
The speed difference comes from timsort’s cache-friendliness and lower
per-element overhead (no heappush/heappop calls per transition).
The memory difference comes from eliminating K iterator objects and
the 4-tuple heap entries (which include iterator references) in favor
of a flat list of 3-tuples.
This matters for count_by_value: at K=10,000 with 2 transitions per
series, count_by_value runs in 0.16s with flat sort vs 0.22s with
a priority queue — a 27% improvement that comes entirely from the
underlying sort strategy.
Consuming transitions directly (few series, very many transitions)¶
With few series and very many transitions per series (small K, large N), the priority queue’s O(N log K) advantage over flat sort’s O(N log N) should theoretically win. Does it?
K |
N/series |
N total |
Priority queue |
Flat sort |
|---|---|---|---|---|
2 |
10,000 |
20,000 |
14 ms / 1.3 MB |
14 ms / 1.9 MB |
2 |
50,000 |
100,000 |
73 ms / 6.1 MB |
81 ms / 9.8 MB |
2 |
100,000 |
200,000 |
149 ms / 12.1 MB |
161 ms / 19.7 MB |
2 |
500,000 |
1,000,000 |
765 ms / 61.2 MB |
833 ms / 99.6 MB |
3 |
50,000 |
150,000 |
86 ms / 9.2 MB |
121 ms / 13.3 MB |
At K=2 with 1M total transitions, the priority queue is ~10% faster in time. But the real win is memory: the heap uses 61 MB vs flat sort’s 100 MB. This is because flat sort materializes all N triples at once, while the heap only holds K=2 entries at any time. The memory ratio approaches N/K as N grows.
At K=3 with 150k transitions, the heap is ~30% faster — the log K = 1.6 vs log N = 17.2 ratio starts to matter when N is large enough.
The takeaway: the priority queue wins on memory when N is large relative to K, because it doesn’t materialize the full triples list. The speed advantage is modest (10–30%) and only appears at very high N. For the common traces use case — many entities with few state changes each (large K, small N per series) — flat sort is faster and uses less memory.
The lesson: the choice between priority queue and flat sort matters
most when the consumer processes transitions directly. When the
consumer goes through iter_merge (which copies K-element lists),
the copy overhead dominates and the sort strategy barely matters.
Choosing an approach¶
Use flat sort (the traces default) when the data is in memory. It’s the fastest approach for the common case and has no complex data structures to implement — just a list, a sort, and a linear scan.
Use a priority queue when N per series is very large (tens of thousands+) and K is small — the heap avoids materializing all N triples in memory (61 MB vs 100 MB at K=2, N=1M). Also the right choice when inputs are streams (data arriving over time, reading from files) since it produces transitions lazily without buffering.
Use the naive approach for simplicity when K is small (under ~20) and correctness matters more than performance. It’s the easiest to implement, test, and reason about. It’s also a good reference implementation for validating the other approaches.
Avoiding O(K) copies: transitions vs. full state¶
All three approaches above can yield either full state lists (what
iter_merge does) or individual transitions (what
iter_merge_transitions does). The choice affects downstream
consumers:
Full state lists are convenient when the consumer needs the complete picture at each time (e.g., applying an arbitrary operation like
sumover all K values). The cost is O(K) per yield for the list copy.Transitions yield O(1) per event and are better when the consumer can maintain its own state incrementally.
count_by_valueuses this: it keeps a{value: count}dict and adjusts two entries per transition (decrement the old value’s count, increment the new value’s count).
For K=1000 with 2 transitions each, iter_merge copies 2000 lists of
1000 elements (2M element copies, ~15 MB). iter_merge_transitions
yields 2000 individual tuples (~0.2 MB for the triples list). The
algorithmic improvement is independent of which sort strategy is used.
Reproducing the benchmarks¶
You can reproduce these benchmarks by running:
python experiments/bench_merge_strategies.py
