OrderedAppend: An optimization for range partitioning

With this feature, we’ve seen up to 100x performance improvements for certain queries.

In our previous post on implementing constraint exclusion, we discussed how TimescaleDB leverages PostgreSQL’s foundation and expands on its capabilities to improve performance. Continuing with the same theme, in this post we will discuss how we’ve added support for ordered appends which optimize a large range of queries, particularly those that are ordered by time.

We’ve seen performance improvements up to 100x for certain queries after applying this feature, so we encourage you to keep reading!

Optimizing Appends for large queries

PostgreSQL represents how plans should be executed using “nodes”. There are a variety of different nodes that may appear in an EXPLAIN output, but we want to focus specifically on Append nodes, which essentially combine the results from multiple sources into a single result.

PostgreSQL has two standard Appends that are commonly used that you can find in an EXPLAIN output:

  • Append: appends results of child nodes to return a unioned result
  • MergeAppend: merge output of child nodes by sort key; all child nodes must be sorted by that same sort key; accesses every chunk when used in TimescaleDB

When MergeAppend nodes are used with TimescaleDB, we necessarily access every chunk to figure out if the chunk has keys that we need to merge. However, this is obviously less efficient since it requires us to touch every chunk.

To address this issue, with the release of TimescaleDB 1.2 we introduced OrderedAppend as an optimization for range partitioning. The purpose of this feature is to optimize a large range of queries, particularly those that are ordered by time and contain a LIMIT clause. This optimization takes advantage of the fact that we know the range of time held in each chunk, and can stop accessing chunks once we’ve found enough rows to satisfy the LIMIT clause. As mentioned above, with this optimization we see performance improvements of up to 100x depending on the query.

With the release of TimescaleDB 1.4, we wanted to extend the cases in which OrderedAppend can be used. This meant making OrderedAppend space-partition aware, as well as removing the LIMIT clause restriction from Ordered Append. With these additions, more users can benefit from the performance benefits achieved through leveraging OrderedAppend.

(Additionally, the updates to OrderedAppend for space partitions will be leveraged even more heavily with the release of TimescaleDB clustering which is currently in private beta. Stay tuned for more information!)

Developing query plans with the optimization

As an optimization for range partitioning, OrderedAppend eliminates sort steps because it is aware of the way data is partitioned.

Since each chunk has a known time range it covers to get sorted output, no global sort step is needed. Only local sort steps have to be completed and then appended in the correct order. If index scans are utilized, which return the output sorted, sorting can be completely avoided.

For a query ordering by the time dimension with a LIMIT clause you would normally get something like this:

dev=# EXPLAIN (ANALYZE,COSTS OFF,BUFFERS,TIMING OFF,SUMMARY OFF)
dev-# SELECT * FROM metrics ORDER BY time LIMIT 1;
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Limit (actual rows=1 loops=1)
   Buffers: shared hit=16
   ->  Merge Append (actual rows=1 loops=1)
         Sort Key: metrics."time"
         Buffers: shared hit=16
         ->  Index Scan using metrics_time_idx on metrics (actual rows=0 loops=1)
               Buffers: shared hit=1
         ->  Index Scan using _hyper_1_1_chunk_metrics_time_idx on _hyper_1_1_chunk (actual rows=1 loops=1)
               Buffers: shared hit=3
         ->  Index Scan using _hyper_1_2_chunk_metrics_time_idx on _hyper_1_2_chunk (actual rows=1 loops=1)
               Buffers: shared hit=3
         ->  Index Scan using _hyper_1_3_chunk_metrics_time_idx on _hyper_1_3_chunk (actual rows=1 loops=1)
               Buffers: shared hit=3
         ->  Index Scan using _hyper_1_4_chunk_metrics_time_idx on _hyper_1_4_chunk (actual rows=1 loops=1)
               Buffers: shared hit=3
         ->  Index Scan using _hyper_1_5_chunk_metrics_time_idx on _hyper_1_5_chunk (actual rows=1 loops=1)
               Buffers: shared hit=3

You can see 3 pages are read from every chunk and an additional page from the parent table which contains no actual rows.

While with this optimization enabled you would get a plan looking like this:

dev=# EXPLAIN (ANALYZE,COSTS OFF,BUFFERS,TIMING OFF,SUMMARY OFF)
dev-# SELECT * FROM metrics ORDER BY time LIMIT 1;
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Limit (actual rows=1 loops=1)
   Buffers: shared hit=3
   ->  Custom Scan (ChunkAppend) on metrics (actual rows=1 loops=1)
         Order: metrics."time"
         Buffers: shared hit=3
         ->  Index Scan using _hyper_1_1_chunk_metrics_time_idx on _hyper_1_1_chunk (actual rows=1 loops=1)
               Buffers: shared hit=3
         ->  Index Scan using _hyper_1_2_chunk_metrics_time_idx on _hyper_1_2_chunk (never executed)
         ->  Index Scan using _hyper_1_3_chunk_metrics_time_idx on _hyper_1_3_chunk (never executed)
         ->  Index Scan using _hyper_1_4_chunk_metrics_time_idx on _hyper_1_4_chunk (never executed)
         ->  Index Scan using _hyper_1_5_chunk_metrics_time_idx on _hyper_1_5_chunk (never executed)

After the first chunk, the remaining chunks never get executed and to complete the query only 3 pages have to be read. TimescaleDB removes parent tables from plans like this because we know the parent table does not contain any data.

MergeAppend vs. ChunkAppend

The main difference between these two examples is the type of Append node we used. In the first case, a MergeAppend node is used. In the second case, we used a ChunkAppend node (also introduced in 1.4) which is a TimescaleDB custom node that works similarly to the PostgreSQL Append node, but contains additional optimizations.

The MergeAppend node implements the global sort and requires locally sorted input which has to be sorted by the same sort key. To produce one tuple, the MergeAppend node has to read one tuple from every chunk to decide which one to return to.

For the very simple example query above, you will see 16 pages read (with MergeAppend) vs. 3 pages (with ChunkAppend) which is a 5x improvement over the unoptimized case (if we ignore the single page from the parent table), and represents the number of chunks present in that hypertable. So for a hypertable with 100 chunks, there would be 100 times less pages to be read to produce the result for the query.

As you can see, you gain the most benefit from OrderedAppend with a LIMIT clause as older chunks don’t have to be touched if the required results can be satisfied from more recent chunks. This type of query is very common in time-series workloads (e.g. if you want to get the last reading from a sensor). However, even for queries without a LIMIT clause, this feature is beneficial because it eliminates sorting of data.

Next steps

If you are interested in using OrderedAppend, make sure you have TimescaleDB 1.2 or higher installed (installation guide). However, we always recommend upgrading to the most recent version of the software (at the time of publishing this post, it’s TimescaleDB 1.4).

If you are brand new to TimescaleDB, get started here. Have questions? Join our Slack channel or leave them in the comments section below.