How to proactively manage long-term data storage with downsampling

Eliminate the need to store billions of rows with TimescaleDB’s upgraded continuous aggregates

If you live in the world of monitoring and observability, you know that it requires the collection of time-series data at a massive scale and in as close to real-time as possible to understand system and application health and performance. As a result, collecting data in near real-time granularity generates very large data sets that can be difficult and costly to manage, and create significant database performance problems.

Many monitoring tools address these issues by trading long term analytical value for reduced storage cost and improved performance. With aggressive default data retention policies and simply purging “historical” data from the system, these monitoring tools reduce data storage cost and protect performance, but they also eliminate the ability to extract long term analytical value from the data.

With TimescaleDB, you don't have to sacrifice long term analytics of your monitoring data to reduce storage costs and improve performance. Instead, TimescaleDB allows you to reduce the granularity of your data by aggregating data into coarser periods of time and dropping the finer-grained real-time data while maintaining data accuracy. This allows you to reduce costs by decreasing your storage footprint while at the same time maintaining the ability to perform analytical queries of your monitoring data over longer time horizons.

In this post, we'll cover what downsampling is, and show you how to combine data retention policies and continuous aggregates to save money - while maintaining analytical power of your data.

We'll use a hypothetical example to demonstrate how to reduce storage needs from 260,000 rows/day to nine, keeping only the summaries of time-series data, instead of the full-fidelity dataset.

What is downsampling?

Historical data provides baseline performance context for your applications - by measuring “normal” performance, you can identify and predict anomalous performance. In order to get the most of your historical data, you can leverage downsampling to help you avoid the trade-off of historical data value vs. storage cost and management.

Downsampling is the act of applying a mathematical aggregation function (i.e. AVG()) to roll up a very granular time series data set (i.e.  3 second intervals) to a more coarse grained set of data (1 hour, 5 hours, 1 day averages as examples). As a result, your data can take on a new role: analytics.

For example, let’s assume we are monitoring a single machine instance, and for the purposes of this exercise we are only focused on CPU usage metrics. We are monitoring a 8 CPU core instance and we measure usage in the system and user spaces, collecting data every 3 seconds.

Here is what something like this will look like on a monitoring dashboard:

In order to understand what is happening in real time, we need to collect the data at this high velocity and high frequency (Hint: a skill TimescaleDB is built for).

However, as this data ages, its purpose changes. When data is new, individual data points are important for debugging and real-time feedback. As data ages, the importance of individual data points often matters less than statistical analysis over large amounts of data.

If we downsample historical data, it will still help us spot trends, set baselines for what we consider “normal”, and allow us to be more accurate in our predictions around future behavior. At the same time, downsampling will reduce storage volumes. Let's take a look at how we can make this happen.

The downsampling process: a brief tutorial

First, we need to decide what data to downsample. Let’s assume the data in its original format is collected every three seconds. Let’s say we need a different time-series for analysis purposes (for example to 1 hour and 5 hour and daily averages).

We can do this with continuous aggregates, time_bucket, and AVG() functions in TimescaleDB to roll up the VERY granular 3 second interval, to views that offer the data at 1 hour, 5 hour, and daily intervals.

To help us manage the storage costs, we are also going to use the TimescaleDB data retention policies to remove data after the five day window. Let's walk through these steps.

#1 Create a continuous aggregate

In this case, we will create a continuous aggregate with a daily average of the CPU usage in both the user and system space. We will roll up the 3 second data to a daily average for these metrics. This is what it will look like:

CREATE VIEW CPU_daily_rollups
WITH (timescaledb.continuous,
    timescaledb.ignore_invalidation_older_than='5d',
    timescaledb.refresh_lag = '-30m',
    timescaledb.refresh_interval = '1d')
AS
  SELECT time_bucket('1d', time), cpu, AVG(usage_system) AS system_usage, AVG(usage_user) AS user_usage
    FROM cpu
    GROUP BY time_bucket('1d', time), cpu;

As you can see, we are building a continuous aggregate that will produce our daily averages using the time bucket function (for more information on building a continuous aggregate click here), and we are refreshing this view once per day.  

The continuous aggregate job that we have created above will  take the average of the utilization across the entire 24 hour period, and rather than needing to keep 259,200 rows per day (which I needed when I was monitoring this in real-time), I can simply keep 9 entries to represent daily CPU Usage (one per core and a total), which will look like this:

SELECT * from cpu_daily_rollups;

Now I can simply repeat this process to create continuous aggregates for 1 hour and 5 hour windows and in this use case I will have everything I need for long term analysis.

#2 Add a data retention policy

The second part of this exercise is to reclaim the space that is being taken up by the underlying 3 second data points.

As I mentioned earlier, we are storing a little more than 259K rows per day per monitored machine in this case. The data has served its purpose for real-time monitoring and has been converted to a less granular form more appropriate for long-term analysis (see above). The next step is to set up a policy that will start to delete the finer-granularity data we originally collected.

In this case, we will use a TimescaleDB data retention policy:

SELECT add_drop_chunks_policy('cpu', INTERVAL '5 days', cascade_to_materializations=>FALSE);

Here we are dropping the underlying data after 5 days. While we are dropping the granular 3 second data records, we will maintain our continuous aggregate views of this data.

#3 Perform analytics

Now that we have created the needed view and downsampled our data, we can start the process of running analytics on that data.

To illustrate, I’ve connected an Excel Sheet to my TimescaleDB instances – and set up a basic pivot table that plots the CPU usage To illustrate, I’ve connected an Excel Sheet to my TimescaleDB instances – and set up a basic pivot table that plots the CPU usage we set up in Step 1 (i.e., our continuous aggregate that rolls up our data from three second intervals to hourly averages).

Recap & next steps

In this post, we’ve covered an overview of downsampling and how – and why – it’s important to leverage it for IT monitoring use cases. Of course you can apply continuous aggregates and data retention polices to a variety of other scenarios. If you are interested in learning more about how continuous aggregates work and to see if they are a fit for you, read this blog “Continuous aggregates: faster queries with automatically maintained materialized views”.

If you are ready to start downsampling, we encourage you to check out this documentation:

Note: For reference, the ability to enable true downsampling described in this post is included with the recent release of TimescaleDB 1.6. If you are interested in staying up-to-date with all of our releases, sign up for our Release Notes.