Introducing #TSDBTipTuesday!

The theme of this month is using SQL for time-series data.

We are excited to announce #TSDBTipTuesday, a new weekly series where we’ll share useful tips for working with time-series data! Every Tuesday, we will post a short tip on our social media channels (Twitter, LinkedIn, Facebook) and update this blog post with more information/technical guidance. This month’s theme is “how and why to use SQL for time-series data.”

SQL is a widely known, popular, and expressive querying language. It’s also easy for organizations to adopt and integrate with other tools. For these reasons (and many more), we believe SQL is the best language for working with (and getting the most value from) your time-series data.

Tip #1: JOIN time-series data with relational data

If you already work with time-series data, you may frequently run complex queries that go beyond the standard SELECT and WHERE commands. For example, you might want to combine data or rows from two or more tables based on a common field -- and this is where JOIN comes in.

Let’s say you have 1,000 devices out in the wild collecting environmental data at various intervals. Your incoming data may look something like this:

timestamp device_id cpu_1m_avg free_mem temperature location_id dev_type
2020-01-01 01:02:00 abc123 80 500MB 75 42 field
2020-01-01 01:02:23 def456 90 400MB 64 42 roof
2020-01-01 01:02:30 ghi789 120 0MB 56 88 roof
2020-01-01 01:03:12 abc123 80 500MB 75 42 field
2020-01-01 01:03:12 def456 90 400MB 64 42 roof
2020-01-01 01:03:42 ghi789 120 0MB 56 88 roof

Now, let’s say you have additional metadata for each location that you can use to map the location_id to the metadata for that location. That table looks like this:

location_id name latitude longitude zip_code region
42 Grand Central Terminal 40.7527° N 73.9772° W 10017 NYC
88 Lobby 8 42.3593° N 71.0935° W 02139 Massachusetts

By joining the two tables, you could ask questions like “what is the average free_mem of our devices in zip_code 10017?” at query time.

Additionally, JOINs allow you to store metadata independently and update mappings more easily. For example, you could update your "region" for location_id 88 (e.g., from "Massachusetts" to "Boston") without going back and overwriting historical data.

To answer these types of questions without joins, you would need to denormalize your data and store all metadata with each measurement row. This creates data bloat, and makes time-series data management more difficult.

This is just one example of how to use JOINs to query your time-series data along side your metadata. There are many similar and more complex JOIN functions you can run to answer all types of queries (e.g. marrying relational data with geospatial data with time-series data for unified data management).

Tip #2: Use SQL schemas for time-series data modeling

There are several reasons for why you may want to use a database schema, but the one we are going to focus on here is the fact that schemas allow you to organize database objects into logical groups, making them more manageable.

If you are managing time-series workloads, SQL schemas help you collect cleaner data (i.e. properly handles data that’s received out of order) and provide flexibility when it comes to data management.    

The first step to developing a schema design is answering the question: what type of queries will I be making against this data? What are the most common ones? We find that queries often follow the 80-20 rule: a small subset of queries make up the vast majority of the work the database has to do. Focus on optimizing those and making the other ones possible, if not necessarily as performant.

From there, you can set up the appropriate indexes and table schema for your workload. If you do this correctly, you can see significant performance improvements.

Using the sensor data from above, let’s say we want to answer questions around the temperature of each device during a certain time. Given that information, a sample schema design might look something like this:

CREATE TABLE conditions (
	time		TIMESTAMPTZ
	device_id	INTEGER REFERENCES devices(id),
	location_id	INTEGER REFERENCES locations (id),
	cpu_1m_avg	NOT NULL,
	free_mem	NOT NULL,
	temperature	FLOAT NOT NULL,
	PRIMARY KEY (device_id, time)
) ;

For this schema, we added a few constraints so we are only collecting known device IDs, and non-null temperature readings. The primary key on time and device_id will guarantee unique time values for each device. The schema above also references our location_id metadata table via a foreign key constraint. We’ve normalized our dev_type into a separate devices table (included in chart above) since the device type cannot change in time, and therefore would be redundant in our time-series table. If we filtered or grouped by that value very frequently, it might be better to denormalize it and include it in the time-series table, but we decided against it here.

This example will allow us to answer questions around a device's temperature at any given time. If we are looking to answer different questions of the data, we would create an additional schema. When working with time-series data, it’s not uncommon to use multiple schema designs for your workloads.

Tip #3: SQL tools for working with time-series data

SQL has been around since 1986, and given its age and large community of users, there’s no shortage of tools, extensions, and resources to help developers (including TimescaleDB). So, let’s talk about a few open source SQL resources our engineers and community members frequently use - and recommend - for working with time-series data.

Extension: PostGIS

PostGIS is a popular Postgres extension that adds support for geographical objects. More specifically, PostGIS adds spatial functions such as distance, area, union, intersection, and specialty geometry data types to PostgreSQL.

Why is this useful? Spatial functions allow you to answer questions about geographical locations, like where along a specific route is a delivery truck on day X, Day Y, and Day Z to help identify environmental factors that may impact delivery times (traffic patterns etc.), a common need in many use cases and industries (most notably IoT, oil & gas, manufacturing, and logistics).

If we have PostGIS installed, using the environmental sensor dataset example from above, we can ask something like:

SELECT time_bucket('30 minutes', time) AS thirty_min, COUNT(*) AS devices_near_station
  FROM measurements
  WHERE ST_Distance(location_geom, ST_Transform(ST_SetSRID(ST_MakePoint(-73.9772,40.7527),4326),2163)) < 500
    AND time < '2020-02-18'
  GROUP BY thirty_min 
  ORDER BY thirty_min;

(Note that we have to format our WHERE statement using the geometry format required by PostGIS).

This above query will show environmental measurements from February 18, 2020 for devices located within 500m of Grand Central Station in New York City, displayed in 30 min time intervals. This shows the power of using geospatial and time-series data together, you can slice data by time and location!
To learn more, pop over to this tutorial that demos the power of PostGIS for geospatial, time-series, and relational data analysis.

Administration/GUI: DBeaver

DBeaver is a free client software development application and database administration tool that you can use in place of a command line. While DBeaver has ER diagrams, database structure comparing, and a lot of other features you’d expect from a GUI-based database management and development tool, it offers additional benefits, like the ability to generate mock data.

As a database developer working with time-series data, you will come to appreciate the graphical tools that you can use to build out or test with various schema designs. This includes the ability to document the work you did, and better understand the decisions that other developers made while working to design and implement your times-series database.

Additionally, the ability to work with mock data (i.e. back fill your database design with mock data) allows you to start the process of testing the schema and make sure that what you are building will optimize performance, and identify trouble spots as you work through the time-series database design process.

If you are a PostgreSQL user, Postico is another good GUI for Mac. PSequel is also a minimalistic GUI for your side projects.

Management tool: PGAdmin

PGAdmin is a management tool for PostgreSQL and derivative relational databases. It’s arguably the most comprehensive option when working with PostgreSQL -- see its list of features.

And here’s a few we’d like to highlight:

  • Create, view and edit on all common PostgreSQL objects.
  • Graphical query planning tool with syntax highlighting.
  • The dashboard lets you monitor server activities such as database locks, connected sessions, and prepared transactions.
  • Since pgAdmin is a web application, you can deploy it on any server and access it remotely.
  • pgAdmin UI consists of detachable panels that you can arrange according to your likings.
  • Procedural language debugger helps you to debug your code.
  • pgAdmin has a portable version using which you can move your data between machines easily.

All of these features make it very easy to work with time-series data, especially the UI exploration.

Tip #4: Apply SQL aggregate functions to speed up your data reporting and analysis

If you already work with time-series data, you likely create lots of reports that allow you to answer critical business questions about your data, such as “how much CPU should we allocate, based on prior usage?” or “how many users engaged with our A/B test feature over the last week?”. However, in order to build these reports, you need to quickly and easily summarize (or aggregate) your data.

Fortunately with SQL, you can make use of aggregate functions! These include:

  • COUNT - counts rows in a specific table or view
  • SUM - calculates the sum of values
  • AVG - calculates the average of a set of values
  • MIN - returns the minimum value in a set of values
  • MAX - returns the maximum value in a set of values

Now, let’s see some of these in action. Using our “Device” table from the example above, say we wanted to know the warmest temperature all devices reached in Grand Central Station (location id = 42) after a certain time. We could run a query such as:

SELECT date_trunc(‘minute’, timestamp) AS time,
	device_id,  MAX(temperature)
	FROM devices
	WHERE timestamp > `2020-01-01 01:02:30` AND location_id = 42
	ORDERBY time;

And get a table that looks like:

time device_id temperature
1:02 abc123 75
1:03 def456 64

This tells us that device abc123 reached a higher temperature than device def456 after 1:02:30pm.

Now, say we want to determine the average temperature for all the devices in Grand Central Terminal (location_id 42) in one minute time intervals. We run the following query:

SELECT date_trunc(‘minute’, timestamp) as time,
	AVG(temperature)
	FROM Devices
	WHERE location_id = 42; 

And get this response:

time location_id AVG(temperature)
1:02 42 68
1:03 42 68

While the average temperature didn’t change over the course of a minute in this case, you could choose different time intervals, or different times of day, that would cause changes.

There you have it! Using a few SQL functions, you can aggregate your data to simplify your reporting process, displaying your data in different ways - based on what you and your organization care about - and generate useful, actionable reports that are shared across the organization.

If you are looking to perform more advanced aggregate analysis, check out TimescaleDB’s continuous aggregates feature, and “How to quickly build dashboards with time-series data'' for a step-by-step example (using Grafana and Python).

Where to go from here?

If you have a tip that you would like to be featured in a future #TSDBTipTuesday, leave a comment below!

Finally, if you are interested in staying up with other SQL, Postgres, and time-series data news, sign up for our bi-weekly newsletter.