How to explore TimescaleDB using simulated IoT sensor data

Introduction

The Internet of Things (IoT) describes a trend where computing is becoming ubiquitous and is embedded in more and more physical things. For many of these things, the purpose of IoT is to collect sensor data about the environment in which it exists: e.g., oil wells, factories, power plants, farms, moving vehicles, office buildings, homes.

In other words, IoT is all about the data. And the datasets generated by these things are generally time-series in nature, with relational metadata to describe those things.

In this tutorial we explore some of the features and capabilities of TimescaleDB using an IoT sensor dataset that is meant to simulate a real-world IoT deployment.

Before you start

Create a new TimescaleDB instance via Timescale Cloud.

Step 1: Set up your tables

First, connect to your Timescale Cloud instance via psql.

Second, create the “sensors” and “sensor_data” tables:

CREATE TABLE sensors(
  id SERIAL PRIMARY KEY,
  type VARCHAR(50),
  location VARCHAR(50)
);
CREATE TABLE
CREATE TABLE sensor_data (
  time TIMESTAMPTZ NOT NULL,
  sensor_id INTEGER,
  temperature DOUBLE PRECISION,
  cpu DOUBLE PRECISION,
  FOREIGN KEY (sensor_id) REFERENCES sensors (id)
);
CREATE TABLE

Third, convert the sensor_data table into a hypertable:

SELECT create_hypertable('sensor_data', 'time');
    create_hypertable     
--------------------------
 (1,public,sensor_data,t)
(1 row)

Fourth, populate the sensors table with 4 sensors:

INSERT INTO sensors (type, location) VALUES
('a','floor'),
('a', 'ceiling'),
('b','floor'),
('b', 'ceiling');
INSERT 0 4

Fifth, verify that the sensors were created correctly:

SELECT * FROM sensors;
 id | type | location 
----+------+----------
  1 | a    | floor
  2 | a    | ceiling
  3 | b    | floor
  4 | b    | ceiling
(4 rows)

Step 2: Create simulated IoT sensor data

Note: for the following sections we’ll share the results of our queries as an example, but since the tutorials generates random data every time it is run, your results will look different (but will be structured the same way).

First, generate a dataset of simulated data for one sensor, recording data every 5 minutes for the past 24 hours:

SELECT 
  generate_series(now() - interval '24 hour', now(), interval '5 minute') AS time,
  random() AS cpu,
  random()*100 AS temperature;

Note: Your data will look different but should have the same structure.

             time              |         cpu          |    temperature    
-------------------------------+----------------------+-------------------
 2019-07-31 15:55:31.744218+00 |   0.0306301130913198 |  75.7227655500174
 2019-07-31 16:00:31.744218+00 |    0.626729523297399 |  20.1422684360296
 2019-07-31 16:05:31.744218+00 |    0.526347786653787 |  87.2947076335549
 2019-07-31 16:10:31.744218+00 |    0.169700589030981 |  21.1996510624886
 2019-07-31 16:15:31.744218+00 |     0.71232553711161 |  91.3718110416085

Second, generate a similar dataset for each of our four sensors and insert into our sensor_data table:

WITH simulated_data 
AS 
(SELECT 
  1 as sensor_id, generate_series(now() - interval '24 hour', now(), interval '5 minute') AS time,
  random() AS cpu,
  random()*100 AS temperature
)
INSERT INTO sensor_data (time, sensor_id, cpu, temperature)
  SELECT time, sensor_id, cpu, temperature FROM simulated_data;
INSERT 0 289
WITH simulated_data 
AS 
(SELECT 
  2 as sensor_id, generate_series(now() - interval '24 hour', now(), interval '5 minute') AS time,
  random() AS cpu,
  random()*100 AS temperature
)
INSERT INTO sensor_data (time, sensor_id, cpu, temperature)
  SELECT time, sensor_id, cpu, temperature FROM simulated_data;
INSERT 0 289
WITH simulated_data 
AS 
(SELECT 
  3 as sensor_id, generate_series(now() - interval '24 hour', now(), interval '5 minute') AS time,
  random() AS cpu,
  random()*100 AS temperature
)
INSERT INTO sensor_data (time, sensor_id, cpu, temperature)
  SELECT time, sensor_id, cpu, temperature FROM simulated_data;
INSERT 0 289
WITH simulated_data 
AS 
(SELECT 
  4 as sensor_id, generate_series(now() - interval '24 hour', now(), interval '5 minute') AS time,
  random() AS cpu,
  random()*100 AS temperature
)
INSERT INTO sensor_data (time, sensor_id, cpu, temperature)
  SELECT time, sensor_id, cpu, temperature FROM simulated_data;
INSERT 0 289

Third, verify that the simulated sensor_data was written correctly:

SELECT * FROM sensor_data ORDER BY time;
             time              | sensor_id |    temperature     |         cpu         
-------------------------------+-----------+--------------------+---------------------
 2019-07-31 15:56:25.843575+00 |         1 |   6.86688972637057 |   0.682070567272604
 2019-07-31 15:56:40.244287+00 |         2 |    26.589260622859 |   0.229583469685167
 2019-07-31 15:56:45.653115+00 |         3 |   79.9925176426768 |   0.457779890391976
 2019-07-31 15:56:53.560205+00 |         4 |   24.3201029952615 |   0.641885648947209
 2019-07-31 16:01:25.843575+00 |         1 |   33.3203678019345 |  0.0159163917414844
 2019-07-31 16:01:40.244287+00 |         2 |   31.2673618085682 |   0.701185956597328
 2019-07-31 16:01:45.653115+00 |         3 |   85.2960689924657 |   0.693413889966905
 2019-07-31 16:01:53.560205+00 |         4 |   79.4769988860935 |   0.360561791341752
…

Step 3: Run basic queries

Let’s start by calculating the average temperature and cpu by 30 minute window:

SELECT 
  time_bucket('30 minutes', time) AS period, 
  AVG(temperature) AS avg_temp, 
  AVG(cpu) AS avg_cpu 
FROM sensor_data 
GROUP BY period;
         period         |     avg_temp     |      avg_cpu      
------------------------+------------------+-------------------
 2019-07-31 19:00:00+00 | 49.6615830013373 | 0.477344429974134
 2019-07-31 22:00:00+00 | 58.8521540844037 | 0.503637770501276
 2019-07-31 16:00:00+00 | 50.4250325243144 | 0.511075591299838
 2019-07-31 17:30:00+00 | 49.0742547437549 | 0.527267253802468
 2019-08-01 14:30:00+00 | 49.3416377226822 | 0.438027751864865

But what if we don’t just want the average temperature for each period, but also the last temperature? (For example if we wanted to understand the final temperature value at the end of the interval.)

SELECT 
  time_bucket('30 minutes', time) AS period, 
  AVG(temperature) AS avg_temp, 
  last(temperature, time) AS last_temp, 
  AVG(cpu) AS avg_cpu 
FROM sensor_data 
GROUP BY period;
         period         |     avg_temp     |    last_temp     |      avg_cpu      
------------------------+------------------+------------------+-------------------
 2019-07-31 19:00:00+00 | 49.6615830013373 | 84.3963081017137 | 0.477344429974134
 2019-07-31 22:00:00+00 | 58.8521540844037 | 76.5528806950897 | 0.503637770501276
 2019-07-31 16:00:00+00 | 50.4250325243144 | 43.5192013625056 | 0.511075591299838
 2019-07-31 17:30:00+00 | 49.0742547437549 |  22.740753274411 | 0.527267253802468
 2019-08-01 14:30:00+00 | 49.3416377226822 | 59.1331578791142 | 0.438027751864865
…

Now let’s take advantage of some of the metadata we have stored in the sensors table:

SELECT 
  sensors.location,
  time_bucket('30 minutes', time) AS period, 
  AVG(temperature) AS avg_temp, 
  last(temperature, time) AS last_temp, 
  AVG(cpu) AS avg_cpu 
FROM sensor_data JOIN sensors on sensor_data.sensor_id = sensors.id
GROUP BY period, sensors.location;
 location |         period         |     avg_temp     |     last_temp     |      avg_cpu      
----------+------------------------+------------------+-------------------+-------------------
 ceiling  | 2019-07-31 15:30:00+00 | 25.4546818090603 |  24.3201029952615 | 0.435734559316188
 floor    | 2019-07-31 15:30:00+00 | 43.4297036845237 |  79.9925176426768 |  0.56992522883229
 ceiling  | 2019-07-31 16:00:00+00 | 53.8454438598516 |  43.5192013625056 | 0.490728285357666
 floor    | 2019-07-31 16:00:00+00 | 47.0046211887772 |  23.0230117216706 |  0.53142289724201
 ceiling  | 2019-07-31 16:30:00+00 | 58.7817596504465 |  63.6621567420661 | 0.488188337767497
 floor    | 2019-07-31 16:30:00+00 |  44.611586847653 |  2.21919436007738 | 0.434762630766879
 ceiling  | 2019-07-31 17:00:00+00 | 35.7026890735142 |  42.9420990403742 | 0.550129583687522
 floor    | 2019-07-31 17:00:00+00 | 62.2794370166957 |  52.6636955793947 | 0.454323202022351
…

Step 4: Set up a continuous aggregate view

Our queries have gotten a little unwieldy. If we find ourselves running them often, we can save ourselves time (user time and query time) by saving them as a continuous aggregate view:

CREATE VIEW sensor_data_30min
WITH (timescaledb.continuous)
AS
SELECT 
  time_bucket('30 minutes', time) AS period, 
  AVG(temperature) AS avg_temp, 
  last(temperature, time) AS last_temp, 
  AVG(cpu) AS avg_cpu 
FROM sensor_data 
GROUP BY period;
CREATE VIEW

We can see the results by automatically querying the continuous aggregate view:

SELECT * FROM sensor_data_30min;
         period         |     avg_temp     |    last_temp     |      avg_cpu      
------------------------+------------------+------------------+-------------------
 2019-07-31 15:30:00+00 |  34.442192746792 | 24.3201029952615 | 0.502829894074239
 2019-07-31 16:00:00+00 | 50.4250325243144 | 43.5192013625056 | 0.511075591299838
 2019-07-31 16:30:00+00 | 51.6966732490497 | 63.6621567420661 | 0.461475484267188
 2019-07-31 17:00:00+00 |  48.991063045105 | 42.9420990403742 | 0.502226392854936
 2019-07-31 17:30:00+00 | 49.0742547437549 |  22.740753274411 | 0.527267253802468
...

What a continuous aggregate view does is recompute the query automatically at regular time intervals (which the user can specify) and materialize the results into a table. When we query the view, TimescaleDB reads and processes the much smaller materialized table instead of the raw data. This speeds up the query significantly (and is also much easier for us to type!).

To peek further behind the scenes:

SELECT * FROM timescaledb_information.continuous_aggregate_stats;
     view_name     |  completed_threshold   | invalidation_threshold | job_id |     last_run_started_at      | job_status | last_run_duration |      next_scheduled_run       
-------------------+------------------------+------------------------+--------+------------------------------+------------+-------------------+-------------------------------
 sensor_data_30min | 2019-08-01 14:30:00+00 | 2019-08-01 14:30:00+00 |   1000 | 2019-08-01 15:59:32.46657+00 | scheduled  | 00:00:00.006064   | 2019-08-01 16:59:32.472634+00

Conclusion

Congratulations! You are now on your way to exploring TimescaleDB.

To learn more about the TimescaleDB functions we just used, please visit these pages in our developer documentation: