Timescale Blog

Timescale Newsletter Roundup: November Edition

Lacey Butler — Mon, 07 Dec 2020 16:30:31 GMT

Get a breakdown of what's new from Team Timescale, from the much-anticipated TimescaleDB 2.0 release to our tips and tools for wrangling time-series data, speeding up database performance, and beyond 🚀.

We’re always releasing new features, creating new documentation and tutorials, and hosting virtual sessions to help developers do amazing things with their data. And, to make it easy for our community members to discover and get the resources they need to power their projects, teams, or business with analytics, we round up our favorite new pieces in our biweekly newsletter.

We’re on a mission to teach the world about time-series data, supporting and growing communities around the world.

And, sharing educational resources as broadly as possible is one way to do just that :).

Here’s a snapshot of the content we shared with our readers this month (subscribe to get updates straight to your inbox).

Promotional Text Line 1?

Promotional Text Promotion!

Go To Promotion

Product updates & announcements

⭐️[Bonus Announcement]: TimescaleDB vs. Amazon Timestream: 6000x higher inserts, 5-175x faster queries, 150x-220x cheaper >>

While this didn't quite make it into our last November newsletter, the work behind it took place in November, so we're giving it special mention here.

We ran Amazon Timestream through the open-source Time Series Benchmarking Suite, and as the title suggests, the results were pretty shocking: even after attempting 10+ different configurations, TimescaleDB dramatically outperformed Amazon Timestream in every area. Read the full post for detailed benchmark results, get configuration details to run your own analysis, and learn how our approach to licensing and software development gives us an advantage.

👉 See Ryan’s Twitter thread for an at-a-glance summary, complete with 💯 graphs.
🔖Check out Hacker News discussion (100+ comments!)
🏅Read Ajay’s Twitter thread for more on Cloud Protection Licenses and open-source business sustainability.
💻 Get Time Series Benchmarking Suite code (GitHub)

🥊 See how TimescaleDB and Amazon Timestream stack up

[Product Announcement #1]: TimescaleDB 2.0 RC - multi-node, petabyte-scale, 100% free relational database for time-series - has arrived >>

This release is a huge milestone for us, the TimescaleDB community, and the industry as a whole: TimescaleDB is now a multi-node, petabyte-scale relational database for time-series – and completely free. In addition to multi-node, we’ve added new functionality and enhanced core features to give users more control and flexibility.

🚀 Read our announcement blog post to learn what's new, our journey to 2.0, and why we believe relational databases are the past and future of software development.
🎓 Watch our All Things TimescaleDB 2.0 Youtube playlist (5 videos) to get an overview of all new features, then dive into feature-specific videos, demos, and tips.
🐤 See this Twitter thread from Mike, Timescale CTO, for a quick - and emoji-packed - breakdown.
🙏 Biggest thank you to the Timescale Engineering team and to our countless beta testers for your feedback and support.

[Product Update #2]: Introducing dynamic scaling on Timescale Forge >>

We just shipped dynamic scaling capabilities on Timescale Forge, allowing you to resize your database compute and storage on demand. Result: more flexibility, better cost control, same great cloud-native hosted TimescaleDB instance.

👉 Explore Timescale Forge - 100% free for 30 days
💬 Share feedback on Slack (#timescale-forge channel)

Timescale Forge dynamic scaling in action

New technical content, videos & tutorials

[PostgreSQL Pro Tips]: 5 essential PostgreSQL functions for monitoring & analytics >>

We love PostgreSQL, but it’s not always easy to write efficient, useful queries for DevOps scenarios. In this 45 min. session, @avthars demos his favorite queries for real-time monitoring and historical reporting, including TimescaleDB-specific functions for complex time-series analysis.

🔥 Get his air quality demo code on GitHub.
📈 Visit our docs for 10+ advanced analytics queries.

[Grafana Guide]: Tackle Grafana nuances with advanced tips and workarounds >>

Learn @avthars go-to Grafana workarounds for common scenarios, complete with demos, resources, and beyond. You’ll see how - and why - to enable timeshifting, autoswitch aggregations in a single graph, and alert on templated queries.

New #remote-friendly events & community

[Office Hours with Mike]: Join our Community Q & A sessions >>

If you haven’t joined our monthly sessions yet, 2021's your chance! Office Hours are always different - with topics ranging from best ways to integrate with 3rd party tools to musings on open-source technology - and always chock-full of expert advice, community projects, and fun.

✅ RSVP for Tues, January 5th - we love to see returning and fresh faces.

[Community Spotlight]: We predicted our TimescaleDB performance would get us to 3B Queries: here’s what really happened >>

Shoutout to our friends and long-time community members @DNSFilter for sharing how they've scaled their infrastructure over the last 24 months, why they moved to bare metal to support massive increase in users (and 6B+ requests per day!), and more.

[Session Replay]: Purpose-built Observability Solutions w/ Open Source Software: Lessons from the Field >>

Watch @avthars Open Source Summit EU session to start building your own flexible, 100% open source, 100% free observability stack. You’ll cover the pros and cons of various approaches, get tips from production deployments, and see how to deploy your own stack in <5 mins.

💡 Visit Avthar's blog for background on the session and bonus resources.
🐤 See Twitter thread for talk highlights and key takeaways.

[Community Article #1]: Switching from InfluxDB to TimescaleDB >>

In this 💯 writeup, our friends at AgriConnect detail how they use TimescaleDB to power their IoT platform for agriculture, as well as their experiences with time-series databases and their evaluation criteria.

🙏 to @ng_hongquan
🥬 Visit AgriConnect to learn more about the technology, mission, and team.

[Community Article #2]: TimescaleDB and Django >>

Our friends from Protohaus Makerspace use TimescaleDB and Django to handle massive streams of data from hydroponic gardens. In this how-to, they share how to integrate TimescaleDB with a Django app in a few simple steps, complete with code samples 🎉.

🌷 Visit Protohaus to learn more about the Smart Digital Garden project
💻 Get the code on GitHub

TimescaleDB tips, reading list & etc.

[TimescaleDB Tip #1]: Use real-time aggregates for faster queries on raw (aka the latest) data >>

In this <7 minute how-to video, @avthars breaks down how real-time aggregates give you the best of both worlds: the speed of continuous aggregates and the ability to query your not-yet-materialized raw data. You’ll get demos, benchmarks, and resources to get started quicksmart ✅.

📑 Read our engineering blog post for more details and step-by-step examples.

[TimescaleDB Tip #2]: Explore sample apps, integrations, and more>>

From analyzing cryptocurrency trends and real-time bus locations to measuring environmental changes with Raspberry Pi, this repo has a variety of apps, clients, and integrations to inspire your time-series analysis.

🧭 Want more examples? Choose your own adventure with 20+ tutorials

[TimescaleDB Tip #3]: Speed up batch inserts with parallel-copy >>

Use this handy tool to speed up inserts and data migrations for large time-series workloads (100K+ row CSVs). Our goal: you spend more time analyzing and querying your data, not executing single COPY commands 🙌.

[Reading List]: Get answers to 25+ frequently asked questions >>

We’ve rounded up common questions from community members into one handy FAQ to help you find answers quickly. Topics range from how TimescaleDB scales to how we compare to other databases and details about how core features work.

[Team Timescale Fun]: From competing in asynchronous challenges to showing off our kitchen prowess and culinary skills, we're finding fun ways to stay connected, especially as we onboard new teammates across the world.

🏆 Mel - our asych challenge and team-bonding pro - strikes again with an A+ challenge 👏

Our #social-cooking channel has been 🔥 recently (shoutout to David’s bakery-worthy chocolate babka)

Wrapping Up

And, that concludes this month’s newsletter roundup. We’ll continue to release new content, events, and more - posting monthly updates for everyone.

If you’d like to get updates as soon as they’re available, subscribe to our newsletter (2x monthly emails, prepared with 💛 and no fluff or jargon, promise).

Happy building!

TimescaleDB vs. Amazon Timestream: 6000x higher inserts, 5-175x faster queries, 150x-220x cheaper

Ryan Booz — Wed, 02 Dec 2020 19:41:47 GMT

Our results surprised even us - but even after testing with several different configurations, we found Amazon Timestream slow, expensive, and missing key database capabilities like backups, restores, updates, and deletes. Our testing suite is open-source, so please feel free to check these results for yourself.

In this post, we compare TimescaleDB and Amazon Timestream across quantitative and qualitative dimensions.

Yes, we are the developers of TimescaleDB, so you might quickly disregard our comparison as biased. But if you let the analysis speak for itself, you’ll find that we stay as objective as possible and aim to be fair to Amazon Timestream in our testing and results reporting. Also, if you want to check our work or run your own analysis, we provide all of our testing via the Time-Series Benchmark Suite, an open-source project that anyone can use and contribute to.

About the two database systems tested

TimescaleDB, first launched in April 2017, is today the industry-leading relational database for time-series, open-source, engineered on top of PostgreSQL, and offered via download or as a fully-managed service on AWS, Azure, and GCP. With TimescaleDB 2.0 (released just last month), which introduced multi-node and made all enterprise features free, TimescaleDB is now petabyte-scale and completely free to use.

The TimescaleDB community has become the largest developer community for time-series data: tens of millions of downloads; over 500,000 active databases; organizations like AppDynamics, Bosch, Cisco, Comcast, Credit Suisse, DigitalOcean, Dow Chemical, Electronic Arts, Fujitsu, IBM, Microsoft, Rackspace, Schneider Electric, Samsung, Siemens, Uber, Walmart, Warner Music, WebEx, and thousands of others (all in addition to the PostgreSQL community and ecosystem).

Amazon Timestream, first announced at AWS re:Invent November 2018, but with a launch that was delayed until September 2020, is Amazon’s time-series database-as-a-service. Amazon Timestream not only shares a similar name to TimescaleDB, but also embraces SQL as its query language. Amazon Timestream customers include Autodesk, PubNub, and Trimble.

We compare TimescaleDB and Amazon Timestream across several dimensions:

Insert and query performance
Cost for equivalent workloads
Backups, reliability, and tooling
Query language, ecosystem, ease-of-use
Clouds and regions supported

Below is a summary of our results. For those who are interested, we go into much more detail later in this post.

Insert performance, query performance

Our results are striking. TimescaleDB outperformed Amazon Timestream 6000x on inserts and 5-175x on queries, depending on the query type. In particular, there were workloads and query types easily supported by TimescaleDB that Amazon Timestream was unable to handle.

Note: Several queries’ ratios (high-cpu-all, lastpoint, groupby-orderby-limit) are “undefined” because Amazon Timestream did not finish executing them within the default 60 second timeout period that Timestream imposes, while TimescaleDB completed them in less than a single second.

Results of benchmarking query performance between TimescaleDB and Amazon Timestream

These results are so dramatic that we did not believe them at first, and we tried a variety of workloads and settings to make sure we weren’t missing anything.

We even posted on Reddit to see if others had been able to get better performance with Amazon Timestream. Although feedback was hard to find, we weren’t the only ones seeing these performance results, as evidenced in a similar benchmark by Crate.io.

We REALLY tried to get Amazon Timestream to perform better. Just look at all of the databases we created through the process!

After all of our attempts to achieve better Amazon Timestream performance, we were even more confused when we read a recent post on the AWS Database Blog that discusses achieving ingest speeds of 3 billion metrics/hour. Although the details of how they ingested this scale of data aren’t completely clear, it appears that each “monitored host” sent individual metrics at various intervals directly to Amazon Timestream.

To achieve 3 billion metrics/hour in their test, 4 million hosts sent 26 metrics every two minutes, an average of 33,000 hosts reporting 866,667 metrics every second. It’s certainly impressive to support 33,000 connections per second without issue, and this demonstrates one of the key advantages that Amazon presents with a serverless architecture like Timestream. If you have an edge-based IoT system that pre-computes metrics on thousands of edge nodes before sending them, Amazon Timestream could simplify your data collection architecture.

However, as you’ll see, if you have a more traditional client-server data-collection architecture, or one using a more common streaming pipeline with database consumers, like Apache Kafka, TimescaleDB can import more than 3 million metrics per second from one client – and doesn’t need 33,000 clients.

Because performance benchmarking is complex, we share the details of our setup, configurations, and workload patterns later in this post, as well as instructions on how to reproduce them.

Cost for equivalent workloads

The stark difference in performance translates into a large cost differential as well.

To compare costs, we calculated the cost for our above insert and query workloads, which store 1 billion metrics in TimescaleDB and ~410 million metrics in Amazon Timestream (because we were unable to load the full 1 billion - more later in this post), and ran our suite of queries on top.

For the same workloads, we found that fully-managed TimescaleDB is 154x cheaper than Amazon Timestream (224x cheaper if you’re self-managing TimescaleDB on a VM), and inserted twice as many metrics.

We go into further details about the cost comparison later in this post.

Backups, reliability, and tooling

For reliability, the differences are also striking. In particular, backups, reliability, and tooling feel like an afterthought with Amazon Timestream.

In the 240 page development guide for Amazon Timestream, the words “recovery” and “restore” don’t appear at all, and the word “backup” appears only once... to tell the developer that there is no backup mechanism. Instead, users can “...write your own application using the Timestream SDK to query data and save it to the destination of your choice” (page 100). There isn’t a mechanism or support to DELETE or UPDATE existing data. The only way to remove data is to drop the entire table. Furthermore, there is no way to recover a deleted table since it is an atomic action that cannot be recovered through any Amazon API or Console.

In contrast, TimescaleDB is built on PostgreSQL, which means it inherits the 25+ years of hard, careful engineering work that the entire PostgreSQL community has done to build a rock-solid database that supports millions of mission-critical applications worldwide. When operating TimescaleDB, one inherits all of the battle-tested tools that exist in the PostgreSQL ecosystem: pg_dump/pg_restore and pg_basebackup for backup/restore, high-availability/failover tools like Patroni, load balancing tools for clustering reads like Pgpool/pgbouncer, etc. Since TimescaleDB looks and feels like PostgreSQL, there are minimal operational learning curves. TimescaleDB “just works,” as one would expect from PostgreSQL.

Query language, ecosystem, and ease-of-use

We applaud Amazon Timestream’s decision to adopt SQL as their query language. Even if Amazon Timestream functions like a NoSQL database in many ways, opting for SQL as the query interface lowers developers’ barrier to entry – especially when compared to other databases like MongoDB and InfluxDB.

That said, because Amazon Timestream is not a relational database, it doesn’t support normalized datasets and JOINs across tables. Also, because Amazon Timestream enforces a specific narrow table model on your data, deriving value from your data relies heavily on CASE statements and Common Table Expressions (CTEs) when requesting multiple measurement values (defined by “measurement_name”), leading to some clunky queries (see example later in this post).

TimescaleDB, on the other hand, has fully embraced all parts of the SQL language from Day 1 – and extended SQL with functions custom-built to simplify time-series analysis. TimescaleDB is also a relational database, allowing developers to store their metadata alongside their time-series data and JOIN across tables as necessary. As a consequence, with TimescaleDB, new users have a minimal learning curve and are in full control when querying their data.

Full SQL means that TimescaleDB supports everything that SQL has to offer, including normalized datasets, cross-table JOINs, subqueries, stored procedures, and user-defined functions. Supporting SQL also enables TimescaleDB to support everything in the SQL ecosystem, including Tableau, Looker, PowerBI, Apache Kafka, Apache Spark, Jupyter Notebooks, R, native libraries for every major programming language, and much more. For example, if you already use Tableau to visualize data, or Apache Spark for data processing, TimescaleDB can plug right into the existing infrastructure due to its compatible connectors. And, given its roots, TimescaleDB supports everything in the PostgreSQL ecosystem, including tools like EXPLAIN that help pinpoint why queries are slow and identify ways to improve performance.

By contrast, even though Amazon Timestream speaks a variant of SQL, it is “SQL-like,” not full SQL. Thus, tooling that normally works with SQL databases - e.g., the Tableau and Apache Spark examples cited above - are unable to utilize Amazon Timestream data unless that tool incorporates specific Amazon Timestream drivers and SQL-like dialect. This means, for example, that the tooling you might normally use to help you improve query performance doesn’t currently support Amazon Timestream. And, unfortunately, the current Amazon Timestream UI doesn’t give us any clues about why queries might be performing poorly or ways to improve performance (e.g., via settings or query hints).

In short, if you use Postgres and any tools or extensions with your applications, they will “just work” when connected to TimescaleDB. The same isn’t true for Amazon Timestream.

So, while the decision to adopt a SQL-like query language is a great start for Amazon Timestream, we found a lot to be desired for a true “easy” developer experience.

Cloud offering

Amazon Timestream is only offered as a serverless cloud offering on Amazon Web Services. As of writing, it is available in 3 U.S. regions and 1 E.U region.

Conversely, TimescaleDB can be run in your own infrastructure or fully managed through our cloud offerings, which make TimescaleDB available on AWS, GCP, and Azure, in over 75 regions and 2000 different possible region/storage/compute configurations.

With TimescaleDB, our goal is to give our customers their choice of cloud and the ability to choose the region closest to their customers and co-locate with their other workloads.

Cloud coverage comparison between Amazon Timestream and TimescaleDB (as of November 2020)

What we like about Amazon Timestream

Despite the results of this comparison, there were several things we like about Amazon Timestream.

Highest on the list is the simplicity of setting up a serverless offering. There aren’t any sizing decisions or knobs to tweak. Simply create a database, table, and then start sending data. (Note: this is something also coming soon to TimescaleDB.)

This brings up a second advantage of a serverless architecture: even if throughput isn’t ideal from a single client, the service appears to handle thousands of connections without issue. According to their documentation, Amazon Timestream will eventually add more resources to keep up additional ingest or query threads. This means that an application shouldn’t be limited by the resources of one particular server for reads or writes.

Like with other NoSQL databases, some may find the schemaless nature of Amazon Timestream appealing, especially when a project is just getting off the ground. Although schemas become more necessary as workloads grow for performance and data validation reasons, one of the reasons databases like MongoDB have grown in popularity is because they don’t require the same upfront planning as more traditional SQL databases.

Lastly, SQL. It shouldn’t come as a surprise that we like SQL as an efficient interface to the data we need to examine. And, although Amazon Timestream lacks some support for standard SQL dialects, most users will find it pretty straightforward to start querying data (after they understand Amazon Timestream’s narrow table model).

But why is Amazon Timestream so expensive, slow, and underwhelming?

The reality is that Amazon Timestream, despite taking 2 years post-announcement to launch, still seems half-baked.

Why is Amazon Timestream so expensive, slow, and seemingly underdeveloped? We assume the reason is because of its underlying architecture. Unlike other systems that we and others have benchmarked via the Time-Series Benchmark Suite (e.g., InfluxDB, MongoDB, and Cassandra), Amazon Timestream is completely closed-source. Based on our usage and experience with other Amazon services, we suspect that under the hood Amazon Timestream is backed by a combination of other Amazon services similar to Amazon ElastiCache, Athena, and S3. But, because we cannot inspect the source code (and because Amazon does not make this sort of information public), this is just a guess.

By comparison, all of the source code for TimescaleDB is available for anyone to inspect. We built TimescaleDB on top of PostgreSQL, giving it a rock-solid foundation and large ecosystem, and then spent years adding advanced capabilities to increase performance, lower costs, and improve the developer experience. These capabilities include auto-partitioning via hypertables and chunks, faster queries via continuous aggregates, lower costs via 94%+ native compression, high-performance (10+ million inserts a second) and petabyte-scale via a multi-node architecture, and more.

We believe the real reason behind the difference between the two products is in the companies building these products and how each approaches software development, community, and licensing. (And kind reader, this is where our bias may sneak in a little bit.)

Amazon vs. Timescale

The viability of our company, Timescale, is 100% dependent on the quality of TimescaleDB. If we build a sub-par product, we cease to exist. Amazon Timestream is just another of the 200+ services that Amazon is developing. Regardless of the quality of Amazon Timestream, that team will still be supported by the rest of Amazon’s business – and if the product gets shut down, that team will find homes elsewhere within the larger company.

One can see this difference in how the two companies approach the developer community. Without a doubt, Amazon Web Services is a leader in all things cloud computing. However, with their enormous catalog of cloud services, many of which are originally derived from external open-source projects, Amazon’s attention is spread over hundreds of products.

Case in point, when Amazon Timestream was announced in 2018, there was strong interest in when it would be released and how it would perform. However, after a two year delay, with no information from Amazon, many gave up on waiting for the product. When the product was finally released on September 30, 2020, there was very little fanfare from the community.

In contrast, Timescale develops its source code out in the open, and developers can reach us for help anytime directly via our Slack channel (which is staffed by our engineers), whether they are a paying customer or not. We’ve continued to invest in our community by making all of our software available for free, while also serving our customers with our hosted and fully-managed cloud services.

Building a high-performance, cost-effective, reliable, and easy-to-use time-series database is a hard and increasingly business-critical problem. For us, building TimescaleDB into a best-in-class time-series developer experience is an existential requirement. Without it, we cease to exist. For Amazon, Amazon Timestream is just a checkbox; another service to list on their website.

When Amazon is forced to compete on product quality, all open-source companies have a shot at building great businesses.

Amazon has a history of offering services that take advantage of the R&D efforts of others: for example, Amazon Elasticsearch Service, Amazon Managed Streaming for Apache Kafka, Amazon ElastiCache for Redis, and many others.

If Amazon wanted to launch a time-series database service that supported SQL, why did they build one from scratch, and not just offer managed TimescaleDB?

Answer: our innovative licensing. The core of TimescaleDB is open-source, licensed under Apache 2. But advanced capabilities, such as compression, continuous aggregates, and multi-node, are licensed under the Timescale License, a source-available license that is open-source in spirit and makes all software available for free – but contains a critical restriction: preventing companies from offering that software via a hosted database-as-a-service.

The Timescale License is an example of a “Cloud Protection License”, which are licenses that recognize that the cloud has increasingly become the dominant form of open-source commercialization. So these licenses protect the right of offering the software in the cloud for the main creator/maintainer of the project (who often contributes 99% of the R&D effort). (Read more about how we're building a self-sustaining open-source business in the cloud era.)

This “cloud protection” is what prevents Amazon from just distributing our R&D, and instead forces them to develop their own offering and compete on product quality, not just distribution. And as we can see from Amazon Timestream, building best-in-class database technologies is not easy, even for a company like Amazon.

The truth is that, when Amazon is forced to compete on product quality, all open-source companies have a shot at building great businesses.

We welcome Amazon’s new entry to the time-series database market, and appreciate that developers now have even more choice for storing and analyzing their time-series data. Competition is good for developers, and helps drive further innovation.

For those who want to dig deeper into our benchmarking and comparison, we include detailed notes and methodology below.

For those who want to try TimescaleDB, create a free account to get started with a fully-managed TimescaleDB instance (100% free for 30 days).

Want to host TimescaleDB yourself? Visit our GitHub to learn more about options, get installation instructions, and more (and, if you like what you see, ⭐️ are always appreciated!).

Join our Slack community to ask questions, get advice, and connect with other developers (I, as well as our co-founders, engineers, and passionate community members are active on all channels).

Performance comparison details

Here is a quantitative comparison of the two databases across insert and query workloads.

Note: We've released all the code and data used for the below benchmarks as part of the open-source Time Series Benchmark Suite (TSBS) (GitHub, announcement), so you can reproduce our results or run your own analysis.

Typically, when we conduct performance benchmarks (for example, in our previous benchmarks versus InfluxDB, MongoDB, and Cassandra) we use 5 different dataset configurations. These configurations increase metric loads and cardinalities, to simulate a breadth of time-series workloads for inserts and queries.

However, as you’ll see below, because of performance issues with Amazon Timestream, we were unable to look at Amazon Timestream’s performance under higher cardinalities, and were limited to testing just our lowest-cardinality dataset.

Machine Configuration

Amazon Timestream
Amazon Timestream is a serverless offering, which means that a user cannot provision a specific service tier. The only meaningful configuration option that a user can modify is the “memory store retention” period and the “magnetic store retention” period. In Amazon Timestream, data can only be inserted into a table if the timestamp falls within the memory store retention period. Therefore, the only setting that we modified to insert data for our first test was to set the memory store retention period to 865 hours (~36 days) to provide padding to account for a slower insert rate.

It did not take long for us to realize that Amazon Timestream’s insert performance was dramatically slower than other time-series databases we’ve benchmarked. Therefore, we took extra time to test insert performance using three different Amazon EC2 instance configurations, each launched in the same region as our Amazon Timestream database:

t3.medium running Ubuntu 18 LTS, 2 vCPUs, 4GB mem, up to 5 Gb network
c5n.2xlarge running Ubuntu 20 LTS, 8 vCPUs, 29GB mem, up to 25 Gb network
m5n.12xlarge running Ubuntu 18 LTS, 48 vCPUs,192GB mem, 50 Gb network

After numerous attempts to insert data with each of these instance types, we determined that the size of the client did not noticeably impact insert performance at all. Instead, we needed to run multiple client instances to ingest more data.

In the end we chose to write data from 1 and 10 t3.medium clients, each running 20 threads of TSBS. In the case of 10 clients, each covered a portion of the 30-days to avoid writing duplicate data (Amazon Timestream does not support writing duplicate data).

TimescaleDB
To test the same insert and read latency performance on TimescaleDB, we used the following setup:

Version: TimescaleDB version 1.7.4, with PostgreSQL 12.
1 remote client machine, 1 database server, both in the same cloud datacenter
Instance size: Both client and database server ran on DigitalOcean virtual machines (droplets) with 32 vCPU and 192GB Memory each.
OS: Both server and client machines ran Ubuntu 18.04.3
Disk Size: 4.8TB of disk in a raid0 configuration (EXT4 filesystem)
Deployment method: TimescaleDB was deployed using Docker images from the official Docker hub

In our tests, TimescaleDB outperformed Amazon Timestream by a shocking 6000x on inserts.

The lackluster insert performance of Amazon Timestream took us by surprise, especially since we were using the Amazon Timestream SDK and modeling our TSBS code from examples in their documentation.

In the interest of being thorough and fair to Amazon Timestream, we tried increasing the number of clients writing data, made some code modifications to increase concurrency (in ways that weren’t necessary for TimescaleDB), and worked to eliminate any possible thread contention, and then ran the same benchmark with 10 clients on Amazon Timestream.

After this effort, we were able to increase Amazon Timestream performance to 5,250 metrics / second (across 10 clients) – but even then, TimescaleDB (with only one client, and without any extra code modifications) outperformed Amazon Timestream by 600x.

(Hypothetically, we could have started a lot more clients to increase insert performance on Amazon Timestream (assuming no bottlenecks), but with an average ingest rate of ~523 metrics/second per client, we would have had to start ~61,000 EC2 instances at the same time to finish inserting metrics as fast as one client writing to TimescaleDB.)

In particular, with this low performance, we were only able to test our lowest-cardinality workload, not our usual 5 – even though we worked at it for more than a week. This scenario attempts to insert 100 simulated devices, each generating 10 CPU metrics every 10 seconds for ~100M reading intervals (for a total of 1 billion metrics). We never actually made it to the full 1 billion metrics with Amazon Timestream. After nearly 40 hours of inserting data from 10 EC2 clients, we were only able to insert slightly over 410 million metrics. (The dataset was created using Time-Series Benchmarking Suite, using the cpu-only use case.)

Let us put it another way:

We first tested Amazon Timestream and TimescaleDB with one client writing data.
Then, in an attempt to be fair to Amazon Timestream, we tested it with 10 separate EC2 instances, over a 2 day period, inserting batches of 1,000 readings (100 hosts, 10 measurements per host) as fast as possible.
It’s also worth noting that most clients started to receive a fatal connection error from Amazon Timestream between the 28 and 32 hour mark and didn’t recover. Only one client made uninterrupted inserts for more than 40 hours before we manually stopped it. It’s possible that with some additional error checking with the Amazon Timestream SDK response, TSBS could have recovered on its own and continued to send metrics from all 10 clients.

In total, this means that we inserted data into Amazon Timestream for 332.5 hours and achieved slightly more than 410 million metrics.

TimescaleDB inserted 1 billion metrics from one client in just under 5 minutes.

Amazon claims that Amazon Timestream will learn from your insert and query patterns and automatically adjust resources to increase performance. Their documentation specifically warns that writes may become throttled, with the only remedy to keep inserting at the same (or higher) rate until Amazon Timestream adjusts. However, in our experience, 332.5 hours of inserting data at a very consistent rate was not enough time for it to make this adjustment.

The issue of cardinality:
One other side-effect of Amazon Timestream taking so long to ingest data: we couldn’t compare how it performs with higher cardinalities, which are common in time-series scenarios, where we need to ingest a relentless stream of metrics from devices, apps, customers, and beyond. (Read more about the role of cardinality in time-series and how TimescaleDB solves for it.)

We’ve shown in previous benchmarks that TimescaleDB actually sees better performance relative to other time-series databases as cardinality increases, with moderate drop off in terms of absolute insert rate. TimescaleDB surpasses many other popular time-series databases, like InfluxDB, in terms of insert performance for the configurations of 4,000, 100,000, 1 million and 10 million devices.

But again, we were unable to test this given Amazon Timestream’s (lack of) performance.

Insert performance summary:

TimescaleDB outperforms Amazon Timestream in raw numbers that we found hard to believe. However, despite our best efforts to optimize Amazon Timestream, TimescaleDB still outperformed Amazon Timestream by 6,000x (600x if using 10 clients on Amazon Timestream to TimescaleDB’s 1).
In the time it took us to make a pot of coffee, TimescaleDB inserted 1 billion metrics for a 31 day period. With Amazon Timestream, we got two nights sleep and inserted less than half the metrics.
These are tests using TimescaleDB single-node. With TimescaleDB multi-node, insert rates well over 10 million metrics per second are supported.
That said, if your insert performance is far below these benchmarks (e.g., a few thousand metrics / second), then insert performance will not be your bottleneck.

Full results:

TimescaleDB vs. Amazon Timestream Insert Rate comparison ratios

More information on database configuration for this test:

Batch size
From our research and community members’ feedback, we’ve found that larger batch sizes generally provide better insert performance. (It’s one of the reasons we created tools like Parallel COPY to help our users insert data in large batches concurrently).

In our benchmarking tests for TimescaleDB, the batch size was set to 10,000, something we’ve found works well for this kind of high throughput. The batch size, however, is completely configurable and often worth customizing based on your application requirements.

Amazon Timestream, on the other hand, has a fixed batch size limit of 100 values. This seems to require significantly more overhead and insert latency increases dramatically as the number of metrics we try to insert at one time increases. This is one of the first reasons we believe insert performance was so much slower with Amazon Timestream.

Additional database configurations
For TimescaleDB, we set the chunk time depending on the data volume, aiming for 7-16 chunks in total for each configuration (see our documentation for more on hypertables - "chunks").

With Amazon Timestream, there aren’t additional settings you can tweak to try and improve insert performance - at least not that we found given the tools provided by Amazon. As mentioned in the machine configuration section above, we had to set the memory store retention period equal to ~36 days to ensure we would be able to get all of our data inserted before the magnetic store retention period kicked in.

Query performance comparison

Measuring query latency is complex. Unlike inserts, which primarily vary on cardinality size, the universe of possible queries is essentially infinite, especially with a language as powerful as SQL. Often, the best way to benchmark read latency is to do it with the actual queries you plan to execute. For this case, we use a broad set of queries to mimic the most common time-series query patterns.

For benchmarking query performance, we decided to use a c5n.2xlarge EC2 instance to perform the queries with Amazon Timestream. Our hope was that having more memory and network throughput available to the query application would give Amazon Timestream a better chance. The client for TimescaleDB was unchanged.

Recall that we ran these queries on Amazon Timestream with a dataset that was 40% that of the one we ran on TimescaleDB (410 million vs. 1 billion metrics), owing to the insert problems we had above. Also, because we had to set the memory store retention period to ~36 days, all of the data we queried was in the fastest storage available. These two advantages should have given Amazon Timestream a considerable edge.

That said, TimescaleDB still outperformed Amazon Timestream by 5x to 175x, depending on the query, with Amazon Timestream unable to finish several of the queries.

The results shown below are the average from 1,000 queries for each query type. Latencies in this chart are all shown as milliseconds, with an additional column showing the relative performance of TimescaleDB compared to Amazon Timestream.

Results of benchmarking query performance between TimescaleDB and Amazon Timestream

Results by query type:

SIMPLE ROLLUPS
For simple rollups (i.e., groupbys), when aggregating one metric across a single host for 1 or 12 hours, or multiple metrics across one or multiple hosts (either for 1 hour or 12 hours), TimescaleDB significantly outperforms Amazon Timestream by 11x to 28x.

AGGREGATES
When calculating a simple aggregate for 1 device, TimescaleDB again outperforms Amazon Timestream by a considerable margin, returning results for each of 1,000 queries more than 19x faster.

DOUBLE ROLLUPS
For double rollups aggregating metrics by time and another dimension (e.g., GROUPBY time, deviceId), TimescaleDB again achieves significantly better performance, 5x to 12x.

THRESHOLDS
When selecting rows based on a threshold (CPU > 90%), we see Amazon Timestream really begin to fall apart. Finding the last reading for one host greater than 90% performs 170x better with TimescaleDB compared to Amazon Timestream. And the second variation of this query, trying to find the last reading greater than 90% for all 100 hosts (in the last 31 days), never finished in Amazon Timestream.

Again, to be fair and ensure our query was returning the data we expected, we did manually run one of these queries in the Amazon Timestream Query interface of the AWS Console. It would routinely finish in 30-40 seconds (which would still be 36x slower than TimescaleDB). In addition, running 100 of these queries at a time with the benchmark suite appears to be too much for the query engine, and results for the first set of 100 queries didn’t complete after more than 10 minutes of waiting.

COMPLEX QUERIES
Likewise, for complex queries that go beyond rollups or thresholds, there is no comparison. TimescaleDB vastly outperforms Amazon Timestream, in most cases because Amazon Timestream never returned results for the first set of 100 queries. Just like the complex aggregate above that failed to return any results when queried in batches of 100, these complex queries never returned results with the benchmark client.

In each case, we attempted to run the queries multiple times, ensuring that no other clients or processes were inserting or accessing data. We also ran at least one of the queries manually in the AWS Console to verify that it worked and that we got the expected results. However, when running these kinds of queries in parallel, there seems to be a major issue with Amazon Timestream being able to satisfy the requests.

For these more complex queries that return results from Amazon Timestream, TimescaleDB provides real-time responses (e.g., 10-100s of milliseconds), while Amazon Timestream sees significant human-observable delays (seconds).

And remember, this dataset only had a cardinality of 100 hosts, the lowest cardinality we typically test with the Time-Series Benchmarking Suite, and we were unable to test higher cardinality datasets because of Amazon Timestream issues).

Notice that Timescale exhibits 48x-175x the performance of Amazon Timestream on these complex queries, many of which are common to historical time-series analysis and monitoring.

Read latency performance summary

For simple queries, TimescaleDB outperforms Amazon Timestream in every category.
When selecting rows based on a threshold, TimescaleDB outperforms Amazon Timestream by a significant margin, being over 175x faster.
For complex queries with even low cardinality, Amazon Timestream was unable to return results for sets of 100 queries within the 60 second default query timeout.
Concurrent query load over the same time range seems to impact Amazon Timestream in a dramatic way.

Cost comparison details

These performance differences between TimescaleDB and Amazon Timestream lead to massive differences in costs for the same workloads.

To compare costs, we calculated the cost for our above insert and query workloads, which store 1 billion metrics in TimescaleDB and ~410 million metrics in Amazon Timestream (because we could not load the full 1 billion), and run our suite of queries on top.

Pricing for Amazon Timestream is rather complex. In all, our bill for testing Amazon Timestream over the course of 7 days cost us $336.39, which does not include any Amazon EC2 charges (which we needed for the extra clients). During that time our bill shows that:

We inserted 100GB of data (~500 million metrics total across all of our attempts to ingest data)
Stored a lot of data in memory (and we continue to be charged per hour for that data)
Queried 21 TB of data when running 25,000 real-world queries

For comparison, our tests for TimescaleDB (inserts and queries) completed in far less than an hour and our Digital Ocean droplet costs $1.50/hour. We also ran this test on Timescale Forge, our fully managed TimescaleDB service, and it also completed in far less than an hour, and the instance (8 vCPU, 32GB, 1TB) cost $2.18/hour.

$336.39 for Amazon Timestream vs. $2.18 for fully-managed TimescaleDB ($1.50 if you would rather self-manage the instance yourself), which means that TimescaleDB is 154x cheaper than Amazon Timestream (224x cheaper if self-managed) – and it loaded and queried over twice the number of metrics.

Now, let’s dig a little deeper into our Amazon Timestream bill.

When using Amazon Timestream, users are charged for usage in four main categories: data ingested, data stored in both memory and magnetic storage, and the amount of data scanned to satisfy your queries.

Data that resides in the memory store costs $0.036 GB/hour, while data that is eventually moved to magnetic storage costs only $0.03 GB/month. For our 1-month memory store setting (which was required to insert 30 days of historical data), that’s more than a 720x difference in cost for the same data. What’s more, since Amazon Timestream doesn’t expose any information about how data is stored in the Console or otherwise, we have no idea how well compressed this data is or if there is anything more we could do to reduce storage.

The real surprise, however, came with querying data because the charges don’t scale with performance. Instead, you will be charged for the amount of data scanned to produce a query result, no matter how fast that result comes back. In almost all other database-as-a-service offerings, you can modify the storage, compute or cluster size (at a known cost) to get better performance.

After waiting nearly two days to insert 410 million metrics, we created the traditional set of query batches (as outlined above) and began to run our queries.

In total we had 15 query files, each with 1,000 queries, for a total of 15,000 queries to run against both Amazon Timestream and TimescaleDB.

While some of the queries are certainly complex, others just ask for the most recent point for each host (and remember, this dataset only had 100 hosts). Also, recall from our query performance comparison that a few of the most complex queries were unable to return results for just 100 queries, let alone the full 1,000 query test.

With Amazon Timestream, you are still charged for the data that was read, even if the query was ultimately canceled or never returned a result.

To validate results, we ran each query file twice. In the case where 3 of the query files failed to return results, we attempted to execute them 5 times, hoping for some result. Doing the math, this means that we ran around 25,000 queries. In doing so, Amazon says that we scanned 21,598.02 GB of data, which cost $215.98. There were certainly a few other ad hoc queries performed through the AWS Console UI, but before we started running the benchmarking queries, the total cost for scanning data was about $15.00.

Furthermore, as we’ve mentioned a few times, there is no built-in support to help you identify which queries are scanning too much data and how you might improve them. For comparison, both Amazon Redshift and Amazon RDS provide this kind of feedback in their AWS Console interface.

When we consider some of the recent user applications that we have highlighted elsewhere on our blog, like FlightAware or clevabit, 25,000 queries of various shapes and sizes would easily be run in a few hours or less.

While the bytes scanned might improve over time as partitioning improves, if you don’t need to scale storage beyond a few petabytes of data, it’s hard to see how this would be less costly than a fixed Compute and Storage cost.

Reliability comparison details

Another cardinal rule for a database: it cannot lose or corrupt your data. In this respect, the serverless nature of Amazon Timestream requires that you trust Amazon will not lose your data and all of the data will be stored without corruption. Usually, this is probably a pretty safe bet. In fact, many companies rely on services like Amazon S3 or Amazon Glacier to store their data as a reliable backup solution.

The problem is that we don’t know where our time-series data is stored in Amazon Timestream — because Amazon does not tell us.

This presents a specific challenge that Amazon hasn’t addressed natively: validating or backing up your data.

In their 240 page development guide, the words “recovery” and “restore” don’t appear at all, and the word “backup” appears only once … to tell the developer that there isn’t a backup mechanism. Instead, users can “...write your own application using the Timestream SDK to query data and save it to the destination of your choice” (page 100).

This is not to say that Amazon Timestream will lose or corrupt your data. As we mentioned, Amazon S3, for instance, is a widely known and used service for data storage. The issue here is that we’re unable to learn or easily verify where our data resides and how it’s protected in a service interruption.

We also found it worrisome that with Amazon Timestream, there isn’t a mechanism or support to DELETE or UPDATE existing data. The only way to remove data is to drop the entire table. Furthermore, there is no way to recover a deleted table since it is an atomic action that cannot be recovered through any Amazon API or Console.

Even if one were to write their own backup and restore utility, there is no method for importing more than the most recent year of data because of the memory store retention period limitation.

As an Amazon Timestream user, all these limitations put us in a precarious position. There’s no easy way to backup our data, or restore it once we’ve accumulated more than a year's worth. Even if Amazon never loses our data, deleting an essential table of data through human error is not uncommon.

TimescaleDB uses a dramatically different design principle: build on PostgreSQL. As noted previously, this allows TimescaleDB to inherit over 25 years of dedicated engineering effort that the entire PostgreSQL community has done to build a rock-solid database that supports millions of applications worldwide. (In fact, this principle was at the core of our initial TimescaleDB launch announcement.)

Query language, ecosystem, and ease-of-use comparison details

We applaud Amazon Timestream’s decision to adopt SQL as their query language. We have always been big fans and vocal advocates of SQL, which has become the query language of choice for data infrastructure, is well-documented, and currently ranks as the third-most commonly used programming language among developers (see our SQL vs. NoSQL comparison for more details).

Even if Amazon Timestream functions like a NoSQL database in many ways, opting for SQL as the query interface lowers developers’ barrier to entry – especially when compared to other databases like MongoDB and InfluxDB with their proprietary query languages.

Most popular Programming, Scripting, and Markup Languages. Source: 2020 Stack Overflow Developer Survey

As we discussed earlier in this article, Amazon Timestream is not a relational database, despite feeling like it could be because of the SQL query interface. It doesn’t support normalized datasets, JOINs across tables, or even some common “tricks of the trade” like a simple LATERAL JOIN and correlated subqueries.

Between these SQL limitations and the narrow table model that Amazon Timestream enforces on your data, writing efficient (and easily readable) queries can be a challenge.

Example
To see a brief example of how Amazon Timestream’s “narrow” table model impacts the SQL that you write, let’s look at an example given in the Timestream documentation, Queries with aggregate functions.

Specifically we’ll look at the example to “find the average load and max speed for each truck for the past week”:

Amazon Timestream SQL (CASE statement needed)

SELECT
    bin(time, 1d) as binned_time,
    fleet,
    truck_id,
    make,
    model,
    AVG(
        CASE WHEN measure_name = 'load' THEN measure_value::double ELSE NULL END
    ) AS avg_load_tons,
    MAX(
        CASE WHEN measure_name = 'speed' THEN measure_value::double ELSE NULL END
    ) AS max_speed_mph
FROM "sampleDB".IoT
WHERE time >= ago(7d)
AND measure_name IN ('load', 'speed')
GROUP BY fleet, truck_id, make, model, bin(time, 1d)
ORDER BY truck_id

TimescaleDB SQL

SELECT
    time_bucket(time, ‘1 day’) as binned_time,
    fleet,
    truck_id,
    make,
    model,
    AVG(load) AS avg_load_tons,
    MAX(speed) AS max_speed_mph
FROM “public”.IoT
WHERE time >= now() - INTERVAL ‘7 days’
GROUP BY fleet, truck_id, make, model, binned_time
ORDER BY truck_id

If the above is any indication, even the most simple aggregate queries in Amazon Timestream require multiple levels of CASE statements and column renaming. We were unable to write this query or pivot the results more easily.

Conversely, as evidenced in the above example, with TimescaleDB, we use standard SQL syntax. Additionally, any query that already works with your PostgreSQL-supported applications will “just work”. The same isn’t true for Amazon Timestream.

So while the decision to adopt a SQL-like query language is a great start for Amazon Timestream, there is still a lot to be desired for a truly frictionless, developer-first experience.

Summary

No one wants to invest in a technology only to have it limit their growth or scale in the future, let alone invest in something that's the wrong fit today.

Before making a decision, we recommend taking a step back and analyzing your stack, your team's skills, and your needs (now and in the future). It could be the difference between infrastructure that evolves and grows with you and one that forces you to start all over.

In this post, we performed a detailed comparison of TimescaleDB and Amazon Timestream. We don’t claim to be Amazon Timestream experts, so we’re open to any suggestions on how to improve this comparison – and invite you to perform your own and share your results.

In general, we aim to be as transparent as possible about our data models, methodologies, and analysis, and we welcome feedback. We also encourage readers to raise any concerns about the information we’ve presented in order to help us with benchmarking in the future.

We recognize that TimescaleDB isn’t the only time-series solution on the market. There are situations where it might not be the best time-series database choice, and we strive to be upfront in admitting where an alternate solution may be preferable.

We’re always interested in holistically evaluating our solution against others, and we’ll continue to share our insights with the greater community.

Want to learn more about TimescaleDB?

Create a free account to get started with a fully-managed TimescaleDB instance (100% free for 30 days).

Want to host TimescaleDB yourself? Visit our GitHub to learn more about options, get installation instructions, and more (and, as always, ⭐️ are appreciated!)

Join our Slack community to ask questions, get advice, and connect with other developers (I, as well as our co-founders, engineers, and passionate community members are active on all channels).

What the heck is time-series data (and why do I need a time-series database)?

Ajay Kulkarni — Tue, 01 Dec 2020 16:00:00 GMT

A primer on time-series data, what it is, where to store it, and how to analyze it to gain powerful insights.

(Note: this post was originally published in November 2018, and republished in December 2020 with updated graphs, new trends, and relevant technical information.)

Here’s a riddle: what do self-driving Teslas, autonomous Wall Street trading algorithms, smart homes, transportation networks that fulfill lightning-fast same-day deliveries, and tracking the daily COVID-19 statistics and air quality in your community have in common?

For one, they are signs that our world is changing at warp speed, thanks to our ability to capture and analyze more and more data faster than ever before.

However, if you look closely, you’ll notice that each of these applications requires a special kind of data:

Self-driving cars continuously collect data about how their environment is changing, adjusting based on weather conditions, potholes, and countless other variables.
Autonomous trading algorithms continuously collect data on how the markets are changing to optimize returns, both in the short and long-term. (Read how financial applications like TransferWise and an automated crypto trading bot collect, store, and analyze data.)
Our smart homes monitor what’s going on inside of them to regulate temperature, identify intruders, and respond to our every beck-and-call (“Alexa, play some relaxing music”).
Our retail industry monitors how their assets move with such precision and efficiency that cheap same-day delivery is a luxury that many of us take for granted.

But, far from stock trends, self-driving cars, and knowing the exact minute your next online purchase will arrive, 2020 has provided the most personal example of how time-series data collection and analysis affects our daily lives.

For the first time in history, worldwide interest in time-series data has peaked in the most unexpected way. COVID-19 and the global pandemic have made billions of people across the globe relentless consumers of time-series data, demanding accurate and timely information about the daily trend of various COVID-19 statistics.

Having access to detailed, feature rich time-series data has become one of the most valuable commodities in our information-hungry world. Businesses, governments, schools, and communities, large and small, are finding invaluable ways to mine value from analyzing time-series data. (You can read how some real-world teams, like those tracking real-time flight data or building platforms for sustainable farming mine their time-series metrics in our Developer Q&A series)

Software developer usage patterns already reflect the same trend. In fact, over the past two years, time-series databases (TSDBs) have steadily remained the fastest growing category of databases:

Source: DB-Engines, November 2020

As the developers of an open-source time-series database, my team and I are often asked about this trend and how it should factor into your decisions about which database to select. Specifically, does it really matter if you start with a database specialized for time-series data – or can you easily transition to one later?

To answer those questions, let me start with a more in-depth description of what time-series data is and how you might benefit from using a time-series database, and leave you with a few ways to start exploring time-series data and performing your own analysis.

What is time-series data?

Some think of “time-series data” as a sequence of data points, measuring the same thing over time, stored in time order. That’s true, but it just scratches the surface.

Others may think of a series of numeric values, each paired with a timestamp, defined by a name and a set of labeled dimensions (or “tags”). This is perhaps one way to model time-series data, but not a definition of the data itself.

At a more granular level, time-series data tracks changes over time, millisecond to millisecond, minute to minute, and day to day, giving us the ability to analyze those changes in ways that were previously impossible.

In the past, our view of time-series data was more static; the daily highs and lows in temperature, the opening and closing value of the stock market, or even the daily or cumulative hospitalizations due to COVID-19.

Unfortunately, these totals missed the nuances of how the underlying changes over time contributed to these static values.

Let’s consider a few examples.

If I send you $10, a traditional bank database would debit my account and credit your account. Then, if you send me $10, the same process happens in reverse. At the end of this process, our bank balances would look the same, so the bank might think, “Oh, nothing changed this month.” But, with a time-series database, the bank would see, “Hey, these two people keep sending each other $10, there’s likely a deeper relationship here.” Tracking this nuance, our month-ending account balance takes on greater meaning.

Next, think about an environmental value like mean daily temperature (MDT), the average of the high and low temperature, for consecutive days at a location. Over the last few decades, MDT has been used as a primary variable to calculate buildings’ energy efficiency. In any given week, MDT might only vary slightly from day-to-day in a location, but the contributing environmental factors could be changing drastically over that same period. Instead, knowing how the temperature changed each hour throughout the day, coupled with precipitation, cloud cover, and wind speed during that time, could dramatically improve your ability to model and optimize energy efficiency for your properties.

Likewise, while knowing the total number of COVID-19 hospitalizations per day in your community is valuable, that number alone isn’t very descriptive. For instance, the hospital might disclose daily numbers that show 20 hospitalizations on Monday and increase slightly throughout the week to total 23 hospitalizations on Friday. At first glance, it looks like a 15% increase in hospitalizations this week – but if we add detail to each of those records (and increase the frequency at which we collect them), we might see that it was a net increase of 3 patients, but in reality there were 10 people discharged and 13 admitted, an increase of 65% for new admissions over the last 5 days.

Tracking each aspect of patient data over time (e.g., patient age, admitted or discharged, days to recovery, etc.) helps us understand how we arrive at the daily counts, allowing us to better analyze trends, accurately report totals, and take action. In the case of total COVID-19 hospitalizations, the details behind this analysis impact public policy in the cities and towns where we live.

These examples illustrate how modern time-series data is different from what we’ve known in the past. Time-series data analysis goes far deeper than a pie chart or Excel workbook with columns of summarized totals.

This detailed data doesn’t just include time as a metric, but as a primary component that helps to analyze our data and derive meaningful insights.

And, there are many other kinds of time-series data, but regardless of the scenario or use case, all time-series datasets have 3 things in common:

The data that arrives is almost always recorded as a new entry
The data typically arrives in time order
Time is a primary axis (time-intervals can be either regular or irregular)

In other words, time-series data workloads are generally “append-only.” While they may need to correct erroneous data after the fact, or handle delayed or out-of-order data, these are exceptions, not the norm.

But, I already track a timestamp

You may ask: How is this different than just having a time-field in a dataset? Well, it depends: how does your dataset track changes? By updating the current entry, or by inserting a new one?

When you collect a new reading for sensor_x, do you overwrite your previous reading, or do you create a brand new reading in a separate row? While both methods will provide the current state of the system, you can only analyze the changes in state over time if you insert a new reading each time.

Simply put: time-series datasets track changes to the overall system as INSERTs, not UPDATEs.

This practice of recording each and every change to the system as a new, different row is what makes time-series data so powerful. It allows us to measure and analyze change: what has changed in the past, what is changing in the present, and what can we forecast changes may look like in the future.

In short, here’s how I like to define time-series data: a collection of values that represents how a system/process/behavior changes over time.

This is more than just an academic distinction. By centering our definition around “change,” we can identify time-series datasets that we aren’t collecting today and identify opportunities to start collecting that data now, so that we can harness its value later. All too often, people have time-series data but don’t realize it.

Time-series data hiding in plain sight?

Can you think of some common examples of time-series data in your day-to-day work? Are there reports or analysis you’ve been asked to help create, but lacked the data fidelity to do so?

Imagine you maintain a web application. Every time a user logs in, you may just update a “last_login” timestamp for that user in a single row in your “users” table. But, what if you treated each login as a separate event, and collected them over time? With that kind of time-series data you could analyze historical login activity, see how usage is (in-/de-)creasing over time, bucket users by how often they access the app, and more.

Another example has become vital to every IT group around the world: operational metrics for servers, networks, applications, environments, and more. This kind of time-series metric data is crucial to keeping the services that we rely on, running without interruption. By tracking the changes in each metric, IT departments can quickly identify problems, plan for capacity increases during upcoming events, and diagnose if an application update resulted in changed user behavior, for better or worse. (See how LAIKA uses Timescale to track resource consumption and plan for future needs.)

These examples illustrate a key point: preserving the inherent time-series nature of our data allows us preserve valuable information about how that data changes over time. You may also notice that both of these examples describe a common type of time-series data known as event data.

Of course, storing data at this resolution comes with an obvious problem: you end up with a lot of data, rather fast. So that’s the catch: being able to analyze increased amounts of time-series data is more valuable than ever, but it piles up very quickly.

Having a lot of data creates a different set of problems, both when recording it and when trying to query it in a performant way, which is why people are turning to time-series databases in greater numbers than ever before. The world is demanding that we make better data-driven decisions, faster. The static snapshots found in traditional data won’t cut it. To satisfy the demand, you need to be collecting data at the highest fidelity possible – and that’s what time-series data provides: the dynamic movie of what’s happening across your system (whether it’s your software, your physical power plant, your game, or customers inside your application).

Why do I need a time-series database?

You might ask: Why can’t I just use a “normal” (i.e., non-time-series) database?

The truth is that you can, and some people do. But, there’s at least two reasons why TSDBs are the fastest growing category of databases today: scale and usability.

Scale: Time-series data accumulates very quickly, and normal databases are not designed to handle that scale (at least not in an automated way). Traditionally, relational databases fare poorly with very large datasets, while NoSQL databases are better at scale (although a relational database fine-tuned for time-series data can actually perform better, as we’ve shown in benchmarks versus InfluxDB, versus Cassandra, and versus MongoDB). In contrast, time-series databases - whether they’re relational or NoSQL-based - introduce efficiencies that are only possible when you treat time as a first-class citizen. These efficiencies allow them to offer massive scale, from performance improvements, including higher ingest rates and faster queries at scale (although some support more queries than others) to better data compression.

Usability: TSDBs also typically include built-in functions and operations common to time-series data analysis, such as data retention policies, continuous queries, flexible time aggregations, etc. Even if you’re just starting to collect this type of data and scale is not a concern at the moment, these features can still provide a better user experience and make data analysis tasks easier. Having built-in functions and features to analyze trends readily available at the data-layer often leads you to discover opportunities you didn’t know existed, no matter how big or small your dataset

This is why developers are increasingly adopting time-series databases and using them for a variety of use cases:

Monitoring software systems: Virtual machines, containers, services, applications
Monitoring physical systems: Equipment, machinery, connected devices, the environment, our homes, our bodies
Asset tracking applications: Vehicles, trucks, physical containers, pallets
Financial trading systems: Classic securities, newer cryptocurrencies
Eventing applications: Tracking user/customer interaction data
Business intelligence tools: Tracking key metrics and the overall health of the business
(and more)

Once you begin to see more of the information your applications store as time-series data, you still have to pick a time-series database that best fits your data model, write/read pattern, and developer skill sets. Although NoSQL time-series database options have prevailed for the past decade as the storage medium of choice, more and more developers are seeing the downside to storing time-series data separately from business data (most time-series databases don’t provide good support for relational data). In fact, this poor developer experience was one of the driving factors in why we started Timescale. Keeping all of your data in one system can drastically reduce application development time – and the speed at which you can make key decisions.

Nowhere is this more evident than with the rise of numerous self-service business intelligence tools like Tableau, Power BI, and yes, even Excel. When precious time-series data is kept separate from business data, users struggle to make timely, business critical observations. Instead, users find that they need to rely on these third-party tools to mash-up data into something meaningful. There are many valid and good reasons to use these powerful tools, but being able to quickly query your time-series data alongside meaningful metadata information shouldn’t be one of them. SQL has been built and honed over decades to provide efficient ways of generating these valuable aggregations and analysis.

The bottom line: knowing where your time-series data is and where you store it can have a dramatic impact on your future success.

Is all data time-series data?

For the past decade or so, we have lived in the era of “Big Data,” to the point where it’s almost reached buzzword status; organizations of all sizes and types collect massive amounts of information about our world and apply computational resources to make sense of it.

Even though this era started with modest computing technology, our ability to capture, store, and analyze data has improved at an exponential pace, thanks to major macro-trends: Moore’s law, Kryder’s law, cloud computing, an entire industry of “big data” technologies.

Under Moore’s Law, computational power (transistor density) doubles every 18 months, while Kryder’s Law postulates that storage capacity doubles every 12 months.

We are no longer content to just observe the state of the world. Now, we need to measure how our world changes over time, down to sub-second intervals. Our “Big Data” datasets are now being dwarfed by another type of data, one that relies heavily on time to preserve information about the change that is happening.

Does all data start off as time-series data? Recall the earlier web application example: we had time-series data, but didn’t realize it: tracking user activity that would help you analyze engagement. Or think of any “normal” dataset. Say, the current accounts and balances at a major retail bank. Or the source code for a software project. Or the text for this article.

Typically we choose to store the latest state of the system, but instead, what if we stored every change and computed the latest state at query time? Isn’t a “normal” dataset just a view on top of an inherently time-series dataset (cached for performance reasons)? Don’t banks have transaction ledgers? (And aren’t blockchains just distributed, immutable time-series logs?) Doesn’t a software project have version control (e.g., git commits)? Doesn’t this article have revision history? (Undo. Redo.)

Put differently: Don’t all databases have logs?

We recognize that many applications may never require time-series data (and would be better served by a “current-state view”). But as we continue along the exponential curve of technological progress, it would seem that these “current-state views” become less necessary. Instead we’re finding that storing more and more data in its time-series form often helps us to understand it better.

So is all data time-series data? I’ve yet to find a good counter example. If you’ve got one, I’m open to hearing it. Regardless, one thing is clear: time-series data already surrounds us. It’s time we put it to use.

Mining for treasure with time-series analysis

Hopefully by now your wheels are turning and you’ve started to identify applications or areas in your business that have time-series data just waiting for you to do something with it. So, now what?

This is when the fun (and real work) begins. It’s also when you’ll really see why time-series databases are essential tools.

Let’s look at an example based on the fictional web application we’ve referenced throughout this post. As we discussed, until now we’ve only tracked the last time a user logged in as a field in the “users” table and always update the previously stored value with the new login information. While this allows us to query how many people have logged in over a week or a month, we’re unable to analyze how often they log in, for how long, or drill into any other aspects that might tell us more about our users’ experience or their usage patterns.

We can quickly improve upon this by tracking information about every login, not just the most recent one. To do this, we’ll start logging the timestamp of each login and the type of device used to access our application (e.g., phone, tablet, desktop). This small change - tracking just one more property about the user login experience - provides immediate value, allowing us to answer questions like, “what kind of devices are most frequently used (by individual users and across all users)?” and “what time of day are users the most active?”. From there, we can better inform the features we prioritize - such as mobile-specific capabilities -, the times we display certain promotional messages, and beyond.

To track this new data, we add a new table called “user_logins” that references our “users” table. Here’s an example of what the data might look like:

Users Table

user_id	first_name	last_name
1	Mary	Smith
2	Eon	Tiger
3	Ajay	Kulkarni

User_logins Table

user_id	login_timestamp	device_type
1	2020-11-08 11:15:00	mobile
2	2020-11-10 15:34:00	mobile
1	2020-11-11 12:13:00	desktop
3	2020-11-15 02:47:00	tablet
...	...	...

With the updated data model and these new user details logged, we can start to query the data for insights. As mentioned earlier, time-series databases like TimescaleDB help with this kind of information in two crucial ways:

First, as your application scales and data volume grows, your database is built to handle and ingest the relentless stream of data inherent to time-series workloads, mitigating any negative performance impacts or lags.

Second, they provide specialized functions that make it easier - and faster - to query aspects of your data in meaningful ways where time is a primary component.

To demonstrate some of those specialized time-series analysis capabilities, let’s look at a few example functions that TimescaleDB adds to the SQL language – and how we can use them to better analyze our users’ usage behavioral patterns. (For more examples, see our advanced analytical functions documentation.)

In each example, we’re still relying on standard SQL patterns, a language that many developers are familiar with, and augmenting it for time-series use cases. WHERE clauses still work, and we can still aggregate data easily with GROUP BY clauses. But now, rather than having to parse out specific parts of the dates in order to group the data appropriately (for instance), we can use a function like time_bucket() to easily aggregate data across almost any interval.

And, as a bonus, it also makes the query easier to read!

Query #1: How many logins per day for the last month?

SELECT time_bucket('1 day', login_timestamp) as one_day
  COUNT(*) total_logins
FROM user_logins
WHERE login_timestamp > now() - INTERVAL '1 month'
GROUP BY one_day
ORDER BY one_day;

This first example is the “Hello, World!” of time-series queries, using the time_bucket() function to automatically group and aggregate our time-series data to help us get a quick view of total daily logins (‘1 day’ in the function above) for the last month (‘WHERE login_timestamp > now() - INTERVAL ‘1 month’). Notice that time-series queries allow you to specifically query intervals of time rather than breaking down dates into each component (month, day, year, hour, etc.) to do a similar aggregation without these specialized functions.

Query #2: What was the last login time of each user and what type of device did they use?

SELECT user_id, first_name || ' ' || last_name AS full_name, 
  last(login_timestamp, login_timestamp) AS last_login,
  last(device_type, login_timestamp) AS last_device_type
FROM user_logins ul
  INNER JOIN users u on ul.user_id = u.user_id
WHERE login_timestamp > now() - INTERVAL '1 month'
GROUP BY user_id, full_name
ORDER BY user_id;

In this more complex example, we use another specialized function, last(), to query useful information about our users, specifically the most recent value of a specific set of data. Without a specialized function like last(), we would need to write a query with something like a LATERAL JOIN or a correlated subquery. But, with our handy built-in specialized function, we’re able to get this type of valuable information in a straightforward (and often very quick) way.

Query #3: For the last week, which 6 hour periods saw the most log-ins from users on tablet devices?

SELECT time_bucket('6 hours', login_timestamp, timestamptz '2020-01-01 08:00:00') as device_bucket,
  device_type,
  count(*) AS logins_by_device, 
FROM user_logins
WHERE login_timestamp > now() - INTERVAL '1 week'
  AND device_type = 'tablet'
GROUP BY device_bucket, device_type
ORDER BY logins_by_device desc;

In this final example query, we demonstrate how functions like `time_bucket()` aren’t bound to common intervals (‘1 hour’, ‘1 day’, ‘1 week’, etc.), but can be used for INTERVAL grouping. And, more notably, we can combine these functions with parameters that allow us to refine our results to a specific subset. In this case, we asked TimescaleDB to return results in 6 hour buckets, aligning the first bucket to 8 AM UTC, and only return log-ins from tablet-based sessions.

These examples just scratch the surface; you have infinite flexibility in how your data can be queried and modeled.

In summary, logging just two additional details about user logins - device type and timestamps for every log-in, not just the latest - quickly transforms our ability to understand how our web application is used – and how time-series databases like TimescaleDB help us analyze and make sense of data, so we can make decisions faster.

Now, it’s your turn: resources to get started

If you’re convinced you need a time-series database, or just want to try it out for yourself, spin up a fully-managed TimescaleDB instance - free for 30 days.

From there, follow our intro tutorial to configure your database and execute your first query, then tackle more advanced time-series analysis with our cryptocurrency and time-series forecasting tutorials.

Have questions or want to learn more? Join our Slack community, where you’ll find myself, Timescale engineers, and community members active in all channels.

Timescale Newsletter Roundup: October Edition

Lacey Butler — Wed, 04 Nov 2020 18:26:24 GMT

Get a cornucopia of resources to help you do more with your data - with a special focus on all things open source ✨ - and some of our favorite new content from TimescaleDB community members.

We’re on a mission to teach the world about time-series data, supporting and growing communities around the world.

And, sharing educational resources as broadly as possible is one way to do just that :).

Here’s a snapshot of the content we shared with our readers this month (subscribe to get updates straight to your inbox).

Product updates & announcements

[⭐️ BONUS Product Update]: TimescaleDB 2.0 RC - multi-node, petabyte-scale, 100% free relational database for time-series - has arrived >>

This news didn't quite make it into our biweekly newsletter cadence, but it's definitely a noteworthy October happening; 2.0 is a huge milestone for us, the TimescaleDB community, and the industry as a whole: TimescaleDB is now a multi-node, petabyte-scale relational database for time-series – and it's free.

🚀 Read our announcement blog post to learn what's new, our journey to 2.0, and why we believe relational databases are the past and future of software development.
🎓 Watch our All Things TimescaleDB 2.0 Youtube playlist (5 videos) to get an overview of all new features, then dive into feature-specific videos, demos, and tips.
🐤 See this Twitter thread from Mike, Timescale CTO, for a quick - and emoji-packed - breakdown of our announcement.

Check out our explainer video series for a whirlwind tour of what's new in TimescaleDB 2.0

[Product Update]: Introducing Promscale: an open-source analytical platform for Prometheus metrics >>

We just announced Promscale, a new open-source platform built to scale and augment Prometheus for analytics, combining the power of PromQL and SQL with a rock-solid long-term data store. Ask any question, analyze recent & historical data, assess issues in real-time, forecast future trends, and more 🚀.

🔥 See our blog post to learn more about Promscale, how it originated (3.5+ years of community feedback!), how it works, and ways to get started.
⚙ Go to our GitHub README for various installation options - we recommend tobs, our CLI tool.
🙋 Have feedback or questions? Let us know on Slack (#Prometheus channel).

Watch this 15 min. demo video to learn how Promscale works, plus ways to query your data and answer questions about your systems (using PromQL, SQL, and Grafana).

New technical content, videos & tutorials

[PostgreSQL Pro Tips]: Save time with PostgreSQL Cheatsheet >>

We’ve rounded up essential psql commands in one easy-to-navigate place, so you spend more time querying your data, not trying to remember that command that always escapes you. Click, copy, done ✅.

[PostgreSQL Pro Tips]: Get 10+ PostgreSQL functions for advanced analytics >>

Use this to-the-point reference documentation to run complex queries on your time-series data, from calculating deltas with window functions and finding anomalies in your monitoring metrics to generating histograms.

New #remote-friendly events & community

[Office Hours with Mike]: Join our next monthly Q & A and time-series watercooler session >>

Fun fact: Mike - our CTO - is also a computer science professor at Princeton, so it’s only fitting that he hosts our Office Hours. Each month’s session is different, with topics ranging from TimescaleDB-specific to all things database optimization, favorite tools, and distributed computing.

✅ RSVP for Nov. 10 - everyone’s welcome, whether you have a question or just want to talk time-series, PostgreSQL, and open source.

[Virtual Session]: Observability Solutions w/ Open-Source Software: Lessons from the Field (demos and recommendations) >>

Catch @avthars Open Source Summit EU recording to learn how to build a flexible observability stack with 100% open-source (aka free!) tools. You’ll get a breakdown of available open-source components, hear considerations and best practices from real engineers, and see how to get started in <5 mins.

⭐ Visit Avthar's blog to learn more about his inspiration for the session and get additional resources.
🧵 See Twitter thread for talk highlights and key takeaways.

Check out @avthars Twitter thread to learn more – and visit tobs GitHub repo to get started

[Community Spotlight #1]: How FlightAware fuels flight prediction models for global travelers with TimescaleDB and Grafana >>

Our friends @flightaware - the world’s largest flight tracking platform - share how they built a monitoring system that allows them to predict flight arrival and departure times for 75K+ flights a day. The team breaks down how FlightAware works - and ways to get involved - and shares example Grafana dashboards + SQL queries, pro tips, and more.

🙏 to Caroline, FlightAware Sr. Engineer & Predict Team Lead, for sharing your story!

[Community Spotlight #2]: How WsprDaemon combines TimescaleDB and Grafana to measure and analyze radio transmissions >>

Learn how the WsprDaemon team uses SQL, TimescaleDB, and Grafana to bring radio transmission data and analysis to developers everywhere. Rob & Gywn share example queries, Grafana dashboards, and why they switched from InfluxDB (hint: high-cardinality).

📻 Visit WsprDaemon to learn more about the project, see quickstarts, and more.
💻 Watch Gwyn & Rob’s Digital Communications Conference 2020 presentation to see the project in action.
📑 Read the accompanying conference paper to get a deeper look at their work.

[Community Article]: GTM Stack: IoT Data Analytics at the Edge >>

We love this piece from Gary Stafford, which takes you through building an open-source IoT analytics stack with Grafana, TimescaleDB, and Mosquitto (plus why this is the ideal setup).

🔧 Get the source code to follow along with Gary’s example or spin up your own GTM stack.

[Meetup Replay]: Paris Time-Series Meetup: Intro to TimescaleDB (demos and best practices) >>

Timescale Developer Advocate Avthar shares TimescaleDB fundamentals, including how hypertables and chunking works, then dives into 5 pros and cons – and ways to work around “cons,” like using native compression to reduce storage overhead.

🙏 to Nicolas Steinmetz & @ParisTimeSeries for inviting us.
🗓 See Paris Time Series Meetup website for more upcoming events.

TimescaleDB tips, reading list & etc.

[TimescaleDB Tip #1]: Speed up your Grafana dashboards with UNION ALL >>

If Grafana is slow to load your dashboards with fine-grained, non-aggregated data, you’re not alone. Use this short guide to see how to apply PostgreSQL UNION ALL to speed up your visualizations - saving you time and CPU resources 🔥.

-- Use Daily aggregate for intervals greater than 14 days
SELECT day as time, ride_count, 'daily' AS metric
FROM rides_daily
WHERE  $__timeTo()::timestamp - $__timeFrom()::timestamp > '14 days'::interval AND  $__timeFilter(day)
UNION ALL

-- Use hourly aggregate for intervals between 3 and 14 days
SELECT hour, ride_count, 'hourly' AS metricFROM rides_hourly
WHERE $__timeTo()::timestamp - $__timeFrom()::timestamp BETWEEN '3 days'::interval AND '14 days'::interval AND  $__timeFilter(hour)
UNION ALL

-- Use raw data (minute intervals) intervals between 0 and 3 days
SELECT bucket, ride_count, '10min' AS metric
FROM rides_10mins
WHERE $__timeTo()::timestamp - $__timeFrom()::timestamp < '3 days'::interval AND  $__timeFilter(bucket)
ORDER BY 1;

See our how-to blog post for more sample queries and pro tips

[TimescaleDB Tip #2]: Set up streaming replication to protect against failovers, outages and unforeseen issues >>

Check out this step-by-step tutorial to configure PostgreSQL streaming replication on your TimescaleDB instances before anything goes wrong. You’ll get guidance for synchronous and asynchronous replication and viewing diagnostics, as well as a few example scenarios.

[Reading List]: When Boring is Awesome: Building a scalable time-series database on PostgreSQL >>

Go back to the beginning with our first-ever blog, where we introduced the TimescaleDB beta to the world. We've come a long way in 3 years, but "boring" is even more awesome than ever — especially when it’s your database.

[Reading List]: Time-series data: Why (and how) to use a relational database instead of NoSQL >>

In this old-but-great post, we detail how traditional databases handle time-series data, why neither option is quite right, and how TimescaleDB takes a different approach. The result: a relational database with minimized memory usage, robust index support, and scale (including 15x+ INSERT rate improvements 🎉).

[Reading List]: 5 reasons why relational databases > NoSQL for IoT scenarios >>

The above blog applies to all scenarios, and, in this one, our product team focuses on why relational databases reign supreme for IoT scenarios specifically, from eliminating data silos - all of your data in one place! - to reliability and flexibility.

🔧 Use our IoT Simulation tutorial to test and explore how TimescaleDB handles device data.

[Time-series Fun]: Explore time-series analysis with IoT & DevOps sample datasets >>

We've created some sample datasets and example queries to get you up and running. Each scenario includes various database sizes, time intervals, and partition field values 🎉.

[Team Timescale Fun]: Last, but certainly not least, Timescale People Manager Mel continue to bring her A+ game to all things remote team bonding.

🐯 Want to join our tiger team? We're hiring across various departments (all roles are 100% remote and 100% awesome).

🎁 The prize for our most recent async Slack challenge: an original composition featuring the winners, performed via ukulele at our weekly All Hands 🎼

TimescaleDB made @Embroker’s Top 200 Startup List (#49) 🥳 We’re honored to be featured along so many amazing companies - check out the full list.

Wrapping Up

And, that concludes this month’s newsletter roundup. We’ll continue to release new content, events, and more - posting monthly updates for everyone.

If you’d like to get updates as soon as they’re available, subscribe to our newsletter (2x monthly emails, prepared with 💛 and no fluff or jargon, promise).

Happy building!

TimescaleDB 2.0: A multi-node, petabyte-scale, completely free relational database for time-series

Ajay Kulkarni — Thu, 29 Oct 2020 14:00:02 GMT

After two years of dedicated engineering and user feedback, TimescaleDB 2.0 is finally here, setting a new bar for time-series databases – and it’s completely free.

Time-series data is everywhere. Whether you are monitoring your software stack, users, manufacturing line, home, vehicle, stock and cryptocurrency portfolio, air quality in your house, or just your health in the middle of a pandemic, you are collecting time-series data. As software continues to relentlessly penetrate our lives and businesses, time-series data is becoming even more ubiquitous and mission-critical.

At the same time, relational databases, that old stalwart, are making a comeback as the database of choice for software applications. Despite years of NoSQL hype, the top 4 databases in use today are all relational databases. In addition, PostgreSQL is the fastest growing database over the last year (yes, growing faster than even MongoDB).

What developers need is a new kind of database, purpose-built for time-series workloads while fully embracing the relational model. After all, your time-series data doesn’t exist in a vacuum. Being able to correlate it with technical metadata, business data, and outcomes is critical to understanding how your software, systems, operations, and business changes over time.

Building that database has always been our mission: to help developers store and analyze time-series data in a fast, reliable, and cost-effective way, so that they can focus on their core application and delight their users.

Since launching 3.5 years ago, TimescaleDB has proven itself as the leading relational database for time-series data, engineered on top of PostgreSQL, and offered via free software or as a fully-managed service on AWS, Azure, and GCP.

In that time, the TimescaleDB community has become the largest developer community for time-series data: tens of millions of downloads; over 500,000 active databases; organizations like AppDynamics, Bosch, Cisco, Comcast, Credit Suisse, DigitalOcean, Dow Chemical, Electronic Arts, Fujitsu, IBM, Microsoft, Rackspace, Schneider Electric, Samsung, Siemens, Uber, Walmart, Warner Music, WebEx, and thousands of others; all in addition to the PostgreSQL community and ecosystem.

Today, with TimescaleDB 2.0, we are marking a major milestone in our journey.

With this 2.0 release, TimescaleDB is now a distributed, multi-node, petabyte-scale relational database for time-series. And, we are making everything in this release completely free. This is the culmination of two years of dedicated engineering effort, as well as significant user feedback on several previous betas.

In fact, users have already been running multi-node TimescaleDB in continuous daily use for many months, including a 22-node cluster by a Fortune 100 tech company ingesting more than a billion rows per day:

"We continuously ingest telemetry events into TimescaleDB 2.0 to monitor and analyze huge numbers of sessions. We've been running TimescaleDB multi-node across 22 servers for almost the past year, ingesting more than a billion rows of data per day. TimescaleDB's performance, scale, relational and SQL capabilities, and ability to handle complex data have been a real winner." – Rahul, Technical Leader at Fortune 100 tech company
“Netskope prides itself on speed and scalability, and we rely heavily on time-series data to plan, monitor, and troubleshoot our global network of thousands of servers. With TimescaleDB, we tap into the ubiquitous PostgreSQL ecosystem and use TimescaleDB's continuous aggregates and other built-in time-series functions for real-time analytics and advanced historical analysis. Now with multi-node TimescaleDB, we get the horizontal scalability and rapid ingest throughput we need to monitor and manage our systems at scale, now and in the future.” – Mark S. Reibert, Ph.D., Systems Architect at Netskope, Inc.

TimescaleDB 2.0 also includes:

Updated, more permissive licensing: making all of our enterprise features free and granting more rights to users.
Substantial improvements to Continuous Aggregates: improving APIs and giving users greater control over the process.
User-Defined Actions (new feature!): users can now define custom behaviors inside the database and schedule them using our job scheduling system.
New and improved informational views: including over hypertables, chunks, policies, and job scheduling.

The TimescaleDB 2.0 Release Candidate is available immediately for self-managed software installations, with General Availability expected in late 2020. TimescaleDB 2.0 will be available on our hosted time-series services at that time. If you’re already using TimescaleDB, we’ve created detailed documentation to simplify and speed up your migration.

Download TimescaleDB 2.0 to get started right away.
Read the release overview guide (including changes in this release)
Read the upgrade documentation (for existing software users migrating from TimescaleDB 1.x).

We also encourage you to join our 5,000+ member Slack community for any questions, to learn more, and to meet like-minded developers – we’re active in all channels and here to help.

(While Ajay and Mike are listed as authors of this post, full credit and a big round of applause goes to members of the Timescale database team for their hours, weeks, and months of dedication and commitment to shipping high quality code: Erik Nordström, Gayathri Ayyappan, Ruslan Fomkin, Mats Kindahl, Sven Klemm, Brian Rowe, and Dmitry Simonenko.)

We’d like to give a massive thank you to all of our beta testers; from reporting issues to sharing feedback and suggesting features, you all played a big role in making TimescaleDB 2.0 the best possible experience for developers.

To learn more about TimescaleDB 2.0, time-series data, and why we believe relational databases are the past and future of software development, please read on.

Relational databases are dead. Long live relational databases.

For about 30 years, from the mid-1970s to the mid-2000s, if you were developing software, you used a relational database. From System R (1974) to Oracle (1979), SQL Server (1989), and later open-source options like MySQL (1995) and PostgreSQL (1996), relational databases were the standard for any new application.

About 15 years ago, this all changed. Non-relational databases, sometimes also called “NoSQL” databases, became fashionable. A lot of this usage was legitimately necessary. New Internet giants built new systems to handle data volumes that were previously unfathomable, e.g., Google with MapReduce (2004) and Bigtable (2006); Amazon with Dynamo (2007). But a lot of NoSQL adoption was a knee-jerk reaction, along the lines of, “relational databases don’t scale, so I need a NoSQL database.”

Yet most companies are not Google or Amazon. And it turns out the ability to store data in a way that preserves the relationships in your dataset is valuable. After decades of usage in production, most relational databases are battle-hardened and typically more reliable than their NoSQL cousins. SQL has also re-emerged as the universal language for data analysis, and is the third most widely used language today (after JavaScript and HTML/CSS).

Today, the top 4 databases in use are still all relational databases. In particular, PostgreSQL is the fastest growing database over the last year (yes, growing faster than even MongoDB). Some of this is from developers switching back; some from developers who never left relational databases. So don’t call it a comeback - relational databases have been here for years (h/t James Todd Smith).

Most importantly, relational databases can, in fact, scale. We see this in the more recent wave of “NewSQL” databases. Google again led the way almost a decade ago, with a geo-replicated relational database announced in their first Spanner paper (2012) (whose authors include the original MapReduce authors), followed by other pioneers like CockroachDB (2014) and Yugabyte (2016). And with TimescaleDB (2017), we have built a relational database that scales for time-series data.

What is time-series data?

Simply put, time-series is the measurement of something across time. But, to dig a little deeper, time-series data is the measurement of how something changes.

Here is a simple example:

If I send you $10, then a traditional bank database would atomically debit my account and credit your account. Then, if you send me $10, the same process happens in reverse.

At the end of this process, our bank balances would look the same, so the bank might think, “Oh, nothing happened.” And that’s what a traditional database would show you.

But, with a time-series database, the bank could see, “Hey, these two people keep sending each other $10 - maybe they’re friends, maybe they’re roommates, maybe there’s something else going on.” That level of granularity, the measurement of how something changes, is what time-series enables.

In other words, time-series datasets track changes to the overall system as INSERTs, not UPDATEs, to capture more information of what is happening.

Time-series used to be niche, isolated to industries like finance, process manufacturing (e.g., oil and gas, chemicals, plastics), or power and utilities. But in the last few years, time-series workloads have exploded (the fastest growing category in the past 24 months). This is partly due to the growth in IT monitoring and IoT, but there are also many other new sources of time-series data: cryptocurrencies, gaming, machine learning, and more.

What is happening is that everyone wants to make better data-driven decisions faster, which means collecting data at the highest fidelity possible. Time-series is the highest fidelity of data you can capture, because it tells you exactly how things are changing over time. While traditional datasets give you static snapshots, time-series data provides the dynamic movie of what’s happening across your system: e.g., your software, your physical power plant, your game, your customers inside your application.

Time-series is no longer some niche workload. It’s everywhere. In fact, all data is time-series data - if you are able to store it at that fidelity. Of course, that’s the problem with collecting time-series data: it’s relentless. By performing all these inserts, as opposed to updates, you end up with a lot more data, at higher volumes and velocities than ever before. You quickly get to tables in the billions of rows. For a traditional database, this creates challenges around performance and scalability.

That’s where TimescaleDB comes in.

What is TimescaleDB?

TimescaleDB is the leading relational database for time-series data. Engineered on top of PostgreSQL, Timescale is available via free software or as a fully-managed service on AWS, Azure, and GCP.

TimescaleDB is purpose-built for time-series workloads, so that you can get orders of magnitude better performance at a fraction of the cost, along with a much better developer experience. This means massive scale (100s billions of rows and millions of inserts per second on a single server), 94%+ native compression, 10-100x faster queries than PostgreSQL, InfluxDB, Cassandra, and MongoDB – all while maintaining the reliability, ease-of-use, SQL interface, and overall goodness of PostgreSQL.

Today, there are several options for storing time-series data. However, most are non-relational systems that are essentially glorified metric stores, focused on storing numerical data and not the broad spectrum of data types (nor the rich representation of relationships between datasets) that time-series workloads need.

In April 2017, we launched TimescaleDB into this world full of non-relational metric stores as the first time-series database that supported full SQL. Since then, many others have copied our SQL approach to time-series (including some that are very suspiciously named, *cough* Amazon Timestream *cough*), but no one has been able to replicate the true relational foundation and community of TimescaleDB.

As a result, in just 3.5 years, TimescaleDB has come a long way, now with tens of millions of downloads and over 500,000 active databases. The TimescaleDB developer community includes organizations like AppDynamics, Bosch, Cisco, Comcast, DigitalOcean, Dow Chemical, Electronic Arts, Fujitsu, IBM, Microsoft, Rackspace, Schneider Electric, Samsung, Siemens, Uber, Walmart, Warner Music, WebEx, and thousands of others.

In addition to this dedicated community, we also benefit from the vast PostgreSQL community and ecosystem. Altogether, the TimescaleDB community is the largest developer community for time-series data.

🤔 But I thought

Ever since we launched TimescaleDB, we’ve met skepticism. After all, building a time-series database on PostgreSQL is a non-obvious, somewhat heretical decision. Yet with each release, we continue to disprove our haters and delight our users. Because, as it turns out, building a scalable relational database for time-series isn’t impossible – it’s just hard. But, with our talented team and passionate users, we’re doing it.

Myth 1: A relational database can’t scale as well as a non-relational database

Fact: We outperform non-relational (and other relational) databases for time-series data. Versus Cassandra, 10x higher inserts, 1000x faster queries. Versus Mongo, 20% higher inserts, 1400x faster queries. Versus InfluxDB, higher inserts, faster queries, and better reliability. (Unlike all of these options, we also support full SQL, which allows users to run complex analysis on their data using a programming language and the tools they already know.)

Myth 2: Relational databases take up too much disk space (or, row-oriented databases can’t compress as well as columnar databases)

Fact: It is possible to build columnar compression in a row-oriented database, which is what we have done. TimescaleDB employs several best-in-class compression algorithms, including delta-delta, Gorilla, and Simple-8b RLE, allowing us to achieve 94%+ native compression.

Myth 3: Non-relational databases don’t require schemas, which makes development easier

Fact: Every database, relational or non-relational, uses a schema to store data. The only difference is whether you have the ability to modify that schema and optimize it for your use. However, having a schema automatically generated for you is useful. We are already exploring automatic schemas: e.g., see Promscale, our new analytical platform for Prometheus built on TimescaleDB, which stores data in a dynamically auto-generated schema highly optimized for metrics. More to come.

Myth 4: A relational database can’t scale-out across multiple machines

Fact: NewSQL databases (mentioned above) are disproving this myth for transactional workloads. And today, we are disproving this myth for time-series workloads, with TimescaleDB 2.0.

TimescaleDB 2.0: Multi-node, petabyte-scale, and completely free

As mentioned above, customers have already been running multi-node TimescaleDB in continuous daily use for many months, including a 22-server cluster by a Fortune 100 tech company ingesting more than a billion rows per day.

Introducing distributed hypertables

To achieve multi-node, TimescaleDB 2.0 introduces the concept of a distributed hypertable.

A regular hypertable, one of our original innovations, is a virtual table in TimescaleDB that automatically partitions data into many sub-tables (“chunks”) on a single machine, continuously creating new ones as necessary, yet provides the illusion of a single continuous table across all time.

A distributed hypertable is a hypertable that automatically partitions data into chunks across multiple machines, while still maintaining the illusion (and user-experience) of a single continuous table across all time.

The architecture consists of an access node (AN), which stores metadata for the distributed hypertable and performs query planning across the cluster, and a set of data nodes (DN), which store subsets of the distributed hypertable dataset and execute queries locally. TimescaleDB remains a single piece of software for operational simplicity; these roles as described are established by executing database commands within TimescaleDB (e.g., on a server that should act as an access node, you add_data_node pointing to the hostnames of the data nodes, and then create_distributed_hypertable.)

A multi-dimensional distributed hypertable covering one access node (AN) and three data nodes (DN1-DN3). The "space"dimension (hostname) determines the data node to place a chunk on.

Currently, you can add any number of data nodes for horizontal scalability, as well as leverage existing Postgres physical replication on data nodes for fault tolerance (we are also working on more native replication for future releases; see below).

The access node can also be physically replicated for high availability, and future releases will focus on further scaling out the read and write paths for TimescaleDB multi-node.

Insert and query benchmarks

As a result, while a traditional hypertable scales to 1-2 million metrics per second and 100 terabytes of data, a distributed hypertable scales to ingest 10+ million metrics per second and store petabytes of data:

TimescaleDB running the open-source Time Series Benchmarking Suite, deployed on AWS running m5.2xlarge data nodes and a m5.12xlarge access node, both with standard EBS gp2 storage. More here.

Distributed hypertables also take advantage of query parallelization, employing full/partial aggregates and push-downs, to achieve much faster queries:

We include 1 data node as a point of reference, to show the slight overhead of distributed hypertables due to the extra network communication and processing. Even with that slight overhead, we see that queries complete about 7 times faster on a distributed hypertable with 8 data nodes than on a regular (1 node) hypertable. Here we are using the Time Series Benchmark Suite (TSBS) with the IT monitoring use case (DevOps), on m5.2x large AWS instances. More here.

What’s next for distributed hypertables?

We are already hard at work improving upon this initial release of distributed hypertables with the next series of features:

Replication: Currently every node needs its own replication (using primary/backup physical replication). Cluster-wide replication across data nodes, built natively for TimescaleDB, is in development.
Rebalancing: Currently when new data nodes are elastically added to an existing distributed hypertables, new chunks are created across the available nodes, and queries are routed accordingly to be repartitioning-aware. But related to native replication, existing chunks are not currently rebalanced across nodes, which is also in development.
Backup: Each node can be backed up and restored, but there is currently no consistent restore-point snapshot across the whole cluster. Cluster-wide backup is also in development.
Compression: Compression currently must be performed on a per-chunk basis. In the future, compression policies on the access node, which then propagate to each data node, will be possible.

Some features, such as continuous aggregates and time_bucket_gapfill, do not currently work on distributed hypertables. Those are also in development.

Check out the below explainer video for a breakdown of how distributed hypertables work, when and why you'd use them, best practices, and more.

Watch this overview and how-to video to get up and running with distributed hypertables

What else is new in TimescaleDB 2.0?

While distributed hypertables are the biggest component of this release, TimescaleDB 2.0 also includes:

Updated, more permissive licensing: making all of our enterprise features free and granting more rights to users.
Substantial improvements to Continuous Aggregates: improving APIs and giving users greater control over the process.
User-Defined Actions (new feature!): users can now define custom behaviors inside the database and schedule them using our job scheduling system.
New and improved informational views: including over hypertables, chunks, policies, and job scheduling.

Updated licensing (everything is now free!)

TimescaleDB 2.0 introduces an update to the Timescale License, our source-available license that governs most of our advanced capabilities, including native compression, multi-node, continuous aggregations, and more.

This update makes all of our enterprise features free, and provides expanded rights to users, reinforcing our commitment to our community. Notably, this update adds the “right-to-repair”, the “right-to-improve”, and eliminates the paid enterprise tier and usage limits altogether (thus establishing that all of our software will be available for free). (More in this announcement.)

Continuous Aggregates 2.0

Continuous Aggregates are an existing capability (introduced 1.5 years ago with TimescaleDB 1.3) that automatically calculates the results of a query in the background and materializes the results, leading to vastly faster query times. They are somewhat similar to PostgreSQL materialized views, but unlike a materialized view, Continuous Aggregates do not need to be refreshed manually; views are automatically refreshed in the background as new data is added, or old data is modified. (See our Continuous Aggregates documentation for more details.)

TimescaleDB 2.0 includes substantial improvements (and, as a result, some breaking API changes) to Continuous Aggregates:

Updated APIs that separate function and policies, giving users greater control of the Continuous Aggregation process. For example, a Continuous Aggregate can now be manually refreshed over a given range. One common user request has been to materialize recent data but leave historical data to manual refreshes. Now that is possible.
The separation of function and policies also makes this feature more amenable to distributed operation in the future (e.g., multinode). For instance, a policy on an Access Node can trigger refreshes on Data Nodes.
There are also other improvements that resolve other user issues (e.g., bugs, strange behavior).
As a result, there are some breaking API changes to Continuous Aggregates (highlighted here in the documentation).

Watch this how-to and demo video to get up and running with Continuous Aggregates 2.0

User-Defined Actions (New feature!)

Previously, TimescaleDB offered standard policies that let users define a schedule to run predefined actions, e.g., for data retention, compression, and continuous aggregates.

TimescaleDB 2.0, introduces the idea of a User-Defined Action (UDA). Users can now run functions and procedures implemented in a language of your choice (e.g., SQL, PL/pgSQL, C, PL/Python, or even PL/Perl) on a schedule within TimescaleDB. This allows automatic periodic tasks that are not covered by existing policies and even enhancing existing policies with additional functionality. Users can now also schedule predefined actions themselves, in case they need greater flexibility than what the standard policies provide. (See our User-Defined Actions documentation for more details.)

For example, you can create more generic data retention policies, data tiering policies, joint downsampling and compression policies, and more, all set to run on a schedule you define ahead of time within TimescaleDB.

Watch this how-to and demo video to get up and running with User-Defined Actions

Informational views (NEW and improved views)

TimescaleDB 2.0 also introduces new and updated informational views, including over hypertables, chunks, policies, and job scheduling.

Watch this how-to and demo video to get up and running with informational views

How to get started with TimescaleDB 2.0

Download TimescaleDB 2.0 to get started right away.
Read the release overview guide (including changes in this release)
Read the upgrade documentation (for existing software users migrating from TimescaleDB 1.x).

Join us for Office Hours on Tues, November 10th to ask your questions live, directly to the Timescale team.

From there, we also encourage you to join our 5,000+ member Slack community for any questions, to learn more, and to meet like-minded developers – we’re active in all channels and here to help.

Promscale: An analytical platform and long-term store for Prometheus, with the combined power of SQL and PromQL

Matvey Arye — Tue, 06 Oct 2020 16:27:06 GMT

In this post we introduce Promscale, a new open-source long-term store for Prometheus data designed for analytics.

Promscale is a horizontally scalable and operationally mature platform for Prometheus data that offers the combined power of PromQL and SQL, enabling developers to ask any question, create any dashboard, and achieve greater visibility into their systems. Promscale is built on top of TimescaleDB, the leading relational database for time-series.

Promscale is the result of a year of dedicated development effort by one of Timescale's engineering teams. It incorporates feedback from users and the general Prometheus community, and builds on 3.5 years of feedback from users of our previous Prometheus read-write adapter (for more, please see this related design doc). As a result, despite being a young project, Promscale already sports an active user community, including organizations like Electronic Arts, Dow Chemical, and many others. This latest release marks the graduation of Promscale out of beta.

(The name “Promscale” itself was picked by our users and the Prometheus community via this GitHub poll. Although some of us were secretly rooting for “Promy McPromFace” 😂.)

To get started right away, visit our GitHub repo to install Promscale via Helm Charts, Docker, and others. And, if you like what we're building, please give us a ⭐️ on GitHub 🤗.

If you have a Kubernetes cluster with Helm installed, we suggest using tobs to install a full metric collection and visualization solution including Prometheus, Grafana, Promscale, and a preview version of PromLens in under 5 minutes (demo video).

Note: Although Mat, Josh, and Harkishen are listed as the authors of this post, full credit goes to the entire Promscale team: Ante Krešić, Blagoj Atanasovski, David Kohn, Harkishen Singh, Josh Lockerman, and Mat Arye.

But why did we build Promscale? Please read on for more.

We are witnessing a shift in the role of software, and in the ways organizations manage and monitor their software

Today, every industry is moving its computing to the cloud. The complexity and scale of these modern, cloud-based applications necessitate sophisticated systems to monitor software application health and manage software infrastructure. Unlike in the past, when systems were all built using proprietary software, this new wave of modern infrastructure is being built using free, open components, like Kubernetes and Prometheus. The top two reasons for this shift are: flexibility and cost. Unlike proprietary SaaS solutions, open tools put the users' needs first, enabling them to customize their stack to meet their needs, and cost pennies on the dollar. In this world, developers, not sales contracts, nor RFPs, nor enterprise sales teams, decide which tools are used.

Prometheus has emerged as the de facto monitoring solution for modern software systems

Prometheus, is an open-source systems monitoring and alerting toolkit that can be used to easily and cost-effectively monitor infrastructure and applications. Over the past few years, Prometheus has emerged as the monitoring solution for modern software systems. The key to Prometheus’ success is its pull-based architecture in combination with service discovery, which is able to seamlessly monitor modern, dynamic systems in which (micro-)services startup and shutdown frequently.

Problem: Prometheus is not designed for analytics

As organizations use Prometheus to collect data from more and more of their infrastructure, the benefits from mining this data also increase. Analytics becomes critical for auditing, reporting, capacity planning, prediction, root-cause analysis, and more. Prometheus's architectural philosophy is one of simplicity and extensibility. Accordingly, it does not itself provide durable, highly-available long-term storage or advanced analytics, but relies on other projects to implement this functionality.

There are existing ways to durably store Prometheus data, but while these options are useful for long-term storage, they only support the Prometheus data model and query model (limited to the PromQL query language). While these work extremely well for the simple, fast analyses found in dashboarding, alerting, and monitoring, they fall short for more sophisticated analysis capabilities, or for the ability to enrich their dataset with other sources needed for insight-generating cross-cutting analysis.

Solution: Promscale scales and augments Prometheus for long-term storage and analytics

Enter Promscale. We built Promscale to conquer a challenge that we, and other developers, know all too well: how do we easily find answers to complex questions in our monitoring data?

Built on top of TimescaleDB and PostgreSQL, Promscale supports both PromQL and SQL, offers horizontal scalability to over 10 million metrics per second and petabytes of storage, supports native compression, handles high-cardinality, provides rock-solid reliability, and more. It also offers other native time-series capabilities, such as data retention policies, continuous aggregate views, downsampling, data gap-filling, and interpolation. It is already natively supported by Grafana via the Prometheus and PostgreSQL/TimescaleDB data sources.

Promscale architecture and how it fits into the observability stack

Prometheus writes data to the Promscale connector using the remote_write API, storing the data in TimescaleDB. The Promscale connector understands PromQL queries natively and fetches data from TimescaleDB to execute them, while SQL queries go to TimescaleDB directly.

Promscale is open-source, licensed under Apache 2. TimescaleDB is licensed under the completely free, source-available Timescale License.

Promscale stores data in a dynamically auto-generated schema highly optimized for Prometheus metrics that is the result of thorough benchmarking and community discussion (as can be seen in this design doc). In particular, this schema decouples individual metrics, allowing for the collection of metrics with vastly different cardinalities and retention periods. At the same time, Promscale exposes simple, user-friendly views so that developers do not have to understand this optimized schema.

Thanks to its relational foundation, Promscale also supports a variety of data types (numerics, text, arrays, JSON, booleans), JOINS, and ACID semantics, in addition to simple metric data. Because Promscale is built on top of PostgreSQL, it is operationally mature and includes capabilities such as high-availability, streaming backups, upgrades over time, roles and permissions, and security.

Promscale also benefits from the TimescaleDB user community: tens of millions of downloads, over half a million active databases, 5,000+ member Slack channel.

User testimonials

Although a relatively new project, Promscale is already in use by developers across the globe:

"We have game metrics available in different data sources like Graphite, Datadog, and Cloudwatch. We are storing all of these metrics in Prometheus, with Promscale for long-term storage. Promscale lets us collate metrics from these different sources and generate a single report in a unified view so that we can have better visibility into what is happening inside our games."
— Saket K., Software Engineer, Electronic Arts

"Our goal is to have all of our sites from around the world monitored using Prometheus and view the resulting data in a user-friendly way. We chose Promscale to store our data because it scales, offers flexibility – for example, dividing read and write activities among different nodes – and has the operational maturity and rock-solid reliability of PostgreSQL, including streaming backups and high-availability."
— Adam B., Service Specialist, Dow Chemical

I was super skeptical at first (especially never having used Grafana w/ Postgres) but being able to still use #PromQL against @TimescaleDB is baller af
— Matt (@halfmatthalfcat) August 13, 2020

Install Promscale today via Helm Charts, Docker, and others. More information on GitHub. (And, if you like what we are building, please give us a ⭐️ on GitHub 🤗.)

How to get involved with the Promscale community:

For help with any technical questions, please join Timescale Slack (#prometheus) and/or the promscale-users Google Group.
To participate in roadmap and product discussions and to meet the engineering team, please join the monthly User & Community Meeting.
For infrequent product updates, subscribe to our Promscale Product Updates mailing list.

To learn more about the origin, status, and roadmap for this project, please read on.

Prometheus has emerged as the monitoring solution for modern software systems

Over the past few years, Prometheus, an open-source systems monitoring and alerting toolkit that can be used to easily and cost-effectively monitor infrastructure has emerged as the monitoring solution for modern software systems.

Source: Prometheus docs

The key to Prometheus’ success is that it is built for modern, dynamic systems in which services start up and shut down frequently. The simple way that Prometheus collects data works extremely well with the ephemeral, churning nature of modern software architectures, and microservices in particular, because the services themselves don’t need to know anything about the monitoring system. Any service that wants to be monitored simply exposes its metrics over an HTTP endpoint. Prometheus scrapes these endpoints periodically and records the values it sees into a local time-series database.

Prometheus’ decoupled architecture makes the system as a whole much more resilient. Services don’t need the monitoring stack to be up to get work done, and the monitoring software only needs to know about individual services while it’s actually scraping them. This makes it easy for the monitoring system to adjust seamlessly as services fail and new ones are brought up.

This architecture also responds gracefully to overloading. While push-based architectures often drown in traffic when under high load. Prometheus simply slows down its scrape loop. Thus, while your metric resolution may suffer, your monitoring system will remain up and functional.

Keeping with the theme of resilience and simplicity, Prometheus doesn’t try to store data for the long term, but rather exposes an interface allowing a dedicated database to do so instead. Prometheus continually pushes data to this remote-write interface, ensuring that metric data is durably stored. That is where external long-term storage systems come in.

Analytical options for Prometheus data are lacking

As developers use Prometheus to collect data from more and more of their infrastructure, the benefits from mining this data also increase. Analytics becomes critical, for things like auditing, reporting, capacity planning, prediction, root-cause analysis, and more.

Prometheus itself was developed with a clear sense of what it is, and is not, designed to do. Prometheus is designed to be a monitoring and alerting system; but it is not a durable, highly-available long-term store of data, nor a store for other datasets, nor a sophisticated analytics engine. However, though these capabilities are not provided by Prometheus itself, they are critical for the longer-duration and more intensive usages of metric data, including auditing, reporting, capacity planning, predictive analytics, root-cause analysis, and many others. As such, Prometheus provides hooks to forward its data to an external data store more suited for these tasks.

Existing options for storing Prometheus data externally, while useful, all focus on long-term storage, and in some cases, limited forms of aggregation. Such systems can only store floats, and perform PromQL queries, making them too limited, both in data-stored and in query-model, to perform sophisticated analytics.

In addition, as great as the Prometheus architecture is for recording data in highly dynamic environments, its method of collecting data at unaligned intervals creates challenges when analyzing data, since timestamps from multiple “simultaneous” scapes on different endpoints can differ by a significant amount.

Prometheus devised a language called PromQL that addresses these difficulties by regularizing data at query time: aligning the data at user-specified intervals and discarding excess data points. While this method of analysis works extremely well for simple, fast analyses, found in dashboarding, alerting, and monitoring, it can be lacking for more-sophisticated analysis.

For example, PromQL can’t aggregate across both series and time, making it quite difficult (if not impossible) to get accurate statistics over time for a particular label key, which is necessary for things such as determining when a memory leak was introduced by looking at 90th percentile memory usage grouped by app version across a long time-span. This kind of drill-down and reaggregation is important for many kinds of analytics, because even when the data contains the information needed for the problem at hand, it often wasn’t gathered with that kind of analysis in mind. Other PromQL features, such as joins, filters, and statistics, are similarly restricted, limiting its usage in discovering trends and developing insights.

Others have also written about these issues: The CNCF SIG-Observability working group has put together a list of use-cases in the observability space that need better tools for metrics analytics. Dan Luu, a popular tech blogger, also had a widely distributed blog post about getting more value out of your metric data.

This is where Promscale comes in.

Why we built Promscale

We say the market lacks a system for deep analytics of Prometheus data because we’ve felt that need while monitoring our own infrastructure. We built Promscale to conquer a challenge that we, and other developers, know all too well: how do we easily find answers to complex questions in our monitoring data?

We are big fans of Prometheus as software developers and operators – in particular, we became involved in the Prometheus ecosystem 3.5 years ago when we initially published our previous Prometheus adapter, one of the first read-write adapters.

But after multiple years of use and study we realized we needed capabilities beyond what Prometheus - and its associated tools - currently offer.

In our stack, this includes things like:

Auxiliary data about the system being monitored to augment metrics with additional information that helps us understand what they mean, such as node hardware properties, user/owner information, geographic location, or what the workload is running.
Joins combining metrics with this additional auxiliary data and metadata to create a complete view of the system.
Efficient long-term storage for historical analysis, such as reporting of past incidents, capacity planning, auditing, and more.
Flexible data management to handle the large volume of data monitoring generated, with tiering support such as multi-tenancy, automated data retention, and downsampling.
Isolation between the various metrics. Since different metrics can be sent by completely different systems, we want both the performance and data management of different metrics to be independent (e.g., so that downsampling one metric won’t affect others).
Logs and traces alongside metrics, to provide a better all-around view of the system. If all three modalities are in the same database, then JOINs between this data can lead to interesting insight. (To be clear, Promscale does not support logs and traces today, but this is an area of future work.)
SQL as a versatile query language for those general analytics that PromQL isn’t suited for, as well as the lingua franca spoken by a variety of data analysis and machine learning tools.

What our infrastructure team really wanted was an analytical platform on top of Prometheus to achieve more-insightful and cost-effective observability into our own infrastructure.

That is what we built with Promscale.

How Promscale works

Architecture
This architecture uses the standard remote_write / remote_read Prometheus API, cleanly slotting into that space in the Prometheus stack.

Prometheus writes data to the Promscale connector using the remote_write API, storing the data in TimescaleDB. Promscale understands PromQL queries natively and fetches data from TimescaleDB to execute them, while SQL queries go to TimescaleDB directly.

Promscale architecture and how it fits into the observability stack

Promscale can be deployed in any environment running Prometheus, alongside any Prometheus instance. We provide Helm charts for easier deployments to Kubernetes environments.

SQL interface
The data stored in Promscale can be queried both in PromQL and SQL. Though the data layout we use is internally quite sophisticated (more details in this design doc), you don’t need to understand any of it to analyze metrics through our easy-to-use SQL views.

Each metric is exposed through a view named after the metric, so a measurement called cpu_usage is queried like:

SELECT 
	time, 
	value, 
	jsonb(labels) as labels 
FROM "cpu_usage";

time                    value   labels  
2020-01-01 02:03:04	0.90   	{"namespace": "prod", "pod”: "xyz"}
2020-01-01 02:03:05	0.98   	{"namespace": "dev",  "pod”: "abc"}
2020-01-01 02:03:06	0.70   	{"namespace": "prod", "pod": "xyz"}

The most important fields are time, value, and labels.

labels represents the full set of labels associated with the measurement and is represented as an array of identifiers. In the query above we view the labels in their JSON representation using the jsonb() function.

Each row has a series_id uniquely identifying the measurement’s label set. This enables efficient aggregation by series. You can easily retrieve the labels array from a series_id using the labels(series_id) function. As in this query that shows how many data points we have in each series:

SELECT
	jsonb(labels(series_id_)) as labels,
	count(*)
FROM "cpu_usage" 
GROUP BY series_id;

labels               				count
{"namespace": "prod", "pod”: "xyz"}		1
{"namespace": "dev",  "pod”: "abc"}		7
{"namespace": "prod", "pod": "xyz"}		3

Each label key (in our example namespace and pod) is expanded out into its own column storing foreign key identifiers to their value, which allows us to JOIN, aggregate and filter by label keys and values. You get back the text represented by a label id using the val(id) function. This opens up nifty possibilities such as aggregating across all series with a particular label key. For example, to determine the median CPU usage reported over the past year grouped by namespace, you could run:

SELECT 
	val(namespace_id) as namespace, 
	percentile_cont(0.5) within group (order by value) 
AS median
FROM “cpu_usage” 
WHERE time > '2019-01-01'
GROUP BY namespace_id;

namespace       median
prod            0.8
dev             0.7

The complete view looks something like this:

SELECT * FROM "cpu_usage";

time			value	labels  series_id 	namespace_id	pod_id
2020-01-01 02:03:04	0.90    {1,2} 	1		1		2
2020-01-01 02:03:05	0.98    {4,5}	2		4		5
2020-01-01 03:03:06	0.70    {1,2)	1		1		1

To simplify filtering by labels, we created operators corresponding to the selectors in PromQL. Those operators are used in a WHERE clause of the form labels ? ( ). The four operators are:

== matches tag values that are equal to the pattern
!== matches tag value that are not equal to the pattern
==~ matches tag values that match the pattern regex
!=~ matches tag values that are not equal to the pattern regex

These four matchers correspond to each of the four selectors in PromQL, though they have slightly different spellings to avoid clashing with other PostgreSQL operators. They can be combined together using any boolean logic with any arbitrary WHERE clauses.

For example, if you want only those metrics from the production namespace namespace or those whose pod starts with the letters "ab" you simply OR the corresponding label matchers together:

SELECT avg(value) 
FROM "cpu_usage" 
WHERE labels ? ('namespace' == 'production') 
       OR labels ? ('pod' ==~ 'ab*')

Combined, these features open up all kinds of possibilities for analytics. For example, you could get easily get the 99th percentile of memory usage per container in the default namespace with:

SELECT 
  val(used.container_id) container, 
  percentile_cont(0.99) within group(order by used.value) percent_used_p99  
FROM container_memory_working_set_bytes used
WHERE labels ? ('namespace' == 'default')  
GROUP BY container 
ORDER BY percent_used_p99 ASC 
LIMIT 100;

container             		       percent_used_p99
promscale-drop-chunk                            1433600
prometheus-server-configmap-reload              6631424
kube-state-metrics                             11501568

Or, to take a more complex example from Dan Luu’s post, you can discover Kubernetes containers that are over-provisioned by finding those containers whose 99th percentile memory utilization is low:

WITH memory_allowed as (
  SELECT 
    labels(series_id) as labels, 
    value, 
    min(time) start_time, 
    max(time) as end_time 
  FROM container_spec_memory_limit_bytes total
  WHERE value != 0 and value != 'NaN'
  GROUP BY series_id, value
)
SELECT 
  val(memory_used.container_id) container, 
  percentile_cont(0.99) 
    within group(order by memory_used.value/memory_allowed.value) 
    AS percent_used_p99, 
  max(memory_allowed.value) max_memory_allowed
FROM container_memory_working_set_bytes AS memory_used 
INNER JOIN memory_allowed
      ON (memory_used.time >= memory_allowed.start_time AND 
          memory_used.time <= memory_allowed.end_time AND
          eq(memory_used.labels,memory_allowed.labels)) 
WHERE memory_used.value != 'NaN'   
GROUP BY container 
ORDER BY percent_used_p99 ASC 
LIMIT 100;

container			       percent_used_p99        total
cluster-overprovisioner-system    6.961822509765625e-05   4294967296
sealed-secrets-controller           0.00790748596191406   1073741824
dumpster                             0.0135690307617187    268435456

Demo!

In this 15 minute demo video, Avthar shows you how Promscale handles SQL and PromQL queries, via the terminal and Grafana.

Getting Started

Install Promscale today via Helm Charts, Docker, and others. More information on GitHub. (And if you like what we are building, please give us a ⭐️ on Github 🤗.)

Promscale can be deployed in any environment running Prometheus, alongside any Prometheus instance. If you already have Prometheus installed and/or aren’t using Kubernetes, see our README for various installation options.

How to get involved in the Promscale community:

For help with any technical questions, please join Timescale Slack (#prometheus) and/or the promscale-users Google Group.
To participate in roadmap and product discussions and to meet the engineering team, please join the monthly User & Community Meeting.
For infrequent product updates, subscribe to our Promscale Product Updates mailing list.

How FlightAware fuels flight prediction models for global travelers with TimescaleDB and Grafana

Caroline Rodewig — Mon, 05 Oct 2020 17:38:11 GMT

This is an installment of our “Community Member Spotlight” series, where we invite our customers to share their work, shining a light on their success and inspiring others with new ways to use technology to solve problems.

In this edition, Caroline Rodewig, Senior Software Engineer and Predict Crew Lead at FlightAware, joins us to share how they’ve architected a monitoring system that allows them to power real-time flight predictions, analyze prediction performance, and continuously improve their models.

FlightAware is the world's largest flight tracking and data platform; we fuse hundreds of global data sources to produce an accurate, consistent view of flights around the world. We make this data available to users through web and mobile applications, as well as different APIs.

Our customers cover a number of different segments, including:

Travelers / aviation enthusiasts who use our website and mobile apps to track flights (e.g., using our “where’s my flight?” program).
Business aviation providers (such as Fixed Base Operators or aircraft operators) who use flight-tracking data and custom reporting to support their businesses.
Airlines that use flight-tracking data or our predictive applications to operate more efficiently.

Editor’s Note: for more information about FlightAware’s products (and ways to harness its data infrastructure), check out this overview. Want to build your own flight tracking receiver and ground station? See FlightAware’s PiAware tutorial.

About the team

The Predictive Technologies crew is responsible for FlightAware's predictive applications, which as a whole are called "FlightAware Foresight." At the moment, our small-but-mighty team is made up of only three people: our project manager James Parkman, software engineer Andrew Brooks, and myself. We each wear many different hats; a day's work can cover anything from Tier 2 customer support to R&D, and everything in between.

A former crew member, Diorge Tavares, wrote a cool article about his experience as a site reliability engineer embedded in the Predict crew. He helped us design infrastructure and led our foray into cloud computing; now that our team is more established, he’s moved back to the FlightAware Systems team full-time.

About the project

Our team's chief project is predicting flight arrival times, or ETAs; we predict both landing (EON) and gate arrival (EIN) times. And, ultimately, we need to monitor, visualize, and alarm on the quality of those predictions. This is where TimescaleDB fits in.

Not only should we track how our prediction error changes over the course of each flight, we also need to track how our error changes over months - or years! - to ensure we're continually improving our predictions. Our predictive models can have short bursts of inaccuracy - like failing to anticipate the impact of a huge storm - but they can also drift slowly over time as real-life behaviors change.

As an example of the type of data we extract, the below is our "Worst Flights" dashboard, which we use for QA. (Looking through outliers is an easy way to spot bugs.) The rightmost column compares our error to third-parties', so we can see how we're doing relative to the rest of the industry.

Our Grafana dashboard for tracking "Worst Flights" and our prediction quality vs. other data sources

But, we also go deep into specific flights, like the below "Single Flight" dashboard view. This is useful for debugging, as it gives a detailed picture of how our predictions changed over the course of a single flight.

Our Grafana dashboard for debugging and assessing our prediction quality at the individual flight level

Choosing (and using) TimescaleDB

We tested out several different monitoring setups before settling on TimescaleDB and Grafana. We recently published a blog post detailing our quest for a monitoring system, which I’ve summarized below.

First, we considered using Zabbix; it's widely used at FlightAware, where most software reports into Zabbix in one way or another. However, we quickly realized that Zabbix was not the tool for the job – our Systems crew had serious doubts that Zabbix would be able to handle the load of all the metrics we wanted to track:

We make predictions for around 75,000 flights per day; if we only stored two error values per flight (much fewer than we wanted), it would require making 100 inserts per minute.

After ruling out Zabbix, I started looking at Grafana as a visualization and alerting tool, and it seemed to have all the capabilities we needed. For my database backend, I first picked Prometheus, because it was near the top of Grafana's "supported databases" list and its built-in visualization capabilities seemed promising for rapid development.

I didn't know much about time-series databases, and, while Prometheus is a good fit for some data, it really didn't fit mine well:

No JOINs. My only prior database experience was with PostgreSQL, and it didn't occur to me that some databases just wouldn't support JOINs. While we could have worked around this issue by inserting specific, already-joined error metrics, this would have limited the flexibility and "query-a-bility" of the data.
Number of labels to store. At the bare minimum, we wanted to store EON and EIN predictions for 600 airports, at least 10 times throughout each flight. This works out to 12,000 different label combinations, each stored as a time series – which Prometheus is not currently designed to handle.

And, that’s when I found TimescaleDB. A number of factors went into our decision to use TimescaleDB, but here are the top four:

Excellent performance. This article comparing TimescaleDB vs. PostgreSQL performance really impressed me. Getting consistent performance, despite the number of rows in the table, was critical to our goal of storing performance data over several years.
Institutional knowledge. FlightAware uses PostgreSQL in a vast number of applications, so there was already a lot of institutional knowledge and comfort with SQL.
Impressive documentation. I have yet to have an issue or question that wasn't discussed and answered in the docs. Plus, it was trivial to test out – I love one-line docker start-up commands (see TimescaleDB Docker Installation instructions).
Grafana support. I was pretty confident that I wanted to use Grafana to visualize our data and power our dashboards, so this was a potential dealbreaker.

We use several Grafana dashboards, like this one, to view detailed performance over time (average error trends over one or more airports)

Editor’s Note: To learn more about TimescaleDB and Grafana, see our Grafana tutorials (5 step-by-step guides for building visualizations, using variables, setting up alerts, and more) and Grafana how-to blog posts.

To see how to use TimescaleDB to perform time-series forecasting and analysis, check out our time-series forecasting tutorial (includes two forecasting methods, best practices, and sample queries).

Current deployment & use cases

Our architecture is pretty simple (see diagram below). We run a copy of this setup in several environments: production, production hot-standby, staging, and test. Each environment has its own predictions database, which allows us to compare our predictions in staging to those in production and validate changes before they get released.

⭐ Pro tip: we periodically sync Grafana configurations from production to each of the other environments, which reduces the manual work involved in updating dashboards across instances.

FlightAware Predict team's system architecture, which uses custom Python programs, Docker, Grafana, and TimescaleDB

After some trial and error, we’ve set up our TimescaleDB schema as follows:

(1) Short term (1 week) tables for arrivals, our own predictions, and third-party predictions. The predict-assessor program reads our flight data feed, extracts ETA predictions and arrival times, and inserts them into the database. For scale, the arrivals table typically contains 500k rows, and the predictions tables each contain 5M rows.

Each table is chunked: arrivals by arrival time and predictions by the time the prediction was made.
We use archiving functions to copy some data into long-term storage, and a drop_chunks policy to ensure that rows older than one week are dropped to prevent unlimited table growth.

(2) Long term (permanent) table for prediction and prediction-error data. Archiving functions move data to the long term table by joining the short terms tables together. They also "threshold" the data to reduce verbosity, by only storing predictions at predetermined intervals; i.e., predictions that were present 1 and 2 hours before arrival are migrated to long-term tables, but intermediate predictions (i.e., at 1.5 hours before arrival) are not kept.

Between the join and the threshold, the archiving process reduces the average number rows per flight from 25 (across 3 short-term tables) to 6!
We haven’t enabled a drop_chunks policy on this table as of now; after ~9 months of running this setup, our database file is pretty manageable at 54GB. If we start having space issues, we'd opt to store fewer predictions per flight rather than lose any year-over-year historical data.

Biggest "Aha!" moment

Continuous aggregates are what well and truly sold me on TimescaleDB. We went from 6.4 seconds to execute a query to 30ms. Yes, milliseconds.

I was embarrassingly late to the party when it comes to continuous aggregates. When I first set up our database, every query was fast because the database was small. However, as we added data over time, some queries slowed down significantly.

The biggest offender was a query on our KPIs dashboard, visualized in Grafana below. This graph gives us a birds-eye view of error over time. The blue line represents the average error for all airports at a certain time before arrival; the red line shows the number of flights per day. (You can see the huge traffic drop when airlines stopped flights in March, due to the COVID-19 pandemic.)

Our KPI dashboard includes various metrics, including our average error rate and total flights per day across all airports

Before learning about continuous aggregates, the query to extract this data looked like this:

SELECT
  time_bucket('1 day', arr_time) AS "time",
  AVG(get_error(prediction_fa, arr_time)) AS on_error,
  count(*) AS on_count
FROM prediction_history
WHERE 
  time_out = '02:00:00' AND 
  arr_time BETWEEN '2020-03-01' AND '2020-09-05'
GROUP BY 1
ORDER BY 1

It took 6.4 seconds and aggregated 1.6M rows, from a table of 147M rows.

For what the query was doing, this runtime wasn't too bad – the table was chunked by arr_time, which the query planner could take advantage of.

I considered adding indexes to make the query faster, but wasn't convinced they would help much and was concerned about the resulting performance penalties for inserts.

I also considered creating a materialized view to aggregate the data and writing a cron job to regularly refresh it...but that seemed like a hassle, and after all, I could wait 10 seconds for something to load 🤷‍♀️.

Then, I discovered TimescaleDB's continuous aggregations! For the unfamiliar, they basically implement that regularly-refreshing materialized view idea, but in a far smarter way and with a bunch of cool extra features.

Here's the view for the continuous aggregate:

CREATE VIEW error_by_time_out
WITH (timescaledb.continuous) AS
  SELECT
    time_out,
    time_bucket(INTERVAL '1 hour', arr_time) AS bucket,
    AVG(get_error(prediction_fa, arr_time)) AS avg_error,
    COUNT(*) AS count
  FROM prediction_history
  GROUP BY time_out, bucket;

The new data extraction query is a little bit harder to parse, because the error needs to be aggregated across continuous aggregate buckets:

SELECT
  time_bucket('1 day', bucket) AS "time",
  SUM(avg_error * count) / SUM(count) AS error,
  SUM(count) AS count
FROM error_by_time_out
WHERE 
  time_out = '02:00:00' AND 
  bucket BETWEEN '2020-03-01' AND '2020-09-05'
GROUP BY 1
ORDER BY 1

...and I'll let you guess how long it takes....

30ms. Yes, milliseconds. We went from 6.4 seconds to execute the query to 30ms.

On top of that, unlike in a classic materialized view, the whole view doesn't have to be recalculated every time it needs to be updated - just the parts that have changed. This means refreshes are lightning fast too.

Continuous aggregates are what well and truly sold me on TimescaleDB.

The amazing developers at Timescale recently made continuous aggregates even better through "real-time" aggregates. These will automatically fill in data between the last view refresh and real-time when they're queried, so you always get the most up-to-date data possible. Unfortunately, our database is a few versions behind so we're not using real-time aggregates yet, but I can't wait to upgrade and start using them.

Editor’s Note: To learn more about real-time aggregates and how they work, see our “Ensuring up-to-date results with Real-Time Aggregations” blog and mini-tutorial (includes benchmarks, example scenarios, and resources to get started).

Getting started advice & resources

In addition to the documentation I’ve linked throughout this post, I'd recommend doing what I did: reading the TimescaleDB docs, spinning up a test database, and going to town.

And, after a few months of use, make sure to go back and read the docs again – you'll discover all sorts of new things to try to make your database even faster (looking at you timescaledb-tune)!

Editor’s Note: If you’d like to follow Caroline’s advice and start testing TimescaleDB for yourself, Timescale Forge is the fastest way to get up and running - 100% free for 30 days, no credit card required. You can see self-managed and other hosted options here.

To learn more about timescale-tune, see our Configuring TimescaleDB documentation.

We’d like to thank Caroline and the FlightAware team for sharing their story, as well as for their work to make accurate, reliable flight data available to travelers, aviation enthusiasts, and operators everywhere. We’re big fans of FlightAware at Team Timescale, and we’re honored to have them as members of our community!

We’re always keen to feature new community projects and stories on our blog. If you have a story or project you’d like to share, reach out on Slack (@lacey butler), and we’ll go from there.

Additionally, if you’re looking for more ways to get involved and show your expertise, check out the Timescale Heroes program.

Timescale Newsletter Roundup: September 2020 Edition

Lacey Butler — Tue, 29 Sep 2020 16:04:52 GMT

See what's new from Team Timescale, including big news about the Timescale License - updated to give more rights to our users, enterprise features 100% free, and more! - TimescaleDB 2.0 beta releases, new technical content, and PostgreSQL pro tips 🚀.

We’re always releasing new features, creating documentation and tutorials, and hosting virtual sessions to help developers do amazing things with their data. And, to make it easy for our community members to discover and get the resources they need to power their projects, teams, or business with analytics, we round up our favorite pieces in our biweekly newsletter.

We’re on a mission to teach the world about time-series data, supporting and growing communities around the world. And, sharing educational resources as broadly as possible is one way to do just that :).

Here’s a snapshot of the content we shared with our readers this month (subscribe to get updates straight to your inbox).

Product updates & announcements

[NEW]: How we’re building a self-sustaining open-source business in the cloud era (v2) >>

We just announced updates to our Timescale License - which governs some of our most advanced features, like compression, continuous aggregates, and multi-node - to provide expanded rights to all of our users (YOU!). Ajay and Mike detail how we’ve taken your feedback to heart and liberalized our software license – and why we think all open-source businesses should adopt a similar model.

📰 See the resulting Hacker News discussion (200+ comments).
🙋 Have feedback or questions? Let us know on Slack.

Get all the Timescale License update details in our founders' blog post

[Product Update #1]: TimescaleDB 2.0 Beta-6 >>

We’re working toward a TimescaleDB 2.0 Release Candidate, and our latest beta version focuses on TimescaleDB capabilities related to single-node operation, including updated APIs for continuous aggregates (greater control!), user-defined jobs, and better informational views.

🚀 Get installation instructions and details about what’s new.
💬 See our Public Slack announcement for a summary of changes and resources.

[Product Update #2]: TimescaleDB 1.7.4 >>

We released two maintenance releases in short order: 1.7.3, which fixes issues in a few core capabilities: drop_chunks (data retention policies), compression, and the background worker scheduler, and 1.7.4, which fixes an issue for users who’ve deployed TimescaleDB replicas.

New technical content, videos & tutorials

[Session Replay]: Postgres Pro Tips Part II: 5 Powerful PostgreSQL Functions for Monitoring & Analytics >>

In yet another demo-packed session, @avthars shows you how to build queries for common DevOps scenarios, including things like calculating averages and deltas, addressing missing data, and more.

🔥 Get the sample app and demo dataset (Python script)
🏆 Check out 10+ advanced analytics functions (TimescaleDB docs)

Sneak peek of what you'll learn in Avthar's "5 PostgreSQL Functions for Monitoring & Analytics" session replay.

[Postgres Pro Tips]: Top 5 5 PostgreSQL Extensions >>

We love PostgreSQL for many reasons, but a big one is its extensibility and vast ecosystem of 20K+ extensions. We round up a few of favorites, why you’d use and how to install each one, plus a few sample queries and pro tips to get you started.

🔎 See @avthars Twitter thread for an at-a-glance breakdown (and chime in!).
🙏 Thank you to all of the TimescaleDB community members who recommended their favorites.

New #remote-friendly events & community

[IoT + Time-Series]: Combining the Power of IoT & Time-Series Data Session Replay >>

Learn how Team Grillo builds earthquake early warning systems - used all around the world to warn communities - and how to get involved in their new open source initiative: OpenEEW. From there, Mario breaks down what IIoT is and why it's unique, various data models, and when and why to use each type.

🙏 to our guest speakers: Andy Meira (Grillo Founder) and Mario Ishikawa (PackIOT CTO & Timescale Hero)

[Community Spotlight] How k6 delivers high-performance load testing >>

Kudos to our friends at k6 for their work to build resilient, reliable load and performance testing for developers everywhere. See how they’ve designed their data stack to support their ever-growing customer base and massive amounts of time-series data, now and in the future.

📣 Have a story to share? Reach out on Slack or @TimescaleDB and we’ll make it happen.

[Community Q & A]: Join us for Office Hours on Tues, October 6th >>

Our monthly Office Hours series continues! Anyone and everyone is welcome, whether you’re new to TimescaleDB, an experienced database pro, or somewhere in the middle – our technical team is happy to answer any and all questions.

👉 Reserve your spot on Tuesday, Oct 6th (space is limited).
💬 If you can’t join, but have a question, reach out to our engineering team on Slack.

TimescaleDB tips, reading list & etc.

Example of the visualizations you'll learn how to build in our step-by-step Tableau tutorial.

[TimescaleDB Tip #1]: Connect your Tableau data in just a few clicks >>

Check out this tutorial to get up and running in 3 steps. You’ll connect to your TimescaleDB database, then get sample SQL queries and advice for examining time-series data.

[TimescaleDB Tip #2]: Get up and running with schema design best practices >>

The right PostgreSQL table schema can be the difference between significant performance improvements and significant performance degradation. Use this guide to get detailed best practices and examples to create the best indexes, triggers, and constraints for your projects.

[Reading List]: A tuned database means better read and write performance* >>

We built timescale-tune to help you get the best configuration for your unique setup, and, in this classic post, we share what is, how it works, and how to put it to use.

🔧 Get timescale-tune on GitHub.
🏁 See our configuration docs for additional tips.
*~1M metrics/second and 3x faster queries in our benchmark testing.

[Reading List]: Using SQL functions for time-series analysis >>

TimescaleDB is designed to handle advanced time-series workloads, including special SQL functions optimized for time-series analytics. Our product team shares how we built time_bucket and gap_fill to solve the all-too-common problem of missing - or messy - data, how each one works, and where to get started.

[Time-Series Fun]: Compare database performance with the Time-Series Benchmarking Suite (TSBS) >>

Take the guesswork out of benchmarking: learn how to use TSBS to generate realistic real-world datasets that mimic production workloads and compare read + write performance across various time-series databases. This post gives step-by-step instructions for the IoT dataset and a few sample queries to inspire your own analysis.

Get TSBS on GitHub - available for IoT and DevOps scenarios.

[Team Timescale Fun #1]: NEW Timescale Shop Stickers and Limited-Edition T-shirts >>

Timescale Shop is chock-full of classic Timescale swag and limited edition items, and we just released some fresh new designs featuring Eon, our adorable Tiger mascot.

From sunglass-adorned Eon to showing support for #BLM and Pride, there’s something for everyone
Something missing? Let us know and we’ll add it to our backlog.

See all designs – and we'll continue to send Eon on new adventures!

[Team Timescale Fun #2]: Timescale People Manager Mel continues to dial up her async challenge game.

🍦Exhibit A: the great ice cream flavor debate (the jury's out on which one wins).

Exhibit B: 🌅 Mel prompts us all to take a mini-vacation.

Wrapping Up

And, that concludes this month’s newsletter roundup. We’ll continue to release new content, events, and more - posting monthly updates for everyone.

If you’d like to get updates as soon as they’re available, subscribe to our newsletter (2x monthly emails, prepared with 💛 and no fluff or jargon, promise).

Happy building!

How we are building a self-sustaining open-source business in the cloud era (version 2)

Ajay Kulkarni — Thu, 24 Sep 2020 14:16:48 GMT

Today, we are announcing an update to the Timescale License, which governs many of the advanced features of TimescaleDB, including native compression, multi-node, continuous aggregations, and more. This update loosens restrictions and provides expanded rights to users, reinforcing our commitment to our community.

Notably, this update adds the “right-to-repair”, the “right-to-improve”, and eliminates the paid enterprise tier and usage limits altogether (thus establishing that all of our software will be available for free). These changes will apply with TimescaleDB 2.0, which supports distributed hypertables for multi-node scale-out, slated for release next month.

Two years ago, we first announced how we were building a self-sustaining open-source business in the cloud era, and that we had started developing features under a new, source-available license called the Timescale License (TSL).

At the time, the TSL was a radical idea: a source-available license that was open-source in spirit, but that contained a main restriction: preventing companies from offering software licensed under the TSL via a hosted database-as-a-service. We added this restriction, which only applies to <0.0001% of all possible TimescaleDB users, to enable us to build a self-sustaining business in a world that was rapidly moving to the cloud. (And importantly, we did not and have never re-licensed any of our open-source software, which continues to be licensed under the Apache 2 license.)

The TSL, like the Elastic License before it, and like the Confluent Community License (coincidentally launched around the same time), are examples of what we call “Cloud Protection Licenses.” These licenses attempt to maintain an open-source spirit, but recognize that the cloud has increasingly become the dominant form of open-source commercialization. So these licenses protect the right of offering the software in the cloud for the main creator/maintainer of the project (who often contributes 99% of the R&D effort). This “cloud protection” is what enables open-source businesses like ours to become self-sustaining in the cloud era.

However, since these licenses are not officially sanctioned by the Open Source Initiative (OSI), whom many view as the arbiters as to what is and isn’t officially “Open-source”, these licenses are generally not considered “Open-source” (capital O) (although this sentiment may be shifting). At the same time, many developers still call these licenses “open-source” (lower-case o) because they embody the same open, transparent, collaborative spirit.

Two years later, this experiment has proved successful, exceeding our expectations. The Timescale community has continued to grow to over 500,000 active databases today. The TSL governs many of our new advanced features, including native compression, multi-node, and continuous aggregations, and adoption of these features has continued without friction. The public cloud providers have been deterred from offering these TSL features for free (and are now reaching out to discuss revenue sharing agreements). And other OSS entrepreneurs have reached out for advice on how to create similar licenses.

In fact, this experiment has gone so well that today, we are announcing an update to the TSL that reinforces our commitment to our community. This update loosens restrictions and provides expanded rights to users, including:

New: Right-to-repair
New: Right-to-improve
All enterprise features are now free
No more usage limits
Simplified legalese

These changes will go into place with TimescaleDB 2.0, slated for release next month.

In the rest of this post, we explain why we are making these changes, what this means for users, and why we think this is necessary for the open-source industry as a whole.

What is TimescaleDB?

TimescaleDB is the leading relational database for time-series data, engineered on top of PostgreSQL, offered via free software or as a fully-managed service on AWS, Azure, and GCP.

TimescaleDB offers massive scale (100s billions of rows and millions of inserts per second on a single server), 94%+ native compression, 10-100x faster queries than PostgreSQL, InfluxDB, Cassandra, and MongoDB – all while maintaining all of the reliability, ease-of-use, SQL interface, and overall goodness provided by PostgreSQL.

Initially launched in April 2017, TimescaleDB has come a long way in just 3.5 years, with tens of millions of downloads and over 500,000 active databases today. The TimescaleDB developer community today includes organizations like AppDynamics, Bosch, Cisco, Comcast, DigitalOcean, Fujitsu, IBM, Rackspace, Schneider Electric, Samsung, Siemens, Uber, Walmart, Warner Music, and thousands of others.

The growth of the Timescale Community is a clear sign that software developers need a new database to rely on for their time-series data, and that increasingly more and more are turning to TimescaleDB. But even with all this adoption, how does one build a sustainable business, particularly given the predatory actions of some public cloud providers?

Enter “Cloud Protection Licenses,” like the Timescale License.

What are “Cloud Protection Licenses”, and why are they necessary?

Not that long ago, the open-source business model was straightforward: companies run your software, and when they need help or advanced capabilities, they pay you for commercial support or enterprise features. This is the world in which today’s Open-source licenses were written.

But the world has changed. Today companies would rather pay someone to run your software, thus obviating the need for paid support (and making selling enterprise capabilities much harder). The standard Open-source licenses allow anyone to completely commercialize your software without contributing any software development.

In other words, the rise of the Cloud has cut off the primary business model for open-source software.

Enter new licenses like the Timescale License, Elastic License, Confluent Community License, and others. These licenses, which we call “Cloud Protection Licenses”, attempt to maintain an open-source spirit, but protect the right of offering the software in the cloud for the main creator and maintainer of the project.

This “cloud protection” is what enables open-source businesses like ours to become self-sustaining in the cloud era.

Some may ask, “Why create a new license - why not just compete with public clouds by just providing the best product experience on a level playing field?”

The problem is that the playing field is far from level. Today, the public cloud vendors (Amazon, Microsoft, Google) are trillion dollar corporations – the largest companies in the world – and have a myriad of advantages that arise from that size, including market position, pricing power, deep balance sheets, and (what many have even called) unfair business practices (source: Wall Street Journal articles from April 2020, September 2020). They lock large customers into prepaid, discounted, multi-year enterprise-wide agreements, and give startups $100,000s of free credits.

Yet even with hundreds of thousands of employees and tens of billions of dollars in cash, the public clouds did not develop TimescaleDB, the Elastic Stack, the Confluent Platform, and countless other open-source projects. These were built by independent teams dedicated to advancing state-of-the-art technology and serving developers worldwide.

This is David vs. Goliath. The Rebel Alliance vs. the Empire. Entrepreneurial teams taking on the largest corporations in the world with new, innovative technology. Cloud Protection Licenses foster more innovation, and enable the open-source underdogs to compete against the public cloud giants.

What did developers like and not like about the initial Timescale License?

Ever since we launched the TSL, the response from the community has been overwhelmingly positive. But over the years, the community has also provided really helpful feedback - via GitHub, Slack, Twitter, Hacker News, Reddit, email, etc. - which we have incorporated into this latest update.

Overall, an overwhelming majority has been supportive of the general direction of the TSL, understanding that offering a managed service is increasingly the primary commercialization approach for database and other infrastructure software providers:

“I think we're too hung up on OSI open source licenses. The additional restriction in the timescaledb license that you can't run a paid database as a service offering affects hardly anyone negatively (AWS). It affects us all positively by providing a sustainable business model to support additional development and support of an open-source product we use. Win-win if ever there was one. I'd like to see more open-source and closed-source companies consider this model.” (Source)

But we also heard some requests for liberalizing some of the terms of the Timescale License:

“One of the important reasons I personally use and support open-source is the freedom to not only inspect (which the TSL provides) but to also not have to ask someone else and wait on them to make any changes I need to the software I use. Any chance the TSL can be modified to include this freedom too?” (Source)

“I don't care if I can see the source code if I can't actually _do_ anything with it. If I can't run my modifications in production, it doesn't guard me against vendor lock-in and it doesn't give me the right-to-repair.” (Source)

We’ve listened to that feedback, and looked at where we are going as a company and how our direction lines up with our licensing. And, so we are pleased to announce changes that remove some restrictions of the TSL (and simplifies it in the process).

Giving more rights to users: right-to-repair and right-to-improve

Listening to our users and the general developer community, we are pleased to announce some changes to the Timescale License that loosen restrictions and provide expanded rights to users, reinforcing our commitment to our community.

To be perfectly clear: These changes solely give users additional rights in how they can use and distribute TimescaleDB in more scenarios; these changes do not further restrict any rights.

The two biggest rights we are adding are the “right-to-repair” and the “right-to-improve.”

First, users now have what some call the “right-to-repair” with TimescaleDB. If they encounter any issue or bug that they want fixed immediately, they can find, fix, and deploy a fix locally before it might be released upstream.

Second, users can now add additional features to TimescaleDB that might fit their own needs, and use their modified version for internal use, to build a SaaS service, or even when shipping code to users. Some call this the “right-to-improve.” Previously, they would need to upstream this change back to TimescaleDB before deploying into production. That also meant (as Hacker News readers pointed out) that proposed enhancements couldn’t be run and hardened in production before being submitted upstream.

Previously, we had included these restrictions with good intentions: we wanted to incentivize developers to contribute bug fixes and enhancements upstream, so that the software would be improved for everyone.

We also were concerned about the issues, uncertainty, and support burden that can arise from users running modified versions; we spend a significant amount of time helping answer questions for free in our large and active Slack community, which now numbers almost 5,000 members.

However, after hearing from the community, we’ve come to appreciate that the benefits of these rights outweigh their downsides.

👋 Goodbye to the enterprise tier (and going all-in on cloud)

There’s another big change we are making in the TSL: eliminating the enterprise tier altogether. This means that we are now making all of our software, everything licensed under Apache-2 and the TSL, available for free.

In the past, open-source businesses generally relied on commercial support and an enterprise tier (known as “open-core”) for commercialization. Timescale was no different.

But this year, we have increasingly focused on our managed cloud service as our primary commercialization strategy, and selling an enterprise edition of TimescaleDB for on-premise deployments (either on customers' own physical hardware or on their own cloud VMs) as our secondary commercialization strategy.

That fully-managed cloud service is now the industry-leading service for time-series data, running on all three major clouds and available in more than 75 regions. The growth of our cloud business has enabled us to make it the centerpiece of our business.

This has simplified a key question that every open-source company has historically wrestled with: which of our features should we “hold back” from our free version and keep in the paid enterprise tier?

Before, we would have difficult internal debates about new features: Release something to support our community and drive adoption? Or limit it to the enterprise tier to drive revenue?

Today we are going all-in on cloud, and removing any notion of paid enterprise capabilities from the Timescale License.

By going “all-in” on cloud, our choice becomes simpler: make all features available for free, so that we can invest in our community. Users can then either self-manage for free (including use our open-source k8s helm charts), or use our managed cloud.

But this easy choice – and our ability to “support our community” while preserving Timescale’s long-term viability – exists precisely because we have the Timescale License, which restricts cloud vendors’ ability to offer TSL software unless they first establish a business relationship with us.

So with this thinking, earlier this summer, we moved most of our existing enterprise features into TimescaleDB’s free community tier. And with our upcoming TimescaleDB 2.0 release, we are moving the last enterprise features to the free community tier.

Our original Timescale License also allowed us to set potential “usage limits” on community features. The thinking was that, hypothetically, we might at some future time want to allow users to use multi-node TimescaleDB up to, say, 4 servers for free, but thereafter need an enterprise license.

This is similar to how many SaaS services “tier” consumption under different levels of plans. But these usage limits were always hypothetical: we never released a TimescaleDB feature with usage limits. And internally, we never really liked the idea that users’ internal consumption could “expand” to a level where they would no longer be able to use TimescaleDB for free (even though sized-based pricing is fairly common to databases in the enterprise).

So today, we are also removing any notion of community “usage limits” from the Timescale License.

The main restriction we have preserved: no TimescaleDB-as-a-service

What we have preserved, however, is the main restriction preventing other companies from offering TimescaleDB-as-a-Service in the cloud.

Along a similar vein, we also don’t allow parties to “fork and modify” the database and redistribute this forked version to others, which could serve as a way to try to circumvent licensing restrictions.

This concern isn’t hypothetical: Amazon, for example, has attempted to fork both the code and community of Elastic by releasing its own questionably-named “Open Distro for Elasticsearch” that re-implements some of Elastic’s key community features and licenses them instead as Apache-2 (while heavily monetizing these features as part of its managed Amazon Elasticsearch Service).

As we shared earlier, this restriction is the heart of Cloud Protection Licenses like the TSL, and is what enables further innovation.

Summary: What’s changing and isn’t changing

Let’s review the rights previously granted by the Timescale License, newly granted, expanded rights, and those that are still disallowed:

Note: To understand these changes, it’s important to understand the concept of “Value Added Service or Product”, which is a key part of the Timescale License (and can similarly be found in the Elastic License and Confluent Community License).

The notion of a Value Added Service or Product is that you are building something of value on top of TimescaleDB and not just “reselling” it as part of a “database” or “database-as-a-service”. (The formal legal definition can be found here.)

This Value Added Product or Service can certainly be commercial or proprietary; the TSL is in no way a “non-commercial” license. Many companies provide SaaS services using TimescaleDB as part of their service offering, or distribute commercial products embedding TimescaleDB. They just can’t purely offer “TimescaleDB-as-a-Service,” which is why cloud vendors like Amazon or Microsoft can’t and don’t offer the TSL parts of TimescaleDB as part of AWS RDS or Azure Postgres.

Definitions:

“utilize”: Offering code/product via a software-as-a-service
“distribute”: Shipping code/product via software
“Value Added Service or Product”: Something of value on top of TimescaleDB and not just “reselling” TimescaleDB as part of a “database” or “database-as-a-service”

Rights previously granted (and still allowed) under Timescale License

Right to run unmodified TimescaleDB for internal use
Right to utilize unmodified TimescaleDB to offer a Value Added Service
Right to distribute unmodified Source and Binaries as part of Value Added Product
Right to modify TimescaleDB for internal development and testing, and subsequently upstream modifications to Timescale

Rights newly granted (formerly disallowed) under Timescale License

Right to run modified TimescaleDB for internal use
Right to utilize modified TimescaleDB to offer a Value Added Service
Right to distribute modified Binaries as part of a Value Added Product
Right to distribute unmodified Source and Binaries, even if not as part of Value Added Product

Rights still disallowed under Timescale License

No right to utilize TimescaleDB for external use, unless as part of a Value Added Service
No right to distribute modified Source
No right to distribute modified Binaries, unless as part of a Value Added Product

What does this mean for me?

If you are a current or future user of TimescaleDB, these changes mean that you have more rights. But if you are looking to provide TimescaleDB-as-a-service, you are still restricted to only offering the Apache-2 edition.

In general, by refreshing the Timescale License and focusing on our cloud service, we can continue to invest in our community by releasing our best features to be completely free to use.

And now these features include distributed hypertables in TimescaleDB 2.0 for even greater scale. Beta users have already been running multi-node TimescaleDB in continuous daily use for many months, including a 22-server cluster by a Fortune 50 company ingesting more than a billion rows per day. We’ll be writing more about TimescaleDB 2.0 when it launches (soon!).

Today, you can deploy TimescaleDB on-premise or in your own cloud account, running the software on bare VMs or using our open-source k8s helm charts which automate high-availability/failover and continuous PITR backups. Totally free to use, and now free to even modify for your own use or for services or products you build on TimescaleDB.

Or, if you prefer, you can let us run TimescaleDB for you, fully managed on AWS, Azure, or GCP in 75+ regions and with access to a top-rated support team.

To get started, you can sign up for a free trial right away.

And, join our 5,000+ member Slack community for any questions, to learn more, and to meet like-minded developers – we’re active in all channels and here to help.

Timescale Newsletter Roundup: August Edition

Lacey Butler — Mon, 31 Aug 2020 22:04:49 GMT

3 exciting product releases (including big news for Timescale Cloud), 2 amazing community member shoutouts, 1 "Postgres Pro Tips" technical series, and oh so many tutorials, events, and how-tos to help you continue your journey to data mastery 🎉.

We’re on a mission to teach the world about time-series data, supporting and growing communities around the world.

And, sharing educational resources as broadly as possible is one way to do just that :).

Here’s a snapshot of the content we shared with our readers this month (subscribe to get updates straight to your inbox).

New technical content, videos & tutorials

[Session Replay]: Postgres Pro Tips Part I: 5 Ways to Improve Your PostgreSQL INSERT Performance >>

Our resident webinar pro @avthars breaks down factors that impact PostgreSQL ingest rate and 5 immediately actionable ways to improve your database speed. You’ll learn why ingest rate is so critical to time-series data and get step-by-step demos, tips, and resources to apply each optimization to your apps.

👉 Get 13 tips in this blog post from Timescale CTO Mike Freedman.
🔨 Explore the Time-Series Benchmarking Suite to replicate Avthar’s demo (or benchmark the performance of any time-series database).

[Postgres Pro Tips Part II]: 5 Powerful PostgreSQL Functions for Monitoring & Analytics >>

PostgreSQL is ideal for real-time monitoring and historical analysis, but writing efficient queries can be tricky. Join @avthars to learn how to build essential PostgreSQL functions for common real-time scenarios, plus TimescaleDB-specific functions to simplify time-series historical analysis.

🗓 RSVP for Wed., Sept 16th
We send the recording + resources to all registrants within 48 hours, so sign up even if you can't attend live.

[NEW How-to]: Using PostgreSQL to speed up Grafana: auto-switching between different aggregations >>

Learn how to use UNION ALL to build graphs that allow you to “auto-switch” aggregated views of your data (e.g., daily, hourly, weekly) in the same Grafana visualization. The result: faster dashboards that allow you to drill into your metrics as quickly as possible and save time and CPU resources.

[NEW How-to]: How to visualize timeshifts in Grafana using PostgreSQL>>

Combine PostgreSQL and Grafana to easily compare your metrics across time periods in one graph. We break down what timeshifting is & how it works, plus step-by-step instructions and sample queries that you can modify for your own projects.

Grafana visualization showing current and last week's data (aka "timeshifted), making it easier to visually distinguish between periods >> compare data and spot trends across time intervals

New #remote-friendly events & community

[Community Q & A]: Join us for Office Hours on Tues, Oct 6th >>

If you have questions about TimescaleDB, from schema design best practices to upcoming releases, or want to learn more about general distributed computing and networking topics, Office Hours - hosted by Timescale technical experts - is for you.

👉 Reserve your spot on Tues, Oct 6th (space is limited)
💬 If you can’t join, but have a question, reach out to our engineering team on Slack.

[Community Spotlight #1] How I power a (successful) cryptobot with TimescaleDB >>

Our friend Felipe Queis shares how he combines ML, TimescaleDB, and Node.js to power a crypto trading bot that’s netted several successful trades (+487%!). He details how his bot works, why he selected TimescaleDB, sample queries, favorite resources, and beyond.

💰 See Felipe’s Reddit AMA for even more tips and technical details.
📈 Read his Hacker Noon post to see how his results have only continued to improve.

[Community Spotlight #2] How FlightAware Monitors its Systems with Prometheus, Grafana, and TimescaleDB >>

Three cheers to @flightaware for their amazing breakdown of how they’ve architected their monitoring stack (and for their work to keep travelers informed, safed, and on-time). You’ll learn why they selected each component and the role it plays in fueling their real-time monitoring, ETA predictions, historical analysis, and more.

📊 Inspired by FlightAware’s Grafana dashboards? Explore 5+ tutorials to build awesome Grafana visualizations with your data.
📣 Have a story to share? Reach out on Slack or @TimescaleDB and we’ll make it happen

Product Updates, Reading List & Etc.

[Product Update #1]: Timescale Cloud gets even better, now available in 75+ regions & 2K configurations >>

We launched Timescale Cloud to allow developers to get the power of TimescaleDB, with worry-free operations and the ability to grow, shrink, and migrate workloads with ease. We’re excited to now offer Timescale Cloud in 75+ regions across AWS, GCP, and Azure and with fine-grained CPU/storage config options - giving you ultimate flexibility and control.

📰 See Hacker News discussion (60+ comments!)
🗞 Check out what Timescale CEO says about the news and what it means for developers.
🙌 Blue Sky Analytics, Everactive & When I Work for sharing your experiences for this post.
🙏 Thank you to all of our Cloud customers & community members.

[Product Update #2] Timescale Forge now includes CPU, memory & storage metrics reporting >>

Timescale Forge - our second fully managed database for time-series product, currently in public preview - now includes a comprehensive metrics dashboard! Easily analyze spikes in usage, zoom in for more granular understanding, observe correlations between metrics, and identify issues on-the-fly.

🔥 Explore Timescale Forge (100% free for 30 days).
☁️ Learn more and choose the best cloud option for you.

Create your Timescale Forge account to see the new dashboard in action

[Product Update #3]: Timescale Prometheus Adapter now in beta >>

We just shipped our first Prometheus Adapter beta candidate, which signifies we believe our API is now feature complete (although not yet recommended for production workloads)! This release includes tons of updates to improve performance and usability, including support for querying PromQL directly from the connector.

⭐ Get the source code & release notes (GitHub).
✅ We’re also proud to share that our PromQL support passes 100% compatibility tests.
Subscribe to Prometheus Product Updates to get new releases and announcements straight to your inbox.

[TimescaleDB Tip #1]: Build time-series forecasting models with R, Apache MADlib & Python >>

Follow this detailed tutorial to learn how to use TimescaleDB to analyze your data and make predictions. We’ve included guidance for not one, but two approaches: ARIMA and Holt-Winters.

[TimescaleDB Tip #2]: How to install psql on Mac, Ubuntu, Debian, Windows>>

Use this guide to get up and running with #psql on your platform of choice, see common commands at-a-glance, get tips for saving query results, and more.

[TimescaleDB Tip #3]: Explore the price of Bitcoin and Ethereum over time >>

Get step-by-step instructions for connecting to TimescaleDB, designing your schema, and creating your dataset (plus sample queries to kick off your analysis.

[ICYMI]: TimescaleDB vs. InfluxDB: Purpose built differently for time-series data >>

We just refreshed our TimescaleDB v. InfluxDB benchmarks to see how the latest versions stack up across 5+ areas critical to time-series data workloads.

⏰ Check out @avthars Twitter thread to see key points, in 240 characters or less.

[Reading List]: What is high cardinality, and how do time-series databases like InfluxDB and TimescaleDB compare? >>

Get a look into what causes high-cardinality, why it's so common in time series scenarios, and how TimescaleDB solves for it. This old (but good!) post breaks down the basics & details how different databases approach the issue.

⚖ Want more info about how TimescaleDB & InfluxDB compare? Check out our 2020 Benchmark Report.

[Reading List]: Why SQL is beating NoSQL, and what this means for the future of data >>

Check out this Star Wars-infused history of databases, from a New Hope to NoSQL Strikes Back and Return of the SQL. We published this awhile ago, but it’s even more true now (NoSQL < SQL).

[Watchlist]: Lessons learned optimizing relational schema for Prometheus data >>

Catch Timescale Engineer Mat’s PromCon 2020 lightning talk to learn how the team’s building a long-term data store for Prometheus metrics, what they’ve learned along the way, and what’s next.

[Timescale Team Fun]: Last, but certainly not least, we continue to find little ways to stay connected with things like movie-themed Slack challenges and quirky "tell us about you" prompts. Asynchronous communication and team bonding at its finest.

Check out the below for a few examples, which you're welcome (and encouraged!) to reuse to inspire your own remote team activities 💭.

Original Nintendo, Power Rangers, and Tamagotchi were clear winners for Team Timescale

Check out this Medium post for more "Emoji Movie challenge" ideas

Wrapping Up

And, that concludes this month’s newsletter roundup. We’ll continue to release new content, events, and more - posting monthly updates for everyone.

If you’d like to get updates as soon as they’re available, subscribe to our newsletter (2x monthly emails, prepared with 💛 and no fluff or jargon, promise).

Happy building!

A multi-cloud, fully-managed service for time-series data, now available in AWS, Azure, and GCP with 75+ regions and 2,000 configurations

Ajay Kulkarni — Wed, 12 Aug 2020 12:54:00 GMT

We’re excited to announce that Timescale Cloud, the leading cloud service for time-series data, is now even more powerful, combining the widest cloud provider and region support with the best performance, scale, and developer experience of any managed service for time-series data.

One year ago, we launched Timescale Cloud as the first multi-cloud fully-managed service for time-series data. We initially launched Timescale Cloud to offer developers all of the power of TimescaleDB, the leading relational database for time-series, without needing to worry about database operations.

We’ve seen fantastic growth over the past 12 months, and we’ve continued to make significant investments in the platform to ensure it meets the needs of developers and organizations all around the world.

Today, we are excited to announce some of these major improvements. To support our global customers, Timescale Cloud is now available in 76 regions, across Amazon Web Services, Microsoft Azure, and Google Cloud Platform, with 2,000 different CPU and storage configuration options to provide developers even more flexibility.

You can now incrementally scale up or down in the region of your choice, with CPU options ranging from 2 to 72 CPUs and storage ranging from 20GB to 10TB (equivalent to 150TB+, thanks to TimescaleDB’s best-in-class compression).

With time-series data, each data point is inserted as a new value, instead of overwriting the prior (i.e., earlier) value. As a result, time-series workloads scale much faster than other types of data.

Timescale Cloud solves this problem, with added benefits: you get a database service that not only seamlessly scales to larger workloads, but also offers full SQL and rock-solid performance and reliability no matter your size – all without incurring the astronomical costs you’d pay with other managed services. Performance, scale, SQL, and low cost - Timescale Cloud has it all.

To get started, you can sign up for a free trial right away, or book a 30 min technical demo with the team to customize a plan for your scenario.

But if you’re still not convinced, check out the below to learn why customers select and trust Timescale Cloud – and how it frees them to focus on what matters to their business.

What Timescale Cloud customers have to say

Everactive focuses on optimizing customer experience for their revolutionary batteryless sensors, not scaling their database

“We have over half a billion of rows of data in development and production, across dozens of rows and several tables, and we needed a database that could handle this volume, while also allowing us to use our internal teams and resources. We evaluated several options, and Timescale Cloud was the clear winner: we get to use SQL and leverage tools already available in PostgreSQL, but don't have to worry about database administration or scalability. We can set it and forget, freeing up our engineers to work on optimizing our customers' experience."

– Clayton Yochum, Data Science Staff Engineer at Everactive, a technology company revolutionizing Industrial IoT with batteryless sensors.

When I Work reduces their monitoring footprint and costs by combining Prometheus with Timescale Cloud

“Timescale Cloud has dramatically reduced our monitoring footprint and costs from our previous tools, with simple pricing that allows our developers to collect more custom metrics than ever before. We rely on Prometheus to collect key metrics from each environment, while Timescale aggregates and alerts on them in real-time - something most Prometheus aggregators do not guarantee - and its basis on Postgres allows us to rely on the existing, exhaustive documentation and depth of tooling for that platform. Timescale Cloud provides the uptime we need for constant monitoring, and the Timescale team has continued to be responsive to problems and questions, while simultaneously delivering new features at an incredible pace.”

– Sean Sube, DevOps Engineer at When I Work, a popular SaaS platform for employee scheduling, time and attendance, hiring, and more.

Blue Sky tackles global climate change with advanced geospatial and time-series analytics

“Timescale Cloud is a great place to start working with time-series data, and I strongly recommend it to developers. With Timescale Cloud, you can adjust the system to meet your needs in terms of scale and flexibility…the possibilities are endless! It is truly a cutting-edge technology that has expedited our mission of commanding the space of environmental data.”

– Kshitij Purwar, CTO at Blue Sky Analytics, a startup on a mission to use geospatial data to fight climate change (read more about how Blue Sky Analytics uses data to power their platform).

Check out below to see what’s new and how to get started with a 30 day free trial, no credit card required. We know every customer and scenario is unique, and we’re here to help with any questions along the way.

What’s New: 2,000 configurations in 75+ regions on AWS, Azure, and GCP

Our latest update adds more pricing tiers and CPU/storage combos, making it easy to incrementally scale disk consumption from 20 GB to 50, 100, 250 512, up to 10TB, and CPUs from 2 to 4, 8, 16, up to 72 CPUs.

Timescale Cloud region availability at-a-glance

The most clouds and regions of any managed service for time-series data, now available in:

AWS	GCP	Azure
aws-af-south-1	google-asia-east1	azure-eastus2
aws-ap-east-1	google-asia-east2	azure-eastus
aws-ap-northeast-1	google-asia-northeast1	azure-southeastasia
aws-ap-northeast-2	google-asia-northeast2	azure-westeurope
aws-ap-south-1	google-asia-northeast3	azure-germany-westcentral
aws-ap-southeast-1	google-asia-south1	azure-australiasoutheast
aws-ap-southeast-2	google-asia-southeast1	azure-northcentralus
aws-ca-central-1	google-australia-southeast1	azure-uae-north
aws-eu-central-1	google-europe-north1	azure-india-central
aws-eu-north-1	google-europe-west1	azure-india-west
aws-eu-south-1	google-europe-west2	azure-korea-south
aws-eu-west-1	google-europe-west3	azure-south-africa-north
aws-eu-west-2	google-europe-west4	azure-korea-central
aws-eu-west-3	google-europe-west6	azure-brazilsouth
aws-me-south-1	google-northamerica-northeast1	azure-westus2
aws-sa-east-1	google-southamerica-east1	azure-germany-central
aws-us-east-1	google-us-central1	azure-germany-northeast
aws-us-east-2	google-us-east1	azure-southcentralus
aws-us-west-1	google-us-east4	azure-westcentralus
aws-us-west-2	google-us-west1	azure-canadacentral
	google-us-west2	azure-canadaeast
	google-us-west3	azure-japanwest
	google-us-west4	azure-westus
		azure-switzerland-north
		azure-japaneast
		azure-australiaeast
		azure-india-south
		azure-eastasia
		azure-france-central
		azure-uksouth
		azure-ukwest
		azure-centralus
		azure-northeurope

You can also see all plans and calculate your estimated costs for various combinations with our Cloud pricing calculator.

The leading cloud service for time-series data

With these improvements, Timescale Cloud continues to increase its lead as the top cloud service for time-series data, with better performance, a better developer experience, and a lower cost than AWS RDS, MongoDB, InfluxDB, AWS Timestream, and others.

Supercharged Postgres. The same PostgreSQL you know and love, but better: full SQL, rock-solid reliability, and the largest ecosystem of development and management tools. Be productive instantly.
Accelerated Performance. 10-100x faster queries than PostgreSQL, InfluxDB, and MongoDB. Easily handle high-cardinality data and build faster applications, without worrying about your infrastructure.
Massive scale. Write millions of data points per second. Scale up to 72 CPUs and 10TB of storage (equivalent to 150TB+ of uncompressed data). Timescale Cloud grows with you.
Relational & time-series, together. Simplify your stack and store your relational data alongside time-series data. Ask more complex queries, build more powerful applications.
More cost effective. Spend less with 94% compression savings from best-in-class algorithms, including delta-delta encoding, Gorilla, and more, and a memory-efficient architecture. For example, Timescale Cloud can be anywhere from 9X to 72X cheaper than AWS Timestream, depending on the type of workload (see pricing comparison worksheet).
Deterministic pricing. Crystal-clear pricing calculators so that you know ahead of time how much you’re going to pay at the end of the month. No end-of-month surprises.
More clouds and regions. Available on AWS, Azure, and GCP, in over 75 regions around the world. Timescale Cloud meets you where you already are. By comparison, Influx Cloud is only available in 4 regions. AWS Timestream, whenever it officially launches, will only be available on AWS (and seems to be limited to one region right now during private preview).
We try harder. Our world-class support and customer success team is always here to help you, whether via email or Slack. We do whatever it takes to ensure your success, from advising on database design to giving you specific advice for query optimization and everything in between.

...with the largest time-series developer community

The TimescaleDB developer community has come a long way in just 3 years, with tens of millions of downloads and over 500,000 active databases today.

This community includes organizations like AppDynamics, Bosch, Cisco, Comcast, Fujitsu, IBM, Schneider Electric, Samsung, Siemens, Uber, Warner Music, and thousands of others, including:

TransferWise, for providing instant global monetary transfers with accurate conversion estimates
Zabbix, the open-source IT monitoring platform, for storing metrics from servers, virtual machines, and network devices
SAKURA Internet, a leading Internet infrastructure service provider for businesses and individuals in Japan, for monitoring network traffic
LAIKA, the acclaimed animation studio (Coraline, ParaNorman, The Boxtrolls, Kubo and the Two Strings, Missing Link), for IT monitoring consolidation
European Space Agency, for high-resolution studies of the Sun and inner heliosphere
Senseforce, as a centralized datastore for all their industrial IoT data and machine metrics data
Sentinel Marine, for maritime fleet management, managing boats and other assets
k6, for powering their load testing SaaS service for developers, DevOps, QA, and SRE teams
Grillo, for monitoring earthquakes in Mexico in real-time with low-cost sensors

Ready to Explore?

To get started, you can sign up for a free trial right away ($300 in free credits), or book a 30 min technical demo with the team to customize a plan for your scenario.

Please also reach out on our public Slack at any time to ask questions, get best practices from Timescale Product and Engineering (as well as our active community of developers), and tell us what you think.

Speed up Grafana by auto-switching between different aggregations, using PostgreSQL

Avthar Sewrathan — Tue, 11 Aug 2020 14:13:02 GMT

Learn how (and why) to speed up your Grafana drill downs, using PostgreSQL to allow "auto-switching" between aggregations, depending on the time interval you select.

The problem: Grafana is slow to load visualizations, especially for non-aggregated, fine-grained data

The Grafana UI is great for drilling down into your data. However, for large amounts of data with second, millisecond, or even nanosecond time granularity, it can be frustratingly slow and result in higher resource usage.

For example, take this graph of all New York City taxi rides during the month of January 2016:

Example of how slow drill downs into data can be

One common workaround: instead of querying raw data and aggregating on the fly, you query and visualize data from aggregates of your raw data (e.g., one minute, one hour, or one day rollups).

For PostgreSQL data sources, we do this by aggregating data into views and querying those instead, and for TimescaleDB, we use continuous aggregates – think “automatically refreshing Postgres views” (for more see the continuous aggregates docs).

However, this often leads to several Grafana panels, each querying the same data aggregated at different granularities. For example, you might capture the same metric over time, but set up aggregates at various intervals, such as in minutely, hourly, and daily intervals.

This then requires 3 separate panels, one for each aggregated interval.

Example of 3 panels all showing taxi rides over January 2016 but in different time granularities (daily, hourly, and per minute, from top to bottom).

...but, what if we could use one universal panel that could “automatically” switch between minutely, hourly, daily, or any other arbitrary aggregations of our data, depending on the time period we’d like to query and analyze? This would speed up queries and use resources like CPU more efficiently.

Enter the PostgreSQL UNION ALL function…

The solution: Use Postgres `UNION ALL`

When we use PostgreSQL as our Grafana data source, we can write a single query that allows us to automatically switch between different aggregated views of our data (e.g daily, hourly, weekly views, etc.) in the same Grafana visualization (!).

🔑 The key: we (1) use the UNION ALL function to write separate queries to pull data with different aggregations, and (2) then use the WHERE clause to switch the table (or continuous aggregate view) being queried, depending on the length of the time-interval selected (from either the timepicker, or by highlighting the time period in a graph).

This not only allows us to drill arbitrarily deep into our data, but also makes loading the data as efficient and fast as possible, saving time and CPU resources. (In Grafana, drilling into data is typically done by zooming in and out, highlighting the time period of interest in the graph as shown in the image below),

Example of auto-switching between different aggregations of data depending on the time interval selected. Learn how to create this example in the tutorial below.

Try It Yourself: Implementation in Grafana & sample queries

To help you get up and running with UNION ALL, I’ve put together a short step-by-step guide and a few sample queries (which you can modify to suit your project, app, and the metrics you care about).

Scenario

We’ll use the use case of monitoring IoT devices, specifically taxis equipped with sensors. For reference, we’ll use a dataset that contains all New York City taxi ride activity for the month of January 2016, from the New York Taxi and Limousine Commission (NYC TLC).

Prerequisites

TimescaleDB instance (Timescale Cloud or self-hosted) running PostgreSQL 11+
Grafana instance (cloud or self-hosted)
TimescaleDB instance connected to Grafana (see this tutorial for more)
To load the sample dataset into TimescaleDB, complete Mission 1 in this tutorial, which takes you through downloading the .CSV file and inserting the data into the database.
Use the queries below to create 2 continuous aggregates. These will be the aggregate views we switch between in our Grafana visualization:

To create daily aggregates:

CREATE VIEW rides_daily
WITH (timescaledb.continuous, timescaledb.refresh_interval = '1 day')
AS
    SELECT time_bucket('1 day', pickup_datetime) AS day, COUNT(*) AS ride_count
    FROM rides
    GROUP BY day;

SQL query to create daily aggregates of rides during January 2016

This computes a roll up of the total number of rides taken during each day during the time-period of our data (January 2016).

To create hourly aggregates:

CREATE VIEW rides_hourly
WITH (timescaledb.continuous, timescaledb.refresh_interval = '1 hour')
AS
    SELECT time_bucket('1 hour', pickup_datetime) AS hour, COUNT(*) AS ride_count
    FROM rides
    GROUP BY hour;

SQL query to create hourly aggregates of rides during January 2016

This computes a roll up of the total number of rides taken during each hour during the time-period of our data.

For more on how continuous aggregates work, see these docs.

Example 1: Auto-switch between daily aggregate, hourly aggregate, and raw data

In the example below, we have a query using UNION ALL, where we only select a specific table or view, depending on the length of time selected interval in the Grafana UI (controlled by the $__timeFrom and $__timeTo macros in Grafana).

As the comments in the code below show, we use daily aggregates for intervals greater than 14 days, hourly aggregates for intervals between 3 and 14 days, and per minute aggregates calculated on the fly from raw data for intervals less than 3 days:

Switching between daily aggregation, hourly aggregation and minutely aggregations on raw data

-- Use Daily aggregate for intervals greater than 14 days
SELECT day as time, ride_count, 'daily' AS metric
FROM rides_daily
WHERE  $__timeTo()::timestamp - $__timeFrom()::timestamp > '14 days'::interval AND $__timeFilter(day)
UNION ALL
-- Use hourly aggregate for intervals between 3 and 14 days
SELECT hour, ride_count, 'hourly' AS metric
FROM rides_hourly
WHERE  $__timeTo()::timestamp - $__timeFrom()::timestamp BETWEEN '3 days'::interval AND '14 days'::interval AND $__timeFilter(hour)
UNION ALL
-- Use raw data (minute intervals) intervals between 0 and 3 days
SELECT * FROM
    (SELECT time_bucket('1m',pickup_datetime) AS time, count(*), 'minute' AS metric
    FROM rides
    WHERE  $__timeTo()::timestamp - $__timeFrom()::timestamp < '3 days'::interval AND $__timeFilter(pickup_datetime)
    GROUP BY 1) minute
ORDER BY 1;

Query to switch between daily aggregation, hourly aggregation and per minute aggregations on created on the fly using raw data

This produces the following behavior in our Grafana panels:

Querying daily aggregates for intervals greater than 14 days:

The graph is powered by the daily aggregate view for intervals greater than 14 days

Querying hourly aggregates for intervals between 3-14 days:

The graph is powered by the hourly aggregate view for intervals between 3 and 14 days

Querying raw data for intervals less than 3 days:

The graph is powered by rolling up raw data into 1 minute intervals on the fly the for intervals less than 3 days

This allows you to automatically switch between different aggregations of data, depending on the length of the time interval selected. Notice how the granularity of the data gets richer as we drill down from looking at data over the month of January to looking at data in a single day:

Demo of automatically switching between daily, hourly, and minute aggregations of data, depending on time interval selected

Example 2: Auto-switch between daily, hourly, and 10 minute aggregates

Querying only from continuous aggregates allows us to speed up our dashboards even further. You might not want to directly query the hypertable that houses your raw data, as the queries may be slower, due to things like new data being inserted into the hypertable.

The following example shows a query for switching between aggregations of different granularity without using the raw data hypertable at all (unlike Example 1, which does on-the-fly rollups of raw data).

First, let’s create 10 minute rollups of the raw data:

CREATE VIEW rides_10mins
WITH (timescaledb.continuous, timescaledb.refresh_interval = '10 minute')
AS
    SELECT time_bucket('10 minutes', pickup_datetime) AS bucket, COUNT(*) AS ride_count
    FROM rides
    GROUP BY bucket;

Query to create 10 minute rollups of data in a continuous aggregate

Switching between daily aggregation, hourly aggregation, and minute aggregations (no raw data involved)

-- Use Daily aggregate for intervals greater than 14 days
SELECT day as time, ride_count, 'daily' AS metric
FROM rides_daily
WHERE  $__timeTo()::timestamp - $__timeFrom()::timestamp > '14 days'::interval AND  $__timeFilter(day)
UNION ALL
-- Use hourly aggregate for intervals between 3 and 14 days
SELECT hour, ride_count, 'hourly' AS metricFROM rides_hourly
WHERE $__timeTo()::timestamp - $__timeFrom()::timestamp BETWEEN '3 days'::interval AND '14 days'::interval AND  $__timeFilter(hour)
UNION ALL
-- Use raw data (minute intervals) intervals between 0 and 3 days
SELECT bucket, ride_count, '10min' AS metric
FROM rides_10mins
WHERE $__timeTo()::timestamp - $__timeFrom()::timestamp < '3 days'::interval AND  $__timeFilter(bucket)
ORDER BY 1;

Query to switch between daily aggregation, hourly aggregation and per minute aggregations all using continuous aggregates

In this post, we saw how to use UNION ALL to automatically switch which aggregate view we’re querying on based on the time interval selected, so that we can do more efficient drill downs and make Grafana faster

You can find more information about the UNION ALL function and how it works in this PostgreSQL tutorial - from the aptly named PostgreSQLtutorial.com - and “official” PostgreSQL documentation.

That’s it! You can modify this code to change the aggregates you query and time intervals, as well as the metrics you want to visualize, to suit your needs and projects.

Happy auto-switching!

Next Steps

In this tutorial, we learned how to use PostgreSQL UNION ALL to solve a common Grafana issue: slow loading dashboards when we want to query fine-grained raw data (like millisecond performance metrics).

The result: you create graphs that enable you to automatically switch between different aggregations of your data. This allows you to drill down into your metrics as quickly as possible, saving time and CPU resources!

Learn More

Want more Grafana tips? Explore our Grafana tutorials (I recommend this one on variables and this one on visualizing missing data).

Need a database to power your dashboarding and data analysis? Get started with Timescale Cloud (it’s our fast, easy-to-use, and reliable time-series database built on PostgreSQL, available in 75+ cloud regions). When you sign up, you’ll see $300 in credits to get you up and running.

How I power a (successful) crypto trading bot with TimescaleDB

Lacey Butler — Fri, 07 Aug 2020 23:16:00 GMT

This is an installment of our “Community Member Spotlight” series, where we invite TimescaleDB community members to share their work, shining a light on their success and inspiring others with new ways to use technology to solve problems.

In this edition, Felipe Queis, a senior full-stack engineer for a Brazilian government traffic institution, joins us to share how he uses TimescaleDB to power his crypto trading bot – and how his side project influenced his team’s decision to adopt TimescaleDB for their work.

My first experience with crypto wasn’t under very good circumstances: a friend who takes care of several servers at his job was infected with ransomware – and this malware was demanding he pay the ransom amount in a cryptocurrency called Monero (XMR).

After this not-so-friendly introduction, I started to study how the technology behind cryptocurrencies works, and I fell in love with it. I was already interested in the stock market, so I joined the familiar (stock market) with the novel (crypto). To test the knowledge I’d learned from my stock market books, I started creating a simple Moving Average Convergence Divergence (MACD) crossover bot.

This worked for a while, but I quickly realized that I should - and could - make the bot a lot better.

Now, the project that I started as a hobby has a capital management system, a combination of technical indicators, and sentiment analysis powered by machine learning. Between 10 March 2020 and 10 July 2020, my bot resulted in a success rate of 61.5%, profit factor of 1.89, and cumulative gross result of approximately 487% (you can see a copy of all of my trades during this period in this Google Sheet report).

About me

I'm 29 years old, and I’ve worked in a traffic governmental institution in São Paulo, Brazil (where I live too) as an senior full-stack developer since 2012.

In my day job, my main task at the moment is processing and storing the stream of information from Object Character Recognition (OCR)-equipped speed cameras that capture data from thousands of vehicles as they travel our state highways. Our data stack uses technologies like Java, Node.js, Kafka, and TimescaleDB.

(For reference, I started using TimescaleDB for my hobby project, and, after experiencing its performance and scale with my bot, I proposed we use it at my organization. We’ve found that it brings together the best of both worlds: time-series in a SQL database and open source).

I started to develop my crypto trading bot in mid- 2017, about six months after my first encounter with the crypto ecosystem – and I’ve continued working on it in my spare time for the last two and a half years.

Editor’s Note: Felipe recently hosted a Reddit AMA (Ask Me Anything) to share how he’s finally “perfected” his model, plus his experiences and advice for aspiring crypto developers and traders.

About the project

I needed a bot that gave me a high-performance, scalable way to calculate technical indicators and process sentiment data in real-time.

To do everything I need in terms of my technical indicators calculation, I collect candlestick chart data and market depth via an always-up websocket connection that tracks every Bitcoin market on the Binance exchange (~215 in total, 182 being tradeable, at this moment).

The machine learning sentiment analysis started as a simple experiment to see if external news affected the market. For example: if a famous person in the crypto ecosystem tweeted that a big exchange was hacked, the price will probably fall and affect the whole market. Likewise, very good news should impact the price in a positive way. I calculated sentiment analysis scores in real-time, as soon as new data was ingested from sources like Twitter, Reddit, RSS feeds, and etc. Then, using these scores, I could determine market conditions at the moment.

Now, I combine these two components with a weighted average, 60% technical indicators and 40% sentiment analysis.

Felipe's TradingBot dashboard, where he tracks all ongoing trades and results

Quick breakdown of Felipe’s results and success rates week-over-week (for the period of 10 March 2020 - 10 July 2020)

Using TimescaleDB

At the beginning, I tried to save the collected data in simple files, but quickly realized that wasn’t a good way to store and process this data. I started looking for an alternative: a performant database.

I went through several databases, and all of them always lacked something I wound up needing to continue my project. I tried MongoDB, InfluxDB, and Druid, but none of them 100% met my needs.

Of the databases I tried, InfluxDB was a good option; however, every query that I tried to run was painful, due to their own query language (InfluxQL).

As soon as my series started to grow exponentially to higher levels, the server didn't have enough memory to handle them all in real-time. This is because the current InfluxDB TSM storage engine requires more and more allocated memory for each series. I have a large number of unique metrics, so the process ran out of available memory quickly.

I handle somewhat large amounts of data every day, especially on days with many market movements.

On average, I’m ingesting around 20k records/market, or 3.6 million total records, per day (20k*182 markets).

This is where TimescaleDB started to shine for me. It gave me fast real-time aggregations, built-in time-series functions, high ingestion rates – and it didn’t require elevated memory usage to do all of this.

Editor’s Note: For more about how Flux compares to SQL and deciding which one is right for you, see our blog post exploring the strengths and weaknesses of each.

To learn more about how TimescaleDB real-time aggregations work (as well as how they compare to vanilla PostgreSQL), see this blog post and mini-tutorial.

In addition to this raw market data, a common use case for me is to analyze the data in different time frames (e.g., 1min, 5min, 1hr, etc.). I maintain these records in a pre-computed aggregate to increase my query performance and allow me to make faster decisions about whether or not to enter a position.

For example, here’s a simple query that I use a lot to follow the performance of my trades on a daily or weekly basis (daily in this case):

SELECT time_group, total_trades, positive_trades, 
	negative_trades,
	ROUND(100 * (positive_trades / total_trades), 2) AS success_rate, profit as gross_profit,
    ROUND((profit - (total_trades * 0.15)), 2) AS net_profit
FROM (
	SELECT time_bucket('1 day', buy_at::TIMESTAMP)::DATE AS time_group, COUNT(*) AS total_trades, 
		SUM(CASE WHEN profit >  0 THEN 1 ELSE 0 END)::NUMERIC AS positive_trades, 
		SUM(CASE WHEN profit <= 0 THEN 1 ELSE 0 END)::NUMERIC AS negative_trades,
		ROUND(SUM(profit), 2) AS profit 
	FROM trade
	GROUP BY time_group ORDER BY time_group 
) T ORDER BY time_group

And, I often use this function to measure market volatility, decomposing the range of a market pair in a period:

CREATE OR REPLACE FUNCTION tr(_symbol TEXT, _till INTERVAL)
	RETURNS TABLE(date TIMESTAMP WITHOUT TIME ZONE, result NUMERIC(9,8), percent NUMERIC(9,8)) LANGUAGE plpgsql AS $$ DECLARE BEGIN

RETURN QUERY 
	WITH candlestick AS ( SELECT * FROM candlestick c WHERE c.symbol = _symbol AND c.time > NOW() - _till )
	SELECT d.time, (GREATEST(a, b, c)) :: NUMERIC(9,8) as result, (GREATEST(a, b, c) / d.close) :: NUMERIC(9,8) as percent FROM ( 
		SELECT today.time, today.close, today.high - today.low as a,
      		COALESCE(ABS(today.high - yesterday.close), 0) b,
      		COALESCE(ABS(today.low - yesterday.close), 0) c FROM candlestick today
      	LEFT JOIN LATERAL ( 
			  SELECT yesterday.close FROM candlestick yesterday WHERE yesterday.time < today.time ORDER BY yesterday.time DESC LIMIT 1 
		) yesterday ON TRUE
    WHERE today.time > NOW() - _till) d;
END; $$;

CREATE OR REPLACE FUNCTION atr(_interval INT, _symbol TEXT, _till INTERVAL)
	RETURNS TABLE(date TIMESTAMP WITHOUT TIME ZONE, result NUMERIC(9,8), percent NUMERIC(9,8)) LANGUAGE plpgsql AS $$ DECLARE BEGIN
	
RETURN QUERY
	WITH true_range AS ( SELECT * FROM tr(_symbol, _till) )
	SELECT tr.date, avg.sma result, avg.sma_percent percent FROM true_range tr
	INNER JOIN LATERAL ( SELECT avg(lat.result) sma, avg(lat.percent) sma_percent
		FROM (
			   SELECT * FROM true_range inr
			   WHERE inr.date <= tr.date
			   ORDER BY inr.date DESC
			   LIMIT _interval
			 ) lat
		) avg ON TRUE
  WHERE tr.date > NOW() - _till ORDER BY tr.date;
END; $$;

SELECT * FROM atr(14, 'BNBBTC', '4 HOURS') ORDER BY date

With TimescaleDB, my query response time is in the milliseconds, even with this huge amount of data.

Editor’s Note: To learn more about how TimescaleDB works with cryptocurrency and practice running your own analysis, check out our step-by-step tutorial. We used these instructions to analyze 4100+ cryptocurrencies, see historical trends, and answer questions.

Current Deployment & Future Plans

To develop my bot and all its capabilities, I used Node.js as my main programming language and various libraries: Cote to communicate between all my modules without overengineering, TensorFlow to train and deploy all my machine learning models, and tulind for technical indicator calculation, as well as various others.

I modified some to meet my needs and created some from scratch, including a candlestick recognition pattern, a level calculator for support/resistance, and Fibonacci retracement.

Current TradingBot architecture + breakdown of various Node.js libraries

Today, I have a total of 55 markets (which are re-evaluated every month, based on trade simulation performance) that trade simultaneously 24/7; when all my strategy conditions are met, a trade is automatically opened. The bot respects my capital management system, which is basically to limit myself to 10 opened positions and only use 10% of the available capital at a given time. To keep track of the results of an open trade, I use dynamic Trailing Stop Loss and Trailing Take Profit.

The process of re-evaluating a market requires a second instance of my bot that runs in the background and uses my main strategy to simulate trades in all Bitcoin markets. When it detects that a market is doing well, based on the metrics I track, that market enters the main bot instance and starts live trading. The same applies to those that are performing poorly; as soon as the main instance of my bot detects things are going badly, the market is removed from the main instance and the second instance begins tracking it. If it improves, it's added back in.

As every developer likely knows all too well, the process of building a software is to always improve it. Right now, I’m trying to improve my capital management system using Kelly Criterion, as suggested by a user in my Reddit post (thanks, btw :)).

Getting started advice & resources

For my use case, I’ve found TimescaleDB is a powerful and solid choice: it’s fast with reliable ingest rates, efficiently stores and compresses a huge dataset in a way that’s manageable and cost-effective, and gives me real-time aggregation functionality.

The Timescale website, "using TimescaleDB" core documentation , and this blog post about about managing and processing huge time-series datasets is all pretty easy to understand and follow – and the TimescaleDB team is responsive and helpful (and they always show up in community discussions, like mine on Reddit).

It’s been easy and straightforward to scale, without adding any new technologies to the stack. And, as an SQL user, TimescaleDB adds very little maintenance overhead, especially compared to learning or maintaining a new database or language.

We’d like to thank Felipe for sharing his story, as well as for his work to evangelize the power of time-series data to developers everywhere. His success with this project is an amazing example of how we can use data to fuel real-world decisions – and we congratulate him on his success 🎉.

We’re always keen to feature new community projects and stories on our blog. If you have a story or project you’d like to share, reach out on Slack (@lacey butler), and we’ll go from there.

Additionally, if you’re looking for more ways to get involved and show your expertise, check out the Timescale Heroes program.

How to visualize timeshifts to compare metrics over time in Grafana using PostgreSQL

Avthar Sewrathan — Thu, 06 Aug 2020 14:18:14 GMT

Learn how (and why) to combine PostgreSQL, TimescaleDB, and Grafana to visualize timeshifts and compare how your metrics change over time.

The problem: Comparing metrics over time (aka timeshifting)

When we’re doing real-time monitoring or historical analysis, we often want to visually compare the value of a metric NOW to the value X days, weeks, hours, or months ago (or, in other words, we want to compare its value at the current time to its value timeshifted one or more intervals of time ago).

This is known as a timeshift: comparing a metric against itself, but for a different time period.

This is especially common in DevOps, IoT, and user behavior analysis scenarios, where we want to understand if things like upticks or downticks are seasonal, or a result of something new – as well as a host of other questions that require us to analyze how certain metrics change over time.

For example, take the case of monitoring taxi rides. On any given day, we might ask things like: how does the ride activity today compare with activity over the last 3 days? Or, how does ride activity this Friday compare to Friday last week? What about the week before? Or the same time last year? These questions about taxi rides could easily apply to our website uptime metrics, our CPU utilization, and so forth...

One (painful) way to answer these questions might be to create separate graphs for each time interval and manually compare them by eye. However, this isn’t very efficient, and manual comparison can be mentally taxing.

Graphs comparing taxi rides taken in week 1 and week 2 of January 2016. Notice how difficult it is to compare ride activity between the two graphs.

The solution: Use PostgreSQL LATERAL JOIN

A better way would be to have all trend lines (both for current activity and timeshifted activity) on a single graph. However, in Grafana, this isn't always possible, depending on which datasource you use. For example, Grafana’s Graphite datasource supports timeshift natively, but many others do not.

For the PostgreSQL datasource, timeshifting is possible, and the best way to create time-shifted graphs is to use PostgreSQL’s LATERAL JOIN function.

Using the LATERAL JOIN function, we can create timeshifted graphs for monitoring and historical analysis like these:

Timeshifted graph showing taxi rides for today (green) and last 3 days.

Timeshifted graph showing taxi rides for given day (yellow line) and previous week (green line)

Try It Yourself: Implementation in Grafana & Sample Queries

To help you get the hang of creating timeshifted graphs on a sample dataset before applying it to your own projects, I’ve put together this handy step-by-step guide.

Scenario

We’ll use the use case of monitoring IoT devices, specifically taxis equipped with location-detecting sensors. Our dataset comes from the New York Taxi and Limousine Commission (NYC TLC) for the month of January 2016.

Prerequisites

TimescaleDB instance (Timescale Cloud or self-hosted), running PostgreSQL 11+
Grafana instance (cloud or self-hosted)
TimescaleDB instance connected to Grafana (see our Grafana setup tutorial)To load the taxi dataset into TimescaleDB, complete Mission 1 in this tutorial, which will take you through downloading the .CSV file and inserting the data into your database.

Example 1: Building a 3 Day Timeshift

Let’s say we wanted to answer: “how does taxi ride activity today compare with the activity from the previous 3 days?”

Here’s the full query, with annotations, showing how to use the PostgreSQL LATERAL JOIN function to create a graph that displays the current number of rides, as well as timeshifted rides from the previous 3 days.

-- What to name the series
SELECT time, ride_count, CASE WHEN step = 0 THEN 'today' ELSE (-interval)::text END AS metric
FROM
-- sub-query to generate the intervals
    ( SELECT step, (step||'day')::interval AS interval FROM generate_series(0,3) g(step)) g_offsets
    JOIN LATERAL (
-- subquery to select the rides 
    SELECT
-- adding set interval to time values
      time_bucket('15m',pickup_datetime + interval)::timestamptz AS time, count(*) AS ride_count FROM rides
-- subtract value of interval from time to plot
-- today = 0, 1 day ago = 1, etc
    WHERE pickup_datetime BETWEEN $__timeFrom()::timestamptz - interval AND $__timeTo()::timestamptz - interval
    GROUP BY 1
    ORDER BY 1
    ) l ON true

Query to plot rides in 15 minute intervals, with timeshifts for the previous 3 days

This produces the following graph:

Graph showing taxi rides taken in January 2016, timeshifted to compare rides today to prior three days. Today’s rides shown in green, -1 day in red, -2 days in blue, -3 days in yellow.

If we zoom into a 2 day time period (by selecting it using the timepicker or highlighting it in the graph), we can see how timeshifting allows us to compare ride activity, simply by hovering over the graph at any given time interval:

Graph showing taxi rides taken in January 12 and January 13 2016, timeshifted to compare rides today to the prior three days, zoomed in to an arbitrary 2 day period. Today’s rides shown in green, -1 day in red, -2 days in blue, -3 days in yellow.

How the query works:

In this query, the LATERAL JOIN functions like a “for each” loop, making the results of the sub-query before the LATERAL JOIN available to each result of the sub-query which comes after it.

In this case, the query before the LATERAL JOIN generates the intervals we want to compare ride activity over. We generate the intervals of 0,1,2 and 3 days, since we want to compare ride behaviour on any given day, to that of the previous 3 days:

-- sub-query to generate the intervals
( SELECT step, (step||'day')::interval AS interval 
FROM generate_series(0,3) g(step)) g_offsets

Query to generate intervals to compare ride activity over

In the query after the LATERAL JOIN, we plot the number of rides in our time period of interest in 15 minute time buckets. Notice how we use our interval value from the previous sub-query: by adding and subtracting the interval in the time_bucket function and in the WHERE clause to filter the time-range for the rides selected, we’re able to get the correct values for current and timeshifted intervals:

-- subquery to select the rides 
(SELECT
-- adding set interval to time values  time_bucket('15m',pickup_datetime + interval)::timestamptz AS time, count(*) AS ride_count 
FROM rides
-- subtract value of interval from time to plot
-- today = 0, 1 day ago = 1, etc
WHERE pickup_datetime BETWEEN $__timeFrom()::timestamptz - interval 
AND $__timeTo()::timestamptz - interval
GROUP BY 1
ORDER BY 1)

Query to we plot the number of rides in 15 minute time buckets for current and timeshifted periods

For more on LATERAL JOIN, see this useful tutorial from the folks at Heap and the official PostgreSQL docs.

Visual Pro-tip: Add a series override

In order to make it easier to distinguish between rides for any given day and rides from the previous 3 days, we can apply a series override in order modify the appearance of the timeshifted lines:

Series override parameters to distinguish between real and timeshifted lines

Using the parameters above, we apply a series override to the timeshifted series in order to give them a smaller line width than the real line, allowing us to distinguish between them more easily. We can then look of the non-timeshifted line under the Display settings -- in the image below line width is set to 5 and line are to 2:

Final graph for current rides and previous 3 day timeshift with visual treatment applied

Example 2: Building a 1 Week Timeshifts

Next, we want to answer: “How does the activity this week compare to last week?”

In this example, we create a graph to display the current number of rides, as well as a timeshifted line to graph the rides from the previous week.

Much of the query is the same as in Example 1; the only differences are (1) the interval definition changes from day to week and (2) the series we generate only has two values, 0 and 1, since we only want to compare to the previous week (vs. the 3 day period in the prior example.)

SELECT time, ride_count, 
	CASE WHEN step = 0 THEN 'today' ELSE (-interval)::text END AS metric
FROM
    ( SELECT step, (step||'week')::interval AS interval FROM generate_series(0,1) g(step)) g_offsets
JOIN LATERAL (
    SELECT
      time_bucket('15m',pickup_datetime + interval)::timestamptz AS time, count(*) AS ride_count FROM rides
    WHERE
      pickup_datetime BETWEEN $__timeFrom()::timestamptz - interval AND $__timeTo()::timestamptz - interval
    GROUP BY 1
    ORDER BY 1
) l ON true

Query to plot current rides and 1 week timeshifted rides

This produces the following graph:

Graph showing taxi rides taken in January 2016, time-shifted to compare rides today to the previous week, zoomed in to an arbitrary 5 day period. Rides for a given day are shown in green and rides on that day last from the previous week are shown in yellow

Visual Pro-tip: Add a series override

To make it easier to distinguish between rides for any given day and rides from the previous week, we can apply a series override that modifies the appearance of the timeshifted lines:

Graph showing current rides and 1 week timeshifted rides with series override applied to visually distinguish betweeen timeshift and non-timeshifted lines

To achieve this, we set the line area to 0 (under Display settings), and then apply a series override to the timeshifted series.

In the series override settings, we:

Set the line fill to 2, giving us a shadow look
Set the line width to 0, leaving the non-timeshifted graph as the only series with a solid line, making it more distinguishable.

Series override settings for making time-shifted lines more distinguishable

Next Steps

In this tutorial, we covered what timeshifting is, how it works, and how to use PostgreSQL LATERAL JOIN, TimescaleDB, and Grafana to visualize timeshifts to easily compare data across two (or more!) time periods.

⏰ To modify the query to timeshift any arbitrary number of minutes, hours, days, months, or years, change the parameters on generate_series and interval definition (while it’s most common to compare metrics NOW to previous periods, you can use time-shifting to compare ANY two time periods).

Happy timeshifting!

Learn More

Want more Grafana tips? Explore our Grafana tutorials (I recommend this one on variables and this one on visualizing missing data).

Timescale Newsletter Roundup: July Edition

Lacey Butler — Mon, 03 Aug 2020 22:10:43 GMT

Get a quick breakdown of all the latest product updates, technical content, #webinarwednesday sessions, and TimescaleDB tips to help you take your time-series data skills to the next level 🔥.

We’re on a mission to teach the world about time-series data, supporting and growing communities around the world.

And, sharing educational resources as broadly as possible is one way to do just that :).

Here’s a snapshot of the content we shared with our readers this month (subscribe to get updates straight to your inbox)

New technical content, videos & tutorials

[Session Replay]: Guide to Grafana Part IV: Advanced Topics & Time-Series Pro Tips >>

Catch our latest Grafana session to learn how to solve common Grafana issues, get answers to frequently asked questions, and continue on your journey to dashboarding mastery. You’ll get step-by-step demos and pro tips for time-shifting, using Postgres to efficiently drill into your metrics, and more.

💻 Get the demo queries on GitHub to recreate demos and customize for your projects.
🥇 Explore all of our Grafana tutorials.
Have questions? We’d love to hear from you - reach out on Slack.

[Live-coding session] : 5 Ways to Improve Your Postgres INSERT Performance >>

What’s even better than using Postgres? Learning ways to make it faster. Join @avthars on Wed, August 19 as he breaks down 5 techniques to improve your database ingest speed (plus best practices, pro tips, and technical resources).

🗓 RSVP now
Can’t attend live? Register anyway & we’ll send you the recording + demos within 48 hours.

[NEW]: 13 tips to improve your INSERT performance using PostgreSQL >>

We have a lot of experience optimizing ingest performance for ourselves and community members, and, in this post, Mike shares a “cheatsheet” full of his favorite best practices and tips (includes advice for vanilla Postgres, plus a few TimescaleDB-specific ones).

Have questions or want to add a tip to our list? We’d love to hear from you - reach out on Slack.
...and join our "5 Ways to Improve Your Postgres INSERT Performance" webinar (details above) to see a few of them in action.

New #remote-friendly events & community

[Meet Timescale CTO]: Join us for Office Hours on Tues, August 4 >>

If you have questions about database best practices, upcoming TimescaleDB releases, or want to know more about distributed computing topics, Office Hours - hosted by Timescale CTO Mike Freedman - is for you.

👉 Reserve your spot on Tuesday, Aug 4th (space is limited).
💬 If you can’t join, but have a question, reach out to our engineering team on Slack.

Simplified WsprDaemon architecture, showing data routes from wsprnet and 3rd party interfaces

[Community Spotlight #1] How WsprDaemon combines TimescaleDB and Grafana to analyze radio transmissions >>

Time-series data is everywhere – and, in this Developer Q & A, our friends at WsprDaemon share how they use TimescaleDB, Grafana, and various datasets to analyze radio transmissions, spot and compare trends in space weather, and more. (plus why they chose TimescaleDB over Influx 🙌).

📡 Check out the WsprDaemon project, quickstarts, and more.
📣 Have a story to share? Reach out on Slack or @TimescaleDB and we’ll make it happen.

[Community Spotlight #2]: TransferWise >>

We have amazing customers all over the world, including our friends @TransferWise, the fintech startup that 8 million+ customers trust to move billions of dollars per day. Learn how (and why) they selected TimescaleDB as their scalable time-series database of choice.

[ICYMI]: ListenBrainz moves to TimescaleDB >>

We always love to hear about how and why community members use TimescaleDB – and, ListenBrainz’ migration to TimescaleDB made quite the splash on Hacker News.

🙏 to the team for sharing your story & welcome to the community!

Product Updates, Reading List & Etc.

[Product Development]: Timescale Forge is now in Public Preview >>

We’re excited to unveil the Public Preview of Timescale Forge, the easiest way to get TimescaleDB. Timescale Forge is a fully-managed cloud service and includes features to independently scale compute and storage, pause/resume instances, use native integrations, and (much) more coming soon.

Sign up to try it out (💯free for 30 days) - we’d love your feedback 🙏.
Subscribe to our Release Notes to get new releases and updates straight to your inbox.
We're proud to now give you the choice of two fully-managed relational database services for time-series data 🎊.
If you need Microsoft Azure or Google Cloud support or advanced enterprise features (VPC peering, SOC2, HIPAA compliance), continue to use Timescale Cloud.

Quick peek at the Timescale Forge UI

[Product Updates]: TimescaleDB 1.7.2 Release now available >>

We just released TimescaleDB 1.7.2, which includes medium-priority updates to continuous aggregates, downsampling, and compression and support for fast pruning of inlined functions.

Get the code and full changelog on GitHub (if you haven’t already, consider giving us a ⭐ while you’re there 🙏).
Subscribe to our Release Notes to get new releases delivered straight to your inbox.

[TimescaleDB Tip #1]: Use downsampling to roll up and retain the data you need >>

Combine continuous aggregates with data retention policies to retain the data you need & delete fine-grained data when it’s no longer necessary. Result: you save $ on long-term data storage costs, without sacrificing the ability to analyze your metrics.

[Timescale Tip #2]: Check out TimescaleDB Quick Starts (Python, Ruby & Node.js) to jumpstart your time-series analysis >>

We’ve created a few step-by-step guides to get you up and running with TimescaleDB. Each one takes you through connecting your app, creating tables, inserting rows, and executing your first time-series analysis query.

[Reading List]: How to build more accurate Grafana trend lines: plot two variables with series-override >>

Tying to plot multiple metrics on one Grafana graph, but running into issues? Learn how to use series-override to solve the problem of distorted scale with this handy how-to post.

[Reading List]: Build an application monitoring stack with TimescaleDB, Telegraf & Grafana >>

APM doesn't have to be rigid and expensive, and, in this post, we show you how to build a monitoring stack using open-source tools (Telegraf, Grafana, and TimescaleDB). You’ll get step-by-step guidance for instrumenting data collection, storing metrics, and visualizing them for real-time and historical analysis.

[Reading List]: How to use composite indexes to speed up time-series queries >>

Fact: the right index can reduce your query time by two or three orders of magnitude. In this old-but-good post from our Product team, you’ll learn how to build the right indexes for time-series data (3+ examples to help you get started!).

[Reading List]: Check out 4 ways to use SQL for time-series analysis >>

From schema design to SQL tool recommendations, we’ve rounded up quick and handy ways to use SQL to work with and get more from your time-series data.

[Team Timescale]: Announcing Timescale Shop >>

Over the years, community members continually asked when we’d launch a “swag store” – and the day has arrived!

🐯 Check out universal favorites and a few new limited edition items, ready to ship directly to you.
📦 Due to COVID-19 precautions, shipments from our supplier will be slightly delayed (but hopefully your items are worth it :)).

[Timescale Team Fun]: Timescale People Manager Mel continues to introduce amazing team-building activities to appeal to all sensibilities, as well as deliver on old favorites, like Munch 'N Learns and Book Club.

Here's a quick peek a few of our latest remote team bonding ideas for inspiration💡.

To learn more about how we organize Munch 'N Learns, see past topics, and get other virtual-team building ideas, check out Mel's Virtual Event Guide.

📚 Timescale Book Club kicked off our monthly poll to solicit new read suggestions….and the results are in! If you’d like to follow along, check out “The Mom Test.”

For those looking to step up their cooking game, Team Timescale uses our #social-cooking Slack channel for recipe inspiration, ideas, and questions. Pictured here: Mike's world-famous homemade pizza and berry crumble.

Wrapping Up

And, that concludes this month’s newsletter roundup. We’ll continue to release new content, events, and more - posting monthly updates for everyone.

If you’d like to get updates as soon as they’re available, subscribe to our newsletter (2x monthly emails, prepared with 💛 and no fluff or jargon, promise).

Happy building!