TimescaleDB 2.0: A multi-node, petabyte-scale, completely free relational database for time-series
After two years of dedicated engineering and user feedback, TimescaleDB 2.0 is finally here, setting a new bar for time-series databases – and it’s completely free.
Time-series data is everywhere. Whether you are monitoring your software stack, users, manufacturing line, home, vehicle, stock and cryptocurrency portfolio, air quality in your house, or just your health in the middle of a pandemic, you are collecting time-series data. As software continues to relentlessly penetrate our lives and businesses, time-series data is becoming even more ubiquitous and mission-critical.
At the same time, relational databases, that old stalwart, are making a comeback as the database of choice for software applications. Despite years of NoSQL hype, the top 4 databases in use today are all relational databases. In addition, PostgreSQL is the fastest growing database over the last year (yes, growing faster than even MongoDB).
What developers need is a new kind of database, purpose-built for time-series workloads while fully embracing the relational model. After all, your time-series data doesn’t exist in a vacuum. Being able to correlate it with technical metadata, business data, and outcomes is critical to understanding how your software, systems, operations, and business changes over time.
Building that database has always been our mission: to help developers store and analyze time-series data in a fast, reliable, and cost-effective way, so that they can focus on their core application and delight their users.
Since launching 3.5 years ago, TimescaleDB has proven itself as the leading relational database for time-series data, engineered on top of PostgreSQL, and offered via free software or as a fully-managed service on AWS, Azure, and GCP.
In that time, the TimescaleDB community has become the largest developer community for time-series data: tens of millions of downloads; over 500,000 active databases; organizations like AppDynamics, Bosch, Cisco, Comcast, Credit Suisse, DigitalOcean, Dow Chemical, Electronic Arts, Fujitsu, IBM, Microsoft, Rackspace, Schneider Electric, Samsung, Siemens, Uber, Walmart, Warner Music, WebEx, and thousands of others; all in addition to the PostgreSQL community and ecosystem.
Today, with TimescaleDB 2.0, we are marking a major milestone in our journey.
With this 2.0 release, TimescaleDB is now a distributed, multi-node, petabyte-scale relational database for time-series. And, we are making everything in this release completely free. This is the culmination of two years of dedicated engineering effort, as well as significant user feedback on several previous betas.
In fact, users have already been running multi-node TimescaleDB in continuous daily use for many months, including a 22-node cluster by a Fortune 100 tech company ingesting more than a billion rows per day:
- "We continuously ingest telemetry events into TimescaleDB 2.0 to monitor and analyze huge numbers of sessions. We've been running TimescaleDB multi-node across 22 servers for almost the past year, ingesting more than a billion rows of data per day. TimescaleDB's performance, scale, relational and SQL capabilities, and ability to handle complex data have been a real winner." – Rahul, Technical Leader at Fortune 100 tech company
- “Netskope prides itself on speed and scalability, and we rely heavily on time-series data to plan, monitor, and troubleshoot our global network of thousands of servers. With TimescaleDB, we tap into the ubiquitous PostgreSQL ecosystem and use TimescaleDB's continuous aggregates and other built-in time-series functions for real-time analytics and advanced historical analysis. Now with multi-node TimescaleDB, we get the horizontal scalability and rapid ingest throughput we need to monitor and manage our systems at scale, now and in the future.” – Mark S. Reibert, Ph.D., Systems Architect at Netskope, Inc.
TimescaleDB 2.0 also includes:
- Updated, more permissive licensing: making all of our enterprise features free and granting more rights to users.
- Substantial improvements to Continuous Aggregates: improving APIs and giving users greater control over the process.
- User-Defined Actions (new feature!): users can now define custom behaviors inside the database and schedule them using our job scheduling system.
- New and improved informational views: including over hypertables, chunks, policies, and job scheduling.
The TimescaleDB 2.0 Release Candidate is available immediately for self-managed software installations, with General Availability expected in late 2020. TimescaleDB 2.0 will be available on our hosted time-series services at that time. If you’re already using TimescaleDB, we’ve created detailed documentation to simplify and speed up your migration.
- Download TimescaleDB 2.0 to get started right away.
- Read the release overview guide (including changes in this release)
- Read the upgrade documentation (for existing software users migrating from TimescaleDB 1.x).
We also encourage you to join our 5,000+ member Slack community for any questions, to learn more, and to meet like-minded developers – we’re active in all channels and here to help.
(While Ajay and Mike are listed as authors of this post, full credit and a big round of applause goes to members of the Timescale database team for their hours, weeks, and months of dedication and commitment to shipping high quality code: Erik Nordström, Gayathri Ayyappan, Ruslan Fomkin, Mats Kindahl, Sven Klemm, Brian Rowe, and Dmitry Simonenko.)
We’d like to give a massive thank you to all of our beta testers; from reporting issues to sharing feedback and suggesting features, you all played a big role in making TimescaleDB 2.0 the best possible experience for developers.
To learn more about TimescaleDB 2.0, time-series data, and why we believe relational databases are the past and future of software development, please read on.
Relational databases are dead. Long live relational databases.
For about 30 years, from the mid-1970s to the mid-2000s, if you were developing software, you used a relational database. From System R (1974) to Oracle (1979), SQL Server (1989), and later open-source options like MySQL (1995) and PostgreSQL (1996), relational databases were the standard for any new application.
About 15 years ago, this all changed. Non-relational databases, sometimes also called “NoSQL” databases, became fashionable. A lot of this usage was legitimately necessary. New Internet giants built new systems to handle data volumes that were previously unfathomable, e.g., Google with MapReduce (2004) and Bigtable (2006); Amazon with Dynamo (2007). But a lot of NoSQL adoption was a knee-jerk reaction, along the lines of, “relational databases don’t scale, so I need a NoSQL database.”
Yet most companies are not Google or Amazon. And it turns out the ability to store data in a way that preserves the relationships in your dataset is valuable. After decades of usage in production, most relational databases are battle-hardened and typically more reliable than their NoSQL cousins. SQL has also re-emerged as the universal language for data analysis, and is the third most widely used language today (after JavaScript and HTML/CSS).
Today, the top 4 databases in use are still all relational databases. In particular, PostgreSQL is the fastest growing database over the last year (yes, growing faster than even MongoDB). Some of this is from developers switching back; some from developers who never left relational databases. So don’t call it a comeback - relational databases have been here for years (h/t James Todd Smith).
Most importantly, relational databases can, in fact, scale. We see this in the more recent wave of “NewSQL” databases. Google again led the way almost a decade ago, with a geo-replicated relational database announced in their first Spanner paper (2012) (whose authors include the original MapReduce authors), followed by other pioneers like CockroachDB (2014) and Yugabyte (2016). And with TimescaleDB (2017), we have built a relational database that scales for time-series data.
What is time-series data?
Simply put, time-series is the measurement of something across time. But, to dig a little deeper, time-series data is the measurement of how something changes.
Here is a simple example:
If I send you $10, then a traditional bank database would atomically debit my account and credit your account. Then, if you send me $10, the same process happens in reverse.
At the end of this process, our bank balances would look the same, so the bank might think, “Oh, nothing happened.” And that’s what a traditional database would show you.
But, with a time-series database, the bank could see, “Hey, these two people keep sending each other $10 - maybe they’re friends, maybe they’re roommates, maybe there’s something else going on.” That level of granularity, the measurement of how something changes, is what time-series enables.
In other words, time-series datasets track changes to the overall system as INSERTs, not UPDATEs, to capture more information of what is happening.
Time-series used to be niche, isolated to industries like finance, process manufacturing (e.g., oil and gas, chemicals, plastics), or power and utilities. But in the last few years, time-series workloads have exploded (the fastest growing category in the past 24 months). This is partly due to the growth in IT monitoring and IoT, but there are also many other new sources of time-series data: cryptocurrencies, gaming, machine learning, and more.
What is happening is that everyone wants to make better data-driven decisions faster, which means collecting data at the highest fidelity possible. Time-series is the highest fidelity of data you can capture, because it tells you exactly how things are changing over time. While traditional datasets give you static snapshots, time-series data provides the dynamic movie of what’s happening across your system: e.g., your software, your physical power plant, your game, your customers inside your application.
Time-series is no longer some niche workload. It’s everywhere. In fact, all data is time-series data - if you are able to store it at that fidelity. Of course, that’s the problem with collecting time-series data: it’s relentless. By performing all these inserts, as opposed to updates, you end up with a lot more data, at higher volumes and velocities than ever before. You quickly get to tables in the billions of rows. For a traditional database, this creates challenges around performance and scalability.
That’s where TimescaleDB comes in.
What is TimescaleDB?
TimescaleDB is the leading relational database for time-series data. Engineered on top of PostgreSQL, Timescale is available via free software or as a fully-managed service on AWS, Azure, and GCP.
TimescaleDB is purpose-built for time-series workloads, so that you can get orders of magnitude better performance at a fraction of the cost, along with a much better developer experience. This means massive scale (100s billions of rows and millions of inserts per second on a single server), 94%+ native compression, 10-100x faster queries than PostgreSQL, InfluxDB, Cassandra, and MongoDB – all while maintaining the reliability, ease-of-use, SQL interface, and overall goodness of PostgreSQL.
Today, there are several options for storing time-series data. However, most are non-relational systems that are essentially glorified metric stores, focused on storing numerical data and not the broad spectrum of data types (nor the rich representation of relationships between datasets) that time-series workloads need.
In April 2017, we launched TimescaleDB into this world full of non-relational metric stores as the first time-series database that supported full SQL. Since then, many others have copied our SQL approach to time-series (including some that are very suspiciously named, *cough* Amazon Timestream *cough*), but no one has been able to replicate the true relational foundation and community of TimescaleDB.
As a result, in just 3.5 years, TimescaleDB has come a long way, now with tens of millions of downloads and over 500,000 active databases. The TimescaleDB developer community includes organizations like AppDynamics, Bosch, Cisco, Comcast, DigitalOcean, Dow Chemical, Electronic Arts, Fujitsu, IBM, Microsoft, Rackspace, Schneider Electric, Samsung, Siemens, Uber, Walmart, Warner Music, WebEx, and thousands of others.
In addition to this dedicated community, we also benefit from the vast PostgreSQL community and ecosystem. Altogether, the TimescaleDB community is the largest developer community for time-series data.
🤔 But I thought <insert skepticism here>
Ever since we launched TimescaleDB, we’ve met skepticism. After all, building a time-series database on PostgreSQL is a non-obvious, somewhat heretical decision. Yet with each release, we continue to disprove our haters and delight our users. Because, as it turns out, building a scalable relational database for time-series isn’t impossible – it’s just hard. But, with our talented team and passionate users, we’re doing it.
Myth 1: A relational database can’t scale as well as a non-relational database
Fact: We outperform non-relational (and other relational) databases for time-series data. Versus Cassandra, 10x higher inserts, 1000x faster queries. Versus Mongo, 20% higher inserts, 1400x faster queries. Versus InfluxDB, higher inserts, faster queries, and better reliability. (Unlike all of these options, we also support full SQL, which allows users to run complex analysis on their data using a programming language and the tools they already know.)
Myth 2: Relational databases take up too much disk space (or, row-oriented databases can’t compress as well as columnar databases)
Fact: It is possible to build columnar compression in a row-oriented database, which is what we have done. TimescaleDB employs several best-in-class compression algorithms, including delta-delta, Gorilla, and Simple-8b RLE, allowing us to achieve 94%+ native compression.
Myth 3: Non-relational databases don’t require schemas, which makes development easier
Fact: Every database, relational or non-relational, uses a schema to store data. The only difference is whether you have the ability to modify that schema and optimize it for your use. However, having a schema automatically generated for you is useful. We are already exploring automatic schemas: e.g., see Promscale, our new analytical platform for Prometheus built on TimescaleDB, which stores data in a dynamically auto-generated schema highly optimized for metrics. More to come.
Myth 4: A relational database can’t scale-out across multiple machines
Fact: NewSQL databases (mentioned above) are disproving this myth for transactional workloads. And today, we are disproving this myth for time-series workloads, with TimescaleDB 2.0.
TimescaleDB 2.0: Multi-node, petabyte-scale, and completely free
As mentioned above, customers have already been running multi-node TimescaleDB in continuous daily use for many months, including a 22-server cluster by a Fortune 100 tech company ingesting more than a billion rows per day.
Introducing distributed hypertables
To achieve multi-node, TimescaleDB 2.0 introduces the concept of a distributed hypertable.
A regular hypertable, one of our original innovations, is a virtual table in TimescaleDB that automatically partitions data into many sub-tables (“chunks”) on a single machine, continuously creating new ones as necessary, yet provides the illusion of a single continuous table across all time.
A distributed hypertable is a hypertable that automatically partitions data into chunks across multiple machines, while still maintaining the illusion (and user-experience) of a single continuous table across all time.
The architecture consists of an access node (AN), which stores metadata for the distributed hypertable and performs query planning across the cluster, and a set of data nodes (DN), which store subsets of the distributed hypertable dataset and execute queries locally. TimescaleDB remains a single piece of software for operational simplicity; these roles as described are established by executing database commands within TimescaleDB (e.g., on a server that should act as an access node, you add_data_node
pointing to the hostnames of the data nodes, and then create_distributed_hypertable
.)
Currently, you can add any number of data nodes for horizontal scalability, as well as leverage existing Postgres physical replication on data nodes for fault tolerance (we are also working on more native replication for future releases; see below).
The access node can also be physically replicated for high availability, and future releases will focus on further scaling out the read and write paths for TimescaleDB multi-node.
Insert and query benchmarks
As a result, while a traditional hypertable scales to 1-2 million metrics per second and 100 terabytes of data, a distributed hypertable scales to ingest 10+ million metrics per second and store petabytes of data:
Distributed hypertables also take advantage of query parallelization, employing full/partial aggregates and push-downs, to achieve much faster queries:
What’s next for distributed hypertables?
We are already hard at work improving upon this initial release of distributed hypertables with the next series of features:
- Replication: Currently every node needs its own replication (using primary/backup physical replication). Cluster-wide replication across data nodes, built natively for TimescaleDB, is in development.
- Rebalancing: Currently when new data nodes are elastically added to an existing distributed hypertables, new chunks are created across the available nodes, and queries are routed accordingly to be repartitioning-aware. But related to native replication, existing chunks are not currently rebalanced across nodes, which is also in development.
- Backup: Each node can be backed up and restored, but there is currently no consistent restore-point snapshot across the whole cluster. Cluster-wide backup is also in development.
- Compression: Compression currently must be performed on a per-chunk basis. In the future, compression policies on the access node, which then propagate to each data node, will be possible.
Some features, such as continuous aggregates and time_bucket_gapfill
, do not currently work on distributed hypertables. Those are also in development.
Check out the below explainer video for a breakdown of how distributed hypertables work, when and why you'd use them, best practices, and more.
What else is new in TimescaleDB 2.0?
While distributed hypertables are the biggest component of this release, TimescaleDB 2.0 also includes:
- Updated, more permissive licensing: making all of our enterprise features free and granting more rights to users.
- Substantial improvements to Continuous Aggregates: improving APIs and giving users greater control over the process.
- User-Defined Actions (new feature!): users can now define custom behaviors inside the database and schedule them using our job scheduling system.
- New and improved informational views: including over hypertables, chunks, policies, and job scheduling.
Updated licensing (everything is now free!)
TimescaleDB 2.0 introduces an update to the Timescale License, our source-available license that governs most of our advanced capabilities, including native compression, multi-node, continuous aggregations, and more.
This update makes all of our enterprise features free, and provides expanded rights to users, reinforcing our commitment to our community. Notably, this update adds the “right-to-repair”, the “right-to-improve”, and eliminates the paid enterprise tier and usage limits altogether (thus establishing that all of our software will be available for free). (More in this announcement.)
Continuous Aggregates 2.0
Continuous Aggregates are an existing capability (introduced 1.5 years ago with TimescaleDB 1.3) that automatically calculates the results of a query in the background and materializes the results, leading to vastly faster query times. They are somewhat similar to PostgreSQL materialized views, but unlike a materialized view, Continuous Aggregates do not need to be refreshed manually; views are automatically refreshed in the background as new data is added, or old data is modified. (See our Continuous Aggregates documentation for more details.)
TimescaleDB 2.0 includes substantial improvements (and, as a result, some breaking API changes) to Continuous Aggregates:
- Updated APIs that separate function and policies, giving users greater control of the Continuous Aggregation process. For example, a Continuous Aggregate can now be manually refreshed over a given range. One common user request has been to materialize recent data but leave historical data to manual refreshes. Now that is possible.
- The separation of function and policies also makes this feature more amenable to distributed operation in the future (e.g., multinode). For instance, a policy on an Access Node can trigger refreshes on Data Nodes.
- There are also other improvements that resolve other user issues (e.g., bugs, strange behavior).
- As a result, there are some breaking API changes to Continuous Aggregates (highlighted here in the documentation).
User-Defined Actions (New feature!)
Previously, TimescaleDB offered standard policies that let users define a schedule to run predefined actions, e.g., for data retention, compression, and continuous aggregates.
TimescaleDB 2.0, introduces the idea of a User-Defined Action (UDA). Users can now run functions and procedures implemented in a language of your choice (e.g., SQL, PL/pgSQL, C, PL/Python, or even PL/Perl) on a schedule within TimescaleDB. This allows automatic periodic tasks that are not covered by existing policies and even enhancing existing policies with additional functionality. Users can now also schedule predefined actions themselves, in case they need greater flexibility than what the standard policies provide. (See our User-Defined Actions documentation for more details.)
For example, you can create more generic data retention policies, data tiering policies, joint downsampling and compression policies, and more, all set to run on a schedule you define ahead of time within TimescaleDB.
Informational views (NEW and improved views)
TimescaleDB 2.0 also introduces new and updated informational views, including over hypertables, chunks, policies, and job scheduling.
How to get started with TimescaleDB 2.0
The TimescaleDB 2.0 Release Candidate is available immediately for self-managed software installations, with General Availability expected in late 2020. TimescaleDB 2.0 will be available on our hosted time-series services at that time. If you’re already using TimescaleDB, we’ve created detailed documentation to simplify and speed up your migration.
- Download TimescaleDB 2.0 to get started right away.
- Read the release overview guide (including changes in this release)
- Read the upgrade documentation (for existing software users migrating from TimescaleDB 1.x).
Join us for Office Hours on Tues, November 10th to ask your questions live, directly to the Timescale team.
From there, we also encourage you to join our 5,000+ member Slack community for any questions, to learn more, and to meet like-minded developers – we’re active in all channels and here to help.