Manage Prometheus high-availability with Postgres + TimescaleDB

Prometheus has become one of the most commonly used open-source monitoring and alerting systems. What makes Prometheus so great is that it solves monitoring problems in a simple and effective way. Yet there are still challenges with operating Prometheus, including long-term storage and managing high-availability.

In a previous post, we describe how and why to use PostgreSQL + TimescaleDB for enabling long-term storage (as well as for simplifying operations, and adding query power/flexibility) to Prometheus.

In this post, we describe how PostgreSQL + TimescaleDB can also help with managing high-availability.

There are several reasons why to add high-availability (HA) to Prometheus deployments:

Avoiding data loss and a single point of failure
Avoiding missing any critical alerts generated by an Alertmanager
Replicating Prometheus data (although the data will not be fully consistent between replicas — more on this later)

Setting up HA for Prometheus requires running two or more identically configured Prometheus servers in parallel. Yet this simple setup does come with limitations.

One problem is that it can be hard to keep data in sync. Often parallel Prometheus instances do not have identical data. There are a few reasons for this. One reason is that scrape intervals may differ, since each instance has its own clock. Another reason is that, in the case of instance failures, either of the Prometheus nodes may miss some data, which would mean that in the long run none of the Prometheus instances would have a complete data set. In other words, there would be no single source of truth.

Another problem is with remote storage. When running HA Prometheus with remote storage, one can decide to just have a separate remote storage for each Prometheus instance. But this approach, while valid, does introduce other weaknesses: again, no ground truth; the complexity of operating multiple remote storage instances; and hardware costs.

Given these limitations, it would seem that having shared remote storage across Prometheus nodes would be beneficial: it would establish a single ground truth while minimizing operational overhead.

This is the problem that version 0.4 of our prometheus-postgresql-adapter aims to address. In this latest version, we add support for leader election to allow one to use a single PostgreSQL/TimescaleDB instance as a shared remote storage and the single source of ground truth for an HA Prometheus deployment. And, in particular, we do this using some of the concurrency control features native to PostgreSQL.

Running our new adapter is as simple as adding few configuration parameters. Let’s look at an example to see how it works.

How it works

On a high level our HA setup looks like this: each Prometheus instance talks to one adapter which writes data to TimescaleDB. To prevent duplicate writes, one adapter is elected as a leader, and only the leader is allowed to write to the database. The leader election happens in the background between adapters within the same HA group (more about leader election after the example).

Let’s see it in action. Here is an example `prometheus.yml` for the first Prometheus instance:

global:
 scrape_interval: 5s
 evaluation_interval: 10s
scrape_configs:
 - job_name: prometheus
   static_configs:
     - targets: ['localhost:9090']
remote_write:
 - url: "http://localhost:9201/write"
remote_read:
 - url: "http://localhost:9201/read"
   read_recent: true

And here is prometheus.yml for the second Prometheus instance:

global:
 scrape_interval: 5s
 evaluation_interval: 10s
scrape_configs:
 - job_name: prometheus
   static_configs:
     - targets: ['localhost:9091']
remote_write:
 - url: "http://localhost:9202/write"
remote_read:
 - url: "http://localhost:9202/read"
   read_recent: true

Nothing fancy here, as this is pretty basic Prometheus configuration. The main difference between these two configs is the remote write/read url (each instance sends data to its own adapter thus different port numbers for each).

Now let’s start two Prometheus instances (make sure that prometheus.yml is in the same path as the binary):

./prometheus --storage.tsdb.path=data1/
./prometheus --web.listen-address=:9092 --storage.tsdb.path=data2/

And let’s start two Prometheus adapters:

./prometheus-postgresql-adapter -log.level=debug -leader-election.pg-advisory-lock-id=1 -leader-election.pg-advisory-lock.prometheus-timeout=6s
./prometheus-postgresql-adapter -log.level=debug -web.listen-address=:9202 -leader-election.pg-advisory-lock-id=1 -leader-election.pg-advisory-lock.prometheus-timeout=6s

Since both adapters belong to the same Prometheus HA group, they run with the same value for -leader-election.pg-advisory-lock-id=1.

We set leader-election.pg-advisory-lock.prometheus-timeout to 6s because our Prometheus scrape interval is 5s, which means that each adapter should get a request from Prometheus every 5 seconds. If a request doesn’t come within 6 seconds, the adapter will assume that its Prometheus instance died, and the adapter will resign from being a leader.

If we look into the log of the first adapter, we should see something like:

level=debug caller=log.go:21 msg="Lock obtained for group id 1"
level=info caller=log.go:25 msg="Initialized leader election based on PostgreSQL advisory lock"
level=info caller=log.go:25 msg="Starting up..."
level=info caller=log.go:25 msg=Listening addr=:9201
level=debug caller=log.go:21 msg="Wrote samples" count=51 duration=0.061607509
level=debug caller=log.go:21 msg="Wrote samples" count=81 duration=0.020131404
level=info caller=log.go:25 msg="Samples write throughput" samples/sec=51
level=debug caller=log.go:21 msg="Wrote samples" count=81 duration=0.014755038
level=info caller=log.go:25 msg="Samples write throughput" samples/sec=81

In other words, we can see that the first adapter has obtained the lock and became a leader (i.e., only the first adapter can write to the database).

Now let’s stop the first Prometheus instance and see what happens. Log output from the first adapter:

level=warn caller=log.go:29 msg="Prometheus timeout exceeded" timeout=6s
level=warn caller=log.go:29 msg="Scheduled election is paused. Instance is removed from election pool"
level=info caller=log.go:25 msg="Instance is no longer a leader"
level=debug caller=log.go:21 msg="Scheduled election is paused. Instance can't become a leader until scheduled election is resumed (until Prometheus comes up again)"

Log output from the second adapter:

level=debug caller=log.go:21 msg="Lock obtained for group id 1"
level=debug caller=log.go:21 msg="Wrote samples" count=100 duration=0.035791517
level=info caller=log.go:25 msg="Samples write throughput" samples/sec=0
level=debug caller=log.go:21 msg="Wrote samples" count=100 duration=0.013215907
level=info caller=log.go:25 msg="Samples write throughput" samples/sec=100
level=debug caller=log.go:21 msg="Wrote samples" count=100 duration=0.010737419

As we can see from the above log, the first adapter hasn’t received a request within 6 seconds, and so decided to resign from leader role. The built-in adaptive failover mechanism has taken care of re-electing the leader. The second adapter was automatically elected as a leader and started writing metrics into the database.

Let’s dig a bit deeper and see how adapter leader election works.

How adapter leader election works

Since each Prometheus instance has its own adapter, we need to find a way to coordinate between the adapters and allow only one to write into PostgreSQL/TimescaleDB. This is actually a common problem in distributed systems where one needs to select one node as a leader and allow writes only from that node. In other words, we need a mechanism for leader election.

There are several ways for electing a leader in distributed systems: using a consensus algorithm like Raft, or perhaps simply racing to acquire a shared mutex (lock). If we want stronger semantics, we might choose an approach like the former and have a cluster of adapters. But given the weaker semantic needs here, we instead adopted something simpler with far less operational complexity.

Before jumping into details, let’s try to understand what semantics you get when running HA Prometheus.

Prometheus is a sampling-based metrics system which means it doesn’t care about individual events, but more about trends.

“In general a metrics-based monitoring system [like Prometheus] is not suitable for anything that cares about individual increments” — Prometheus developer Brian Brazil

As we mentioned earlier, there are no guarantees that multiple Prometheus instances (even within the same high-availability group) will have the same data. Due to the fact each Prometheus instance adds timestamp information to the sample, and that this timestamp is based on its local clock rather than the clock of the server or infrastructure it is monitoring, it is even more likely that the data will not be consistent. Because of this, simply eliminating identical, repeated values (e.g., using UPSERTS and UNIQUE constraints) will not solve the problem.

In this context, we’ve decided to implement leader election based on a shared lock. ZooKeeper and Consul are commonly used for coordinating between nodes, but there are also other options more easily available.

In fact, PostgreSQL supports an application meaningful lock, called an advisory lock. Since we store metrics data in PostgreSQL/TimescaleDB, it would also make sense to use PostgreSQL for coordination among our adapters.

Here’s how it works. Once started, each adapter instance tries to obtain a lock and become a leader. The first one to succeed becomes the leader. If an instance (either adapter or Prometheus) goes down, it will automatically drop the lock so another instance is able to take over and become a leader. The lock is exclusive and only one instance can hold the lock at a time. The advantage with the PostgreSQL advisory lock is that no additional metadata needs to be maintained since PostgreSQL manages lock lifecycle.

Adapters are not persisting any data so there are no strong guarantees during the failover (i.e., some data loss/duplication is possible during the election window). This seems like a reasonable trade off if you take into account the semantics of Prometheus data explained above.

Alternatives

There are also other options for adapter coordination.

If you are running your services in Kubernetes then you can use the etcd key-value store that already backs all the cluster data. There is an existing Docker image that implements leader election based on etcd that can run as a sidecar with an adapter and provide leader information through HTTP endpoint. The main drawback with this solution is that it requires an etcd cluster.

To support additional leader election mechanisms (e.g. based on etcd, ZooKeeper, etc.) we’ve provided a REST interface that allows to easily plug-in an external leader election mechanism if needed.

Read and write performance

The latest version of our adapter can be configured to run in read-only mode. This is useful if you’re using PostgreSQL/TimescaleDB replication: you can point the adapter to read replicas to get better read performance.

Recently we’ve improved adapter write performance. Our latest adapter also includes optimizations that yield 6x higher write throughput than before (introduced in version 0.3).

Next steps

To get started with the adapter have a look at the tutorial, or visit our GitHub and grab the latest binaries.

If you are new to TimescaleDB, you can learn more here, and follow our installation instructions to get started.

Like this post and interested in learning more? Check out our GitHub, join our Slack community, and sign up for the community mailing list below.