The ClickHouse Mystery: Uncovering a Hidden Bottleneck in Cloudflare's Billing System

From Stripgay, the free encyclopedia of technology

Cloudflare relies heavily on ClickHouse, an open-source OLAP database, to process millions of queries daily that determine billing for its customers. This pipeline supports hundreds of millions of dollars in revenue and fraud detection, so any slowdown has serious consequences. When a routine migration caused daily aggregation jobs to grind to a halt, the usual suspects—I/O, memory, rows scanned—showed nothing wrong. This is the story of how engineers discovered a subtle bottleneck deep inside ClickHouse’s internals and the three patches that resolved it.

1. What is Cloudflare's ClickHouse-based billing pipeline and why is it so critical?

At Cloudflare, ClickHouse handles over a hundred petabytes of data across dozens of clusters. One of its most vital uses is the billing pipeline: every day, millions of queries are executed to calculate usage charges for each customer. If these aggregation jobs finish late, invoices become difficult to reconcile and downstream systems like fraud detection suffer. The pipeline processes data from hundreds of internal teams and supports a revenue stream worth hundreds of millions of dollars. Any delay in the pipeline has immediate financial and operational impacts, making its reliability paramount.

The ClickHouse Mystery: Uncovering a Hidden Bottleneck in Cloudflare's Billing System
Source: blog.cloudflare.com

2. What is the Ready-Analytics system and how does it work?

To simplify onboarding for internal teams, Cloudflare built Ready-Analytics in early 2022. Instead of designing custom tables, teams stream data into a single, massive table that uses a standardized schema—20 float fields, 20 string fields, a timestamp, and an indexID. Data is sorted by a primary key: (namespace, indexID, timestamp). The indexID is a string that allows each namespace to define its own optimal sort order. This system became extremely popular, growing to over 2 PiB by December 2024 and ingesting millions of rows per second. However, it had a critical flaw: a one-size-fits-all retention policy.

3. Why did the billing pipeline suddenly slow down after a migration?

After a scheduled migration, the daily aggregation jobs that drive Cloudflare’s billing became significantly slower. All standard diagnostic checks—I/O, memory, rows scanned, parts read—appeared normal. The slowdown was unexpected because the migration itself seemed routine. The real cause was a hidden bottleneck within ClickHouse’s internal query execution engine, triggered by a combination of the large, flat table design and the new retention policies. The engineers had to dig deep into ClickHouse internals to find the root cause.

4. How did engineers uncover the hidden bottleneck inside ClickHouse?

After exhausting usual performance checks, the team turned to advanced profiling tools inside ClickHouse. They discovered that the slowdown stemmed from a subtle interaction between the primary key’s sort order and ClickHouse’s partition pruning logic. The indexID field, while useful for per-namespace sorting, caused ClickHouse to scan more data than necessary when queries spanned multiple namespaces. The bottleneck was buried in how ClickHouse handled granularity and mark ranges during aggregation queries. By tracing query execution plans and analyzing merge-tree internals, they pinpointed three areas that needed patching.

The ClickHouse Mystery: Uncovering a Hidden Bottleneck in Cloudflare's Billing System
Source: blog.cloudflare.com

5. What were the three patches that fixed the ClickHouse bottleneck?

Cloudflare engineers contributed three patches to ClickHouse to resolve the issue. The first improved the index granularity for primary keys with high cardinality string fields like indexID, reducing unnecessary scans. The second optimized partition pruning when multiple namespaces are queried, preventing redundant reads. The third enhanced query concurrency handling by adjusting internal threading limits. Together, these patches restored the billing pipeline’s performance and even improved overall query speed for other workloads on the Ready-Analytics platform. All three were upstreamed to the open-source ClickHouse project.

6. What was the problem with the one-size-fits-all retention policy?

The original Ready-Analytics system used a 31-day partition-based retention policy applied uniformly to all namespaces. While simple, this prevented teams that needed longer retention (due to legal or contractual obligations) from using the platform. Conversely, teams that needed only a few days of storage wasted resources. This limitation forced many use cases to adopt a conventional, more complex setup. The new per-namespace retention system allows each team to define its own time-to-live, improving flexibility and efficiency. However, implementing this change required careful rebalancing of ClickHouse internals, which indirectly led to the discovery of the hidden bottleneck.