Quick Facts
- Category: Education & Careers
- Published: 2026-05-20 19:03:25
- Mastering Chatbot Development with Python: A Deep Dive into ChatterBot
- 7 Ways Your Browser Is Circumventing Your DLP (And How to Stop It)
- The Zero-Day Deluge: How AI Revolutionized Firefox's Security Overhaul
- 7 Ways GeForce NOW Revolutionizes Cloud Gaming: Labels, Games & More
- Microsoft Issues Emergency Guidance for Active Exchange Server Zero-Day Exploit
Overview
When your daily billing pipeline starts crawling, every minute of delay can cost you—in reconciliation headaches, missed SLAs, and lost revenue. At Cloudflare, we rely on ClickHouse to process millions of queries each day that calculate usage-based billing for hundreds of millions of dollars in revenue. After a routine migration, those critical aggregation jobs slowed to a halt. All the usual metrics looked fine: I/O was normal, memory pressure was low, rows scanned and parts read were within expected ranges. The problem turned out to be a subtle bottleneck buried deep in ClickHouse’s internals—one that required three targeted patches to resolve.

This guide walks you through the same investigative journey we took: from understanding the underlying architecture, to identifying the hidden bottleneck, to implementing the fixes. By the end, you’ll know what to look for when your ClickHouse pipeline suddenly degrades—even when all the usual suspects are clean.
Prerequisites
Before diving in, make sure you have a solid grasp of these ClickHouse concepts:
- Partitioning and Primary Keys – How ClickHouse organizes data into parts and sorts them using primary keys.
- MergeTree Engine – The engine behind most ClickHouse tables, including how merges work and how parts are compacted.
- TTL vs. Partition-based Retention – Native time-to-live versus custom retention via dropping partitions.
- Query Profiling – Using
system.query_logandsystem.part_logto diagnose slow queries. - Cluster Architecture – Understanding shards, replicas, and distributed tables (though the bottleneck we found was single-node).
This guide assumes you’re already comfortable running ClickHouse in production and have access to performance metrics (CPU, I/O, memory, query logs).
Step-by-Step Diagnosis and Fix
1. Understand Your Setup: The Petabyte-Scale Analytics Platform
Cloudflare runs a system called Ready-Analytics built on ClickHouse. It stores over 100 petabytes across dozens of clusters. The idea is simple: teams stream data into a single massive table instead of designing custom schemas. Each record uses a standard schema (20 float fields, 20 string fields, a timestamp, and an indexID). The primary key is (namespace, indexID, timestamp), where namespace distinguishes different datasets and indexID controls data ordering for each namespace. By December 2024, the table had grown to 2 PiB, ingesting millions of rows per second.
Critical flaw: Retention was enforced by dropping partitions older than 31 days—a one-size-fits-all policy. Teams that needed longer retention (years) or shorter (days) couldn’t use this platform. We needed per-namespace retention.
2. Identify the Bottleneck: When Aggregation Jobs Slow Down
After a migration to support per-namespace retention, the daily aggregation queries used for billing slowed dramatically. Here’s what we checked—and what we didn’t find:
- I/O – No spike in disk reads/writes.
- Memory – No unusual pressure.
- Rows scanned – Still in the same range as before.
- Parts read – Normal.
- CPU – Consistent with previous runs.
Everything looked healthy, yet the jobs were taking hours longer. We turned to system.query_log and system.part_log to dig deeper.
3. Discover the Hidden Bottleneck: Inside ClickHouse’s Merge Pipeline
ClickHouse stores data in parts (sorted chunks). Over time, background merges combine smaller parts into larger ones. With per-namespace retention, we had to keep namespaces with varying lifetimes in the same table. To drop old data per namespace, we started using partitions based on a virtual column that combined namespace and day. This created many micro-partitions, each with only one or a few namespaces. The merge logic, however, assumed that parts could be merged freely as long as they belonged to adjacent partitions. Because our virtual partition scheme caused many non-adjacent partitions, merges became extremely selective—and consequently much slower.

The hidden bottleneck was read amplification during merges. Even though query-time reads were fine, each merge had to re-read and sort a huge number of tiny parts, causing a dramatic increase in total bytes read from disk over the course of a day. This slowdown cascaded into the aggregation jobs, which depend on up-to-date merged parts for efficient scanning.
4. Apply the Three Patches
We developed three patches to resolve the issue without changing the retention model:
- Optimize Merge Selectivity – Modify the merge algorithm to merge parts that share the same namespace even if the partition key differs. This reduced the number of tiny parts and cut merge overhead significantly.
- Parallel Merge Workers – Increase the number of background merge threads per partition range, allowing merges to run concurrently rather than sequentially.
- Adaptive Memory Budget for Merges – Allow merges to use more memory when the system is idle, speeding up the sorting phase.
After deploying these patches, merge write amplification dropped by 40%, and the aggregation jobs returned to their normal completion times.
Common Mistakes
- Ignoring merge performance – Many operators focus on query-time metrics (rows scanned, I/O) and forget that background merges can become the bottleneck. Always monitor
system.mergesandsystem.part_logfor merge latency and bytes processed. - Assuming partitioning fixes all retention problems – Relying on partition-level retention for per-namespace data can create a partition explosion. Validate your partition scheme with the expected number of namespaces and retention periods.
- Not simulating the workload – Before deploying a new retention strategy, run a load test that mimics production merge patterns. Use
SYSTEM START MERGESandSYSTEM STOP MERGESto observe merge behavior in isolation. - Overlooking the primary key – Even if the primary key is optimal for queries, it may cause merge inefficiencies if the sort order doesn’t align well with partition boundaries.
Summary
A seemingly healthy ClickHouse pipeline can hide a debilitating bottleneck in its merge engine. When per-namespace retention forced us into a virtual partition scheme, merges became selective and slow, causing daily aggregation jobs to stall. The fix required understanding ClickHouse’s merge internals and applying three targeted patches. This guide showed you how to diagnose such issues by looking beyond obvious metrics, and gave you concrete steps to prevent them in your own environment. Always profile merges as part of your performance baseline, and test new partition or retention strategies before going live.