SigNoz — ClickHouse Disk Full Incident & Remediation¶
Date: 2026-06-19
Environment: ai71-aps1-stg-superhive (ap-south-1, staging)
Severity: High — SigNoz observability stack down
Symptoms¶
- 3
signoz-k8s-infra-otel-agentpods stuck inCrashLoopBackOff(1500+ restarts) signoz-otel-collectorpod with 1300+ restarts, readiness/liveness probes failing on port13133- Agents logging:
connection refusedonsignoz-otel-collector.observability.svc.cluster.local:4317
Root Cause¶
ClickHouse disk 100% full (98 GiB / 98 GiB) on all 3 replicas, caused by ClickHouse internal system diagnostic tables accumulating without TTL limits:
| Table | Size | TTL before fix |
|---|---|---|
system.text_log |
~15 GiB / node | None |
system.query_log |
~8 GiB / node | 30 days |
system.processors_profile_log |
~7 GiB / node | 30 days |
system.trace_log |
~5 GiB / node | 30 days |
system.part_log |
~4 GiB / node | 30 days |
system.metric_log |
~1.4 GiB / node | None |
system.asynchronous_metric_log |
~480 MiB / node | None |
system.query_views_log |
~165 MiB / node | None |
signoz_logs.logs_v2 |
~50 GiB / node | Dynamic (_retention_days column) |
Cascade failure:
1. ClickHouse disk full → code: 243, Cannot reserve N bytes, not enough space
2. signoz-otel-collector cannot write metrics/logs/traces → crashes
3. Port 4317 on the collector goes down
4. otel-agent pods cannot export telemetry → CrashLoopBackOff
Remediation Steps¶
Step 1 — Bootstrap free space (disk too full for TRUNCATE)¶
ClickHouse TRUNCATE requires scratch space; at 100% the command itself fails with code: 243.
On each node, delete the oldest partition directories directly from the store/ path (NOT via the data/ symlink — ClickHouse re-flushes through symlinks on shutdown):
# Resolve the real store path from the symlink
STORE=$(kubectl exec -n observability chi-signozch-clickhouse-signozch-0-0-0 -- \
bash -c "readlink /var/lib/clickhouse/data/system/text_log | sed 's|../../||'")
FULL=/var/lib/clickhouse/$STORE
# Delete oldest (May) partitions
kubectl exec -n observability chi-signozch-clickhouse-signozch-0-0-0 -- \
bash -c "find '$FULL' -mindepth 1 -maxdepth 1 -name '202605_*' -exec rm -rf {} +"
This immediately reclaims ~7 GiB (no restart required when deleting from store/ directly).
Why
store/and notdata/?data/system/text_log/is a symlink tostore/<uuid>/. If you delete via the symlink path while ClickHouse is running, ClickHouse re-creates the partition on shutdown by flushing its write buffer. Deleting fromstore/bypasses this.
Step 2 — Truncate all system tables via ClickHouse client¶
Run on all 3 nodes:
for pod in chi-signozch-clickhouse-signozch-0-0-0 \
chi-signozch-clickhouse-signozch-0-1-0 \
chi-signozch-clickhouse-signozch-0-2-0; do
kubectl exec -n observability $pod -- \
clickhouse-client --password "<password>" --multiquery --query "
TRUNCATE TABLE IF EXISTS system.text_log;
TRUNCATE TABLE IF EXISTS system.query_log;
TRUNCATE TABLE IF EXISTS system.processors_profile_log;
TRUNCATE TABLE IF EXISTS system.trace_log;
TRUNCATE TABLE IF EXISTS system.part_log;
TRUNCATE TABLE IF EXISTS system.metric_log;
TRUNCATE TABLE IF EXISTS system.asynchronous_metric_log;
TRUNCATE TABLE IF EXISTS system.query_views_log;
"
done
Result: disk dropped from 100% to 62% (61 GiB / 98 GiB) on all nodes, freeing ~38 GiB per node.
Step 3 — Restart stuck otel-agent pods¶
kubectl delete pod -n observability \
signoz-k8s-infra-otel-agent-6f2bz \
signoz-k8s-infra-otel-agent-jss2x \
signoz-k8s-infra-otel-agent-rtwtr
DaemonSet recreates them immediately; they come up Running once the collector is healthy.
Permanent Fix — 7-Day TTL on System Tables¶
Applied on all 3 ClickHouse nodes (system tables are local MergeTree, not replicated):
for pod in chi-signozch-clickhouse-signozch-0-0-0 \
chi-signozch-clickhouse-signozch-0-1-0 \
chi-signozch-clickhouse-signozch-0-2-0; do
kubectl exec -n observability $pod -- \
clickhouse-client --password "<password>" --multiquery --query "
ALTER TABLE system.text_log MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
ALTER TABLE system.metric_log MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
ALTER TABLE system.asynchronous_metric_log MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
ALTER TABLE system.query_views_log MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
ALTER TABLE system.query_log MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
ALTER TABLE system.processors_profile_log MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
ALTER TABLE system.trace_log MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
ALTER TABLE system.part_log MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
"
done
Force immediate cleanup (don't wait for background merge):
for pod in ...; do
kubectl exec -n observability $pod -- \
clickhouse-client --password "<password>" --multiquery --query "
ALTER TABLE system.text_log MATERIALIZE TTL;
ALTER TABLE system.metric_log MATERIALIZE TTL;
ALTER TABLE system.asynchronous_metric_log MATERIALIZE TTL;
ALTER TABLE system.query_views_log MATERIALIZE TTL;
ALTER TABLE system.query_log MATERIALIZE TTL;
ALTER TABLE system.processors_profile_log MATERIALIZE TTL;
ALTER TABLE system.trace_log MATERIALIZE TTL;
ALTER TABLE system.part_log MATERIALIZE TTL;
"
done
Disk Usage Verification¶
kubectl exec -n observability chi-signozch-clickhouse-signozch-0-0-0 -- \
clickhouse-client --password "<password>" --query "
SELECT database, table, formatReadableSize(sum(bytes_on_disk)) AS size
FROM system.parts
GROUP BY database, table
ORDER BY sum(bytes_on_disk) DESC
LIMIT 20
"
kubectl exec -n observability chi-signozch-clickhouse-signozch-0-0-0 -- \
df -h /var/lib/clickhouse
Future Recommendations¶
| Action | Priority |
|---|---|
Expand ClickHouse EBS volumes from 98 GiB — signoz_logs.logs_v2 alone is ~50 GiB and growing at ~1.6 GiB/day |
High |
| Add a CloudWatch / Prometheus alert on ClickHouse disk usage > 70% | High |
Review signoz_logs.logs_v2 default _retention_days value via SigNoz retention settings |
Medium |
| Consider ClickHouse storage tiering (hot/cold) for older log data | Low |
Key Files¶
- ClickHouse namespace:
observability - ClickHouse StatefulSets:
chi-signozch-clickhouse-signozch-0-{0,1,2} - ClickHouse secret:
kubectl get secret -n observability signoz-clickhouse-secret - SigNoz pod:
signoz-0 - Kubeconfig:
environments/aws/superhive/.kubeconfig-ai71-aps1-stg-superhive