Skip to content

SigNoz — ClickHouse Disk Full Incident & Remediation

Date: 2026-06-19 Environment: ai71-aps1-stg-superhive (ap-south-1, staging) Severity: High — SigNoz observability stack down


Symptoms

  • 3 signoz-k8s-infra-otel-agent pods stuck in CrashLoopBackOff (1500+ restarts)
  • signoz-otel-collector pod with 1300+ restarts, readiness/liveness probes failing on port 13133
  • Agents logging: connection refused on signoz-otel-collector.observability.svc.cluster.local:4317

Root Cause

ClickHouse disk 100% full (98 GiB / 98 GiB) on all 3 replicas, caused by ClickHouse internal system diagnostic tables accumulating without TTL limits:

Table Size TTL before fix
system.text_log ~15 GiB / node None
system.query_log ~8 GiB / node 30 days
system.processors_profile_log ~7 GiB / node 30 days
system.trace_log ~5 GiB / node 30 days
system.part_log ~4 GiB / node 30 days
system.metric_log ~1.4 GiB / node None
system.asynchronous_metric_log ~480 MiB / node None
system.query_views_log ~165 MiB / node None
signoz_logs.logs_v2 ~50 GiB / node Dynamic (_retention_days column)

Cascade failure: 1. ClickHouse disk full → code: 243, Cannot reserve N bytes, not enough space 2. signoz-otel-collector cannot write metrics/logs/traces → crashes 3. Port 4317 on the collector goes down 4. otel-agent pods cannot export telemetry → CrashLoopBackOff


Remediation Steps

Step 1 — Bootstrap free space (disk too full for TRUNCATE)

ClickHouse TRUNCATE requires scratch space; at 100% the command itself fails with code: 243.

On each node, delete the oldest partition directories directly from the store/ path (NOT via the data/ symlink — ClickHouse re-flushes through symlinks on shutdown):

# Resolve the real store path from the symlink
STORE=$(kubectl exec -n observability chi-signozch-clickhouse-signozch-0-0-0 -- \
  bash -c "readlink /var/lib/clickhouse/data/system/text_log | sed 's|../../||'")
FULL=/var/lib/clickhouse/$STORE

# Delete oldest (May) partitions
kubectl exec -n observability chi-signozch-clickhouse-signozch-0-0-0 -- \
  bash -c "find '$FULL' -mindepth 1 -maxdepth 1 -name '202605_*' -exec rm -rf {} +"

This immediately reclaims ~7 GiB (no restart required when deleting from store/ directly).

Why store/ and not data/? data/system/text_log/ is a symlink to store/<uuid>/. If you delete via the symlink path while ClickHouse is running, ClickHouse re-creates the partition on shutdown by flushing its write buffer. Deleting from store/ bypasses this.

Step 2 — Truncate all system tables via ClickHouse client

Run on all 3 nodes:

for pod in chi-signozch-clickhouse-signozch-0-0-0 \
           chi-signozch-clickhouse-signozch-0-1-0 \
           chi-signozch-clickhouse-signozch-0-2-0; do
  kubectl exec -n observability $pod -- \
    clickhouse-client --password "<password>" --multiquery --query "
      TRUNCATE TABLE IF EXISTS system.text_log;
      TRUNCATE TABLE IF EXISTS system.query_log;
      TRUNCATE TABLE IF EXISTS system.processors_profile_log;
      TRUNCATE TABLE IF EXISTS system.trace_log;
      TRUNCATE TABLE IF EXISTS system.part_log;
      TRUNCATE TABLE IF EXISTS system.metric_log;
      TRUNCATE TABLE IF EXISTS system.asynchronous_metric_log;
      TRUNCATE TABLE IF EXISTS system.query_views_log;
    "
done

Result: disk dropped from 100% to 62% (61 GiB / 98 GiB) on all nodes, freeing ~38 GiB per node.

Step 3 — Restart stuck otel-agent pods

kubectl delete pod -n observability \
  signoz-k8s-infra-otel-agent-6f2bz \
  signoz-k8s-infra-otel-agent-jss2x \
  signoz-k8s-infra-otel-agent-rtwtr

DaemonSet recreates them immediately; they come up Running once the collector is healthy.


Permanent Fix — 7-Day TTL on System Tables

Applied on all 3 ClickHouse nodes (system tables are local MergeTree, not replicated):

for pod in chi-signozch-clickhouse-signozch-0-0-0 \
           chi-signozch-clickhouse-signozch-0-1-0 \
           chi-signozch-clickhouse-signozch-0-2-0; do
  kubectl exec -n observability $pod -- \
    clickhouse-client --password "<password>" --multiquery --query "
      ALTER TABLE system.text_log                MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
      ALTER TABLE system.metric_log              MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
      ALTER TABLE system.asynchronous_metric_log MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
      ALTER TABLE system.query_views_log         MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
      ALTER TABLE system.query_log               MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
      ALTER TABLE system.processors_profile_log  MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
      ALTER TABLE system.trace_log               MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
      ALTER TABLE system.part_log                MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
    "
done

Force immediate cleanup (don't wait for background merge):

for pod in ...; do
  kubectl exec -n observability $pod -- \
    clickhouse-client --password "<password>" --multiquery --query "
      ALTER TABLE system.text_log                MATERIALIZE TTL;
      ALTER TABLE system.metric_log              MATERIALIZE TTL;
      ALTER TABLE system.asynchronous_metric_log MATERIALIZE TTL;
      ALTER TABLE system.query_views_log         MATERIALIZE TTL;
      ALTER TABLE system.query_log               MATERIALIZE TTL;
      ALTER TABLE system.processors_profile_log  MATERIALIZE TTL;
      ALTER TABLE system.trace_log               MATERIALIZE TTL;
      ALTER TABLE system.part_log                MATERIALIZE TTL;
    "
done

Disk Usage Verification

kubectl exec -n observability chi-signozch-clickhouse-signozch-0-0-0 -- \
  clickhouse-client --password "<password>" --query "
    SELECT database, table, formatReadableSize(sum(bytes_on_disk)) AS size
    FROM system.parts
    GROUP BY database, table
    ORDER BY sum(bytes_on_disk) DESC
    LIMIT 20
  "

kubectl exec -n observability chi-signozch-clickhouse-signozch-0-0-0 -- \
  df -h /var/lib/clickhouse

Future Recommendations

Action Priority
Expand ClickHouse EBS volumes from 98 GiB — signoz_logs.logs_v2 alone is ~50 GiB and growing at ~1.6 GiB/day High
Add a CloudWatch / Prometheus alert on ClickHouse disk usage > 70% High
Review signoz_logs.logs_v2 default _retention_days value via SigNoz retention settings Medium
Consider ClickHouse storage tiering (hot/cold) for older log data Low

Key Files

  • ClickHouse namespace: observability
  • ClickHouse StatefulSets: chi-signozch-clickhouse-signozch-0-{0,1,2}
  • ClickHouse secret: kubectl get secret -n observability signoz-clickhouse-secret
  • SigNoz pod: signoz-0
  • Kubeconfig: environments/aws/superhive/.kubeconfig-ai71-aps1-stg-superhive