SigNoz — ClickHouse Disk Full Incident & Remediation¶

Date: 2026-06-19 Environment: ai71-aps1-stg-superhive (ap-south-1, staging) Severity: High — SigNoz observability stack down

Symptoms¶

3 signoz-k8s-infra-otel-agent pods stuck in CrashLoopBackOff (1500+ restarts)
signoz-otel-collector pod with 1300+ restarts, readiness/liveness probes failing on port 13133
Agents logging: connection refused on signoz-otel-collector.observability.svc.cluster.local:4317

Root Cause¶

ClickHouse disk 100% full (98 GiB / 98 GiB) on all 3 replicas, caused by ClickHouse internal system diagnostic tables accumulating without TTL limits:

Table	Size	TTL before fix
`system.text_log`	~15 GiB / node	None
`system.query_log`	~8 GiB / node	30 days
`system.processors_profile_log`	~7 GiB / node	30 days
`system.trace_log`	~5 GiB / node	30 days
`system.part_log`	~4 GiB / node	30 days
`system.metric_log`	~1.4 GiB / node	None
`system.asynchronous_metric_log`	~480 MiB / node	None
`system.query_views_log`	~165 MiB / node	None
`signoz_logs.logs_v2`	~50 GiB / node	Dynamic (`_retention_days` column)

Cascade failure: 1. ClickHouse disk full → code: 243, Cannot reserve N bytes, not enough space 2. signoz-otel-collector cannot write metrics/logs/traces → crashes 3. Port 4317 on the collector goes down 4. otel-agent pods cannot export telemetry → CrashLoopBackOff

Remediation Steps¶

Step 1 — Bootstrap free space (disk too full for TRUNCATE)¶

ClickHouse TRUNCATE requires scratch space; at 100% the command itself fails with code: 243.

On each node, delete the oldest partition directories directly from the store/ path (NOT via the data/ symlink — ClickHouse re-flushes through symlinks on shutdown):

# Resolve the real store path from the symlink
STORE=$(kubectl exec -n observability chi-signozch-clickhouse-signozch-0-0-0 -- \
  bash -c "readlink /var/lib/clickhouse/data/system/text_log | sed 's|../../||'")
FULL=/var/lib/clickhouse/$STORE

# Delete oldest (May) partitions
kubectl exec -n observability chi-signozch-clickhouse-signozch-0-0-0 -- \
  bash -c "find '$FULL' -mindepth 1 -maxdepth 1 -name '202605_*' -exec rm -rf {} +"

This immediately reclaims ~7 GiB (no restart required when deleting from store/ directly).

Why store/ and not data/? data/system/text_log/ is a symlink to store/<uuid>/. If you delete via the symlink path while ClickHouse is running, ClickHouse re-creates the partition on shutdown by flushing its write buffer. Deleting from store/ bypasses this.

Step 2 — Truncate all system tables via ClickHouse client¶

Run on all 3 nodes:

for pod in chi-signozch-clickhouse-signozch-0-0-0 \
           chi-signozch-clickhouse-signozch-0-1-0 \
           chi-signozch-clickhouse-signozch-0-2-0; do
  kubectl exec -n observability $pod -- \
    clickhouse-client --password "<password>" --multiquery --query "
      TRUNCATE TABLE IF EXISTS system.text_log;
      TRUNCATE TABLE IF EXISTS system.query_log;
      TRUNCATE TABLE IF EXISTS system.processors_profile_log;
      TRUNCATE TABLE IF EXISTS system.trace_log;
      TRUNCATE TABLE IF EXISTS system.part_log;
      TRUNCATE TABLE IF EXISTS system.metric_log;
      TRUNCATE TABLE IF EXISTS system.asynchronous_metric_log;
      TRUNCATE TABLE IF EXISTS system.query_views_log;
    "
done

Result: disk dropped from 100% to 62% (61 GiB / 98 GiB) on all nodes, freeing ~38 GiB per node.

Step 3 — Restart stuck otel-agent pods¶

kubectl delete pod -n observability \
  signoz-k8s-infra-otel-agent-6f2bz \
  signoz-k8s-infra-otel-agent-jss2x \
  signoz-k8s-infra-otel-agent-rtwtr

DaemonSet recreates them immediately; they come up Running once the collector is healthy.

Permanent Fix — 7-Day TTL on System Tables¶

Applied on all 3 ClickHouse nodes (system tables are local MergeTree, not replicated):

for pod in chi-signozch-clickhouse-signozch-0-0-0 \
           chi-signozch-clickhouse-signozch-0-1-0 \
           chi-signozch-clickhouse-signozch-0-2-0; do
  kubectl exec -n observability $pod -- \
    clickhouse-client --password "<password>" --multiquery --query "
      ALTER TABLE system.text_log                MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
      ALTER TABLE system.metric_log              MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
      ALTER TABLE system.asynchronous_metric_log MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
      ALTER TABLE system.query_views_log         MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
      ALTER TABLE system.query_log               MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
      ALTER TABLE system.processors_profile_log  MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
      ALTER TABLE system.trace_log               MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
      ALTER TABLE system.part_log                MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
    "
done

Force immediate cleanup (don't wait for background merge):

for pod in ...; do
  kubectl exec -n observability $pod -- \
    clickhouse-client --password "<password>" --multiquery --query "
      ALTER TABLE system.text_log                MATERIALIZE TTL;
      ALTER TABLE system.metric_log              MATERIALIZE TTL;
      ALTER TABLE system.asynchronous_metric_log MATERIALIZE TTL;
      ALTER TABLE system.query_views_log         MATERIALIZE TTL;
      ALTER TABLE system.query_log               MATERIALIZE TTL;
      ALTER TABLE system.processors_profile_log  MATERIALIZE TTL;
      ALTER TABLE system.trace_log               MATERIALIZE TTL;
      ALTER TABLE system.part_log                MATERIALIZE TTL;
    "
done

Disk Usage Verification¶

kubectl exec -n observability chi-signozch-clickhouse-signozch-0-0-0 -- \
  clickhouse-client --password "<password>" --query "
    SELECT database, table, formatReadableSize(sum(bytes_on_disk)) AS size
    FROM system.parts
    GROUP BY database, table
    ORDER BY sum(bytes_on_disk) DESC
    LIMIT 20
  "

kubectl exec -n observability chi-signozch-clickhouse-signozch-0-0-0 -- \
  df -h /var/lib/clickhouse

Future Recommendations¶

Action	Priority
Expand ClickHouse EBS volumes from 98 GiB — `signoz_logs.logs_v2` alone is ~50 GiB and growing at ~1.6 GiB/day	High
Add a CloudWatch / Prometheus alert on ClickHouse disk usage > 70%	High
Review `signoz_logs.logs_v2` default `_retention_days` value via SigNoz retention settings	Medium
Consider ClickHouse storage tiering (hot/cold) for older log data	Low

Key Files¶

ClickHouse namespace: observability
ClickHouse StatefulSets: chi-signozch-clickhouse-signozch-0-{0,1,2}
ClickHouse secret: kubectl get secret -n observability signoz-clickhouse-secret
SigNoz pod: signoz-0
Kubeconfig: environments/aws/superhive/.kubeconfig-ai71-aps1-stg-superhive