Redis Cache Problems: Penetration, Breakdown, Avalanche, ...

This is article 49 in the Big Data series. This article systematically explains five classic Redis problems in high-concurrency scenarios and their solutions.

Full illustrated version (with screenshots): CSDN Original | Juejin

Cache Penetration

Problem Description

Requested data exists in neither cache nor database, causing every request to bypass cache and hit the database directly, creating invalid DB query pressure. Common in malicious attacks (using random/invalid IDs to make massive requests).

Solutions

Solution 1: Cache Null Values

When database query result is empty, write null or empty marker value to cache with short TTL (30-60 seconds) to prevent repeated database hits.

Solution 2: Bloom Filter

Deploy Bloom filter in front of cache layer, write all valid keys in advance. When request arrives, first query Bloom filter, those confirmed as non-existent are directly intercepted without checking cache or database.

Bloom filter principle: Uses a bit array of length m and k independent hash functions. When writing element, set k hash positions to 1; when querying, if any position is 0, the element definitely doesn’t exist; if all are 1, it probably exists (has false positive rate).

False positive rate formula: p ≈ (1 - e^(-kn/m))^k, can reduce false positive rate by increasing m or adjusting k.

Solution 3: Parameter Validation + Rate Limiting

Enforce strict validation of request parameter legality in business layer, combine with interface rate limiting and abnormal request monitoring to intercept illegal requests in advance.

Cache Avalanche

Problem Description

Many keys expire simultaneously, or Redis cluster fails, causing massive requests to simultaneously penetrate to database, extreme case triggers system cascading collapse.

Solutions

Solution 1: Stagger Expiration Times

Avoid setting same TTL when batch writing, add random offset on base time:

int expireTime = baseTime + ThreadLocalRandom.current().nextInt(0, 300);
redisTemplate.expire(key, expireTime, TimeUnit.SECONDS);

Solution 2: Multi-Level Cache

Build local cache (Caffeine/Guava) → distributed cache (Redis) → database three-level protection. When Redis fails, local cache can fallback.

Solution 3: High-Availability Architecture

Deploy Redis Sentinel or Redis Cluster to avoid single point of failure
Integrate circuit breaker and degradation in business layer (like Sentinel, Hystrix), when Redis is unavailable degrade to database or return fallback data
Pre-warm critical data during off-peak hours

Cache Breakdown

Problem Description

At the moment a single hot key expires, massive concurrent requests experience cache miss simultaneously, all hit database to rebuild cache, creating “thundering herd” effect.

Solutions

Solution 1: Mutex Lock

On cache miss, only allow one thread to acquire distributed lock to query database and rebuild cache. Other threads wait or spin retry. Ensure database is queried only once.

Solution 2: Never Expire (Logical Expiration)

Hot key doesn’t set physical expiration time, instead stores logical expiration time in value. Background async task refreshes cache before logical expiration. Users always read from cache (may briefly read stale data).

Solution 3: Early Renewal

Monitor remaining TTL of key, actively refresh before expiration to fundamentally eliminate expiration gap.

Data Consistency Problem

Delayed Double Delete Strategy

1. Delete cache before updating database
2. Update database
3. Wait ~200ms~2s (let other threads' read requests complete)
4. Delete cache again (clear possibly rewritten old data)
5. Set reasonable TTL on cache as final fallback

More thorough solution is to listen to database Binlog (like Canal), precisely invalidate corresponding cache when data change events occur, achieving near real-time cache consistency.

Hot Key Problem

Problem Description

Massive requests concentrate on the same key, exceeding single Redis node’s network bandwidth or CPU processing capability, which may cause node crash and trigger avalanche.

Detection Methods

Offline: redis-cli --hotkeys (based on LFU statistics, need to enable corresponding config)
Online: MONITOR command captures request traffic (has performance impact, use with caution)
Stream computing: Integrate Flink/Spark for real-time access frequency statistics, write hot key info to ZooKeeper to notify application layer

Solutions

Local cache fallback: Replicate hot key to application process memory (Caffeine), accept cost of brief data inconsistency
Sharded reading: Replicate hot key to multiple Redis nodes (key_1, key_2…key_N), randomly select node when reading
Rate limiting + circuit breaker: Rate limit abnormally high-frequency access keys, circuit breaker protect backend

Big Key Problem

Problem Description

Single key’s value is too large (String exceeds 10KB, collection elements exceed 5000), causes:

Uneven memory distribution, affects cluster data migration and rebalance
Long blocking time for read/write operations, other request latencies increase
DEL on big key directly blocks main thread (synchronous operation)

Detection Methods

redis-cli --bigkeys: Scan entire keyspace (time-consuming when data volume is large)
RDB file analysis tools (like rdbtools, redis-rdb-tools): Offline analysis, doesn’t affect production

Solutions

Split big key: Split large String into multiple keys (e.g., sharded storage), split large Hash/List/Set by hash bucketing
External storage: Store extra large values (images, documents, serialized objects) in MongoDB or CDN, only store reference ID in Redis
Lazy deletion: Use UNLINK instead of DEL, asynchronously delete big key in background to avoid blocking main thread

# Safely delete big key
UNLINK big_key_name

Summary

Problem	Root Cause	Primary Solution
Cache Penetration	Request non-existent keys	Bloom filter
Cache Avalanche	Many keys expire simultaneously	Random TTL + multi-level cache
Cache Breakdown	Hot key expiration moment concurrency	Mutex lock / never expire
Hot Key	Traffic concentrated on single node	Local cache + sharded reading
Big Key	Single key data volume too large	Split + UNLINK delete