Caching adds a fast storage layer between consumers and the primary data source to reduce repeated expensive reads and improve response times.
It is most effective when reads dominate writes and when the system can tolerate bounded staleness with explicit invalidation and expiry behavior.
How It Works
Caching places a fast-access storage layer (near-memory or distributed) between the data consumer and the data source. When a request arrives, the cache is checked first. If the data is present (hit), it is returned immediately. If not (miss), it is fetched from the source, stored in the cache, and then returned.
- Read-through: The cache itself fetches missing data from the source on a cache miss.
- Write-through: Data is written to both the cache and the source simultaneously, ensuring they stay in sync.
- Write-behind (write-back): Data is written to the cache first and asynchronously propagated to the source, improving write performance but increasing risk of data loss.
- Cache-aside (lazy loading): The application manages the cache explicitly — checking it before reads and updating it after writes.
- TTL (Time-To-Live): Entries expire and are automatically removed after a fixed duration.
- LRU (Least Recently Used): The least recently accessed entries are evicted first to make room for new data.
Failure Modes
- Stale data: Serving outdated information because the source changed but the cache has not yet expired or been invalidated.
- Cache stampede (Thunderous Herd): Many concurrent requests for the same expired key all trigger a miss simultaneously, overwhelming the backend with redundant recomputations.
- Cold start: After a restart or deployment, the cache is empty, causing all initial requests to hit the backend at once.
- Memory pressure: An unbounded cache consumes excessive memory, leading to garbage collection pauses or system instability.
- Cache poisoning: Incorrect, unauthorized, or corrupted data enters the cache and is served repeatedly to multiple users.
Verification
- Under realistic load, p99 response time for cached endpoints stays below the target threshold (e.g.,
< 100 ms) with a cache hit ratio of at least 80%. - Latency reduction: Verify that p95 response time is at least 50% lower with caching enabled compared to a non-cached baseline.
- Invalidation check: After a write to the source, the cache reflects the new value within the agreed TTL window or immediately upon explicit invalidation.
- Simulation: Verify that the system degrades gracefully and does not crash when the caching layer (e.g., Redis) is unavailable.
Variants and Related Tactics
- In-process cache: Data stored in application memory (e.g., hash maps, Caffeine). Fastest access, but not shared across instances.
- Distributed cache: Shared cache across multiple application instances (e.g., Redis, Memcached, Hazelcast). Enables horizontal scaling and consistency across the cluster.
- CDN caching: Content cached at edge locations close to users. Essential for static or semi-static assets in global applications.
- HTTP caching: Browser and proxy caches using standard headers (Cache-Control, ETag).
- Memoization: Caching the results of pure functions based on their input parameters.