Understanding Workloads to Schedule KVCache - A Methodological Approach
Introduction
KVCache has become a cornerstone optimization in large language model serving, enabling systems to reuse intermediate key-value computations across requests. However, the effectiveness of KVCache fundamentally depends on scheduling decisions that are inherently workload-dependent. The paper KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider not just a solution, but more importantly, a systematic methodology for understanding workloads that can inform scheduling design.
The core insight is deceptively simple: optimal scheduling policies cannot be designed in a vacuum. They must emerge from a deep understanding of how real workloads behave and, crucially, why they behave that way.
Part 1: The KVCache Scheduling Challenge
KVCache scheduling involves two fundamental aspects:
Lifecycle management encompasses the entire data flow of KVCache blocks from creation to deletion. This includes deciding whether to cache newly computed blocks, managing their movement across storage tiers (HBM, CPU memory, SSD), and determining when to evict them. Each storage tier offers different capacity-throughput tradeoffs, and optimal scheduling must also estimate the required size for each layer to avoid waste while ensuring performance.
Aggressive optimizations like prefetching attempt to proactively move blocks from lower tiers back to HBM before they're needed. These strategies leverage temporal and spatial locality patterns but must balance potential performance gains against bandwidth waste and system complexity.
The challenge is that these decisions depend entirely on workload characteristics - what works for API-driven workloads may fail catastrophically for interactive chat sessions.
Part 2: Workload Analysis Methodology
Data Collection Philosophy
The methodology begins with an iterative refinement process. Initial data collection reveals unexpected patterns, which prompt collection of additional contextual information. This feedback loop prevents both over-collection and blind spots. Each iteration sharpens the understanding of what data truly matters for scheduling decisions.
Pattern Modeling Approach
The first step in understanding workloads is attempting to model their behavior with simple mathematical formulations. The authors initially tried to fit overall access patterns with exponential distributions for reuse probability over time.
This simplicity is methodological discipline. Complex models might fit observed data more precisely, but they risk overfitting to specific traces. Simple models that fit reasonably well likely capture fundamental behaviors that generalize across deployments.
When simple models failed to adequately capture behavior across all requests, this signaled the need for categorization. Poor model fit isn't failure - it's information indicating that the workload contains distinct sub-populations with different behaviors.
Workload Categorization Strategy
Upon discovering that unified models couldn't capture workload behavior, the authors split requests into categories: single-turn versus multi-turn, API calls versus chat interactions, and different application types. Each split was motivated by model fit improvement.
But the crucial step was validating these categories by identifying root causes. They discovered API requests exhibited high reuse because developers typically hardcode identical system prompts. File-understanding requests showed longer reuse times because users need time to process document summaries. These aren't just patterns - they're behavioral explanations that validate the categorization.
This creates a powerful refinement cycle:
- Attempt simple model fitting on aggregate data.
- Identify subsets where models fail.
- Split into categories that improve model fit.
- Validate categories by finding behavioral reasons.
The end result: multiple simple models, each capturing a behaviorally-distinct workload component.
From Models to System Design
With validated models for each workload category, system design becomes principled rather than heuristic. The exponential decay in reuse probability directly informs eviction priorities - blocks from categories with faster decay should be evicted first when capacity is constrained.
The models also enable capacity planning. By understanding the lifespan distribution of each category and their relative frequencies, the system can estimate required cache sizes for each storage tier. For instance, discovering that API workloads have short but intense reuse patterns suggests keeping a smaller but faster cache is optimal.