Hybrid Cache: HNSW + Milvus
The Hybrid Cache combines an in-memory HNSW index for fast search with a Milvus vector database for scalable, persistent storage.
Overview
The hybrid architecture provides:
- Fast search via in-memory HNSW index
- Scalable storage via Milvus vector database
- Persistence with Milvus as the source of truth
- Hot data caching with local document cache
Architecture
┌──────────────────────────────────────────────────┐
│ Hybrid Cache │
├──────────────────────────────────────────────────┤
│ ┌─────────────────┐ ┌──────────────────┐ │
│ │ In-Memory │ │ Local Cache │ │
│ │ HNSW Index │ ◄─────┤ (Hot Data) │ │
│ └────────┬────────┘ └──────────────────┘ │
│ │ │
│ │ ID Mapping │
│ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ Milvus Vector Database │ │
│ └──────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
How It Works
Write Path (AddEntry)
When adding a cache entry:
- Generate embedding using the configured embedding model
- Write entry to Milvus for persistence
- Add entry to in-memory HNSW index (if space is available)
- Add document to local cache
Read Path (FindSimilar)
When searching for a similar query:
- Generate query embedding
- Search HNSW index for nearest neighbors
- Check local cache for matching documents
- If found in local cache: return immediately (hot path)
- If not found: fetch from Milvus (cold path)
- Cache fetched documents in local cache for future queries
Memory Management
- HNSW Index: Limited to a configured maximum number of entries
- Local Cache: Limited to a configured number of documents
- Eviction: FIFO policy when limits are reached
- Data Persistence: All data remains in Milvus regardless of memory limits
Configuration
Basic Configuration
semantic_cache:
enabled: true
backend_type: "hybrid"
similarity_threshold: 0.85
ttl_seconds: 3600
# Hybrid-specific settings
max_memory_entries: 100000 # Max entries in HNSW
local_cache_size: 1000 # Local document cache size
# HNSW parameters
hnsw_m: 16
hnsw_ef_construction: 200
# Milvus configuration
backend_config_path: "config/milvus.yaml"
Configuration Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
backend_type | string | - | Must be "hybrid" |
similarity_threshold | float | 0.85 | Minimum similarity for cache hit |
max_memory_entries | int | 100000 | Max entries in HNSW index |
local_cache_size | int | 1000 | Hot document cache size |
hnsw_m | int | 16 | HNSW bi-directional links |
hnsw_ef_construction | int | 200 | HNSW construction quality |
backend_config_path | string | - | Path to Milvus config file |
Milvus Configuration
Create config/milvus.yaml:
milvus:
address: "localhost:19530"
collection_name: "semantic_cache"
dimension: 384
index_type: "HNSW"
metric_type: "IP"
params:
M: 16
efConstruction: 200
Example Usage
Go Code
import "github.com/vllm-project/semantic-router/src/semantic-router/pkg/cache"
// Initialize hybrid cache
options := cache.HybridCacheOptions{
Enabled: true,
SimilarityThreshold: 0.85,
TTLSeconds: 3600,
MaxMemoryEntries: 100000,
HNSWM: 16,
HNSWEfConstruction: 200,
MilvusConfigPath: "config/milvus.yaml",
LocalCacheSize: 1000,
}
hybridCache, err := cache.NewHybridCache(options)
if err != nil {
log.Fatalf("Failed to create hybrid cache: %v", err)
}
defer hybridCache.Close()
// Add cache entry
err = hybridCache.AddEntry(
"request-id-123",
"gpt-4",
"What is quantum computing?",
[]byte(`{"prompt": "What is quantum computing?"}`),
[]byte(`{"response": "Quantum computing is..."}`),
)
// Search for similar query
response, found, err := hybridCache.FindSimilar(
"gpt-4",
"Explain quantum computers",
)
if found {
fmt.Printf("Cache hit! Response: %s\n", string(response))
}
// Get statistics
stats := hybridCache.GetStats()
fmt.Printf("Total entries in HNSW: %d\n", stats.TotalEntries)
fmt.Printf("Hit ratio: %.2f%%\n", stats.HitRatio * 100)
Monitoring and Metrics
The hybrid cache exposes metrics for monitoring:
stats := hybridCache.GetStats()
// Available metrics
stats.TotalEntries // Entries in HNSW index
stats.HitCount // Total cache hits
stats.MissCount // Total cache misses
stats.HitRatio // Hit ratio (0.0 - 1.0)
Prometheus Metrics
# Cache entries in HNSW
semantic_cache_entries{backend="hybrid"}
# Cache operations
semantic_cache_operations_total{backend="hybrid",operation="find_similar",status="hit_local"}
semantic_cache_operations_total{backend="hybrid",operation="find_similar",status="hit_milvus"}
semantic_cache_operations_total{backend="hybrid",operation="find_similar",status="miss"}
# Cache hit ratio
semantic_cache_hit_ratio{backend="hybrid"}
Multi-Instance Deployment
The hybrid cache supports multi-instance deployments where each instance maintains its own HNSW index and local cache, but shares Milvus for persistence and data consistency:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Instance 1 │ │ Instance 2 │ │ Instance 3 │
│ HNSW Cache │ │ HNSW Cache │ │ HNSW Cache │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└─────────────────┼─────────────────┘
│
┌──────▼──────┐
│ Milvus │
│ (Shared) │
└─────────────┘