Skip to main content

Hybrid Cache: HNSW + Milvus

The Hybrid Cache combines an in-memory HNSW index for fast search with a Milvus vector database for scalable, persistent storage.

Overview

The hybrid architecture provides:

  • Fast search via in-memory HNSW index
  • Scalable storage via Milvus vector database
  • Persistence with Milvus as the source of truth
  • Hot data caching with local document cache

Architecture

┌──────────────────────────────────────────────────┐
│ Hybrid Cache │
├──────────────────────────────────────────────────┤
│ ┌─────────────────┐ ┌──────────────────┐ │
│ │ In-Memory │ │ Local Cache │ │
│ │ HNSW Index │◄─────┤ (Hot Data) │ │
│ └────────┬────────┘ └──────────────────┘ │
│ │ │
│ │ ID Mapping │
│ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ Milvus Vector Database │ │
│ └──────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘

How It Works

Write Path (AddEntry)

When adding a cache entry:

  1. Generate embedding using the configured embedding model
  2. Write entry to Milvus for persistence
  3. Add entry to in-memory HNSW index (if space is available)
  4. Add document to local cache

Read Path (FindSimilar)

When searching for a similar query:

  1. Generate query embedding
  2. Search HNSW index for nearest neighbors
  3. Check local cache for matching documents
    • If found in local cache: return immediately (hot path)
    • If not found: fetch from Milvus (cold path)
  4. Cache fetched documents in local cache for future queries

Memory Management

  • HNSW Index: Limited to a configured maximum number of entries
  • Local Cache: Limited to a configured number of documents
  • Eviction: FIFO policy when limits are reached
  • Data Persistence: All data remains in Milvus regardless of memory limits

Configuration

Basic Configuration

semantic_cache:
enabled: true
backend_type: "hybrid"
similarity_threshold: 0.85
ttl_seconds: 3600

# Hybrid-specific settings
max_memory_entries: 100000 # Max entries in HNSW
local_cache_size: 1000 # Local document cache size

# HNSW parameters
hnsw_m: 16
hnsw_ef_construction: 200

# Milvus configuration
backend_config_path: "config/milvus.yaml"

Configuration Parameters

ParameterTypeDefaultDescription
backend_typestring-Must be "hybrid"
similarity_thresholdfloat0.85Minimum similarity for cache hit
max_memory_entriesint100000Max entries in HNSW index
local_cache_sizeint1000Hot document cache size
hnsw_mint16HNSW bi-directional links
hnsw_ef_constructionint200HNSW construction quality
backend_config_pathstring-Path to Milvus config file

Milvus Configuration

Create config/milvus.yaml:

milvus:
address: "localhost:19530"
collection_name: "semantic_cache"
dimension: 384
index_type: "HNSW"
metric_type: "IP"
params:
M: 16
efConstruction: 200

Example Usage

Go Code

import "github.com/vllm-project/semantic-router/src/semantic-router/pkg/cache"

// Initialize hybrid cache
options := cache.HybridCacheOptions{
Enabled: true,
SimilarityThreshold: 0.85,
TTLSeconds: 3600,
MaxMemoryEntries: 100000,
HNSWM: 16,
HNSWEfConstruction: 200,
MilvusConfigPath: "config/milvus.yaml",
LocalCacheSize: 1000,
}

hybridCache, err := cache.NewHybridCache(options)
if err != nil {
log.Fatalf("Failed to create hybrid cache: %v", err)
}
defer hybridCache.Close()

// Add cache entry
err = hybridCache.AddEntry(
"request-id-123",
"gpt-4",
"What is quantum computing?",
[]byte(`{"prompt": "What is quantum computing?"}`),
[]byte(`{"response": "Quantum computing is..."}`),
)

// Search for similar query
response, found, err := hybridCache.FindSimilar(
"gpt-4",
"Explain quantum computers",
)
if found {
fmt.Printf("Cache hit! Response: %s\n", string(response))
}

// Get statistics
stats := hybridCache.GetStats()
fmt.Printf("Total entries in HNSW: %d\n", stats.TotalEntries)
fmt.Printf("Hit ratio: %.2f%%\n", stats.HitRatio * 100)

Monitoring and Metrics

The hybrid cache exposes metrics for monitoring:

stats := hybridCache.GetStats()

// Available metrics
stats.TotalEntries // Entries in HNSW index
stats.HitCount // Total cache hits
stats.MissCount // Total cache misses
stats.HitRatio // Hit ratio (0.0 - 1.0)

Prometheus Metrics

# Cache entries in HNSW
semantic_cache_entries{backend="hybrid"}

# Cache operations
semantic_cache_operations_total{backend="hybrid",operation="find_similar",status="hit_local"}
semantic_cache_operations_total{backend="hybrid",operation="find_similar",status="hit_milvus"}
semantic_cache_operations_total{backend="hybrid",operation="find_similar",status="miss"}

# Cache hit ratio
semantic_cache_hit_ratio{backend="hybrid"}

Multi-Instance Deployment

The hybrid cache supports multi-instance deployments where each instance maintains its own HNSW index and local cache, but shares Milvus for persistence and data consistency:

┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│ Instance 1 │ │ Instance 2 │ │ Instance 3 │
│ HNSW Cache │ │ HNSW Cache │ │ HNSW Cache │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└─────────────────┼─────────────────┘

┌──────▼──────┐
│ Milvus │
│ (Shared) │
└─────────────┘

See Also