Hybrid Cache: HNSW + Milvus

The Hybrid Cache combines an in-memory HNSW index for fast search with a Milvus vector database for scalable, persistent storage.

Overview

The hybrid architecture provides:

Fast search via in-memory HNSW index
Scalable storage via Milvus vector database
Persistence with Milvus as the source of truth
Hot data caching with local document cache

Architecture

┌──────────────────────────────────────────────────┐
│                  Hybrid Cache                     │
├──────────────────────────────────────────────────┤
│  ┌─────────────────┐      ┌──────────────────┐  │
│  │  In-Memory      │      │   Local Cache    │  │
│  │  HNSW Index     │◄─────┤   (Hot Data)     │  │
│  └────────┬────────┘      └──────────────────┘  │
│           │                                       │
│           │ ID Mapping                           │
│           ▼                                       │
│  ┌──────────────────────────────────────────┐   │
│  │         Milvus Vector Database           │   │
│  └──────────────────────────────────────────┘   │
└──────────────────────────────────────────────────┘

How It Works

Write Path (AddEntry)

When adding a cache entry:

Generate embedding using the configured embedding model
Write entry to Milvus for persistence
Add entry to in-memory HNSW index (if space is available)
Add document to local cache

Read Path (FindSimilar)

When searching for a similar query:

Generate query embedding
Search HNSW index for nearest neighbors
Check local cache for matching documents
- If found in local cache: return immediately (hot path)
- If not found: fetch from Milvus (cold path)
Cache fetched documents in local cache for future queries

Memory Management

HNSW Index: Limited to a configured maximum number of entries
Local Cache: Limited to a configured number of documents
Eviction: FIFO policy when limits are reached
Data Persistence: All data remains in Milvus regardless of memory limits

Configuration

Basic Configuration

semantic_cache:
  enabled: true
  backend_type: "hybrid"
  similarity_threshold: 0.85
  ttl_seconds: 3600
  
  # Hybrid-specific settings
  max_memory_entries: 100000  # Max entries in HNSW
  local_cache_size: 1000      # Local document cache size
  
  # HNSW parameters
  hnsw_m: 16
  hnsw_ef_construction: 200
  
  # Milvus configuration
  backend_config_path: "config/milvus.yaml"

Configuration Parameters

Parameter	Type	Default	Description
`backend_type`	string	-	Must be `"hybrid"`
`similarity_threshold`	float	0.85	Minimum similarity for cache hit
`max_memory_entries`	int	100000	Max entries in HNSW index
`local_cache_size`	int	1000	Hot document cache size
`hnsw_m`	int	16	HNSW bi-directional links
`hnsw_ef_construction`	int	200	HNSW construction quality
`backend_config_path`	string	-	Path to Milvus config file

Milvus Configuration

Create config/milvus.yaml:

milvus:
  address: "localhost:19530"
  collection_name: "semantic_cache"
  dimension: 384
  index_type: "HNSW"
  metric_type: "IP"
  params:
    M: 16
    efConstruction: 200

Example Usage

Go Code

import "github.com/vllm-project/semantic-router/src/semantic-router/pkg/cache"

// Initialize hybrid cache
options := cache.HybridCacheOptions{
    Enabled:             true,
    SimilarityThreshold: 0.85,
    TTLSeconds:          3600,
    MaxMemoryEntries:    100000,
    HNSWM:               16,
    HNSWEfConstruction:  200,
    MilvusConfigPath:    "config/milvus.yaml",
    LocalCacheSize:      1000,
}

hybridCache, err := cache.NewHybridCache(options)
if err != nil {
    log.Fatalf("Failed to create hybrid cache: %v", err)
}
defer hybridCache.Close()

// Add cache entry
err = hybridCache.AddEntry(
    "request-id-123",
    "gpt-4",
    "What is quantum computing?",
    []byte(`{"prompt": "What is quantum computing?"}`),
    []byte(`{"response": "Quantum computing is..."}`),
)

// Search for similar query
response, found, err := hybridCache.FindSimilar(
    "gpt-4",
    "Explain quantum computers",
)
if found {
    fmt.Printf("Cache hit! Response: %s\n", string(response))
}

// Get statistics
stats := hybridCache.GetStats()
fmt.Printf("Total entries in HNSW: %d\n", stats.TotalEntries)
fmt.Printf("Hit ratio: %.2f%%\n", stats.HitRatio * 100)

Monitoring and Metrics

The hybrid cache exposes metrics for monitoring:

stats := hybridCache.GetStats()

// Available metrics
stats.TotalEntries  // Entries in HNSW index
stats.HitCount      // Total cache hits
stats.MissCount     // Total cache misses
stats.HitRatio      // Hit ratio (0.0 - 1.0)

Prometheus Metrics

# Cache entries in HNSW
semantic_cache_entries{backend="hybrid"}

# Cache operations
semantic_cache_operations_total{backend="hybrid",operation="find_similar",status="hit_local"}
semantic_cache_operations_total{backend="hybrid",operation="find_similar",status="hit_milvus"}
semantic_cache_operations_total{backend="hybrid",operation="find_similar",status="miss"}

# Cache hit ratio
semantic_cache_hit_ratio{backend="hybrid"}

Multi-Instance Deployment

The hybrid cache supports multi-instance deployments where each instance maintains its own HNSW index and local cache, but shares Milvus for persistence and data consistency:

┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│  Instance 1 │   │  Instance 2 │   │  Instance 3 │
│  HNSW Cache │   │  HNSW Cache │   │  HNSW Cache │
└──────┬──────┘   └──────┬──────┘   └──────┬──────┘
       │                 │                 │
       └─────────────────┼─────────────────┘
                         │
                  ┌──────▼──────┐
                  │   Milvus    │
                  │  (Shared)   │
                  └─────────────┘

Overview​

Architecture​

How It Works​

Write Path (AddEntry)​

Read Path (FindSimilar)​

Memory Management​

Configuration​

Basic Configuration​

Configuration Parameters​

Milvus Configuration​

Example Usage​

Go Code​

Monitoring and Metrics​

Prometheus Metrics​

Multi-Instance Deployment​

See Also​