semantic-cache
Category: cache · Cloud + Local · Status: v1 — production
Embeds each incoming request, searches the vector index for similar past requests, returns the cached response if similarity exceeds the threshold. Skips the provider call entirely on a hit.
What it does
For workloads with repeated questions in different phrasings, semantic-cache is the highest-impact module you can run. A typical customer-support bot sees 30–50% hit rate.
When to use it
✅ Customer support chatbots (lots of similar questions) ✅ Documentation Q&A ✅ Repetitive coding queries ✅ Any read-heavy workload
❌ Creative writing (similar prompts deserve different responses) ❌ Real-time data queries (stock price, weather) ❌ Prompts with embedded timestamps or unique tokens
Configuration
semantic-cache:
similarity: 0.85 # 0.0 - 1.0; cosine similarity threshold
ttlSeconds: 3600 # cache lifetime
embeddingModel: 'voyage-3'
excludePaths: [] # request paths to skip caching (regex)
scope: 'user' # 'user' | 'global' — share across users?Metrics emitted
cache.semantic.hit(boolean)cache.semantic.score(0.0–1.0; the best score found, even on miss)cache.semantic.lookup_ms(number)cache.semantic.write_ms(number; post-hook only)
Examples
High precision — almost never false-positive, low hit rate:
semantic-cache:
similarity: 0.95Balanced default — good for most apps:
semantic-cache:
similarity: 0.85
ttlSeconds: 3600Aggressive — max savings, occasional false positives acceptable:
semantic-cache:
similarity: 0.75
ttlSeconds: 86400How it works
-
Pre hook:
- Serialize the request (system + messages, normalized).
- Embed it.
vectorSearchagainst the cached embeddings table for this user (or global ifscope: global).- If best score ≥
similarity: return cached response withcache.semantic.hit = true. Skip provider call. - If miss: continue pipeline, attach
cache.semantic.scorefor visibility.
-
Post hook (only on miss):
- Store the request embedding + response with
ttlSeconds. - Skipped if the response was an error (
stop_reason === 'error').
- Store the request embedding + response with
Cache scope
| Setting | Behavior |
|---|---|
scope: 'user' (default) | Cache hits only match within the same userId. Each user has their own private cache. |
scope: 'global' | Cache hits match across all users. Use for documentation Q&A or other public-info workloads. |
Streaming
Cache hits on streaming requests are replayed as a synthetic SSE stream. Your client sees message_start → content_block_* → message_stop events just like a real stream — only faster and free.
Cloud vs Local
| Mode | Backend |
|---|---|
| Cloud | Postgres + pgvector (vector(1536), ivfflat index) |
| Local | SQLite + sqlite-vec (or pure-JS cosine fallback) |
Both backends store the request embedding next to the cached response so a single round trip handles lookup + retrieval.
Don’t set scope: 'global' on workloads that include user-specific data in the prompt. The serializer normalizes whitespace + casing but does not redact user IDs, names, or PII.