`semantic-cache`

Category: cache · Cloud + Local · Status: v1 — production

Embeds each incoming request, searches the vector index for similar past requests, returns the cached response if similarity exceeds the threshold. Skips the provider call entirely on a hit.

What it does

For workloads with repeated questions in different phrasings, semantic-cache is the highest-impact module you can run. A typical customer-support bot sees 30–50% hit rate.

When to use it

✅ Customer support chatbots (lots of similar questions) ✅ Documentation Q&A ✅ Repetitive coding queries ✅ Any read-heavy workload

❌ Creative writing (similar prompts deserve different responses) ❌ Real-time data queries (stock price, weather) ❌ Prompts with embedded timestamps or unique tokens

Configuration


semantic-cache:
  similarity: 0.85          # 0.0 - 1.0; cosine similarity threshold
  ttlSeconds: 3600          # cache lifetime
  embeddingModel: 'voyage-3'
  excludePaths: []          # request paths to skip caching (regex)
  scope: 'user'             # 'user' | 'global' — share across users?

Metrics emitted

cache.semantic.hit (boolean)
cache.semantic.score (0.0–1.0; the best score found, even on miss)
cache.semantic.lookup_ms (number)
cache.semantic.write_ms (number; post-hook only)

Examples

High precision — almost never false-positive, low hit rate:


semantic-cache:
  similarity: 0.95

Balanced default — good for most apps:


semantic-cache:
  similarity: 0.85
  ttlSeconds: 3600

Aggressive — max savings, occasional false positives acceptable:


semantic-cache:
  similarity: 0.75
  ttlSeconds: 86400

How it works

Pre hook:
- Serialize the request (system + messages, normalized).
- Embed it.
- vectorSearch against the cached embeddings table for this user (or global if scope: global).
- If best score ≥ similarity: return cached response with cache.semantic.hit = true. Skip provider call.
- If miss: continue pipeline, attach cache.semantic.score for visibility.
Post hook (only on miss):
- Store the request embedding + response with ttlSeconds.
- Skipped if the response was an error (stop_reason === 'error').

Cache scope

Setting	Behavior
`scope: 'user'` (default)	Cache hits only match within the same `userId`. Each user has their own private cache.
`scope: 'global'`	Cache hits match across all users. Use for documentation Q&A or other public-info workloads.

Streaming

Cache hits on streaming requests are replayed as a synthetic SSE stream. Your client sees message_start → content_block_* → message_stop events just like a real stream — only faster and free.

Cloud vs Local

Mode	Backend
Cloud	Postgres + pgvector (`vector(1536)`, ivfflat index)
Local	SQLite + sqlite-vec (or pure-JS cosine fallback)

Both backends store the request embedding next to the cached response so a single round trip handles lookup + retrieval.

Don’t set scope: 'global' on workloads that include user-specific data in the prompt. The serializer normalizes whitespace + casing but does not redact user IDs, names, or PII.

Source

packages/modules-core/src/semantic-cache.ts