Skip to Content
prxy.monster v1 is in early access. See what shipped →
Modulessemantic-cache

semantic-cache

Category: cache · Cloud + Local · Status: v1 — production

Embeds each incoming request, searches the vector index for similar past requests, returns the cached response if similarity exceeds the threshold. Skips the provider call entirely on a hit.

What it does

For workloads with repeated questions in different phrasings, semantic-cache is the highest-impact module you can run. A typical customer-support bot sees 30–50% hit rate.

When to use it

✅ Customer support chatbots (lots of similar questions) ✅ Documentation Q&A ✅ Repetitive coding queries ✅ Any read-heavy workload

❌ Creative writing (similar prompts deserve different responses) ❌ Real-time data queries (stock price, weather) ❌ Prompts with embedded timestamps or unique tokens

Configuration

semantic-cache: similarity: 0.85 # 0.0 - 1.0; cosine similarity threshold ttlSeconds: 3600 # cache lifetime embeddingModel: 'voyage-3' excludePaths: [] # request paths to skip caching (regex) scope: 'user' # 'user' | 'global' — share across users?

Metrics emitted

  • cache.semantic.hit (boolean)
  • cache.semantic.score (0.0–1.0; the best score found, even on miss)
  • cache.semantic.lookup_ms (number)
  • cache.semantic.write_ms (number; post-hook only)

Examples

High precision — almost never false-positive, low hit rate:

semantic-cache: similarity: 0.95

Balanced default — good for most apps:

semantic-cache: similarity: 0.85 ttlSeconds: 3600

Aggressive — max savings, occasional false positives acceptable:

semantic-cache: similarity: 0.75 ttlSeconds: 86400

How it works

  1. Pre hook:

    • Serialize the request (system + messages, normalized).
    • Embed it.
    • vectorSearch against the cached embeddings table for this user (or global if scope: global).
    • If best score ≥ similarity: return cached response with cache.semantic.hit = true. Skip provider call.
    • If miss: continue pipeline, attach cache.semantic.score for visibility.
  2. Post hook (only on miss):

    • Store the request embedding + response with ttlSeconds.
    • Skipped if the response was an error (stop_reason === 'error').

Cache scope

SettingBehavior
scope: 'user' (default)Cache hits only match within the same userId. Each user has their own private cache.
scope: 'global'Cache hits match across all users. Use for documentation Q&A or other public-info workloads.

Streaming

Cache hits on streaming requests are replayed as a synthetic SSE stream. Your client sees message_start → content_block_* → message_stop events just like a real stream — only faster and free.

Cloud vs Local

ModeBackend
CloudPostgres + pgvector (vector(1536), ivfflat index)
LocalSQLite + sqlite-vec (or pure-JS cosine fallback)

Both backends store the request embedding next to the cached response so a single round trip handles lookup + retrieval.

Don’t set scope: 'global' on workloads that include user-specific data in the prompt. The serializer normalizes whitespace + casing but does not redact user IDs, names, or PII.

Source

packages/modules-core/src/semantic-cache.ts

Last updated on