Budget-first

For workloads where the cost line on your invoice matters more than getting the very best response. Aggressive caching, tight cost caps, no fancy injection.

What this pipeline is good at

Cuts the LLM bill by 50–80% on most workloads.
Provides a hard ceiling — your bill cannot exceed the daily cap.
(v1.1) Routes to the cheapest model that’s likely to handle each task.

The pipeline

Env var


PRXY_PIPE='exact-cache,semantic-cache,cost-guard'

When v1.1 ships, add router and prompt-optimizer:


PRXY_PIPE='exact-cache,semantic-cache,prompt-optimizer,cost-guard,router'

Why this order

Both caches first — never burn money on something you’ve answered.
cost-guard last (in pre) — gates anything that survives the caches.
(v1.1) prompt-optimizer before cost-guard — restructure for max provider cache discount, then check if the (now-cheaper) request fits the budget.
(v1.1) router last — picks the cheapest model when nothing else short-circuited.

Trade-off: aggressive caching means occasional false positives

A semantic-cache similarity of 0.78 will return a cached response for paraphrases that aren’t quite identical. For workloads where “good enough” beats “perfect”, this is the right knob. For ones where it’s not (creative writing, real-time data), bump it to 0.92+ or remove semantic-cache entirely.

Cost math

10,000 requests/month, average $0.015 each:

Without prxy.monster	With this pipeline
$150/mo	$30–50/mo (after 60–80% cache hit rate)

Plus cap: bug in your code → max $25 lost, not $5,000.