Budget-first
For workloads where the cost line on your invoice matters more than getting the very best response. Aggressive caching, tight cost caps, no fancy injection.
What this pipeline is good at
- Cuts the LLM bill by 50–80% on most workloads.
- Provides a hard ceiling — your bill cannot exceed the daily cap.
- (v1.1) Routes to the cheapest model that’s likely to handle each task.
The pipeline
Env var
PRXY_PIPE='exact-cache,semantic-cache,cost-guard'When v1.1 ships, add router and prompt-optimizer:
PRXY_PIPE='exact-cache,semantic-cache,prompt-optimizer,cost-guard,router'Why this order
- Both caches first — never burn money on something you’ve answered.
cost-guardlast (in pre) — gates anything that survives the caches.- (v1.1)
prompt-optimizerbeforecost-guard— restructure for max provider cache discount, then check if the (now-cheaper) request fits the budget. - (v1.1)
routerlast — picks the cheapest model when nothing else short-circuited.
Trade-off: aggressive caching means occasional false positives
A semantic-cache similarity of 0.78 will return a cached response for paraphrases that aren’t quite identical. For workloads where “good enough” beats “perfect”, this is the right knob. For ones where it’s not (creative writing, real-time data), bump it to 0.92+ or remove semantic-cache entirely.
Cost math
10,000 requests/month, average $0.015 each:
| Without prxy.monster | With this pipeline |
|---|---|
| $150/mo | $30–50/mo (after 60–80% cache hit rate) |
Plus cap: bug in your code → max $25 lost, not $5,000.
See also
Last updated on