Using prxy.monster with LlamaIndex (TypeScript)
LlamaIndex TS uses @llamaindex/openai for the OpenAI provider, which accepts a baseURL argument. Set it on the OpenAI LLM instance — every query engine, retriever, and agent that uses it inherits the routing.
Install
npm install llamaindex @llamaindex/openaiConfigure
import { OpenAI } from '@llamaindex/openai';
import { Settings } from 'llamaindex';
Settings.llm = new OpenAI({
model: 'gpt-4o',
apiKey: process.env.OPENAI_API_KEY, // your prxy key
additionalSessionOptions: {
baseURL: 'https://api.prxy.monster/v1',
},
});Or per-LLM (without setting a global default):
const llm = new OpenAI({
model: 'gpt-4o',
apiKey: process.env.OPENAI_API_KEY,
additionalSessionOptions: {
baseURL: 'https://api.prxy.monster/v1',
},
});
const queryEngine = index.asQueryEngine({ llm });Code change
The additionalSessionOptions: { baseURL: ... } line is the only diff. If you set the env var instead:
export OPENAI_BASE_URL=https://api.prxy.monster/v1
export OPENAI_API_KEY=prxy_live_xxxxxxxxxxxxxxxxxxxxxxxx…then no code change at all — the underlying OpenAI client picks it up.
Verify
curl https://api.prxy.monster/healthRun any query — successful response confirms routing.
What you get
- Semantic cache for repeated retrievals — same question across sessions returns cached answers.
- Pattern memory — successful query patterns get learned.
- Infinite context — long-running query engines stop hitting context limits via the
ipcmodule. - Cost guards — hard per-request budget caps before the OpenAI bill arrives.
Embeddings
LlamaIndex’s OpenAIEmbedding also accepts additionalSessionOptions.baseURL. Both your LLM calls AND your embedding calls route through prxy.monster:
import { OpenAIEmbedding } from '@llamaindex/openai';
Settings.embedModel = new OpenAIEmbedding({
model: 'text-embedding-3-small',
apiKey: process.env.OPENAI_API_KEY,
additionalSessionOptions: {
baseURL: 'https://api.prxy.monster/v1',
},
});Recommended pipeline
For RAG-heavy LlamaIndex apps:
PRXY_PIPE=semantic-cache,exact-cache,patterns,cost-guardexact-cache first because RAG often produces identical retrieval prompts.
Common issues
Settings.llmis global — set it once at app startup, before instantiating any indexes / engines.asQueryEngine()builds a fresh LLM call each query — every call goes through prxy.monster as expected.- Streaming —
queryEngine.query({ stream: true })works; cache hits replay as synthetic SSE.
Full example
Adapt examples/openai-quickstart — replace the OpenAI client with the LlamaIndex Settings.llm setup above.
Verify the exact option name (additionalSessionOptions.baseURL) with the LlamaIndex TS docs for your installed version.