Using prxy.monster with LlamaIndex (TypeScript)

LlamaIndex TS uses @llamaindex/openai for the OpenAI provider, which accepts a baseURL argument. Set it on the OpenAI LLM instance — every query engine, retriever, and agent that uses it inherits the routing.

Install


npm install llamaindex @llamaindex/openai

Configure


import { OpenAI } from '@llamaindex/openai';
import { Settings } from 'llamaindex';
 
Settings.llm = new OpenAI({
  model: 'gpt-4o',
  apiKey: process.env.OPENAI_API_KEY, // your prxy key
  additionalSessionOptions: {
    baseURL: 'https://api.prxy.monster/v1',
  },
});

Or per-LLM (without setting a global default):


const llm = new OpenAI({
  model: 'gpt-4o',
  apiKey: process.env.OPENAI_API_KEY,
  additionalSessionOptions: {
    baseURL: 'https://api.prxy.monster/v1',
  },
});
 
const queryEngine = index.asQueryEngine({ llm });

Code change

The additionalSessionOptions: { baseURL: ... } line is the only diff. If you set the env var instead:


export OPENAI_BASE_URL=https://api.prxy.monster/v1
export OPENAI_API_KEY=prxy_live_xxxxxxxxxxxxxxxxxxxxxxxx

…then no code change at all — the underlying OpenAI client picks it up.

Verify


curl https://api.prxy.monster/health

Run any query — successful response confirms routing.

What you get

Semantic cache for repeated retrievals — same question across sessions returns cached answers.
Pattern memory — successful query patterns get learned.
Infinite context — long-running query engines stop hitting context limits via the ipc module.
Cost guards — hard per-request budget caps before the OpenAI bill arrives.

Embeddings

LlamaIndex’s OpenAIEmbedding also accepts additionalSessionOptions.baseURL. Both your LLM calls AND your embedding calls route through prxy.monster:


import { OpenAIEmbedding } from '@llamaindex/openai';
 
Settings.embedModel = new OpenAIEmbedding({
  model: 'text-embedding-3-small',
  apiKey: process.env.OPENAI_API_KEY,
  additionalSessionOptions: {
    baseURL: 'https://api.prxy.monster/v1',
  },
});

Recommended pipeline

For RAG-heavy LlamaIndex apps:


PRXY_PIPE=semantic-cache,exact-cache,patterns,cost-guard

exact-cache first because RAG often produces identical retrieval prompts.

Common issues

Settings.llm is global — set it once at app startup, before instantiating any indexes / engines.
asQueryEngine() builds a fresh LLM call each query — every call goes through prxy.monster as expected.
Streaming — queryEngine.query({ stream: true }) works; cache hits replay as synthetic SSE.

Full example

Adapt examples/openai-quickstart — replace the OpenAI client with the LlamaIndex Settings.llm setup above.

Verify the exact option name (additionalSessionOptions.baseURL) with the LlamaIndex TS docs for your installed version.