Using prxy.monster with LlamaIndex (Python)

LlamaIndex Python’s OpenAI LLM class accepts an api_base constructor arg. Set it once, every query engine / retriever / agent inherits the routing.

Install


pip install llama-index llama-index-llms-openai llama-index-embeddings-openai

Configure


from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
 
Settings.llm = OpenAI(
    model="gpt-4o",
    api_base="https://api.prxy.monster/v1",
    api_key="prxy_live_xxxxxxxxxxxxxxxxxxxxxxxx",
)

Or per-LLM (no global default):


llm = OpenAI(
    model="gpt-4o",
    api_base="https://api.prxy.monster/v1",
    api_key="prxy_live_xxxxxxxxxxxxxxxxxxxxxxxx",
)
 
query_engine = index.as_query_engine(llm=llm)

Or via env var (no constructor change):


export OPENAI_BASE_URL=https://api.prxy.monster/v1
export OPENAI_API_KEY=prxy_live_xxxxxxxxxxxxxxxxxxxxxxxx


from llama_index.llms.openai import OpenAI
Settings.llm = OpenAI(model="gpt-4o")  # picks up env vars

Verify


curl https://api.prxy.monster/health

Run a query — successful response confirms routing.

What you get

Semantic cache for retrievals (RAG often produces near-identical follow-up queries).
Pattern memory — successful retrieval / answer patterns get learned.
Infinite context — ipc keeps multi-turn QueryEngine sessions from hitting the wall.
Cost guards — per-request and per-day budget caps.

Embeddings

The same api_base arg works for OpenAIEmbedding:


from llama_index.embeddings.openai import OpenAIEmbedding
 
Settings.embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    api_base="https://api.prxy.monster/v1",
    api_key="prxy_live_xxxxxxxxxxxxxxxxxxxxxxxx",
)

Both your LLM calls AND your embedding calls route through prxy.monster.

Anthropic provider

If you’re using llama-index-llms-anthropic, the same pattern applies — pass the prxy base URL:


from llama_index.llms.anthropic import Anthropic
 
Settings.llm = Anthropic(
    model="claude-sonnet-4-6",
    base_url="https://api.prxy.monster",
    api_key="prxy_live_xxxxxxxxxxxxxxxxxxxxxxxx",
)

Recommended pipeline

For RAG:


PRXY_PIPE=exact-cache,semantic-cache,patterns,cost-guard

For agents with tool use:


PRXY_PIPE=mcp-optimizer,semantic-cache,patterns,ipc

Common issues

Settings.llm is process-global — set it at app startup, before instantiating indexes.
Async APIs (aquery, astream_query) — work identically.
Function calling (llm.predict_and_call(tools=...)) — pass-through.

Full example

Adapt examples/openai-quickstart — replace the OpenAI client with Settings.llm = OpenAI(...) as shown above.

Verify the exact constructor argument name with the LlamaIndex Python docs for your installed version. api_base is stable across recent versions.