AI Fusion docs

An OpenAI-compatible gateway aggregating free-tier API access across 5 providers and 57 models.

1. Quick start

Three steps to your first request:

  1. Sign up for a free tenant.
  2. Visit Tokens and create a project token (afp_live_…). Copy it — the plaintext is shown once.
  3. Make a request:
curl https://ai.viktorarsov.com/v1/chat/completions \
  -H "Authorization: Bearer afp_live_…" \
  -H "Content-Type: application/json" \
  -d '{"model":"gateway:fast-free","messages":[{"role":"user","content":"hi"}]}'

2. Endpoints

All paths are mounted under the root host. See /docs (FastAPI Swagger) for full request and response schemas.

MethodPathDescription
POST/v1/chat/completionsOpenAI-compatible chat completion (streaming + non-streaming).
POST/v1/chat/completions/progressSame payload, but emits an SSE feed of routing attempts as the spiral proceeds.
POST/v1/messagesAnthropic-compatible Messages API ingress. Translated to OpenAI internally.
POST/v1beta/models/{model}:generateContentGemini-native ingress. Translated to OpenAI internally.
POST/v1/embeddingsGenerate vector embeddings.
POST/v1/moderationsOpenAI-compatible moderation pass.
POST/v1/images/generationsImage generation (where provider supports it).
GET/v1/modelsList models the calling tenant can route to (including aliases).
GET/v1/usagePer-tenant usage rollup (tokens, cost, savings).
POST/v1/feedbackSubmit thumbs-up / thumbs-down on a prior request.
GET/healthzLiveness + dependency probe.

3. Authentication

Every /v1/* call must include a project token in the standard bearer header:

Authorization: Bearer afp_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Tokens are issued from Dashboard → Tokens. Only the SHA-256 hash is stored server-side; the plaintext is shown once on creation.

Channel pinning suffix

Append :<channel> to your model string to pin to a routing channel — for example gateway:fast-free:dev isolates a dev workload's quota windows from your prod traffic. Channels are bookkeeping only; they don't change which provider answers.

4. Models & aliases

You can address models three ways:

  • Aliasgateway:fast-free. Resolves to a ranked list of (provider, model) pairs.
  • Provider-qualifiedgroq/llama-3.1-8b-instant. Pin to one provider.
  • Barellama-3.3-70b. Matches across providers by display name.

Aliases

SlugDescriptionResolves to
auto Spiral rotation: T1→T5 across all 5 sources × all keys. The DEFAULT alias. nvidia_nim/mistralai/mistral-large-3-675b-instruct-2512 → cerebras/qwen-3-235b-a22b-instruct-2507 → gemini/gemini-2.5-pro → groq/openai/gpt-oss-120b +146 more
best-free-chat Best-quality free chat: try premium-feel models first, fall through to high-volume. gemini/gemini-2.5-pro → nvidia_nim/mistralai/mistral-large-3-675b-instruct-2512 → openrouter/deepseek/deepseek-chat-v3-0324:free → gemini/gemini-2.5-flash +3 more
cheap-paid Cheapest paid fallback when all free quotas exhausted. groq/llama-3.1-8b-instant → gemini/gemini-2.5-flash-lite → groq/openai/gpt-oss-120b
code-free Coding-tuned free models (autocomplete, refactor, explain). nvidia_nim/qwen/qwen3-coder-480b-a35b-instruct → openrouter/qwen/qwen3-coder:free → groq/openai/gpt-oss-120b → cerebras/qwen-3-235b-a22b-instruct-2507
fast-free Low-latency cheap/free models for high-volume simple tasks. groq/llama-3.1-8b-instant → cerebras/llama3.1-8b → gemini/gemini-2.5-flash-lite → openrouter/google/gemma-4-31b-it:free +1 more
long-context Models with very large context windows (>200K). gemini/gemini-2.5-pro → gemini/gemini-2.5-flash → openrouter/meta-llama/llama-4-scout:free → openrouter/qwen/qwen3-coder:free
reasoning Stronger reasoning models. nvidia_nim/deepseek-ai/deepseek-v3.2 → nvidia_nim/moonshotai/kimi-k2-thinking → nvidia_nim/qwen/qwen3-next-80b-a3b-thinking → openrouter/deepseek/deepseek-r1-zero:free +1 more
vision Multimodal (image input) capable free models. gemini/gemini-2.5-flash → gemini/gemini-2.5-flash-lite → nvidia_nim/meta/llama-4-maverick-17b-128e-instruct → openrouter/meta-llama/llama-4-scout:free +1 more

Providers

ProviderAdapterBase URL
OpenRouter openrouter openrouter https://openrouter.ai/api/v1
Google Gemini (AI Studio) gemini gemini_native https://generativelanguage.googleapis.com/v1beta
Groq groq openai_compat https://api.groq.com/openai/v1
Cerebras cerebras openai_compat https://api.cerebras.ai/v1
NVIDIA NIM (build.nvidia.com) nvidia_nim openai_compat https://integrate.api.nvidia.com/v1

Models

Display nameModel idContextTierCapabilities
Whisper Large v3 (Groq) whisper-large-v3 448 audio
Whisper Large v3 Turbo (Groq) whisper-large-v3-turbo 448 audio
Gemini 3.1 Pro (paid) gemini-3.1-pro 1048576 chat tools vision json stream
GPT-4.1 (direct) gpt-4.1 1048576 chat tools vision json stream
Claude 3.5 Sonnet (direct) claude-3-5-sonnet-20241022 200000 chat tools vision json stream
GPT-4o (direct) gpt-4o 128000 chat tools vision json stream
Mistral Large 3 675B (NIM) mistralai/mistral-large-3-675b-instruct-2512 128000 chat tools stream
Claude 3 Opus (direct) claude-3-opus-20240229 200000 chat tools vision json stream
Llama 3.1 405B (NIM) meta/llama-3.1-405b-instruct 128000 chat tools stream
Nemotron Ultra 253B (NIM) nvidia/llama-3.1-nemotron-ultra-253b-v1 128000 chat tools stream
Gemini 2.5 Pro gemini-2.5-pro 1048576 chat tools vision json stream
DeepSeek V3.2 (NIM) deepseek-ai/deepseek-v3.2 128000 chat tools stream
Hermes 3 Llama 3.1 405B (free) nousresearch/hermes-3-llama-3.1-405b:free 131072 chat tools stream
Llama 4 Maverick (NIM, vision) meta/llama-4-maverick-17b-128e-instruct 128000 chat tools vision stream
Nemotron 3 Super 120B (free) nvidia/nemotron-3-super-120b-a12b:free 262144 chat tools stream
Qwen3 235B A22B (Cerebras) qwen-3-235b-a22b-instruct-2507 65536 chat tools stream
Arcee Trinity Large 400B (free) arcee-ai/trinity-large-preview:free 131072 chat tools stream
DeepSeek V3.1 Terminus (NIM) deepseek-ai/deepseek-v3.1-terminus 128000 chat tools stream
Gemini 3.1 Flash (paid) gemini-3.1-flash 1048576 chat tools vision json stream
Qwen3-Next 80B (free) qwen/qwen3-next-80b-a3b-instruct:free 262144 chat tools stream
Kimi K2 (NIM) moonshotai/kimi-k2-instruct 128000 chat tools stream
Gemini 2.5 Flash gemini-2.5-flash 1048576 chat tools vision json stream
GPT-OSS 120B (OR, free) openai/gpt-oss-120b:free 131072 chat tools json stream
Llama 3.3 70B llama-3.3-70b-versatile 131072 chat tools json stream
Groq Compound (agentic) groq/compound 131072 chat tools stream
Llama 3.3 70B (NIM) meta/llama-3.3-70b-instruct 128000 chat tools stream
GPT-OSS 120B (Groq) openai/gpt-oss-120b 131072 chat tools json stream
Gemma 4 31B (free) google/gemma-4-31b-it:free 262144 chat tools stream
GPT-OSS 120B (NIM) openai/gpt-oss-120b 128000 chat tools json stream
Nemotron Super 49B v1.5 (NIM) nvidia/llama-3.3-nemotron-super-49b-v1.5 128000 chat tools stream
Llama 3.3 70B (free) meta-llama/llama-3.3-70b-instruct:free 65536 chat tools stream
Llama 4 Scout 17B (Groq) meta-llama/llama-4-scout-17b-16e-instruct 131072 chat tools vision stream
Z.ai GLM 4.5 Air (free) z-ai/glm-4.5-air:free 131072 chat tools stream
Gemma 4 26B (free) google/gemma-4-26b-a4b-it:free 262144 chat tools stream
Minimax M2.5 (free) minimax/minimax-m2.5:free 196608 chat tools stream
Nemotron 3 Nano 30B (free) nvidia/nemotron-3-nano-30b-a3b:free 256000 chat tools stream
Qwen3 32B (Groq) qwen/qwen3-32b 131072 chat tools json stream
Nemotron Nano 12B VL (free, vision) nvidia/nemotron-nano-12b-v2-vl:free 128000 chat tools vision stream
Gemma 3 27B (free) google/gemma-3-27b-it:free 131072 chat tools stream
Qwen3 Coder 480B (NIM) qwen/qwen3-coder-480b-a35b-instruct 128000 code tools stream
Qwen3 Coder 480B (free) qwen/qwen3-coder:free 262144 code tools stream
Claude 3.5 Haiku (direct) claude-3-5-haiku-20241022 200000 fast tools vision json stream
GPT-4o mini (direct) gpt-4o-mini 128000 fast tools vision json stream
Groq Compound Mini (agentic-fast) groq/compound-mini 131072 fast tools stream
Gemini 2.0 Flash (deprecates 2026-06-01) gemini-2.0-flash 1048576 fast tools vision json stream
GPT-OSS 20B (OR, free) openai/gpt-oss-20b:free 131072 fast tools json stream
GPT-OSS 20B (NIM) openai/gpt-oss-20b 128000 fast tools json stream
GPT-OSS 20B (Groq) openai/gpt-oss-20b 131072 fast tools json stream
Nemotron Nano 9B (free) nvidia/nemotron-nano-9b-v2:free 128000 fast tools stream
Llama 3.1 8B Instant llama-3.1-8b-instant 131072 fast tools json stream
Gemini 3.1 Flash-Lite (paid) gemini-3.1-flash-lite 1048576 fast tools vision json stream
Llama 3.2 3B (free) meta-llama/llama-3.2-3b-instruct:free 131072 fast tools stream
Gemini 2.5 Flash-Lite gemini-2.5-flash-lite 1048576 fast tools vision json stream
Llama 3.1 8B (Cerebras) llama3.1-8b 8192 fast tools stream
o3-mini (direct, reasoning) o3-mini 200000 reason tools json stream
Kimi K2 Thinking (NIM, reasoning) moonshotai/kimi-k2-thinking 128000 reason tools stream
Qwen3-Next 80B Thinking (NIM) qwen/qwen3-next-80b-a3b-thinking 128000 reason tools stream

5. Live progress streaming

POST /v1/chat/completions/progress accepts the same JSON body as /v1/chat/completions but always responds with text/event-stream. Each SSE event has a type field describing what just happened in the spiral:

event: progress
data: {"type":"attempt","provider":"groq","model":"llama-3.1-8b-instant","key":"k_3"}

event: progress
data: {"type":"failure","provider":"groq","status":429,"reason":"rate_limited"}

event: progress
data: {"type":"attempt","provider":"cerebras","model":"llama-3.3-70b","key":"k_1"}

event: progress
data: {"type":"success","provider":"cerebras","latency_ms":612,"tokens":248}

event: done
data: {"choices":[{"message":{"role":"assistant","content":"…"}}]}

Clients should keep the connection open until they see an event: done frame.

6. Cross-format ingress

The same model pool can be reached using non-OpenAI request shapes:

  • POST /v1/messages — Anthropic Messages API surface. Same auth header, response shape matches Anthropic's.
  • POST /v1beta/models/{model}:generateContent — Google Gemini REST shape. Pass an x-goog-api-key header containing your afp_live_… token, or use ?key=….

Both surfaces translate to the OpenAI-compatible internal pipeline so the same routing, quotas, and webhooks apply.

7. Quotas & rate limits

Three layers cooperate:

  • Plan caps — monthly request and token ceilings, set per plan.
  • RPM cap — sliding-window per project token, enforced in Redis.
  • Per-key daily quotas — enforced atomically inside the routing layer, with cooldown until provider midnight on 429/401.

When all keys are cooling, the gateway parks the request up to max_wait_ms (passed via the x-gateway-max-wait-ms header), then either serves it or returns 429 with a next_slot_eta_s body.

8. Error codes

StatusMeaningWhat to do
200OKRead body and headers.
400Malformed request bodyValidate JSON shape.
401Missing or invalid project tokenRe-issue token in dashboard.
403Token disabled or tenant disabledContact admin.
404Unknown model or aliasList models via /v1/models.
429Plan, RPM, or pool exhaustedHonor Retry-After or next_slot_eta_s.
500Internal errorRetry; surface request id from response header.
502Upstream provider returned bad payloadRetry with another model.
503No healthy keys for any provider in the resolution listWait or use a different alias.
504Upstream timeoutRetry; consider a smaller model.

9. Webhooks

Configure target URLs in Dashboard → Webhooks. We POST a JSON body and an HMAC-SHA256 signature in X-AFP-Signature:

{
  "type": "request.completed",
  "tenant_id": "…",
  "request_id": "…",
  "model": "groq/llama-3.1-8b-instant",
  "tokens_in": 41,
  "tokens_out": 207,
  "cost_usd": 0.0000,
  "savings_usd": 0.00031,
  "latency_ms": 612,
  "ts": "2026-04-21T10:14:33Z"
}

Event types:

  • request.completed — every successful chat / embeddings / image call.
  • request.failed — every terminal failure after the spiral exhausts.
  • quota.threshold — fired at 80% and 100% of the monthly cap.
  • invoice.created — emitted at the start of each billing period.

10. Response headers we add

Every successful response includes:

HeaderValue
x-gateway-providerSlug of the provider that answered (e.g. groq).
x-gateway-modelThe concrete model id served (post-alias-resolution).
x-gateway-latency-msEnd-to-end latency including all retries.
x-gateway-tokensin/out token counts, comma-separated.
x-gateway-attemptsHow many (provider, key) attempts were made.
x-gateway-cost-usdComputed cost using the model's per-MTok pricing.
x-gateway-request-idStable id you can quote in support tickets.
Retry-AfterSeconds to wait, on 429 / 503 only.