AI Fusion docs
An OpenAI-compatible gateway aggregating free-tier API access across 5 providers and 57 models.
1. Quick start
Three steps to your first request:
- Sign up for a free tenant.
- Visit Tokens and create a project token (
afp_live_…). Copy it — the plaintext is shown once. - Make a request:
curl https://ai.viktorarsov.com/v1/chat/completions \
-H "Authorization: Bearer afp_live_…" \
-H "Content-Type: application/json" \
-d '{"model":"gateway:fast-free","messages":[{"role":"user","content":"hi"}]}'
2. Endpoints
All paths are mounted under the root host. See /docs (FastAPI Swagger) for full request and response schemas.
| Method | Path | Description |
|---|---|---|
| POST | /v1/chat/completions | OpenAI-compatible chat completion (streaming + non-streaming). |
| POST | /v1/chat/completions/progress | Same payload, but emits an SSE feed of routing attempts as the spiral proceeds. |
| POST | /v1/messages | Anthropic-compatible Messages API ingress. Translated to OpenAI internally. |
| POST | /v1beta/models/{model}:generateContent | Gemini-native ingress. Translated to OpenAI internally. |
| POST | /v1/embeddings | Generate vector embeddings. |
| POST | /v1/moderations | OpenAI-compatible moderation pass. |
| POST | /v1/images/generations | Image generation (where provider supports it). |
| GET | /v1/models | List models the calling tenant can route to (including aliases). |
| GET | /v1/usage | Per-tenant usage rollup (tokens, cost, savings). |
| POST | /v1/feedback | Submit thumbs-up / thumbs-down on a prior request. |
| GET | /healthz | Liveness + dependency probe. |
3. Authentication
Every /v1/* call must include a project token in the standard bearer header:
Authorization: Bearer afp_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Tokens are issued from Dashboard → Tokens. Only the SHA-256 hash is stored server-side; the plaintext is shown once on creation.
Channel pinning suffix
Append :<channel> to your model string to pin to a routing channel — for example gateway:fast-free:dev isolates a dev workload's quota windows from your prod traffic. Channels are bookkeeping only; they don't change which provider answers.
4. Models & aliases
You can address models three ways:
- Alias —
gateway:fast-free. Resolves to a ranked list of (provider, model) pairs. - Provider-qualified —
groq/llama-3.1-8b-instant. Pin to one provider. - Bare —
llama-3.3-70b. Matches across providers by display name.
Aliases
| Slug | Description | Resolves to |
|---|---|---|
| auto | Spiral rotation: T1→T5 across all 5 sources × all keys. The DEFAULT alias. | nvidia_nim/mistralai/mistral-large-3-675b-instruct-2512 → cerebras/qwen-3-235b-a22b-instruct-2507 → gemini/gemini-2.5-pro → groq/openai/gpt-oss-120b +146 more |
| best-free-chat | Best-quality free chat: try premium-feel models first, fall through to high-volume. | gemini/gemini-2.5-pro → nvidia_nim/mistralai/mistral-large-3-675b-instruct-2512 → openrouter/deepseek/deepseek-chat-v3-0324:free → gemini/gemini-2.5-flash +3 more |
| cheap-paid | Cheapest paid fallback when all free quotas exhausted. | groq/llama-3.1-8b-instant → gemini/gemini-2.5-flash-lite → groq/openai/gpt-oss-120b |
| code-free | Coding-tuned free models (autocomplete, refactor, explain). | nvidia_nim/qwen/qwen3-coder-480b-a35b-instruct → openrouter/qwen/qwen3-coder:free → groq/openai/gpt-oss-120b → cerebras/qwen-3-235b-a22b-instruct-2507 |
| fast-free | Low-latency cheap/free models for high-volume simple tasks. | groq/llama-3.1-8b-instant → cerebras/llama3.1-8b → gemini/gemini-2.5-flash-lite → openrouter/google/gemma-4-31b-it:free +1 more |
| long-context | Models with very large context windows (>200K). | gemini/gemini-2.5-pro → gemini/gemini-2.5-flash → openrouter/meta-llama/llama-4-scout:free → openrouter/qwen/qwen3-coder:free |
| reasoning | Stronger reasoning models. | nvidia_nim/deepseek-ai/deepseek-v3.2 → nvidia_nim/moonshotai/kimi-k2-thinking → nvidia_nim/qwen/qwen3-next-80b-a3b-thinking → openrouter/deepseek/deepseek-r1-zero:free +1 more |
| vision | Multimodal (image input) capable free models. | gemini/gemini-2.5-flash → gemini/gemini-2.5-flash-lite → nvidia_nim/meta/llama-4-maverick-17b-128e-instruct → openrouter/meta-llama/llama-4-scout:free +1 more |
Providers
| Provider | Adapter | Base URL |
|---|---|---|
| OpenRouter openrouter | openrouter | https://openrouter.ai/api/v1 |
| Google Gemini (AI Studio) gemini | gemini_native | https://generativelanguage.googleapis.com/v1beta |
| Groq groq | openai_compat | https://api.groq.com/openai/v1 |
| Cerebras cerebras | openai_compat | https://api.cerebras.ai/v1 |
| NVIDIA NIM (build.nvidia.com) nvidia_nim | openai_compat | https://integrate.api.nvidia.com/v1 |
Models
| Display name | Model id | Context | Tier | Capabilities |
|---|---|---|---|---|
| Whisper Large v3 (Groq) | whisper-large-v3 | 448 | audio | |
| Whisper Large v3 Turbo (Groq) | whisper-large-v3-turbo | 448 | audio | |
| Gemini 3.1 Pro (paid) | gemini-3.1-pro | 1048576 | chat | tools vision json stream |
| GPT-4.1 (direct) | gpt-4.1 | 1048576 | chat | tools vision json stream |
| Claude 3.5 Sonnet (direct) | claude-3-5-sonnet-20241022 | 200000 | chat | tools vision json stream |
| GPT-4o (direct) | gpt-4o | 128000 | chat | tools vision json stream |
| Mistral Large 3 675B (NIM) | mistralai/mistral-large-3-675b-instruct-2512 | 128000 | chat | tools stream |
| Claude 3 Opus (direct) | claude-3-opus-20240229 | 200000 | chat | tools vision json stream |
| Llama 3.1 405B (NIM) | meta/llama-3.1-405b-instruct | 128000 | chat | tools stream |
| Nemotron Ultra 253B (NIM) | nvidia/llama-3.1-nemotron-ultra-253b-v1 | 128000 | chat | tools stream |
| Gemini 2.5 Pro | gemini-2.5-pro | 1048576 | chat | tools vision json stream |
| DeepSeek V3.2 (NIM) | deepseek-ai/deepseek-v3.2 | 128000 | chat | tools stream |
| Hermes 3 Llama 3.1 405B (free) | nousresearch/hermes-3-llama-3.1-405b:free | 131072 | chat | tools stream |
| Llama 4 Maverick (NIM, vision) | meta/llama-4-maverick-17b-128e-instruct | 128000 | chat | tools vision stream |
| Nemotron 3 Super 120B (free) | nvidia/nemotron-3-super-120b-a12b:free | 262144 | chat | tools stream |
| Qwen3 235B A22B (Cerebras) | qwen-3-235b-a22b-instruct-2507 | 65536 | chat | tools stream |
| Arcee Trinity Large 400B (free) | arcee-ai/trinity-large-preview:free | 131072 | chat | tools stream |
| DeepSeek V3.1 Terminus (NIM) | deepseek-ai/deepseek-v3.1-terminus | 128000 | chat | tools stream |
| Gemini 3.1 Flash (paid) | gemini-3.1-flash | 1048576 | chat | tools vision json stream |
| Qwen3-Next 80B (free) | qwen/qwen3-next-80b-a3b-instruct:free | 262144 | chat | tools stream |
| Kimi K2 (NIM) | moonshotai/kimi-k2-instruct | 128000 | chat | tools stream |
| Gemini 2.5 Flash | gemini-2.5-flash | 1048576 | chat | tools vision json stream |
| GPT-OSS 120B (OR, free) | openai/gpt-oss-120b:free | 131072 | chat | tools json stream |
| Llama 3.3 70B | llama-3.3-70b-versatile | 131072 | chat | tools json stream |
| Groq Compound (agentic) | groq/compound | 131072 | chat | tools stream |
| Llama 3.3 70B (NIM) | meta/llama-3.3-70b-instruct | 128000 | chat | tools stream |
| GPT-OSS 120B (Groq) | openai/gpt-oss-120b | 131072 | chat | tools json stream |
| Gemma 4 31B (free) | google/gemma-4-31b-it:free | 262144 | chat | tools stream |
| GPT-OSS 120B (NIM) | openai/gpt-oss-120b | 128000 | chat | tools json stream |
| Nemotron Super 49B v1.5 (NIM) | nvidia/llama-3.3-nemotron-super-49b-v1.5 | 128000 | chat | tools stream |
| Llama 3.3 70B (free) | meta-llama/llama-3.3-70b-instruct:free | 65536 | chat | tools stream |
| Llama 4 Scout 17B (Groq) | meta-llama/llama-4-scout-17b-16e-instruct | 131072 | chat | tools vision stream |
| Z.ai GLM 4.5 Air (free) | z-ai/glm-4.5-air:free | 131072 | chat | tools stream |
| Gemma 4 26B (free) | google/gemma-4-26b-a4b-it:free | 262144 | chat | tools stream |
| Minimax M2.5 (free) | minimax/minimax-m2.5:free | 196608 | chat | tools stream |
| Nemotron 3 Nano 30B (free) | nvidia/nemotron-3-nano-30b-a3b:free | 256000 | chat | tools stream |
| Qwen3 32B (Groq) | qwen/qwen3-32b | 131072 | chat | tools json stream |
| Nemotron Nano 12B VL (free, vision) | nvidia/nemotron-nano-12b-v2-vl:free | 128000 | chat | tools vision stream |
| Gemma 3 27B (free) | google/gemma-3-27b-it:free | 131072 | chat | tools stream |
| Qwen3 Coder 480B (NIM) | qwen/qwen3-coder-480b-a35b-instruct | 128000 | code | tools stream |
| Qwen3 Coder 480B (free) | qwen/qwen3-coder:free | 262144 | code | tools stream |
| Claude 3.5 Haiku (direct) | claude-3-5-haiku-20241022 | 200000 | fast | tools vision json stream |
| GPT-4o mini (direct) | gpt-4o-mini | 128000 | fast | tools vision json stream |
| Groq Compound Mini (agentic-fast) | groq/compound-mini | 131072 | fast | tools stream |
| Gemini 2.0 Flash (deprecates 2026-06-01) | gemini-2.0-flash | 1048576 | fast | tools vision json stream |
| GPT-OSS 20B (OR, free) | openai/gpt-oss-20b:free | 131072 | fast | tools json stream |
| GPT-OSS 20B (NIM) | openai/gpt-oss-20b | 128000 | fast | tools json stream |
| GPT-OSS 20B (Groq) | openai/gpt-oss-20b | 131072 | fast | tools json stream |
| Nemotron Nano 9B (free) | nvidia/nemotron-nano-9b-v2:free | 128000 | fast | tools stream |
| Llama 3.1 8B Instant | llama-3.1-8b-instant | 131072 | fast | tools json stream |
| Gemini 3.1 Flash-Lite (paid) | gemini-3.1-flash-lite | 1048576 | fast | tools vision json stream |
| Llama 3.2 3B (free) | meta-llama/llama-3.2-3b-instruct:free | 131072 | fast | tools stream |
| Gemini 2.5 Flash-Lite | gemini-2.5-flash-lite | 1048576 | fast | tools vision json stream |
| Llama 3.1 8B (Cerebras) | llama3.1-8b | 8192 | fast | tools stream |
| o3-mini (direct, reasoning) | o3-mini | 200000 | reason | tools json stream |
| Kimi K2 Thinking (NIM, reasoning) | moonshotai/kimi-k2-thinking | 128000 | reason | tools stream |
| Qwen3-Next 80B Thinking (NIM) | qwen/qwen3-next-80b-a3b-thinking | 128000 | reason | tools stream |
5. Live progress streaming
POST /v1/chat/completions/progress accepts the same JSON body as /v1/chat/completions but always responds with text/event-stream. Each SSE event has a type field describing what just happened in the spiral:
event: progress
data: {"type":"attempt","provider":"groq","model":"llama-3.1-8b-instant","key":"k_3"}
event: progress
data: {"type":"failure","provider":"groq","status":429,"reason":"rate_limited"}
event: progress
data: {"type":"attempt","provider":"cerebras","model":"llama-3.3-70b","key":"k_1"}
event: progress
data: {"type":"success","provider":"cerebras","latency_ms":612,"tokens":248}
event: done
data: {"choices":[{"message":{"role":"assistant","content":"…"}}]}
Clients should keep the connection open until they see an event: done frame.
6. Cross-format ingress
The same model pool can be reached using non-OpenAI request shapes:
POST /v1/messages— Anthropic Messages API surface. Same auth header, response shape matches Anthropic's.POST /v1beta/models/{model}:generateContent— Google Gemini REST shape. Pass anx-goog-api-keyheader containing yourafp_live_…token, or use?key=….
Both surfaces translate to the OpenAI-compatible internal pipeline so the same routing, quotas, and webhooks apply.
7. Quotas & rate limits
Three layers cooperate:
- Plan caps — monthly request and token ceilings, set per plan.
- RPM cap — sliding-window per project token, enforced in Redis.
- Per-key daily quotas — enforced atomically inside the routing layer, with cooldown until provider midnight on 429/401.
When all keys are cooling, the gateway parks the request up to max_wait_ms (passed via the x-gateway-max-wait-ms header), then either serves it or returns 429 with a next_slot_eta_s body.
8. Error codes
| Status | Meaning | What to do |
|---|---|---|
| 200 | OK | Read body and headers. |
| 400 | Malformed request body | Validate JSON shape. |
| 401 | Missing or invalid project token | Re-issue token in dashboard. |
| 403 | Token disabled or tenant disabled | Contact admin. |
| 404 | Unknown model or alias | List models via /v1/models. |
| 429 | Plan, RPM, or pool exhausted | Honor Retry-After or next_slot_eta_s. |
| 500 | Internal error | Retry; surface request id from response header. |
| 502 | Upstream provider returned bad payload | Retry with another model. |
| 503 | No healthy keys for any provider in the resolution list | Wait or use a different alias. |
| 504 | Upstream timeout | Retry; consider a smaller model. |
9. Webhooks
Configure target URLs in Dashboard → Webhooks. We POST a JSON body and an HMAC-SHA256 signature in X-AFP-Signature:
{
"type": "request.completed",
"tenant_id": "…",
"request_id": "…",
"model": "groq/llama-3.1-8b-instant",
"tokens_in": 41,
"tokens_out": 207,
"cost_usd": 0.0000,
"savings_usd": 0.00031,
"latency_ms": 612,
"ts": "2026-04-21T10:14:33Z"
}
Event types:
request.completed— every successful chat / embeddings / image call.request.failed— every terminal failure after the spiral exhausts.quota.threshold— fired at 80% and 100% of the monthly cap.invoice.created— emitted at the start of each billing period.
10. Response headers we add
Every successful response includes:
| Header | Value |
|---|---|
| x-gateway-provider | Slug of the provider that answered (e.g. groq). |
| x-gateway-model | The concrete model id served (post-alias-resolution). |
| x-gateway-latency-ms | End-to-end latency including all retries. |
| x-gateway-tokens | in/out token counts, comma-separated. |
| x-gateway-attempts | How many (provider, key) attempts were made. |
| x-gateway-cost-usd | Computed cost using the model's per-MTok pricing. |
| x-gateway-request-id | Stable id you can quote in support tickets. |
| Retry-After | Seconds to wait, on 429 / 503 only. |