Model Performance and Pricing
Note
All prices are in CHF per 1,000 tokens. MMLU scores indicate model performance on the Massive Multitask Language Understanding benchmark.
High Performance Models
Model | Input | Output | MMLU | Details |
llama3-swiss π¨π | 0.0150 | 0.0450 | 85.2% |
|
gpt-4o | 0.0038 | 0.0150 | 92.3% |
|
claude-sonnet | 0.0045 | 0.0225 | 88.7% |
|
Balanced Models
Model | Input | Output | MMLU | Details |
llama-swiss-medium π¨π | 0.0075 | 0.0150 | 79.2% |
|
mixtral-swiss-big π¨π | 0.0015 | 0.0045 | N/A |
|
mistral-medium | 0.0041 | 0.0122 | 77.3% |
|
mixtral-swiss-medium π¨π | 0.0045 | 0.0150 | 77.3% |
|
gpt-4 | 0.0450 | 0.0900 | 86.5% |
|
claude-opus | 0.0225 | 0.1125 | 86.8% |
|
Efficient Models
Model | Input | Output | MMLU | Details |
gpt-3.5-turbo-1106 | 0.0015 | 0.0030 | ~70% |
|
mistral-tiny | 0.0004 | 0.0013 | 60.1% |
|
mistral-small | 0.0002 | 0.0005 | 70.6% |
|
Without Category
Model | Input | Output | MMLU | Details |
gpt-4o-mini | 0.0002 | 0.0009 |
|
|
gpt-4.1 | 0.0030 | 0.0120 |
|
|
gpt-4-1106-preview | 0.0150 | 0.0450 |
|
|
gpt-4-0125-preview | 0.0150 | 0.0450 |
|
|
gpt-4-turbo | 0.0150 | 0.0450 |
|
|
gpt-4.5 | 0.1125 | 0.2250 |
|
|
gpt-5 | 0.0019 | 0.0150 |
|
|
gpt-swiss | 0.0200 | 0.0600 |
|
|
o1-mini | 0.0017 | 0.0017 |
|
|
o1-pro | 0.2250 | 0.9000 |
|
|
o1-review | 0.0225 | 0.0900 |
|
|
o3-mini | 0.0017 | 0.0066 |
|
|
mistral-swiss | 0.0450 | 0.0900 |
|
|
mistral-small-swiss | 0.0002 | 0.0005 |
|
|
gemini-2-5-pro-preview | 0.0038 | 0.0225 |
|
|
deepseek-r1 | 0.0300 | 0.0900 |
|
|
deepseek-reasoner | 0.0008 | 0.0033 |
|
|
deepseak-chat | 0.0004 | 0.0017 |
|
|
gemma-3 | 0.0170 | 0.0400 |
|
|
gemma-3-swiss | 0.0200 | 0.0600 |
|
|
qwen-3-fast | 0.0005 | 0.0014 |
|
|
qwen-3 | 0.0003 | 0.0009 |
|
|
mixtral-swiss-medium | 0.0045 | 0.0150 |
|
|
Performance Notes
MMLU scores marked with (1) indicate single-shot performance
Scores marked with (5-shot) use few-shot learning
N/A indicates pending benchmark data
MMLU scores
While MMLU scores provide a useful metric for comparing language model capabilities, they represent only one dimension of performance. These scores primarily measure how well models can handle a standardized set of tasks, but they do not fully capture broader skills such as multilingual comprehension, information retention, domain-specific usage, programming proficiency, or complex reasoning abilities. In practice, different tasks place different demands on a modelβs underlying architecture and training data, causing performance to vary considerably across these domains. As a result, MMLU should be seen as a helpful indicator rather than a definitive measure of a modelβs overall quality or suitability for a given application. Source: Llm leaderboard
Rate Limiting
All endpoints have a combined limit of CHF 50 per month. If you would like to increase it, please contact support with your estimated usage and use case.
