Querying Available Models#
The data[].id in the response is the value available for the model parameter.Chat Models#
| Model | Context | Features | Use Cases |
|---|
gpt-4o | 128K | Multimodal, well-rounded | Complex tasks, image + text understanding |
gpt-4o-mini | 128K | Fast, low cost | High-concurrency scenarios, simple conversations |
gpt-4.1 | 1M | Ultra-long context | Long document processing, codebase analysis |
gpt-4.1-mini | 1M | Long context + low cost | Long document summarization |
gpt-4.1-nano | 1M | Fastest and cheapest | Lightweight tasks like classification and extraction |
o3 | 200K | Enhanced reasoning | Complex reasoning, scientific questions |
o4-mini | 200K | Reasoning + low cost | Daily reasoning tasks |
claude-sonnet-4-20250514 | 200K | Strong at coding and reasoning | Code generation, analysis |
claude-opus-4-20250514 | 200K | Top-tier reasoning, deep analysis | Complex professional tasks, long-form writing |
gemini-2.5-pro | 1M | Ultra-long context + multimodal | Long documents, multimodal analysis |
Embedding Models#
| Model | Dimensions | Description |
|---|
text-embedding-3-large | 3072 (reducible) | High precision, recommended for production |
text-embedding-3-small | 1536 (reducible) | Fast, low cost |
text-embedding-ada-002 | 1536 | Legacy compatibility |
Image Models#
| Model | Max Size | Special Features |
|---|
gpt-image-1 | 1536x1024 | Transparent background, moderation level control |
dall-e-3 | 1792x1024 | High resolution, style options |
dall-e-2 | 1024x1024 | Basic generation, multiple outputs |
Audio Models#
| Model | Use | Description |
|---|
tts-1 | Text-to-speech | Standard quality |
tts-1-hd | Text-to-speech | High-definition quality |
gpt-4o-mini-tts | Text-to-speech | Supports style instructions |
whisper-1 | Speech-to-text / Translation | Multilingual support |
Video Models#
| Model | Description |
|---|
veo-2 | Google video generation |
sora | OpenAI text/image to video |
Rerank Models#
| Model | Description |
|---|
cohere-rerank-v3 | Cohere reranking, recommended for RAG scenarios |
Moderation Models#
| Model | Description |
|---|
omni-moderation-latest | Multimodal moderation, supports text + image |
Note: The actually available models are subject to the response of GET /v1/models. The platform will continue to add new models.
2. Quota and Billing#
Billing Method#
Nexhina bills by Token usage, with different prices for different models.Input Tokens (prompt_tokens): Content you send to the model
Output Tokens (completion_tokens): Content generated by the model
Generally, the unit price of output Tokens is higher than that of input Tokens
What is a Token#
A Token is the basic unit of text processed by a model. A rough conversion:| Language | 1 Token ≈ |
|---|
| English | 4 characters / 0.75 words |
| Chinese | 1-2 characters |
Model Multipliers#
Different models have different prices, calculated via multipliers. Using gpt-4o-mini as the baseline (1x):| Model | Input Multiplier | Output Multiplier | Description |
|---|
| gpt-4o-mini | 1x | 1x | Baseline |
| gpt-4o | 5x | 15x | Strong capability, high price |
| gpt-4.1 | 10x | 30x | Long context |
| claude-sonnet-4 | 6x | 30x | Strong at coding |
| claude-opus-4 | 15x | 75x | Top-tier reasoning |
Multipliers are for reference only; actual values are subject to the backend configuration. Administrators can adjust them in Operations Settings → Model Pricing.
Querying Quota#
Log in to the admin console and check the used and remaining quota for a Key in Token Management
Or get the consumption of the current request in real time via the usage field in the API response
Quota Exhausted#
After a Key's quota is used up, requests will return:{
"error": {
"message": "Insufficient quota",
"type": "insufficient_quota",
"code": "insufficient_quota"
}
}
The HTTP status code is 402. At this point, you need to top up or switch to a Key with quota.Billing for Different Interfaces#
| Interface | Billing Basis |
|---|
| Chat / Responses | Input + Output Tokens |
| Embeddings | Input Tokens |
| Images | Per image and per model, not Token-based |
| Audio TTS | Per input character count |
| Audio STT / Translation | Per audio duration |
| Video | Per request |
| Moderation | Input Tokens (usually a small amount) |
| Rerank | Input Tokens |
3. Rate Limits#
Limit Dimensions#
| Dimension | Meaning |
|---|
| RPM | Requests Per Minute |
| TPM | Tokens Per Minute |
Limit Rules#
Limits are based on the API Key dimension; different Keys are counted independently
Administrators can set different limits for different token groups in the backend
Default limits vary by deployment configuration; contact the administrator for specific values
Rate Limit Exceeded Response#
{
"error": {
"message": "Rate limit reached for default",
"type": "rate_limit_error",
"code": "rate_limit_exceeded"
}
}
The HTTP status code is 429.Rate limit information is returned via HTTP response headers:| Header | Meaning |
|---|
X-RateLimit-Limit | Total limit for the current window |
X-RateLimit-Remaining | Remaining requests in the current window |
X-RateLimit-Reset | Limit reset time (Unix timestamp) |
Strategies#
1.
Read Response Headers: Check X-RateLimit-Remaining after each request to predict in advance
2.
Client-Side Throttling: Implement local rate limiting on the client — don't wait for a 429 to slow down
3.
Exponential Backoff: After receiving a 429, wait 1s → 2s → 4s → 8s before retrying
4.
Multi-Key Rotation: Configure multiple Keys and rotate them to increase total throughput
5.
Reduce Invalid Tokens: Streamline prompts and avoid duplicate context
Batch Calls#
Dimensionality reduction is a feature of OpenAI's text-embedding-3 series. Other embedding models use their fixed dimensions.
Reducing dimensions loses precision. We recommend starting from a high dimension and gradually lowering it based on the results.
| Parameter | Recommendation |
|---|
model | Default to cohere-rerank-v3, currently the most universal |
top_n | Usually 3-5, no need to return too many |
return_documents | Set to true to avoid looking up the original text by index |
Modified at 2026-06-30 09:30:09