Quick to use

Querying Available Models

The data[].id in the response is the value available for the model parameter.

Chat Models

Model	Context	Features	Use Cases
`gpt-4o`	128K	Multimodal, well-rounded	Complex tasks, image + text understanding
`gpt-4o-mini`	128K	Fast, low cost	High-concurrency scenarios, simple conversations
`gpt-4.1`	1M	Ultra-long context	Long document processing, codebase analysis
`gpt-4.1-mini`	1M	Long context + low cost	Long document summarization
`gpt-4.1-nano`	1M	Fastest and cheapest	Lightweight tasks like classification and extraction
`o3`	200K	Enhanced reasoning	Complex reasoning, scientific questions
`o4-mini`	200K	Reasoning + low cost	Daily reasoning tasks
`claude-sonnet-4-20250514`	200K	Strong at coding and reasoning	Code generation, analysis
`claude-opus-4-20250514`	200K	Top-tier reasoning, deep analysis	Complex professional tasks, long-form writing
`gemini-2.5-pro`	1M	Ultra-long context + multimodal	Long documents, multimodal analysis

Embedding Models

Model	Dimensions	Description
`text-embedding-3-large`	3072 (reducible)	High precision, recommended for production
`text-embedding-3-small`	1536 (reducible)	Fast, low cost
`text-embedding-ada-002`	1536	Legacy compatibility

Image Models

Model	Max Size	Special Features
`gpt-image-1`	1536x1024	Transparent background, moderation level control
`dall-e-3`	1792x1024	High resolution, style options
`dall-e-2`	1024x1024	Basic generation, multiple outputs

Audio Models

Model	Use	Description
`tts-1`	Text-to-speech	Standard quality
`tts-1-hd`	Text-to-speech	High-definition quality
`gpt-4o-mini-tts`	Text-to-speech	Supports style instructions
`whisper-1`	Speech-to-text / Translation	Multilingual support

Video Models

Model	Description
`veo-2`	Google video generation
`sora`	OpenAI text/image to video

Rerank Models

Model	Description
`cohere-rerank-v3`	Cohere reranking, recommended for RAG scenarios

Moderation Models

Model	Description
`omni-moderation-latest`	Multimodal moderation, supports text + image

Note: The actually available models are subject to the response of GET /v1/models. The platform will continue to add new models.

2. Quota and Billing

Billing Method

Nexhina bills by Token usage, with different prices for different models.

Input Tokens (prompt_tokens): Content you send to the model

Output Tokens (completion_tokens): Content generated by the model

Generally, the unit price of output Tokens is higher than that of input Tokens

What is a Token

A Token is the basic unit of text processed by a model. A rough conversion:

Language	1 Token ≈
English	4 characters / 0.75 words
Chinese	1-2 characters

Model Multipliers

Different models have different prices, calculated via multipliers. Using gpt-4o-mini as the baseline (1x):

Model	Input Multiplier	Output Multiplier	Description
gpt-4o-mini	1x	1x	Baseline
gpt-4o	5x	15x	Strong capability, high price
gpt-4.1	10x	30x	Long context
claude-sonnet-4	6x	30x	Strong at coding
claude-opus-4	15x	75x	Top-tier reasoning

Multipliers are for reference only; actual values are subject to the backend configuration. Administrators can adjust them in Operations Settings → Model Pricing.

Querying Quota

Or get the consumption of the current request in real time via the usage field in the API response

Quota Exhausted

After a Key's quota is used up, requests will return:

{
  "error": {
    "message": "Insufficient quota",
    "type": "insufficient_quota",
    "code": "insufficient_quota"
  }
}

The HTTP status code is 402. At this point, you need to top up or switch to a Key with quota.

Billing for Different Interfaces

Interface	Billing Basis
Chat / Responses	Input + Output Tokens
Embeddings	Input Tokens
Images	Per image and per model, not Token-based
Audio TTS	Per input character count
Audio STT / Translation	Per audio duration
Video	Per request
Moderation	Input Tokens (usually a small amount)
Rerank	Input Tokens

3. Rate Limits

Limit Dimensions

Dimension	Meaning
RPM	Requests Per Minute
TPM	Tokens Per Minute

Limit Rules

Limits are based on the API Key dimension; different Keys are counted independently

Administrators can set different limits for different token groups in the backend

Default limits vary by deployment configuration; contact the administrator for specific values

Rate Limit Exceeded Response

{
  "error": {
    "message": "Rate limit reached for default",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}

The HTTP status code is 429.

Response Headers

Rate limit information is returned via HTTP response headers:

Header	Meaning
`X-RateLimit-Limit`	Total limit for the current window
`X-RateLimit-Remaining`	Remaining requests in the current window
`X-RateLimit-Reset`	Limit reset time (Unix timestamp)

Strategies

Read Response Headers: Check X-RateLimit-Remaining after each request to predict in advance

Client-Side Throttling: Implement local rate limiting on the client — don't wait for a 429 to slow down

Exponential Backoff: After receiving a 429, wait 1s → 2s → 4s → 8s before retrying

Multi-Key Rotation: Configure multiple Keys and rotate them to increase total throughput

Reduce Invalid Tokens: Streamline prompts and avoid duplicate context

Batch Calls

Dimensionality reduction is a feature of OpenAI's text-embedding-3 series. Other embedding models use their fixed dimensions.

Reducing dimensions loses precision. We recommend starting from a high dimension and gradually lowering it based on the results.

Parameter	Recommendation
`model`	Default to `cohere-rerank-v3`, currently the most universal
`top_n`	Usually 3-5, no need to return too many
`return_documents`	Set to `true` to avoid looking up the original text by index

Querying Available Models#

Chat Models#

Embedding Models#

Image Models#

Audio Models#

Video Models#

Rerank Models#

Moderation Models#

2. Quota and Billing#

Billing Method#

What is a Token#

Model Multipliers#

Querying Quota#

Quota Exhausted#

Billing for Different Interfaces#

3. Rate Limits#

Limit Dimensions#

Limit Rules#

Rate Limit Exceeded Response#

Response Headers#

Strategies#

Batch Calls#