Website
  1. Start
  • Start
    • Product Intrduction
    • Quick to use
    • Using Nexhina in AI Coding Tools
  • API integration explanation
    • HTTP Status Codes
    • Getting an API Key
    • Authentication
    • Quick Start
    • Streaming Output Guide
    • Request URL
  • API Endpoints
    • Chat
      • Chat Completion
    • Models
      • List Available Models
    • Responses
      • Responses API
    • Embeddings
      • Text Embedding
    • Images
      • Generate Image
    • Audio
      • Text-to-Speech (TTS)
      • Speech-to-Text (STT)
      • Audio Translation
    • Video
      • Generate Video
    • Moderation
      • Content Moderation
    • Rerank
      • Rerank
  • Platform-related
    • Platform agreement
    • Privacy Policy
    • General Questions
  • Schemas
    • ChatRequest
    • Log
    • ChatMessage
    • ToolCall
    • User
    • FunctionDefinition
    • Channel
    • ToolDefinition
    • Token
    • ChatCompletionRequest
    • Redemption
    • ChatCompletionChoice
    • ChatCompletionResponse
    • ChatCompletionChunk
    • ResponseInputText
    • ResponseRequest
    • ResponseOutputText
    • ResponseOutputMessage
    • ResponseObject
    • EmbeddingRequest
    • EmbeddingData
    • EmbeddingResponse
    • ImageGenerationRequest
    • ImageData
    • ImageUsageInputTokensDetails
    • ImageUsage
    • ImageGenerationResponse
    • SpeechRequest
    • TranscriptionRequest
    • TranslationRequest
    • TranscriptionResponse
    • TranslationResponse
    • VideoGenerationRequest
    • VideoData
    • VideoGenerationResponse
    • ModerationRequest
    • ModerationCategory
    • ModerationResult
    • ModerationResponse
    • RerankRequest
    • RerankResult
    • RerankResponse
  1. Start

Quick to use

Querying Available Models#

The data[].id in the response is the value available for the model parameter.

Chat Models#

ModelContextFeaturesUse Cases
gpt-4o128KMultimodal, well-roundedComplex tasks, image + text understanding
gpt-4o-mini128KFast, low costHigh-concurrency scenarios, simple conversations
gpt-4.11MUltra-long contextLong document processing, codebase analysis
gpt-4.1-mini1MLong context + low costLong document summarization
gpt-4.1-nano1MFastest and cheapestLightweight tasks like classification and extraction
o3200KEnhanced reasoningComplex reasoning, scientific questions
o4-mini200KReasoning + low costDaily reasoning tasks
claude-sonnet-4-20250514200KStrong at coding and reasoningCode generation, analysis
claude-opus-4-20250514200KTop-tier reasoning, deep analysisComplex professional tasks, long-form writing
gemini-2.5-pro1MUltra-long context + multimodalLong documents, multimodal analysis

Embedding Models#

ModelDimensionsDescription
text-embedding-3-large3072 (reducible)High precision, recommended for production
text-embedding-3-small1536 (reducible)Fast, low cost
text-embedding-ada-0021536Legacy compatibility

Image Models#

ModelMax SizeSpecial Features
gpt-image-11536x1024Transparent background, moderation level control
dall-e-31792x1024High resolution, style options
dall-e-21024x1024Basic generation, multiple outputs

Audio Models#

ModelUseDescription
tts-1Text-to-speechStandard quality
tts-1-hdText-to-speechHigh-definition quality
gpt-4o-mini-ttsText-to-speechSupports style instructions
whisper-1Speech-to-text / TranslationMultilingual support

Video Models#

ModelDescription
veo-2Google video generation
soraOpenAI text/image to video

Rerank Models#

ModelDescription
cohere-rerank-v3Cohere reranking, recommended for RAG scenarios

Moderation Models#

ModelDescription
omni-moderation-latestMultimodal moderation, supports text + image
Note: The actually available models are subject to the response of GET /v1/models. The platform will continue to add new models.

2. Quota and Billing#

Billing Method#

Nexhina bills by Token usage, with different prices for different models.
Input Tokens (prompt_tokens): Content you send to the model
Output Tokens (completion_tokens): Content generated by the model
Generally, the unit price of output Tokens is higher than that of input Tokens

What is a Token#

A Token is the basic unit of text processed by a model. A rough conversion:
Language1 Token ≈
English4 characters / 0.75 words
Chinese1-2 characters

Model Multipliers#

Different models have different prices, calculated via multipliers. Using gpt-4o-mini as the baseline (1x):
ModelInput MultiplierOutput MultiplierDescription
gpt-4o-mini1x1xBaseline
gpt-4o5x15xStrong capability, high price
gpt-4.110x30xLong context
claude-sonnet-46x30xStrong at coding
claude-opus-415x75xTop-tier reasoning
Multipliers are for reference only; actual values are subject to the backend configuration. Administrators can adjust them in Operations Settings → Model Pricing.

Querying Quota#

Log in to the admin console and check the used and remaining quota for a Key in Token Management
Or get the consumption of the current request in real time via the usage field in the API response

Quota Exhausted#

After a Key's quota is used up, requests will return:
{
  "error": {
    "message": "Insufficient quota",
    "type": "insufficient_quota",
    "code": "insufficient_quota"
  }
}
The HTTP status code is 402. At this point, you need to top up or switch to a Key with quota.

Billing for Different Interfaces#

InterfaceBilling Basis
Chat / ResponsesInput + Output Tokens
EmbeddingsInput Tokens
ImagesPer image and per model, not Token-based
Audio TTSPer input character count
Audio STT / TranslationPer audio duration
VideoPer request
ModerationInput Tokens (usually a small amount)
RerankInput Tokens

3. Rate Limits#

Limit Dimensions#

DimensionMeaning
RPMRequests Per Minute
TPMTokens Per Minute

Limit Rules#

Limits are based on the API Key dimension; different Keys are counted independently
Administrators can set different limits for different token groups in the backend
Default limits vary by deployment configuration; contact the administrator for specific values

Rate Limit Exceeded Response#

{
  "error": {
    "message": "Rate limit reached for default",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}
The HTTP status code is 429.

Response Headers#

Rate limit information is returned via HTTP response headers:
HeaderMeaning
X-RateLimit-LimitTotal limit for the current window
X-RateLimit-RemainingRemaining requests in the current window
X-RateLimit-ResetLimit reset time (Unix timestamp)

Strategies#

1.
Read Response Headers: Check X-RateLimit-Remaining after each request to predict in advance
2.
Client-Side Throttling: Implement local rate limiting on the client — don't wait for a 429 to slow down
3.
Exponential Backoff: After receiving a 429, wait 1s → 2s → 4s → 8s before retrying
4.
Multi-Key Rotation: Configure multiple Keys and rotate them to increase total throughput
5.
Reduce Invalid Tokens: Streamline prompts and avoid duplicate context

Batch Calls#

Dimensionality reduction is a feature of OpenAI's text-embedding-3 series. Other embedding models use their fixed dimensions.
Reducing dimensions loses precision. We recommend starting from a high dimension and gradually lowering it based on the results.
ParameterRecommendation
modelDefault to cohere-rerank-v3, currently the most universal
top_nUsually 3-5, no need to return too many
return_documentsSet to true to avoid looking up the original text by index
Modified at 2026-06-30 09:30:09
Previous
Product Intrduction
Next
Using Nexhina in AI Coding Tools
Built with