Running Inference

Inferia provides an OpenAI-compatible API, allowing you to use existing tools and libraries with minimal configuration.

Prerequisites

Inferia Stack Running: Ensure the Inference Gateway is up (default port: 8001).
API Key: Generate an API Key from the Inferia Dashboard.
Deployment Name: You need the name of the model deployment you created (e.g., llama-3-8b).

[!NOTE] Automatic RAG & Templating Prompt templates and RAG configurations are linked to the Deployment in the dashboard. You do not need to pass extra parameters in your API call; the backend automatically applies the correct template and context based on the model name.

Using the OpenAI Python SDK

Install the official OpenAI library:

pip install openai

Configure the client to point to your Inferia instance:

from openai import OpenAI

# 1. Point to Inferia Gateway (Port 8001)
# 2. Use your generated API Key
client = OpenAI(
    base_url="http://localhost:8001/v1",
    api_key="sk-inferia-..." 
)

response = client.chat.completions.create(
    model="llama-3-8b",  # Use your specific Deployment Name here
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in one sentence."}
    ],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Using cURL

You can also test endpoints directly from the terminal:

curl -X POST http://localhost:8001/v1/chat/completions \
  -H "Authorization: Bearer sk-inferia-..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-8b",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Running Inference

Prerequisites

Using the OpenAI Python SDK

Using cURL

On this page