Setup & Configuration
First Deployment
Step-by-step guide to deploying your first LLM
This guide will walk you through provisioning compute, deploying a model, and running your first inference request using InferiaLLM.
Prerequisites
- You have running InferiaLLM services (
inferia api-start). - You have access to the Admin Dashboard (default:
http://localhost:3001). - You have the necessary provider credentials configured (e.g., Nosana wallet).
Step 1: Provision Compute (Pools)
Before deploying a model, you need a compute resource. In InferiaLLM, these are managed in Pools.
- Navigate to the Pools section in the Dashboard sidebar.
- Click Create New Pool.
- Select Provider: Choose a provider (e.g., Nosana).
- Configuration:
- Select the desired GPU type (e.g., NVIDIA A10G, A100).
- Set the quantity/size of the pool.
- Click Provision.
- The system will request resources from the provider. Wait for the status to change to
Active.
- The system will request resources from the provider. Wait for the status to change to
Step 2: Create a Deployment
Once you have active compute, you can deploy a model onto it.
- Navigate to the Deployments section.
- Click New Deployment.
- Select Job Type: Choose Inference.
- Select Engine: Choose an optimization engine (e.g., vLLM for high-throughput serving).
- Configure Model:
- Deployment Name: Enter a unique name (e.g.,
my-first-llama).Important: This name will be used as the
modelparameter in your API calls. - Source: Specify the model weights (e.g., HuggingFace ID
meta-llama/Llama-2-7b-chat-hf).
- Deployment Name: Enter a unique name (e.g.,
- Select Pool: Assign the deployment to the pool you created in Step 1.
- Click Deploy.
- The system will pull the model and start the inference server. Wait for the status to be
RUNNING.
- The system will pull the model and start the inference server. Wait for the status to be
Step 3: Generate an API Key
To access your deployment securely, you need an API Key.
- Navigate to API Keys in the settings or sidebar.
- Click Create New Key.
- Give it a name (e.g., "Development Key").
- Copy the generated key (e.g.,
sk-inf-...). You won't be able to see it again.
Step 4: Run Inference
Now you can send requests to the Inference Gateway using your new deployment.
Endpoint: http://localhost:8001/v1/chat/completions
Example Request (cURL)
Replace $API_KEY with your key and use the Deployment Name you set in Step 2.
curl http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <YOUR_API_KEY>" \
-d '{
"model": "my-first-llama",
"messages": [
{
"role": "user",
"content": "Hello! Tell me a fun fact about space."
}
],
"temperature": 0.7
}'Response
You should receive a JSON response with the model's generated text:
{
"id": "chatcmpl-...",
"object": "chat.completion",
"created": 1678900000,
"model": "my-first-llama",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Did you know that a day on Venus is longer than a year on Venus? ..."
},
"finish_reason": "stop"
}
]
}