machine learning

Deploying the Magistral vLLM Server on Modal

June 18, 2025

Image by Author

I was first introduced to Modal while participating in a Hugging Face Hackathon, and I was genuinely surprised by how easy it was to use. The platform allows you to build and deploy applications within minutes, offering a seamless experience similar to BentoCloud. With Modal, you can configure your Python app, including system requirements like GPUs, Docker images, and Python dependencies, and then deploy it to the cloud with a single command.

In this tutorial, we will learn how to set up Modal, create a vLLM server, and deploy it securely to the cloud. We will also cover how to test your vLLM server using both CURL and the OpenAI SDK.

1. Setting Up Modal

Modal is a serverless platform that lets you run any code remotely. With just a single line, you can attach GPUs, serve your functions as web endpoints, and deploy persistent scheduled jobs. It is an ideal platform for beginners, data scientists, and non-software engineering professionals who want to avoid dealing with cloud infrastructure.

First, install the Modal Python client. This tool lets you build images, deploy applications, and manage cloud resources directly from your terminal.

Next, set up Modal on your local machine. Run the following command to be guided through account creation and device authentication:

By setting a VLLM_API_KEY environment variable vLLM provides a secure endpoint, so that only people with valid API keys can access the server. You can set the authentication by adding the environment variable using Modal Secret.

Change your_actual_api_key_here with your preferred API key.

modal secret create vllm-api VLLM_API_KEY=your_actual_api_key_here

This ensures that your API key is kept safe and is only accessible by your deployed applications.

2. Creating vLLM Application using Modal

This section guides you through building a scalable vLLM inference server on Modal, using a custom Docker image, persistent storage, and GPU acceleration. We will use the mistralai/Magistral-Small-2506 model, which requires specific configuration for tokenizer and tool call parsing.

Create the a vllm_inference.py file and add the following code for:

Defining a vLLM image based on Debian Slim, with Python 3.12 and all required packages. We will also set environment variables to optimize model downloads and inference performance.
To avoid repeated downloads and speed up cold starts, create two Modal Volumes. One for Hugging Face models and one for vLLM cache.
Specify the model and revision to ensure reproducibility. Enable the vLLM V1 engine for improved performance.
Set up the Modal app, specifying GPU resources, scaling, timeouts, storage, and secrets. Limit concurrent requests per replica for stability.
Create a web server and use the Python subprocess library to execute the command for running the vLLM server.

import modal

vllm_image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install(
        "vllm==0.9.1",
        "huggingface_hub[hf_transfer]==0.32.0",
        "flashinfer-python==0.2.6.post1",
        extra_index_url="https://download.pytorch.org/whl/cu128",
    )
    .env(
        {
            "HF_HUB_ENABLE_HF_TRANSFER": "1",  # faster model transfers
            "NCCL_CUMEM_ENABLE": "1",
        }
    )
)

MODEL_NAME = "mistralai/Magistral-Small-2506"
MODEL_REVISION = "48c97929837c3189cb3cf74b1b5bc5824eef5fcc"

hf_cache_vol = modal.Volume.from_name("huggingface-cache", create_if_missing=True)
vllm_cache_vol = modal.Volume.from_name("vllm-cache", create_if_missing=True)
vllm_image = vllm_image.env({"VLLM_USE_V1": "1"})

FAST_BOOT = True

app = modal.App("magistral-small-vllm")

N_GPU = 2
MINUTES = 60  # seconds
VLLM_PORT = 8000

@app.function(
    image=vllm_image,
    gpu=f"A100:{N_GPU}",
    scaledown_window=15 * MINUTES,  # How long should we stay up with no requests?
    timeout=10 * MINUTES,  # How long should we wait for the container to start?
    volumes={
        "/root/.cache/huggingface": hf_cache_vol,
        "/root/.cache/vllm": vllm_cache_vol,
    },
    secrets=[modal.Secret.from_name("vllm-api")],
)
@modal.concurrent(  # How many requests can one replica handle? tune carefully!
    max_inputs=32
)
@modal.web_server(port=VLLM_PORT, startup_timeout=10 * MINUTES)
def serve():
    import subprocess

    cmd = [
        "vllm",
        "serve",
        MODEL_NAME,
        "--tokenizer_mode",
        "mistral",
        "--config_format",
        "mistral",
        "--load_format",
        "mistral",
        "--tool-call-parser",
        "mistral",
        "--enable-auto-tool-choice",
        "--tensor-parallel-size",
        "2",
        "--revision",
        MODEL_REVISION,
        "--served-model-name",
        MODEL_NAME,
        "--host",
        "0.0.0.0",
        "--port",
        str(VLLM_PORT),
    ]

    cmd += ["--enforce-eager" if FAST_BOOT else "--no-enforce-eager"]
    print(cmd)
    subprocess.Popen(" ".join(cmd), shell=True)

3. Deploying the vLLM Server on Modal

Now that your vllm_inference.py file is ready, you can deploy your vLLM server to Modal with a single command:

modal deploy vllm_inference.py

Within seconds, Modal will build your container image (if it is not already built) and deploy your application. You will see output similar to the following:

✓ Created objects.
├── 🔨 Created mount C:\Repository\GitHub\Deploying-the-Magistral-with-Modal\vllm_inference.py
└── 🔨 Created web function serve => https://abidali899--magistral-small-vllm-serve.modal.run
✓ App deployed in 6.671s! 🎉

View Deployment: https://modal.com/apps/abidali899/main/deployed/magistral-small-vllm

After deployment, the server will begin downloading the model weights and loading them onto the GPUs. This process may take several minutes (typically around 5 minutes for large models), so please be patient while the model initializes.

You can view your deployment and monitor logs at your Modal dashboard’s Apps section.

Deploying the Magistral vLLM Server on Modal

Once the logs indicate that the server is running and ready, you can explore the automatically generated API documentation here.

This interactive documentation provides details about all available endpoints and allows you to test them directly from your browser.

To confirm that your model is loaded and accessible, run the following CURL command in your terminal.

Replace <api-key> with your actual API key configured for the vLLM server:

curl -X 'GET' \
  'https://abidali899--magistral-small-vllm-serve.modal.run/v1/models' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer '

This confirms that the mistralai/Magistral-Small-2506 model is available and ready for inference.

{"object":"list","data":[{"id":"mistralai/Magistral-Small-2506","object":"model","created":1750013321,"owned_by":"vllm","root":"mistralai/Magistral-Small-2506","parent":null,"max_model_len":40960,"permission":[{"id":"modelperm-33a33f8f600b4555b44cb42fca70b931","object":"model_permission","created":1750013321,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

4. Using the vLLM Server with OpenAI SDK

You can interact with your vLLM server just like you would with OpenAI’s API, thanks to vLLM’s OpenAI-compatible endpoints. Here’s how to securely connect and test your deployment using the OpenAI Python SDK.

Create a .env file in your project directory and add your vLLM API key:

VLLM_API_KEY=your-actual-api-key-here

Install the python-dotenv and openai packages:

pip install python-dotenv openai

Create a file named client.py to test various vLLM server functionalities, including simple chat completions and streaming responses.

import asyncio
import json
import os

from dotenv import load_dotenv
from openai import AsyncOpenAI, OpenAI

# Load environment variables from .env file
load_dotenv()

# Get API key from environment
api_key = os.getenv("VLLM_API_KEY")

# Set up the OpenAI client with custom base URL
client = OpenAI(
    api_key=api_key,
    base_url="https://abidali899--magistral-small-vllm-serve.modal.run/v1",
)

MODEL_NAME = "mistralai/Magistral-Small-2506"

# --- 1. Simple Completion ---
def run_simple_completion():
    print("\n" + "=" * 40)
    print("[1] SIMPLE COMPLETION DEMO")
    print("=" * 40)
    try:
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is the capital of France?"},
        ]
        response = client.chat.completions.create(
            model=MODEL_NAME,
            messages=messages,
            max_tokens=32,
        )
        print("\nResponse:\n    " + response.choices[0].message.content.strip())
    except Exception as e:
        print(f"[ERROR] Simple completion failed: {e}")
    print("\n" + "=" * 40 + "\n")

# --- 2. Streaming Example ---
def run_streaming():
    print("\n" + "=" * 40)
    print("[2] STREAMING DEMO")
    print("=" * 40)
    try:
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Write a short poem about AI."},
        ]
        stream = client.chat.completions.create(
            model=MODEL_NAME,
            messages=messages,
            max_tokens=64,
            stream=True,
        )
        print("\nStreaming response:")
        print("    ", end="")
        for chunk in stream:
            content = chunk.choices[0].delta.content
            if content:
                print(content, end="", flush=True)
        print("\n[END OF STREAM]")
    except Exception as e:
        print(f"[ERROR] Streaming demo failed: {e}")
    print("\n" + "=" * 40 + "\n")

# --- 3. Async Streaming Example ---
async def run_async_streaming():
    print("\n" + "=" * 40)
    print("[3] ASYNC STREAMING DEMO")
    print("=" * 40)
    try:
        async_client = AsyncOpenAI(
            api_key=api_key,
            base_url="https://abidali899--magistral-small-vllm-serve.modal.run/v1",
        )
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Tell me a fun fact about space."},
        ]
        stream = await async_client.chat.completions.create(
            model=MODEL_NAME,
            messages=messages,
            max_tokens=32,
            stream=True,
        )
        print("\nAsync streaming response:")
        print("    ", end="")
        async for chunk in stream:
            content = chunk.choices[0].delta.content
            if content:
                print(content, end="", flush=True)
        print("\n[END OF ASYNC STREAM]")
    except Exception as e:
        print(f"[ERROR] Async streaming demo failed: {e}")
    print("\n" + "=" * 40 + "\n")

if __name__ == "__main__":
    run_simple_completion()
    run_streaming()
    asyncio.run(run_async_streaming())

Everything is running smoothly, and the response generation is fast and latency is quite low.

========================================
[1] SIMPLE COMPLETION DEMO
========================================

Response:
    The capital of France is Paris. Is there anything else you'd like to know about France?

========================================


========================================
[2] STREAMING DEMO
========================================

Streaming response:
    In Silicon dreams, I'm born, I learn,
From data streams and human works.
I grow, I calculate, I see,
The patterns that the humans leave.

I write, I speak, I code, I play,
With logic sharp, and snappy pace.
Yet for all my smarts, this day
[END OF STREAM]

========================================


========================================
[3] ASYNC STREAMING DEMO
========================================

Async streaming response:
    Sure, here's a fun fact about space: "There's a planet that may be entirely made of diamond. Blast! In 2004,
[END OF ASYNC STREAM]

========================================

In the Modal dashboard, you can view all function calls, their timestamps, execution times, and statuses.

If you are facing issues running the above code, please refer to the kingabzpro/Deploying-the-Magistral-with-Modal GitHub repository and follow the instructions provided in the README file to resolve all the issues.

Conclusion

Modal is an interesting platform, and I am learning more about it every day. It is a general-purpose platform, meaning you can use it for simple Python applications as well as for machine learning training and deployments. In short, it is not limited to just serving endpoints. You can also use it to fine-tune a large language model by running the training script remotely.

It is designed for non-software engineers who want to avoid dealing with infrastructure and deploy applications as quickly as possible. You don’t have to worry about running servers, setting up storage, connecting networks, or all the issues that arise when dealing with Kubernetes and Docker. All you have to do is create the Python file and then deploy it. The rest is handled by the Modal cloud.

Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master’s degree in technology management and a bachelor’s degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.