Integrating DeepSeek R1 with FastAPI for Real-Time Inference

Integrating DeepSeek R1 with FastAPI for real-time inference involves several steps, including setting up the environment, configuring Ollama to manage the DeepSeek R1 model, and creating a FastAPI application to handle requests and responses. Here's a detailed guide on how to achieve this integration:

Step 1: Setting Up the Environment

First, ensure you have Python installed on your system. You will need to install FastAPI and Uvicorn for running the FastAPI application. Additionally, you'll need Ollama to manage and serve the DeepSeek R1 model locally.

bash
pip install fastapi uvicorn ollama

Step 2: Configuring Ollama for DeepSeek R1

Ollama simplifies the process of downloading and running large language models like DeepSeek R1 locally. Follow these steps to configure Ollama:

1. Install Ollama: If not already installed, you can install it via pip.
2. Download DeepSeek R1 Model: Use Ollama to download the DeepSeek R1 model. This step may require specifying the model name and version.
3. Run the Model: Start the Ollama server to serve the DeepSeek R1 model. This typically involves running a command like `ollama serve deepseek-r1`.

Step 3: Creating a FastAPI Application

Now, let's create a FastAPI application to integrate with the DeepSeek R1 model served by Ollama.

Basic FastAPI Setup

Create a new Python file for your FastAPI app, e.g., `main.py`, and add the following code:

python
from fastapi import FastAPI
from pydantic import BaseModel
from typing import List

app = FastAPI()

class InputData(BaseModel):
    text: str

# Assuming Ollama is running on http://localhost:11434/v1/
# You might need to adjust the base URL based on your Ollama setup
import requests

def get_prediction(text):
    response = requests.post(
        'http://localhost:11434/v1/completions',
        json={'prompt': text, 'model': 'deepseek-r1'}
    )
    return response.json()

@app.post("/predict/")
async def predict(data: InputData):
    prediction = get_prediction(data.text)
    return {"prediction": prediction}

Enhancing with Streaming Responses

For real-time inference, you might want to use streaming responses. This involves setting up an endpoint that can handle chunked responses from the model. Hereâs how you can modify the code to achieve this:

python
from fastapi import FastAPI, Query
from fastapi.responses import StreamingResponse

app = FastAPI()

def stream_text(text):
    # Assuming Ollama supports streaming responses
    # Adjust the URL and parameters as needed
    response = requests.post(
        'http://localhost:11434/v1/completions',
        json={'prompt': text, 'model': 'deepseek-r1', 'stream': True},
        stream=True
    )
    
    for chunk in response.iter_content(chunk_size=1024):
        yield chunk

@app.post("/stream-predict/")
async def stream_predict(text: str = Query(...)):
    return StreamingResponse(stream_text(text), media_type="text/plain")

Asynchronous Processing

To further enhance performance, utilize FastAPI's asynchronous capabilities. This allows for non-blocking calls to your AI model, which is crucial for maintaining performance under load.

python
import asyncio

async def get_prediction_async(text):
    # Simulating an asynchronous call to the model
    # Replace with actual async logic if available
    loop = asyncio.get_running_loop()
    response = await loop.run_in_executor(None, requests.post, 
        'http://localhost:11434/v1/completions',
        json={'prompt': text, 'model': 'deepseek-r1'}
    )
    return response.json()

@app.post("/async-predict/")
async def async_predict(data: InputData):
    prediction = await get_prediction_async(data.text)
    return {"prediction": prediction}

Step 4: Running the FastAPI Application

To run your FastAPI application, use Uvicorn:

bash
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

Your application is now ready to handle real-time inference requests using the DeepSeek R1 model managed by Ollama.

Conclusion

Integrating DeepSeek R1 with FastAPI offers a powerful solution for real-time AI inference. By leveraging Ollama to manage the model and FastAPI for creating a robust web service, you can build scalable and efficient AI-powered applications. This setup allows for privacy, low latency, and customization, making it ideal for various use cases such as customer support, data analysis, and automation[1][4][5].

Citations:
[1] https://vadim.blog/deepseek-r1-ollama-fastapi
[2] https://www.linkedin.com/pulse/exploring-deepseek-comprehensive-installation-guide-r1-model-vjmhc
[3] https://www.restack.io/p/real-time-ai-inference-answer-fastapi-applications-cat-ai
[4] https://618media.com/en/blog/integrating-deepseek-r1-into-existing-systems-a-guide/
[5] https://blog.stackademic.com/integrating-deepseek-r1-with-fastapi-building-an-ai-powered-resume-analyzer-code-demo-4e1cc29cdc6e
[6] https://dev.to/ajmal_hasan/setting-up-ollama-running-deepseek-r1-locally-for-a-powerful-rag-system-4pd4
[7] https://www.restack.io/p/real-time-ai-inference-answer-fastapi-data-handling-cat-ai
[8] https://aws.amazon.com/blogs/machine-learning/deploy-deepseek-r1-distilled-models-on-amazon-sagemaker-using-a-large-model-inference-container/
[9] https://www.byteplus.com/en/topic/397556
[10] https://www.evidentlyai.com/blog/fastapi-tutorial

How can I integrate DeepSeek R1 with FastAPI for real-time inference