How to Build a Custom AI API with FastAPI and Ollama: A Step-by-Step Guide
Learn to build a custom AI API with FastAPI and Ollama. Step-by-step guide for private, scalable AI solutions.
- 8 min read

Introduction: Unleashing AI Power on Your Own Terms
Imagine having a personal AI assistant that runs on your machine, answers your questions, and powers your applications without ever touching the cloud. Sounds like a sci-fi dream? It’s not—it’s reality with Ollama and FastAPI. Whether you’re a developer crafting a private chatbot for a client, a data scientist prototyping AI tools, or a hobbyist tinkering with large language models (LLMs), building a custom AI API can transform your projects. But how do you bring this dream to life without a PhD in machine learning or a hefty cloud budget?
In this guide, we’ll walk you through creating a custom AI API using Ollama, a tool that lets you run LLMs locally, and FastAPI, a blazing-fast Python framework for building APIs. By the end, you’ll have a fully functional AI API that’s secure, scalable, and ready to power your next big idea. We’ll make this journey engaging, practical, and packed with insights from recent sources and real-world applications. Ready to dive in? Let’s get started!
Why Build a Custom AI API with Ollama and FastAPI?
Before we jump into the code, let’s explore why this combo is a game-changer:
- Privacy First: Running LLMs locally with Ollama means your data never leaves your machine—perfect for industries like healthcare or finance where security is non-negotiable.
- Cost Efficiency: No recurring API fees. Once you set up Ollama, your AI runs on your hardware, slashing cloud costs.
- Speed and Scalability: FastAPI’s high-performance architecture, built on Starlette and Pydantic, delivers responses as fast as Node.js or Go, making it ideal for real-time applications.
- Customization: Fine-tune models or swap them out with Ollama’s library to suit your needs, from chatbots to code assistants.
- Offline Capabilities: No internet? No problem. Ollama runs LLMs locally, ensuring your API works in disconnected environments.
Think of this setup as your own personal AI factory: Ollama provides the raw materials (LLMs), and FastAPI builds the sleek assembly line to deliver your AI’s output to the world.
Prerequisites: Setting the Stage
Before we build, let’s gather our tools. You’ll need:
- Python 3.7+: Ensure Python is installed. Download it from python.org if needed.
- Ollama: A platform for running LLMs locally. Install it from ollama.com or via command line (we’ll cover this).
- FastAPI and Dependencies: We’ll install these using pip.
- A Code Editor: Visual Studio Code or PyCharm works great.
- Basic Python Knowledge: Familiarity with Python and REST APIs helps, but we’ll keep things beginner-friendly.
Got your tools ready? Let’s build something amazing.
Step 1: Installing and Setting Up Ollama
Ollama is your AI engine, letting you run models like Llama 3.1, DeepSeek, or Mistral on your machine. Here’s how to get it up and running:
Download and Install Ollama
- Visit ollama.com and download the installer for your OS (Windows, macOS, or Linux).
- Alternatively, use the command line for a quick setup:
curl -fsSL https://ollama.ai/install.sh | sh
- Verify the installation by checking the version:
ollama --version
Pull a Model
Ollama’s library offers models like Llama 3.1, DeepSeek-R1, and Mistral. For this guide, we’ll use Llama 3.1 (8B parameters) for its balance of performance and resource needs:
ollama pull llama3.1
This downloads the model to your machine. Be patient—it may take a while depending on your internet speed. Once downloaded, test it:
ollama run llama3.1
Type a prompt like “What is AI?” to confirm it’s working. You’ll see the model respond directly in the terminal.
Ensure Ollama is Running
Ollama runs a local server on http://localhost:11434
. Check if it’s active:
curl http://localhost:11434
If you see a response, Ollama is ready. If not, start it manually:
ollama serve
Step 2: Setting Up Your Python Environment
Now, let’s prepare the Python side of things. We’ll create a virtual environment and install FastAPI and other dependencies.
Create a Project Directory
mkdir ai-api-project
cd ai-api-project
Set Up a Virtual Environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
Install Dependencies
Install FastAPI, Uvicorn (an ASGI server for FastAPI), and the Ollama Python client:
pip install fastapi uvicorn requests ollama
Create a requirements.txt
file to track dependencies:
pip freeze > requirements.txt
Step 3: Building the FastAPI Application
FastAPI is our API framework, known for its speed and automatic generation of interactive documentation. Let’s create a basic API that interacts with Ollama.
Create the Main Application File
Create a file named main.py
and add the following code:
from fastapi import FastAPI
from pydantic import BaseModel
import ollama
app = FastAPI()
class PromptRequest(BaseModel):
prompt: str
model: str = "llama3.1"
@app.get("/")
async def root():
return {"message": "Welcome to your custom AI API!"}
@app.post("/generate")
async def generate_text(request: PromptRequest):
try:
response = ollama.generate(model=request.model, prompt=request.prompt)
return {"response": response["response"]}
except Exception as e:
return {"error": str(e)}
What’s Happening Here?
- FastAPI Setup: We initialize a FastAPI app.
- Pydantic Model:
PromptRequest
defines the expected JSON payload (a prompt and optional model name). - Endpoints:
/
: A simple welcome message./generate
: Sends the user’s prompt to Ollama and returns the response.
Run the FastAPI Server
Start the server with Uvicorn:
uvicorn main:app --reload
Your API is now live at http://localhost:8000
. Visit http://localhost:8000/docs
to see FastAPI’s interactive Swagger UI, where you can test endpoints.
Step 4: Testing the API
Let’s test our API using cURL or the Swagger UI.
Using cURL
Send a POST request to the /generate
endpoint:
curl -X POST "http://localhost:8000/generate" -H "Content-Type: application/json" -d '{"prompt": "What is the capital of France?", "model": "llama3.1"}'
You should see a response like:
{"response": "The capital of France is Paris."}
Using Swagger UI
- Open
http://localhost:8000/docs
in your browser. - Click the
/generate
endpoint, then “Try it out.” - Enter a JSON payload like:
{ "prompt": "What is the capital of France?", "model": "llama3.1" }
- Click “Execute” to see the response.
Step 5: Enhancing the API with Advanced Features
Our basic API is working, but let’s make it more robust with features like model management and streaming responses.
Add Model Management
Let’s add an endpoint to list available models:
@app.get("/models")
async def list_models():
models = ollama.list()
return {"models": models["models"]}
Test it:
curl http://localhost:8000/models
This returns a list of models downloaded to your machine, like llama3.1
, mistral
, etc.
Enable Streaming Responses
Streaming allows real-time token-by-token output, ideal for chatbots. Modify the /generate
endpoint:
from fastapi.responses import StreamingResponse
import json
@app.post("/stream")
async def stream_text(request: PromptRequest):
def generate():
stream = ollama.generate(model=request.model, prompt=request.prompt, stream=True)
for chunk in stream:
yield json.dumps({"response": chunk["response"]}) + "\n"
return StreamingResponse(generate(), media_type="application/x-ndjson")
Test streaming:
curl -X POST "http://localhost:8000/stream" -H "Content-Type: application/json" -d '{"prompt": "Tell me a story", "model": "llama3.1"}'
You’ll see the response stream in real-time.
Step 6: Deploying Your API
To make your API accessible beyond your local machine, consider deploying it. Here’s a quick guide using Docker:
Create a Dockerfile
FROM python:3.9
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Build and Run
docker build -t ai-api .
docker run -d -p 8000:8000 --add-host=host.docker.internal:host-gateway ai-api
Ensure Ollama is running on your host machine or in a separate Docker container, accessible at host.docker.internal:11434
.
Deployment Tips
- Security: Use environment variables for sensitive data and consider authentication (e.g., JWT) for public APIs.
- Scaling: Use a reverse proxy like Nginx for load balancing and HTTPS.
- Monitoring: Track API usage with tools like Prometheus or Grafana.
Real-World Applications and Case Studies
This setup isn’t just a cool project—it’s a foundation for real-world solutions:
- Healthcare: A hospital used Ollama and FastAPI to create a private chatbot for patient data analysis, ensuring HIPAA compliance by keeping data local.
- Education: A university built an AI-powered tutor API to assist students offline, reducing latency and costs compared to cloud-based solutions.
- Startups: A startup developed a coding assistant API using DeepSeek-R1 and Ollama, offering a free alternative to GitHub Copilot.
These examples show the power of local AI APIs: privacy, cost savings, and flexibility.
Challenges and Best Practices
Building an AI API isn’t without hurdles:
- Hardware Requirements: Larger models like Llama 3.1 (70B) need significant RAM and GPU power. Start with smaller models like Llama 3.1 (8B) if your hardware is limited.
- Error Handling: Always include try-catch blocks to manage API errors gracefully.
- Rate Limiting: Implement rate limiting with
slowapi
to prevent resource overuse. - Documentation: FastAPI’s auto-generated docs are great, but add custom descriptions for clarity.
Pro Tip: Regularly update Ollama to access new models and performance improvements. Check the Ollama GitHub for updates.
Conclusion: Your AI Journey Starts Here
Congratulations—you’ve built a custom AI API with FastAPI and Ollama! You now have a powerful, private, and cost-effective tool to bring AI to your applications. Whether you’re creating a chatbot, automating workflows, or prototyping the next big thing, this setup gives you control and flexibility.
What’s next? Experiment with different models, integrate with frontends like Streamlit or React Native, or explore fine-tuning for domain-specific tasks. The possibilities are endless, and your AI factory is ready to scale. Share your creations in the comments or on X—let’s inspire each other to push the boundaries of local AI!
Happy coding, and may your APIs always return 200 OK!