How it works
under the hood.

A containerized stack of open-source services. Each component is isolated, replaceable, and independently scalable.

// overview

Request lifecycle.

Every request stays on your network. Here's the path from browser to model and back.

1

User sends a request

Through the React web UI — upload a file, type a question, or trigger a TTS job.

2

API Gateway routes it

Django backend validates the request, authenticates the session, and routes to the appropriate service handler.

3

Context is assembled

For RAG queries: embeddings are generated, relevant chunks are retrieved from the vector store, and a prompt is constructed with context.

4

Model engine processes

The request hits your chosen inference engine — Ollama, vLLM, LM Studio, or any OpenAI-compatible server running on your hardware.

5

Response streams back

Tokens stream back in real-time via SSE. No data leaves your network. Everything is logged locally.

// stack

The full stack.

Four core containers, each with a single responsibility. Swap any component without touching the others.

01 Frontend

Web UI

Next.js application served through a Caddy reverse proxy. Provides the full user interface — chat, file uploads, voice transcription, model management, and settings. Communicates with the Django backend via REST API and Server-Sent Events for real-time streaming.

Next.js 14 TypeScript Tailwind CSS Caddy
PORT 80
02 Backend

API Gateway

Django REST API that orchestrates all services. Handles authentication, file parsing, RAG pipeline, prompt construction, and routes requests to the correct model engine or service. All business logic lives here.

Python 3.12 Django Django REST Framework Gunicorn
PORT 8000
03 Inference

Model Engine

Ollama runs as a bundled container by default, giving you instant access to hundreds of open-source models. You can also point Local AI at a host-installed Ollama instance or any other OpenAI-compatible inference server by changing a single environment variable.

Ollama (bundled default) or any OpenAI-compatible server
PORT 11434
04 Storage

Storage Layer

PostgreSQL stores all application data — users, chat history, documents, and RAG embeddings. A separate RAG service handles document ingestion, chunking, and vector search. All data is persisted in named Docker volumes on your machine.

PostgreSQL 16 pgvector Docker Volumes
PORT 5433
// configuration

One compose file. Four services.

Here's what the default docker-compose.yml looks like.

docker-compose.yml
# local-ai.run — Docker Compose

version: "3.9"
services:

# ── Web UI ──
ui:
image: localai/ui:latest
ports: ["3000:3000"]
depends_on: [api]

# ── API Gateway ──
api:
image: localai/api:latest
ports: ["8000:8000"]
environment:
MODEL_ENGINE: ollama
VECTOR_STORE: chromadb
depends_on: [ollama, chromadb]

# ── Model Engine ──
ollama:
image: ollama/ollama:latest
ports: ["11434:11434"]
volumes: [ollama_data:/root/.ollama]

# ── Vector Store ──
chromadb:
image: chromadb/chroma:latest
ports: ["8001:8000"]
volumes: [chroma_data:/chroma/chroma]
// security & privacy

Your data never leaves.

local-ai is designed from the ground up for air-gapped, on-premise deployment.

🔒

Zero external calls

No telemetry, no analytics, no outbound network requests. Runs fully offline after initial Docker pull.

🛡️

Local storage only

All files, embeddings, chat history, and generated outputs stay in Docker volumes on your machine.

🔑

Optional auth layer

Built-in session auth for multi-user setups. Drop in your own SSO/LDAP provider via environment config.

📋

Audit logging

Every query, file upload, and model call is logged locally. Export logs for compliance and review.

See it in action.

Install in under 2 minutes and explore the stack yourself.

Install local-ai View Source Code