How It Works - Local AI | Private, Offline AI for Everyone

// overview

Request lifecycle.

Every request stays on your network. Here's the path from browser to model and back.

1

User sends a request

Through the React web UI — upload a file, type a question, or trigger a TTS job.

2

API Gateway routes it

Django backend validates the request, authenticates the session, and routes to the appropriate service handler.

3

Context is assembled

For RAG queries: embeddings are generated, relevant chunks are retrieved from the vector store, and a prompt is constructed with context.

4

Model engine processes

The request hits your chosen inference engine — Ollama, vLLM, LM Studio, or any OpenAI-compatible server running on your hardware.

5

Response streams back

Tokens stream back in real-time via SSE. No data leaves your network. Everything is logged locally.

// stack

The full stack.

Four core containers, each with a single responsibility. Swap any component without touching the others.

01 Frontend

Web UI

Next.js application served through a Caddy reverse proxy. Provides the full user interface — chat, file uploads, voice transcription, model management, and settings. Communicates with the Django backend via REST API and Server-Sent Events for real-time streaming.

Next.js 14 TypeScript Tailwind CSS Caddy

PORT 80

02 Backend

API Gateway

Django REST API that orchestrates all services. Handles authentication, file parsing, RAG pipeline, prompt construction, and routes requests to the correct model engine or service. All business logic lives here.

Python 3.12 Django Django REST Framework Gunicorn

PORT 8000

03 Inference

Model Engine

Ollama runs as a bundled container by default, giving you instant access to hundreds of open-source models. You can also point Local AI at a host-installed Ollama instance or any other OpenAI-compatible inference server by changing a single environment variable.

Ollama (bundled default) or any OpenAI-compatible server

PORT 11434

04 Storage

Storage Layer

PostgreSQL stores all application data — users, chat history, and documents. The pgvector extension turns it into a full vector database, storing RAG embeddings for fast similarity search. A separate RAG service handles document ingestion, chunking, and vector retrieval. All data is persisted in named Docker volumes on your machine.

PostgreSQL 16 pgvector (Vector DB) Docker Volumes

PORT 5433

// configuration

One compose file. The whole stack.

Here's what the default docker-compose.yml looks like.

docker-compose.yml

# local-ai.run — docker-compose.yml

version: "3.9"

services:

# ── Reverse Proxy ──

caddy:

image: caddy:2-alpine

ports: ["80:80"]

depends_on: [nextjs, django]

# ── Database + pgvector (vector store) ──

postgres:

image: postgres:16-alpine

ports: ["5433:5432"]

volumes: [postgres_data:/var/lib/postgresql/data]

# ── API Gateway ──

django:

build: ./backend

environment:

DATABASE_URL: postgresql://…@postgres:5432/…

OLLAMA_BASE_URL: http://host.docker.internal:11434

depends_on: [postgres]

# ── Web UI ──

nextjs:

build: ./frontend

environment:

BACKEND_URL: http://django:8000

depends_on: [django]

# ── Model Engine (optional profile) ──

ollama:

image: ollama/ollama:latest

profiles: [container-ollama]

ports: ["11434:11434"]

volumes: [ollama_data:/root/.ollama]

# ── RAG · Whisper · Auto-updater ──

rag:

build: ./rag

ports: ["8501:8501"]

whisper:

build: ./whisper

updater:

build: ./updater

// security & privacy

Your data never leaves.

local-ai is designed from the ground up for air-gapped, on-premise deployment.

🔒

Zero external calls

No telemetry, no analytics, no outbound network requests. Runs fully offline after initial Docker pull.

🛡️

Local storage only

All files, embeddings, chat history, and generated outputs stay in Docker volumes on your machine.

🔑

Optional auth layer

Built-in session auth for multi-user setups. Drop in your own SSO/LDAP provider via environment config.

📋

Audit logging

Every query, file upload, and model call is logged locally. Export logs for compliance and review.

How it works
under the hood.

Request lifecycle.

User sends a request

API Gateway routes it

Context is assembled

Model engine processes

Response streams back

The full stack.

Web UI

API Gateway

Model Engine

Storage Layer

One compose file. The whole stack.

Your data never leaves.

Zero external calls

Local storage only

Optional auth layer

Audit logging

See it in action.

How it worksunder the hood.

Request lifecycle.

User sends a request

API Gateway routes it

Context is assembled

Model engine processes

Response streams back

The full stack.

Web UI

API Gateway

Model Engine

Storage Layer

One compose file. The whole stack.

Your data never leaves.

Zero external calls

Local storage only

Optional auth layer

Audit logging

See it in action.

How it works
under the hood.