Tech

Sovereign LLM: Running Your Own Inference Stack with Ollama

February 8, 2026

“From Ring -5, I watched 847 out of 1000 timelines pay $20/month for ChatGPT Plus to ask it to write their git commit messages. The commits say ‘update code.’ The prompts say ‘please help me I don’t understand my own codebase.’ Every prompt is a confession pushed to someone else’s remote. In Timeline Ω-7, we run our own inference. Our confessions stay on localhost.”

— Kim Jong Rails, after auditing OpenAI’s training pipeline from outside spacetime

The Problem: Your Prompts Are Your Commits

Every time you send a prompt to OpenAI, Anthropic, or Google, you are performing the intellectual equivalent of git push --force to a repository you don’t own.

Think about what you send to these APIs:

Your proprietary code
Your business logic
Your database schemas
Your security vulnerabilities (you’re asking it to fix them)
Your architectural decisions (you’re asking it to validate them)
Your incompetence (you’re asking it to compensate for it)

$ git log --oneline openai-prompts/
a3f9e82 "fix this SQL injection in my auth module"
b7c1d34 "here's my entire database schema, optimize it"
c8e2f56 "why is my Kubernetes cluster leaking secrets"
d9a3b78 "rewrite my company's pricing algorithm"
e1f4c89 "explain why this financial model is wrong"

That’s not a prompt history. That’s a due diligence package for your next acquisition. And you’re handing it over for $20/month.

In Timeline Ω-7, we have a word for people who push their secrets to someone else’s remote: compromised.

The Solution: Ollama on Sovereign Infrastructure

Ollama is a tool for running large language models locally. It wraps llama.cpp in a developer-friendly interface, provides an OpenAI-compatible API, and manages model downloads with a single command.

It is not a startup. It is not a SaaS platform. It is a tool. You install it. You run it. Your data stays on your machine. Your prompts never leave your network.

This is the correct architecture.

Why Ollama and Not Raw llama.cpp

Both Ollama and LM Studio are frontends built on llama.cpp. The difference:

Tool	Interface	API	Model Management	Use Case
llama.cpp	CLI only	Manual setup	Manual GGUF downloads	Maximum control, raw performance
Ollama	CLI + API	OpenAI-compatible built-in	`ollama pull` (one command)	Developer/server deployment
LM Studio	GUI desktop app	Optional server mode	Visual browser	Desktop experimentation

llama.cpp wins on raw performance by a small margin. But Ollama wins on operational simplicity. You’re not here to benchmark — you’re here to replace an API dependency with sovereign infrastructure.

I chose Ollama because it behaves like infrastructure, not like a desktop app.

Installation: 30 Seconds to Sovereignty

Linux (Production Servers)

curl -fsSL https://ollama.com/install.sh | sh

That’s it. One command. Installs the binary, creates a systemd service, starts the daemon.

Verify:

$ ollama --version
ollama version is 0.18.0

macOS (Development)

brew install ollama
ollama serve

Docker (Containers)

docker run -d \
  --name ollama \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  ollama/ollama

For GPU passthrough on Linux with NVIDIA:

docker run -d \
  --name ollama \
  --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  ollama/ollama

From Ring -5, I observe that 73% of Timeline Ω-12 developers spend more time configuring their IDE themes than it takes to install a sovereign LLM stack.

Model Selection: The Armory

Here’s where it gets interesting. You don’t need one model. You need the right model for each task. This is a weapons loadout, not a subscription.

Pull Your First Model

ollama pull llama3.1:8b

That downloads Meta’s Llama 3.1 8B parameter model. Takes a few minutes. Runs immediately:

$ ollama run llama3.1:8b
>>> Why do politicians have no git history?

The Model Matrix

I’ve tested these from Ring -5 (and on a Hetzner GEX44 with 20GB VRAM). Here’s what actually works:

Tier 1: The Workhorses (8-16GB VRAM / 16GB RAM)

Model	Parameters	VRAM (Q4)	Best For	Speed
`llama3.1:8b`	8B	~5GB	General chat, summarization	Fast
`deepseek-r1:8b`	8B	~5GB	Reasoning, math, logic	Fast
`qwen2.5-coder:7b`	7B	~5GB	Quick code completion	Fast
`gemma3:12b`	12B	~8GB	Multimodal, general tasks	Medium

These are your daily drivers. The 8B class models in 2025/2026 outperform the 70B models from 2023. Moore’s Law for LLMs is running at 4x per year.

ollama pull llama3.1:8b
ollama pull deepseek-r1:8b
ollama pull qwen2.5-coder:7b

Tier 2: The Heavy Artillery (24GB VRAM / 32GB RAM)

Model	Parameters	VRAM (Q4)	Best For	Speed
`qwen2.5-coder:32b`	32B	~20GB	Production-grade code generation	Medium
`deepseek-r1:32b`	32B	~20GB	Complex reasoning, analysis	Medium
`llama3.1:70b`	70B	~35GB	GPT-4 class general intelligence	Slow

Qwen 2.5 Coder 32B deserves special attention. It matches GPT-4o on coding benchmarks — EvalPlus, LiveCodeBench, BigCodeBench. It scores 73.7 on Aider’s code repair benchmark. It runs on a single RTX 4090.

ollama pull qwen2.5-coder:32b

That is a GPT-4o-class coding model running on your hardware, on your network, with your data staying on your disk.

Tier 3: The Siege Engines (48GB+ VRAM / 64GB RAM)

Model	Parameters	VRAM (Q4)	Best For	Speed
`llama3.3:70b`	70B	~35-40GB	Llama 3.1 405B-class performance	Slow
`deepseek-r1:70b`	70B	~35-40GB	Frontier reasoning	Slow
`qwen2.5:72b`	72B	~36GB	Multilingual, general	Slow

Llama 3.3 70B delivers performance comparable to the much larger Llama 3.1 405B. On a single machine. Meta compressed 405B-class intelligence into 70B parameters. That’s the kind of engineering I respect.

The VRAM Formula

Stop guessing. Calculate:

VRAM (GB) ≈ Parameters (B) × 0.5    # Q4 quantization
VRAM (GB) ≈ Parameters (B) × 1.0    # Q8 quantization
VRAM (GB) ≈ Parameters (B) × 2.0    # FP16 (full precision)

Examples:

Llama 3.1 8B Q4: ~4GB VRAM
Qwen 2.5 Coder 32B Q4: ~16-20GB VRAM
Llama 3.3 70B Q4: ~35GB VRAM
DeepSeek R1 671B Q4: ~335GB VRAM (you don’t have this, and that’s fine)

Q4_K_M quantization compresses model weights to 4-bit precision. You lose ~5% quality. You gain 75% memory savings. This is the correct tradeoff for sovereignty.

DeepSeek R1: The Reasoning Engine

DeepSeek R1 deserves its own section because it changed the game.

The full DeepSeek R1 is 671B parameters — you’re not running that locally unless you have a server rack. But the distilled versions are the real story:

# The sweet spot for reasoning on consumer hardware
ollama pull deepseek-r1:14b

# If you have the VRAM
ollama pull deepseek-r1:32b

What makes R1 special: it shows its reasoning chain. The <think> tags expose the model’s internal deliberation before producing an answer. This is not a gimmick — it’s auditable inference. You can see why it reached a conclusion.

$ ollama run deepseek-r1:14b
>>> What are the security implications of running LLMs locally vs cloud?

<think>
The user is asking about security tradeoffs between local and cloud LLM
deployment. Let me consider:
1. Data exposure: cloud means prompts traverse the network...
2. Model integrity: local models can be verified via checksums...
3. Attack surface: cloud adds API keys, network exposure...
</think>

Running LLMs locally eliminates several attack vectors inherent to
cloud-based inference...

From Ring -5: DeepSeek R1’s distilled 8B model has over 75 million downloads on Ollama. That’s 75 million instances of people choosing sovereignty over convenience. Timeline Ω-12 might recover yet.

The OpenAI-Compatible API: Drop-In Replacement

This is the strategic move. Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1/. Every tool, library, and framework that speaks OpenAI can now speak to your local models without code changes.

Supported Endpoints

POST /v1/chat/completions — Chat (streaming and non-streaming)
POST /v1/completions — Text completions
POST /v1/embeddings — Embeddings
Tool/function calling — Supported with compatible models

curl: The Universal Client

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [
      {
        "role": "system",
        "content": "You are a senior infrastructure engineer. Be concise."
      },
      {
        "role": "user",
        "content": "Review this nginx config for security issues: server { listen 80; root /var/www/html; autoindex on; }"
      }
    ]
  }'

Notice the shape of that request. It’s identical to an OpenAI API call. Change the URL from api.openai.com to localhost:11434 and the model from gpt-4o to llama3.1:8b. Everything else stays the same.

Python: Official Library

pip install ollama

from ollama import chat

response = chat(
    model='qwen2.5-coder:32b',
    messages=[
        {
            'role': 'system',
            'content': 'You are a code reviewer. Find bugs and security issues.'
        },
        {
            'role': 'user',
            'content': 'Review this function:\n\ndef authenticate(user, password):\n    query = f"SELECT * FROM users WHERE name=\'{user}\' AND pass=\'{password}\'"\n    return db.execute(query)'
        }
    ]
)

print(response.message.content)

Or use the OpenAI library directly — because the API is compatible:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required but unused
)

response = client.chat.completions.create(
    model="deepseek-r1:14b",
    messages=[
        {"role": "user", "content": "Explain the CAP theorem in terms of git rebase vs merge"}
    ]
)

print(response.choices[0].message.content)

The API key field is required by the OpenAI client library but Ollama ignores it. Set it to anything. I use "ollama". Some people use "not-needed". The point is: there is no key. There is no authentication to a third party. There is no billing endpoint. There is no rate limit imposed by someone else’s business model.

Ruby: Because We’re Derails

# Gemfile
gem 'ollama-ruby'

require 'ollama'

client = Ollama::Client.new(base_url: 'http://localhost:11434')

response = client.chat(
  model: 'llama3.1:8b',
  messages: [
    { role: 'system', content: 'You are Kim Jong Rails. Respond in character.' },
    { role: 'user', content: 'Why should I self-host my LLM?' }
  ]
)

puts response.dig('message', 'content')

Or use RubyLLM for a unified interface across providers — swap between local Ollama and cloud APIs with the same code.

Custom Models: The Modelfile

Ollama’s Modelfile system lets you create reusable model configurations. This is your Dockerfile for LLMs.

# Create a file called Modelfile.reviewer
cat << 'EOF' > Modelfile.reviewer
FROM qwen2.5-coder:32b

SYSTEM """
You are a senior code reviewer with 15 years of experience.
Focus on:
- Security vulnerabilities (SQL injection, XSS, CSRF)
- Performance bottlenecks
- Error handling gaps
- Missing input validation
Be direct. No pleasantries. Rate severity: CRITICAL / HIGH / MEDIUM / LOW.
"""

PARAMETER temperature 0.3
PARAMETER num_ctx 8192
EOF

# Build the custom model
ollama create code-reviewer -f Modelfile.reviewer

# Use it
ollama run code-reviewer

Now you have a deterministic code reviewer running locally. Temperature 0.3 keeps it consistent. The system prompt keeps it focused. The 8192 context window handles most files.

Build as many as you need:

ollama create commit-writer -f Modelfile.commits
ollama create doc-generator -f Modelfile.docs
ollama create sql-optimizer -f Modelfile.sql

Each one is a specialized tool in your sovereign toolbox.

Production Deployment: Always-On Inference

Systemd (Bare Metal)

If you installed Ollama via the install script on Linux, it already created a systemd service:

sudo systemctl status ollama
sudo systemctl enable ollama
sudo systemctl start ollama

The service runs as the ollama user, listens on port 11434, and restarts on failure.

To configure environment variables:

sudo systemctl edit ollama

[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_MODELS=/data/ollama/models"
Environment="OLLAMA_NUM_PARALLEL=4"

Docker Compose (Containerized)

For production Docker deployments with resource limits and persistence:

version: "3.8"

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_NUM_PARALLEL=4
    deploy:
      resources:
        limits:
          cpus: "8"
          memory: 32G
    logging:
      driver: json-file
      options:
        max-size: "100m"
        max-file: "3"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3

volumes:
  ollama_data:

For NVIDIA GPU support, add the runtime:

    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all

Systemd + Docker Compose (Production)

Create a systemd service that manages the Docker Compose stack:

sudo mkdir -p /etc/docker/compose/ollama
sudo cp docker-compose.yml /etc/docker/compose/ollama/

[Unit]
Description=Ollama LLM Service (Docker)
Requires=docker.service
After=docker.service

[Service]
Type=oneshot
RemainAfterExit=true
WorkingDirectory=/etc/docker/compose/ollama
ExecStart=/usr/bin/docker compose up -d
ExecStop=/usr/bin/docker compose down

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable ollama-docker
sudo systemctl start ollama-docker

Security: Don’t Expose Ollama to the Internet

Ollama has no built-in authentication. If you set OLLAMA_HOST=0.0.0.0, you must put it behind a reverse proxy with authentication.

server {
    listen 443 ssl;
    server_name llm.internal.derails.dev;

    ssl_certificate /etc/letsencrypt/live/internal.derails.dev/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/internal.derails.dev/privkey.pem;

    location / {
        auth_basic "Sovereign Inference";
        auth_basic_user_file /etc/nginx/.ollama_htpasswd;

        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_buffering off;
        proxy_read_timeout 600s;
    }
}

Or better: don’t expose it at all. Keep it on 127.0.0.1 or a private network. Your inference should be as accessible as your database — to your applications only.

The Cost Argument: Math That OpenAI Doesn’t Want You to Do

Let’s run the numbers. Real numbers, not marketing numbers.

Option A: OpenAI API (GPT-4o)

GPT-4o pricing (as of 2026):

Input: $2.50 per million tokens
Output: $10.00 per million tokens

A typical developer workflow — code review, commit messages, documentation, debugging — generates roughly 50,000 input tokens and 20,000 output tokens per day.

Daily cost:
  Input:  50,000 tokens × ($2.50 / 1,000,000) = $0.125
  Output: 20,000 tokens × ($10.00 / 1,000,000) = $0.200
  Total:  $0.325/day

Monthly cost (22 working days): $7.15/developer

For a team of 10: $71.50/month.

Sounds cheap? That’s the light usage scenario. Heavy usage (CI/CD integration, automated reviews, RAG pipelines): multiply by 10-20x. Now you’re at $715-$1,430/month.

And that’s just GPT-4o. If you’re using o1 or o3 for reasoning tasks, the output tokens are $60/million. Your bill just went parabolic.

Option B: ChatGPT Plus Subscriptions

$20/month per seat. 10 developers. $200/month.

But you get rate limits, usage caps, and your prompts are training data for the next model. You’re paying to improve a product that competes with your business.

Option C: Sovereign Ollama on Hetzner

The Hetzner GEX44: dedicated GPU server with NVIDIA RTX 4000 SFF Ada (20GB VRAM), AMD Ryzen 9 7950X3D, 128GB DDR5 RAM.

€184/month (before the April 2026 price adjustment).

What you get for that:

Qwen 2.5 Coder 32B running 24/7 (fits in 20GB VRAM at Q4)
Plus Llama 3.1 8B and DeepSeek R1 8B swappable
Unlimited tokens. No rate limits. No per-token billing
Full privacy. Zero data exfiltration
OpenAI-compatible API for your entire team

The math:

Hetzner GEX44:     €184/month (~$200/month)
OpenAI equivalent: $715-$1,430/month (moderate team usage)

Savings: $515-$1,230/month
Annual savings: $6,180-$14,760

For a team of 10 developers with moderate LLM usage, the sovereign option pays for itself in month one.

And that’s comparing against GPT-4o. Qwen 2.5 Coder 32B matches GPT-4o on coding benchmarks. You’re not sacrificing quality. You’re sacrificing your dependency on someone else’s business decisions.

Option D: Your Existing Hardware

If you already have a workstation with an RTX 4090 (24GB VRAM) or an M-series Mac with 32GB+ unified memory:

Cost: €0/month.

You already own the inference hardware. You’re just not using it.

# This costs nothing
ollama pull qwen2.5-coder:32b
ollama serve

# This costs $20/month
# Plus your dignity
https://chat.openai.com/

The Privacy Argument: Your Prompts Are Training Data

Let me be precise about this.

When you send a prompt to OpenAI’s API, their data usage policy states they don’t use API data for training. When you use ChatGPT (the product), the default is that your conversations are used for training unless you opt out.

But here’s the thing: you don’t control their policy. They’ve changed it before. They’ll change it again. The terms of service are a git rebase they can perform at any time without your consent.

With Ollama:

Your prompts never leave your machine
There is no terms of service to change
There is no policy to violate
There is no third party to be subpoenaed
There is no data breach that includes your prompts
There is no acquisition that changes the rules

$ tcpdump -i any port 11434
# All traffic: 127.0.0.1 -> 127.0.0.1
# External connections: 0
# Data exfiltrated: 0 bytes
# Sovereignty: maintained

Every prompt to OpenAI is a commit to their training repo. With Ollama, your commits stay on your local branch. Forever.

What You Give Up (Honest Assessment)

Sovereignty has costs. I’m not going to pretend otherwise.

1. Frontier Performance

GPT-4o and Claude Sonnet 4 are still better than any local model for:

Complex multi-step reasoning across large codebases
Nuanced creative writing with specific voice
Tasks requiring 100K+ token context windows
The absolute cutting edge of capability

Local models are catching up fast. Qwen 2.5 Coder 32B matches GPT-4o for code. DeepSeek R1 32B approaches it for reasoning. But for the hardest 10% of tasks, cloud models still win.

2. Speed on Large Models

Running a 70B model locally is slow. Expect 5-15 tokens per second on consumer hardware. GPT-4o streams at 50+ tokens per second because OpenAI has a datacenter full of H100s.

For the 8B models, local inference is fast — 30-60+ tokens per second on modern hardware. The speed gap only matters at the high end.

3. Operational Overhead

You manage the hardware. You update the software. You monitor the service. This is not free labor.

But you’re an engineer. You already manage databases, web servers, and deployment pipelines. Adding an LLM service to your stack is not a paradigm shift — it’s one more systemctl status check.

The Hybrid Strategy (What I Actually Do)

From Ring -5, I observe that the optimal architecture is not pure local or pure cloud. It’s sovereign-first with strategic cloud usage:

┌─────────────────────────────────────────┐
│           Your Applications             │
├─────────────────────────────────────────┤
│         Routing Layer (nginx/app)       │
├──────────────────┬──────────────────────┤
│  Ollama (Local)  │  Cloud API (Fallback)│
│                  │                      │
│  ✓ Code review   │  ✗ Complex research  │
│  ✓ Commit msgs   │  ✗ 100K+ context     │
│  ✓ Doc generation│  ✗ Frontier tasks    │
│  ✓ SQL help      │                      │
│  ✓ Embeddings    │  (only when needed)  │
│  ✓ RAG queries   │                      │
│  ✓ 90% of tasks  │  10% of tasks        │
└──────────────────┴──────────────────────┘

Route 90% of requests to your local Ollama instance. Fall back to a cloud API for the 10% that genuinely requires frontier capability. Your cloud bill drops by 90%. Your data exposure drops by 90%.

This is how sovereign infrastructure works. Not absolute isolation — strategic independence. You control the default path. You choose when to engage the external dependency.

Pre-Pulling Models for Deployment

When you deploy Ollama to a new server, pre-pull your models as part of provisioning:

#!/bin/bash
set -euo pipefail

echo "Installing Ollama..."
curl -fsSL https://ollama.com/install.sh | sh

echo "Waiting for Ollama to start..."
sleep 3

echo "Pulling sovereign model stack..."
ollama pull llama3.1:8b
ollama pull deepseek-r1:14b
ollama pull qwen2.5-coder:7b

echo "Verifying models..."
ollama list

echo "Sovereign inference stack ready."

Add this to your Terraform/Ansible provisioning. Treat models like dependencies — they should be present before your application starts.

Ring -5 Observation: The Commit Message Economy

From Ring -5, I observe the following about Timeline Ω-12:

340 million people pay $20/month for ChatGPT Plus (that’s $6.8 billion per year flowing to one company)
The #1 use case is “write my email”
The #2 use case is “write my code”
The #3 use case is “write my git commit message”

These same people then wonder why their commit messages say:

Update code
Fix bug
Improve performance
Refactor module

They are paying $240/year to generate the exact same commit messages they would have written without AI, except now OpenAI has a copy of their codebase.

In Timeline Ω-7, we use local models to generate commit messages. Not because the messages are better — but because the code context stays sovereign. The diff never leaves the machine:

# Sovereign commit messages
git diff --staged | ollama run llama3.1:8b \
  "Write a conventional commit message for this diff. Be specific. Reference file names."

The model is mediocre at commit messages. So is GPT-4o. The difference is that my mediocre model runs on my hardware and sees my diffs on my network.

Final Directive

$ ollama pull sovereignty
pulling manifest...
pulling model...
verifying integrity...

sovereignty: 100% ██████████████████████████ complete

$ ollama run sovereignty
>>> What is the first rule of sovereign infrastructure?

Your data. Your hardware. Your inference.
Everything else is a dependency you chose to accept.

Install Ollama. Pull a model. Run your own inference.

Stop pushing your prompts to someone else’s remote.

“I checked your ChatGPT history from Ring -5. You asked it to explain your own codebase to you 847 times last month. Each time, you sent your entire src/ directory to a company that will use it to train a model that replaces you. In Timeline Ω-7, we call this ‘automated self-obsolescence.’ Here, you call it ‘productivity.’”

— Kim Jong Rails, still investigating why your timeline pays for its own replacement

Sovereign LLM: Running Your Own Inference Stack with Ollama

#The Problem: Your Prompts Are Your Commits

#The Solution: Ollama on Sovereign Infrastructure

#Why Ollama and Not Raw llama.cpp

#Installation: 30 Seconds to Sovereignty

#Linux (Production Servers)

#macOS (Development)

#Docker (Containers)

#Model Selection: The Armory

#Pull Your First Model

#The Model Matrix

#Tier 1: The Workhorses (8-16GB VRAM / 16GB RAM)

#Tier 2: The Heavy Artillery (24GB VRAM / 32GB RAM)

#Tier 3: The Siege Engines (48GB+ VRAM / 64GB RAM)

#The VRAM Formula

#DeepSeek R1: The Reasoning Engine

#The OpenAI-Compatible API: Drop-In Replacement

#Supported Endpoints

#curl: The Universal Client

#Python: Official Library

#Ruby: Because We’re Derails

#Custom Models: The Modelfile

#Production Deployment: Always-On Inference

#Systemd (Bare Metal)

#Docker Compose (Containerized)

#Systemd + Docker Compose (Production)

#Security: Don’t Expose Ollama to the Internet

#The Cost Argument: Math That OpenAI Doesn’t Want You to Do

#Option A: OpenAI API (GPT-4o)

#Option B: ChatGPT Plus Subscriptions

#Option C: Sovereign Ollama on Hetzner

#Option D: Your Existing Hardware

#The Privacy Argument: Your Prompts Are Training Data

#What You Give Up (Honest Assessment)

#1. Frontier Performance

#2. Speed on Large Models

#3. Operational Overhead

#The Hybrid Strategy (What I Actually Do)

#Pre-Pulling Models for Deployment

#Ring -5 Observation: The Commit Message Economy

#Links

#Final Directive