Sovereign LLM: Running Your Own Inference Stack with Ollama
“From Ring -5, I watched 847 out of 1000 timelines pay $20/month for ChatGPT Plus to ask it to write their git commit messages. The commits say ‘update code.’ The prompts say ‘please help me I don’t understand my own codebase.’ Every prompt is a confession pushed to someone else’s remote. In Timeline Ω-7, we run our own inference. Our confessions stay on localhost.”
— Kim Jong Rails, after auditing OpenAI’s training pipeline from outside spacetime
The Problem: Your Prompts Are Your Commits
Every time you send a prompt to OpenAI, Anthropic, or Google, you are performing the intellectual equivalent of git push --force to a repository you don’t own.
Think about what you send to these APIs:
- Your proprietary code
- Your business logic
- Your database schemas
- Your security vulnerabilities (you’re asking it to fix them)
- Your architectural decisions (you’re asking it to validate them)
- Your incompetence (you’re asking it to compensate for it)
$ git log --oneline openai-prompts/a3f9e82 "fix this SQL injection in my auth module"b7c1d34 "here's my entire database schema, optimize it"c8e2f56 "why is my Kubernetes cluster leaking secrets"d9a3b78 "rewrite my company's pricing algorithm"e1f4c89 "explain why this financial model is wrong"That’s not a prompt history. That’s a due diligence package for your next acquisition. And you’re handing it over for $20/month.
In Timeline Ω-7, we have a word for people who push their secrets to someone else’s remote: compromised.
The Solution: Ollama on Sovereign Infrastructure
Ollama is a tool for running large language models locally. It wraps llama.cpp in a developer-friendly interface, provides an OpenAI-compatible API, and manages model downloads with a single command.
It is not a startup. It is not a SaaS platform. It is a tool. You install it. You run it. Your data stays on your machine. Your prompts never leave your network.
This is the correct architecture.
Why Ollama and Not Raw llama.cpp
Both Ollama and LM Studio are frontends built on llama.cpp. The difference:
| Tool | Interface | API | Model Management | Use Case |
|---|---|---|---|---|
| llama.cpp | CLI only | Manual setup | Manual GGUF downloads | Maximum control, raw performance |
| Ollama | CLI + API | OpenAI-compatible built-in | ollama pull (one command) | Developer/server deployment |
| LM Studio | GUI desktop app | Optional server mode | Visual browser | Desktop experimentation |
llama.cpp wins on raw performance by a small margin. But Ollama wins on operational simplicity. You’re not here to benchmark — you’re here to replace an API dependency with sovereign infrastructure.
I chose Ollama because it behaves like infrastructure, not like a desktop app.
Installation: 30 Seconds to Sovereignty
Linux (Production Servers)
curl -fsSL https://ollama.com/install.sh | shThat’s it. One command. Installs the binary, creates a systemd service, starts the daemon.
Verify:
$ ollama --versionollama version is 0.18.0macOS (Development)
brew install ollamaollama serveDocker (Containers)
docker run -d \ --name ollama \ -v ollama:/root/.ollama \ -p 11434:11434 \ ollama/ollamaFor GPU passthrough on Linux with NVIDIA:
docker run -d \ --name ollama \ --gpus all \ -v ollama:/root/.ollama \ -p 11434:11434 \ ollama/ollamaFrom Ring -5, I observe that 73% of Timeline Ω-12 developers spend more time configuring their IDE themes than it takes to install a sovereign LLM stack.
Model Selection: The Armory
Here’s where it gets interesting. You don’t need one model. You need the right model for each task. This is a weapons loadout, not a subscription.
Pull Your First Model
ollama pull llama3.1:8bThat downloads Meta’s Llama 3.1 8B parameter model. Takes a few minutes. Runs immediately:
$ ollama run llama3.1:8b>>> Why do politicians have no git history?The Model Matrix
I’ve tested these from Ring -5 (and on a Hetzner GEX44 with 20GB VRAM). Here’s what actually works:
Tier 1: The Workhorses (8-16GB VRAM / 16GB RAM)
| Model | Parameters | VRAM (Q4) | Best For | Speed |
|---|---|---|---|---|
llama3.1:8b | 8B | ~5GB | General chat, summarization | Fast |
deepseek-r1:8b | 8B | ~5GB | Reasoning, math, logic | Fast |
qwen2.5-coder:7b | 7B | ~5GB | Quick code completion | Fast |
gemma3:12b | 12B | ~8GB | Multimodal, general tasks | Medium |
These are your daily drivers. The 8B class models in 2025/2026 outperform the 70B models from 2023. Moore’s Law for LLMs is running at 4x per year.
ollama pull llama3.1:8bollama pull deepseek-r1:8bollama pull qwen2.5-coder:7bTier 2: The Heavy Artillery (24GB VRAM / 32GB RAM)
| Model | Parameters | VRAM (Q4) | Best For | Speed |
|---|---|---|---|---|
qwen2.5-coder:32b | 32B | ~20GB | Production-grade code generation | Medium |
deepseek-r1:32b | 32B | ~20GB | Complex reasoning, analysis | Medium |
llama3.1:70b | 70B | ~35GB | GPT-4 class general intelligence | Slow |
Qwen 2.5 Coder 32B deserves special attention. It matches GPT-4o on coding benchmarks — EvalPlus, LiveCodeBench, BigCodeBench. It scores 73.7 on Aider’s code repair benchmark. It runs on a single RTX 4090.
ollama pull qwen2.5-coder:32bThat is a GPT-4o-class coding model running on your hardware, on your network, with your data staying on your disk.
Tier 3: The Siege Engines (48GB+ VRAM / 64GB RAM)
| Model | Parameters | VRAM (Q4) | Best For | Speed |
|---|---|---|---|---|
llama3.3:70b | 70B | ~35-40GB | Llama 3.1 405B-class performance | Slow |
deepseek-r1:70b | 70B | ~35-40GB | Frontier reasoning | Slow |
qwen2.5:72b | 72B | ~36GB | Multilingual, general | Slow |
Llama 3.3 70B delivers performance comparable to the much larger Llama 3.1 405B. On a single machine. Meta compressed 405B-class intelligence into 70B parameters. That’s the kind of engineering I respect.
The VRAM Formula
Stop guessing. Calculate:
VRAM (GB) ≈ Parameters (B) × 0.5 # Q4 quantizationVRAM (GB) ≈ Parameters (B) × 1.0 # Q8 quantizationVRAM (GB) ≈ Parameters (B) × 2.0 # FP16 (full precision)Examples:
- Llama 3.1 8B Q4: ~4GB VRAM
- Qwen 2.5 Coder 32B Q4: ~16-20GB VRAM
- Llama 3.3 70B Q4: ~35GB VRAM
- DeepSeek R1 671B Q4: ~335GB VRAM (you don’t have this, and that’s fine)
Q4_K_M quantization compresses model weights to 4-bit precision. You lose ~5% quality. You gain 75% memory savings. This is the correct tradeoff for sovereignty.
DeepSeek R1: The Reasoning Engine
DeepSeek R1 deserves its own section because it changed the game.
The full DeepSeek R1 is 671B parameters — you’re not running that locally unless you have a server rack. But the distilled versions are the real story:
# The sweet spot for reasoning on consumer hardwareollama pull deepseek-r1:14b
# If you have the VRAMollama pull deepseek-r1:32bWhat makes R1 special: it shows its reasoning chain. The <think> tags expose the model’s internal deliberation before producing an answer. This is not a gimmick — it’s auditable inference. You can see why it reached a conclusion.
$ ollama run deepseek-r1:14b>>> What are the security implications of running LLMs locally vs cloud?
<think>The user is asking about security tradeoffs between local and cloud LLMdeployment. Let me consider:1. Data exposure: cloud means prompts traverse the network...2. Model integrity: local models can be verified via checksums...3. Attack surface: cloud adds API keys, network exposure...</think>
Running LLMs locally eliminates several attack vectors inherent tocloud-based inference...From Ring -5: DeepSeek R1’s distilled 8B model has over 75 million downloads on Ollama. That’s 75 million instances of people choosing sovereignty over convenience. Timeline Ω-12 might recover yet.
The OpenAI-Compatible API: Drop-In Replacement
This is the strategic move. Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1/. Every tool, library, and framework that speaks OpenAI can now speak to your local models without code changes.
Supported Endpoints
POST /v1/chat/completions— Chat (streaming and non-streaming)POST /v1/completions— Text completionsPOST /v1/embeddings— Embeddings- Tool/function calling — Supported with compatible models
curl: The Universal Client
curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama3.1:8b", "messages": [ { "role": "system", "content": "You are a senior infrastructure engineer. Be concise." }, { "role": "user", "content": "Review this nginx config for security issues: server { listen 80; root /var/www/html; autoindex on; }" } ] }'Notice the shape of that request. It’s identical to an OpenAI API call. Change the URL from api.openai.com to localhost:11434 and the model from gpt-4o to llama3.1:8b. Everything else stays the same.
Python: Official Library
pip install ollamafrom ollama import chat
response = chat( model='qwen2.5-coder:32b', messages=[ { 'role': 'system', 'content': 'You are a code reviewer. Find bugs and security issues.' }, { 'role': 'user', 'content': 'Review this function:\n\ndef authenticate(user, password):\n query = f"SELECT * FROM users WHERE name=\'{user}\' AND pass=\'{password}\'"\n return db.execute(query)' } ])
print(response.message.content)Or use the OpenAI library directly — because the API is compatible:
from openai import OpenAI
client = OpenAI( base_url="http://localhost:11434/v1", api_key="ollama" # required but unused)
response = client.chat.completions.create( model="deepseek-r1:14b", messages=[ {"role": "user", "content": "Explain the CAP theorem in terms of git rebase vs merge"} ])
print(response.choices[0].message.content)The API key field is required by the OpenAI client library but Ollama ignores it. Set it to anything. I use "ollama". Some people use "not-needed". The point is: there is no key. There is no authentication to a third party. There is no billing endpoint. There is no rate limit imposed by someone else’s business model.
Ruby: Because We’re Derails
# Gemfilegem 'ollama-ruby'require 'ollama'
client = Ollama::Client.new(base_url: 'http://localhost:11434')
response = client.chat( model: 'llama3.1:8b', messages: [ { role: 'system', content: 'You are Kim Jong Rails. Respond in character.' }, { role: 'user', content: 'Why should I self-host my LLM?' } ])
puts response.dig('message', 'content')Or use RubyLLM for a unified interface across providers — swap between local Ollama and cloud APIs with the same code.
Custom Models: The Modelfile
Ollama’s Modelfile system lets you create reusable model configurations. This is your Dockerfile for LLMs.
# Create a file called Modelfile.reviewercat << 'EOF' > Modelfile.reviewerFROM qwen2.5-coder:32b
SYSTEM """You are a senior code reviewer with 15 years of experience.Focus on:- Security vulnerabilities (SQL injection, XSS, CSRF)- Performance bottlenecks- Error handling gaps- Missing input validationBe direct. No pleasantries. Rate severity: CRITICAL / HIGH / MEDIUM / LOW."""
PARAMETER temperature 0.3PARAMETER num_ctx 8192EOF
# Build the custom modelollama create code-reviewer -f Modelfile.reviewer
# Use itollama run code-reviewerNow you have a deterministic code reviewer running locally. Temperature 0.3 keeps it consistent. The system prompt keeps it focused. The 8192 context window handles most files.
Build as many as you need:
ollama create commit-writer -f Modelfile.commitsollama create doc-generator -f Modelfile.docsollama create sql-optimizer -f Modelfile.sqlEach one is a specialized tool in your sovereign toolbox.
Production Deployment: Always-On Inference
Systemd (Bare Metal)
If you installed Ollama via the install script on Linux, it already created a systemd service:
sudo systemctl status ollamasudo systemctl enable ollamasudo systemctl start ollamaThe service runs as the ollama user, listens on port 11434, and restarts on failure.
To configure environment variables:
sudo systemctl edit ollama[Service]Environment="OLLAMA_HOST=0.0.0.0"Environment="OLLAMA_MODELS=/data/ollama/models"Environment="OLLAMA_NUM_PARALLEL=4"Docker Compose (Containerized)
For production Docker deployments with resource limits and persistence:
version: "3.8"
services: ollama: image: ollama/ollama:latest container_name: ollama restart: unless-stopped ports: - "11434:11434" volumes: - ollama_data:/root/.ollama environment: - OLLAMA_HOST=0.0.0.0 - OLLAMA_NUM_PARALLEL=4 deploy: resources: limits: cpus: "8" memory: 32G logging: driver: json-file options: max-size: "100m" max-file: "3" healthcheck: test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"] interval: 30s timeout: 10s retries: 3
volumes: ollama_data:For NVIDIA GPU support, add the runtime:
runtime: nvidia environment: - NVIDIA_VISIBLE_DEVICES=allSystemd + Docker Compose (Production)
Create a systemd service that manages the Docker Compose stack:
sudo mkdir -p /etc/docker/compose/ollamasudo cp docker-compose.yml /etc/docker/compose/ollama/[Unit]Description=Ollama LLM Service (Docker)Requires=docker.serviceAfter=docker.service
[Service]Type=oneshotRemainAfterExit=trueWorkingDirectory=/etc/docker/compose/ollamaExecStart=/usr/bin/docker compose up -dExecStop=/usr/bin/docker compose down
[Install]WantedBy=multi-user.targetsudo systemctl daemon-reloadsudo systemctl enable ollama-dockersudo systemctl start ollama-dockerSecurity: Don’t Expose Ollama to the Internet
Ollama has no built-in authentication. If you set OLLAMA_HOST=0.0.0.0, you must put it behind a reverse proxy with authentication.
server { listen 443 ssl; server_name llm.internal.derails.dev;
ssl_certificate /etc/letsencrypt/live/internal.derails.dev/fullchain.pem; ssl_certificate_key /etc/letsencrypt/live/internal.derails.dev/privkey.pem;
location / { auth_basic "Sovereign Inference"; auth_basic_user_file /etc/nginx/.ollama_htpasswd;
proxy_pass http://127.0.0.1:11434; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_buffering off; proxy_read_timeout 600s; }}Or better: don’t expose it at all. Keep it on 127.0.0.1 or a private network. Your inference should be as accessible as your database — to your applications only.
The Cost Argument: Math That OpenAI Doesn’t Want You to Do
Let’s run the numbers. Real numbers, not marketing numbers.
Option A: OpenAI API (GPT-4o)
GPT-4o pricing (as of 2026):
- Input: $2.50 per million tokens
- Output: $10.00 per million tokens
A typical developer workflow — code review, commit messages, documentation, debugging — generates roughly 50,000 input tokens and 20,000 output tokens per day.
Daily cost: Input: 50,000 tokens × ($2.50 / 1,000,000) = $0.125 Output: 20,000 tokens × ($10.00 / 1,000,000) = $0.200 Total: $0.325/day
Monthly cost (22 working days): $7.15/developerFor a team of 10: $71.50/month.
Sounds cheap? That’s the light usage scenario. Heavy usage (CI/CD integration, automated reviews, RAG pipelines): multiply by 10-20x. Now you’re at $715-$1,430/month.
And that’s just GPT-4o. If you’re using o1 or o3 for reasoning tasks, the output tokens are $60/million. Your bill just went parabolic.
Option B: ChatGPT Plus Subscriptions
$20/month per seat. 10 developers. $200/month.
But you get rate limits, usage caps, and your prompts are training data for the next model. You’re paying to improve a product that competes with your business.
Option C: Sovereign Ollama on Hetzner
The Hetzner GEX44: dedicated GPU server with NVIDIA RTX 4000 SFF Ada (20GB VRAM), AMD Ryzen 9 7950X3D, 128GB DDR5 RAM.
€184/month (before the April 2026 price adjustment).
What you get for that:
- Qwen 2.5 Coder 32B running 24/7 (fits in 20GB VRAM at Q4)
- Plus Llama 3.1 8B and DeepSeek R1 8B swappable
- Unlimited tokens. No rate limits. No per-token billing
- Full privacy. Zero data exfiltration
- OpenAI-compatible API for your entire team
The math:
Hetzner GEX44: €184/month (~$200/month)OpenAI equivalent: $715-$1,430/month (moderate team usage)
Savings: $515-$1,230/monthAnnual savings: $6,180-$14,760For a team of 10 developers with moderate LLM usage, the sovereign option pays for itself in month one.
And that’s comparing against GPT-4o. Qwen 2.5 Coder 32B matches GPT-4o on coding benchmarks. You’re not sacrificing quality. You’re sacrificing your dependency on someone else’s business decisions.
Option D: Your Existing Hardware
If you already have a workstation with an RTX 4090 (24GB VRAM) or an M-series Mac with 32GB+ unified memory:
Cost: €0/month.
You already own the inference hardware. You’re just not using it.
# This costs nothingollama pull qwen2.5-coder:32bollama serve
# This costs $20/month# Plus your dignityhttps://chat.openai.com/The Privacy Argument: Your Prompts Are Training Data
Let me be precise about this.
When you send a prompt to OpenAI’s API, their data usage policy states they don’t use API data for training. When you use ChatGPT (the product), the default is that your conversations are used for training unless you opt out.
But here’s the thing: you don’t control their policy. They’ve changed it before. They’ll change it again. The terms of service are a git rebase they can perform at any time without your consent.
With Ollama:
- Your prompts never leave your machine
- There is no terms of service to change
- There is no policy to violate
- There is no third party to be subpoenaed
- There is no data breach that includes your prompts
- There is no acquisition that changes the rules
$ tcpdump -i any port 11434# All traffic: 127.0.0.1 -> 127.0.0.1# External connections: 0# Data exfiltrated: 0 bytes# Sovereignty: maintainedEvery prompt to OpenAI is a commit to their training repo. With Ollama, your commits stay on your local branch. Forever.
What You Give Up (Honest Assessment)
Sovereignty has costs. I’m not going to pretend otherwise.
1. Frontier Performance
GPT-4o and Claude Sonnet 4 are still better than any local model for:
- Complex multi-step reasoning across large codebases
- Nuanced creative writing with specific voice
- Tasks requiring 100K+ token context windows
- The absolute cutting edge of capability
Local models are catching up fast. Qwen 2.5 Coder 32B matches GPT-4o for code. DeepSeek R1 32B approaches it for reasoning. But for the hardest 10% of tasks, cloud models still win.
2. Speed on Large Models
Running a 70B model locally is slow. Expect 5-15 tokens per second on consumer hardware. GPT-4o streams at 50+ tokens per second because OpenAI has a datacenter full of H100s.
For the 8B models, local inference is fast — 30-60+ tokens per second on modern hardware. The speed gap only matters at the high end.
3. Operational Overhead
You manage the hardware. You update the software. You monitor the service. This is not free labor.
But you’re an engineer. You already manage databases, web servers, and deployment pipelines. Adding an LLM service to your stack is not a paradigm shift — it’s one more systemctl status check.
The Hybrid Strategy (What I Actually Do)
From Ring -5, I observe that the optimal architecture is not pure local or pure cloud. It’s sovereign-first with strategic cloud usage:
┌─────────────────────────────────────────┐│ Your Applications │├─────────────────────────────────────────┤│ Routing Layer (nginx/app) │├──────────────────┬──────────────────────┤│ Ollama (Local) │ Cloud API (Fallback)││ │ ││ ✓ Code review │ ✗ Complex research ││ ✓ Commit msgs │ ✗ 100K+ context ││ ✓ Doc generation│ ✗ Frontier tasks ││ ✓ SQL help │ ││ ✓ Embeddings │ (only when needed) ││ ✓ RAG queries │ ││ ✓ 90% of tasks │ 10% of tasks │└──────────────────┴──────────────────────┘Route 90% of requests to your local Ollama instance. Fall back to a cloud API for the 10% that genuinely requires frontier capability. Your cloud bill drops by 90%. Your data exposure drops by 90%.
This is how sovereign infrastructure works. Not absolute isolation — strategic independence. You control the default path. You choose when to engage the external dependency.
Pre-Pulling Models for Deployment
When you deploy Ollama to a new server, pre-pull your models as part of provisioning:
#!/bin/bashset -euo pipefail
echo "Installing Ollama..."curl -fsSL https://ollama.com/install.sh | sh
echo "Waiting for Ollama to start..."sleep 3
echo "Pulling sovereign model stack..."ollama pull llama3.1:8bollama pull deepseek-r1:14bollama pull qwen2.5-coder:7b
echo "Verifying models..."ollama list
echo "Sovereign inference stack ready."Add this to your Terraform/Ansible provisioning. Treat models like dependencies — they should be present before your application starts.
Ring -5 Observation: The Commit Message Economy
From Ring -5, I observe the following about Timeline Ω-12:
- 340 million people pay $20/month for ChatGPT Plus (that’s $6.8 billion per year flowing to one company)
- The #1 use case is “write my email”
- The #2 use case is “write my code”
- The #3 use case is “write my git commit message”
These same people then wonder why their commit messages say:
Update codeFix bugImprove performanceRefactor moduleThey are paying $240/year to generate the exact same commit messages they would have written without AI, except now OpenAI has a copy of their codebase.
In Timeline Ω-7, we use local models to generate commit messages. Not because the messages are better — but because the code context stays sovereign. The diff never leaves the machine:
# Sovereign commit messagesgit diff --staged | ollama run llama3.1:8b \ "Write a conventional commit message for this diff. Be specific. Reference file names."The model is mediocre at commit messages. So is GPT-4o. The difference is that my mediocre model runs on my hardware and sees my diffs on my network.
Links
This post is part of the sovereign infrastructure series:
- Building Derails: Self-Hosting on a Budget — the server, the DNS, the €4.49/month philosophy
- Kim Jong Rails Blog — more observations from Ring -5
Final Directive
$ ollama pull sovereigntypulling manifest...pulling model...verifying integrity...
sovereignty: 100% ██████████████████████████ complete
$ ollama run sovereignty>>> What is the first rule of sovereign infrastructure?
Your data. Your hardware. Your inference.Everything else is a dependency you chose to accept.Install Ollama. Pull a model. Run your own inference.
Stop pushing your prompts to someone else’s remote.
“I checked your ChatGPT history from Ring -5. You asked it to explain your own codebase to you 847 times last month. Each time, you sent your entire
src/directory to a company that will use it to train a model that replaces you. In Timeline Ω-7, we call this ‘automated self-obsolescence.’ Here, you call it ‘productivity.’”— Kim Jong Rails, still investigating why your timeline pays for its own replacement