v2.0.2 · March 2026

Personal LLM

100% Offline Private AI Assistant — Your Hardware, Your Data, Your Rules

0AI Models
$0Cloud Cost
0Platforms
0Private
Core Philosophy
🚫

Zero Dependencies

No Ollama, no Docker, no cloud APIs. Python loads model files directly.

🔒

100% Private

All data stays on your disk. Nothing leaves your machine — ever.

✈️

Offline First

After one-time model download, zero network calls. Works in airplane mode.

💻

Your Hardware

Uses YOUR CPU, YOUR GPU, YOUR RAM. You own the entire stack.

System Architecture
Full-Stack Monorepo Architecture
🖥️Electron DesktopReact UI + native window wrapping the FastAPI backendUI Layer
📱Mobile App (Expo)React Native Android app connected via LAN to your PCMobile
FastAPI BackendREST API — chat, models, RAG, cloud proxy on port 8000API
🧠LLM Enginellama-cpp-python wrapping llama.cpp C++ inference engineCore
📄GGUF Model FilesBinary model files on your disk — no DRM, no license serversStorage
Capabilities
💬

Multi-Turn Chat

Conversations like ChatGPT with persistent JSON history. Dynamic context pruning keeps memory efficient.

chat_engine.py · 358 lines
🧠

Context Intelligence

4-layer pipeline: RAG retrieval → recursive decomposition → self-refinement → adaptive chain-of-thought prompting.

context_engine.py · 498 lines
📄

Document Q&A (RAG)

Upload PDFs/text, chunk & embed via sentence-transformers, query with ChromaDB cosine similarity.

knowledge_base.py · 260 lines
💻

Code Generation

Write, debug, explain code with CodeLlama, Qwen, StarCoder2, OpenCoder and DeepSeek Coder models.

5 code-specialized models
🌐

Cloud Proxy

Mix local & cloud AI. Configure OpenAI, Groq, or Together API keys. Keys stored locally, never leave.

api.py · /chat/cloud endpoint
🔄

Hot Model Swap

Switch between 28 models on-the-fly. Aggressive GC + EmptyWorkingSet clears RAM/VRAM between loads.

llm_engine.py · 375 lines
28 Model Catalog
Model Parameters RAM Best For License
Phi-4 Mini 3.8B ~3.5 GB General chat, reasoning Restricted
DeepSeek-R1 (Qwen) 7B ~6.5 GB Step-by-step reasoning Open Weights
Llama 3.2 1B / 3B ~2-3 GB Fast, lightweight tasks Restricted
Qwen 3 1.7B / 4B ~2-4 GB Multilingual, coding Open Weights
CodeLlama 7B ~5.5 GB Code generation Restricted
Mistral 7B 7B ~6 GB General purpose Open Weights
OLMo 3 1B / 7B ~1-6 GB Research, open science Fully Open
Falcon 3 / 7B 1B-7B ~1-6 GB General, multilingual Open Weights
StarCoder2 3B ~3 GB Code completion Open Weights
Gemma 2 / 3 2B-4B ~2-4 GB Instruction following Restricted

+ 18 more models including Pythia, GPT-NeoX, Cerebras-GPT, RWKV, MPT, YaLM — all Q4_K_M quantized

Multi-Platform
🖥️

Desktop App

Electron + React with bundled Python 3.11. One-click .exe installer via NSIS. Native window, splash screen.

Windows · Electron Builder
📱

Android App

React Native Expo app connecting over LAN. Model switching, persistent settings. EAS cloud builds to .apk.

Android · Expo + EAS
🌐

Web + Launch Site

Gradio legacy UI on port 7865. Next.js 16 marketing site with dark glassmorphism theme and Framer Motion.

Browser · Next.js 16
Tech Stack
Core
Python 3.11llama-cpp-pythonllama.cpp (C++)
API
FastAPIUvicornPydantic
Desktop
ElectronReactNext.js
Mobile
React NativeExpoEAS Build
RAG
ChromaDBsentence-transformersPyPDF2
AI
CUDAMetalAVX2 SIMD
Build
PyInstallerInno SetupElectron Builder
Website
Next.js 16Tailwind CSSFramer Motion
Models
GGUF formatQ4_K_M quantHuggingFace Hub
Privacy & Security
🛡️

Everything Stays on Your Machine

LLM inference, chat history, RAG documents, vector database — all stored locally. Zero telemetry, zero analytics, zero cloud. The only network call is the one-time model download from HuggingFace. Cloud proxy is entirely opt-in.

🧠LLM InferenceOffline
💬Chat & HistoryOffline
📄RAG DocumentsOffline
🌐Gradio Web UILocalhost
FastAPI BackendLAN Only
📥Model DownloadOne-time
🌐Cloud ProxyOpt-in
📊TelemetryNone · Ever