Personal LLM 2.0 — 100% Offline Private AI Assistant

Core Philosophy

🚫

Zero Dependencies

No Ollama, no Docker, no cloud APIs. Python loads model files directly.

🔒

100% Private

All data stays on your disk. Nothing leaves your machine — ever.

✈️

Offline First

After one-time model download, zero network calls. Works in airplane mode.

💻

Your Hardware

Uses YOUR CPU, YOUR GPU, YOUR RAM. You own the entire stack.

System Architecture

Full-Stack Monorepo Architecture

🖥️Electron DesktopReact UI + native window wrapping the FastAPI backendUI Layer

↕

📱Mobile App (Expo)React Native Android app connected via LAN to your PCMobile

↕

⚡FastAPI BackendREST API — chat, models, RAG, cloud proxy on port 8000API

↕

🧠LLM Enginellama-cpp-python wrapping llama.cpp C++ inference engineCore

↕

📄GGUF Model FilesBinary model files on your disk — no DRM, no license serversStorage

Capabilities

💬

Multi-Turn Chat

Conversations like ChatGPT with persistent JSON history. Dynamic context pruning keeps memory efficient.

chat_engine.py · 358 lines

🧠

Context Intelligence

4-layer pipeline: RAG retrieval → recursive decomposition → self-refinement → adaptive chain-of-thought prompting.

context_engine.py · 498 lines

📄

Document Q&A (RAG)

Upload PDFs/text, chunk & embed via sentence-transformers, query with ChromaDB cosine similarity.

knowledge_base.py · 260 lines

💻

Code Generation

Write, debug, explain code with CodeLlama, Qwen, StarCoder2, OpenCoder and DeepSeek Coder models.

5 code-specialized models

🌐

Cloud Proxy

Mix local & cloud AI. Configure OpenAI, Groq, or Together API keys. Keys stored locally, never leave.

api.py · /chat/cloud endpoint

🔄

Hot Model Swap

Switch between 28 models on-the-fly. Aggressive GC + EmptyWorkingSet clears RAM/VRAM between loads.

llm_engine.py · 375 lines

28 Model Catalog

Model	Parameters	RAM	Best For	License
Phi-4 Mini	3.8B	~3.5 GB	General chat, reasoning	Restricted
DeepSeek-R1 (Qwen)	7B	~6.5 GB	Step-by-step reasoning	Open Weights
Llama 3.2	1B / 3B	~2-3 GB	Fast, lightweight tasks	Restricted
Qwen 3	1.7B / 4B	~2-4 GB	Multilingual, coding	Open Weights
CodeLlama	7B	~5.5 GB	Code generation	Restricted
Mistral 7B	7B	~6 GB	General purpose	Open Weights
OLMo 3	1B / 7B	~1-6 GB	Research, open science	Fully Open
Falcon 3 / 7B	1B-7B	~1-6 GB	General, multilingual	Open Weights
StarCoder2	3B	~3 GB	Code completion	Open Weights
Gemma 2 / 3	2B-4B	~2-4 GB	Instruction following	Restricted

+ 18 more models including Pythia, GPT-NeoX, Cerebras-GPT, RWKV, MPT, YaLM — all Q4_K_M quantized

Multi-Platform

🖥️

Desktop App

Electron + React with bundled Python 3.11. One-click .exe installer via NSIS. Native window, splash screen.

Windows · Electron Builder

📱

Android App

React Native Expo app connecting over LAN. Model switching, persistent settings. EAS cloud builds to .apk.

Android · Expo + EAS

🌐

Web + Launch Site

Gradio legacy UI on port 7865. Next.js 16 marketing site with dark glassmorphism theme and Framer Motion.

Browser · Next.js 16

Tech Stack

Core

Python 3.11llama-cpp-pythonllama.cpp (C++)

API

FastAPIUvicornPydantic

Desktop

ElectronReactNext.js

Mobile

React NativeExpoEAS Build

RAG

ChromaDBsentence-transformersPyPDF2

AI

CUDAMetalAVX2 SIMD

Build

PyInstallerInno SetupElectron Builder

Website

Next.js 16Tailwind CSSFramer Motion

Models

GGUF formatQ4_K_M quantHuggingFace Hub

Privacy & Security

🛡️

Everything Stays on Your Machine

LLM inference, chat history, RAG documents, vector database — all stored locally. Zero telemetry, zero analytics, zero cloud. The only network call is the one-time model download from HuggingFace. Cloud proxy is entirely opt-in.

🧠LLM InferenceOffline

💬Chat & HistoryOffline

📄RAG DocumentsOffline

🌐Gradio Web UILocalhost

⚡FastAPI BackendLAN Only

📥Model DownloadOne-time

🌐Cloud ProxyOpt-in

📊TelemetryNone · Ever