Documentation
Learn how to run AI models locally on your machine.
Getting Started
Learn the basics of running AI models locally.
Install Ollama: curl -fsSL https://ollama.com/install.sh | shPull a model: ollama pull llama3.1:8bRun it: ollama run llama3.1:8b
Hardware Guide
Understanding GPU, CPU, and RAM requirements.
VRAM is the most critical factor — models must fit in GPU memory8 GB VRAM: can run 7-8B parameter models at Q4 quantization24 GB VRAM: comfortable for 13-32B modelsSystem RAM acts as fallback but is much slower
Quantization
How quantization affects model quality and performance.
Q4_K_M: best quality/size ratio — recommended for most usersQ5_K_M: slightly better quality, ~20% more VRAMQ8_0: near-original quality, needs 2x the VRAMGGUF format is used by llama.cpp and Ollama
Inference Engines
Tools and frameworks for running models locally.
Ollama: easiest setup, great for beginnersllama.cpp: most flexible, C++ implementationvLLM: high-throughput serving for productionLM Studio: GUI-based, cross-platform
Optimization Tips
Get the best performance from your hardware.
Use quantized models to reduce VRAM usageEnable flash attention when availableSet context length to what you actually needUse GPU offloading for partial acceleration