🚀

Getting Started

Learn the basics of running AI models locally.

  • Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
  • Pull a model: ollama pull llama3.1:8b
  • Run it: ollama run llama3.1:8b
🖥️

Hardware Guide

Understanding GPU, CPU, and RAM requirements.

  • VRAM is the most critical factor — models must fit in GPU memory
  • 8 GB VRAM: can run 7-8B parameter models at Q4 quantization
  • 24 GB VRAM: comfortable for 13-32B models
  • System RAM acts as fallback but is much slower
📦

Quantization

How quantization affects model quality and performance.

  • Q4_K_M: best quality/size ratio — recommended for most users
  • Q5_K_M: slightly better quality, ~20% more VRAM
  • Q8_0: near-original quality, needs 2x the VRAM
  • GGUF format is used by llama.cpp and Ollama
⚙️

Inference Engines

Tools and frameworks for running models locally.

  • Ollama: easiest setup, great for beginners
  • llama.cpp: most flexible, C++ implementation
  • vLLM: high-throughput serving for production
  • LM Studio: GUI-based, cross-platform

Optimization Tips

Get the best performance from your hardware.

  • Use quantized models to reduce VRAM usage
  • Enable flash attention when available
  • Set context length to what you actually need
  • Use GPU offloading for partial acceleration