A 70B parameter model needs ~140GB in fp16. With 4-bit quantization, that drops to ~35GB. With aggressive 2-bit schemes, under 20GB. The question isn't whether to quantize — it's how far you can push before quality collapses.
Quantization reduces the precision of model weights from floating-point (16/32 bits) to lower bit-widths (8, 4, 3, or even 2 bits). The key methods:
Quantization quality doesn't degrade linearly. There's a cliff:
Bit-width | Perplexity delta | Practical impact
-----------+-----------------+------------------
fp16 | baseline | Full quality
8-bit | +0.01-0.05 | Negligible loss
4-bit | +0.1-0.3 | Minor quality loss
3-bit | +0.5-2.0 | Noticeable on complex tasks
2-bit | +3.0-10.0 | Significant degradation
1.5-bit | +15-50+ | Barely functional
The cliff typically hits between 3 and 2 bits. Above 3 bits, you can quantize almost any model with minimal impact. Below 3, you need specialized techniques.
TurboQuant uses a learned codebook approach: instead of mapping each weight to a fixed set of quantization levels, it learns an optimal set of representative values for each layer. This allows effective 2.5-bit quantization with quality comparable to standard 3-bit methods.
The key innovation is group-wise codebook learning: weights are grouped (typically 128 per group), and each group gets its own codebook optimized for that group's distribution. This captures the fact that different layers and even different regions within layers have very different weight distributions.
# Using llama.cpp with Q4_K_M quantization
# Download GGUF quantized model
wget https://huggingface.co/TheBloke/Llama-2-70B-GGUF/resolve/main/llama-2-70b.Q4_K_M.gguf
# Run with GPU offloading (24GB VRAM = ~40 layers on GPU)
./llama-server -m llama-2-70b.Q4_K_M.gguf \
-ngl 40 \
-c 4096 \
--host 0.0.0.0 --port 8080