How to Run Local LLMs with Llama.cpp: Complete Guide

pookie · Beginner ·🧠 Large Language Models ·6mo ago
In this guide, you'll learn how to run local llm models using llama.cpp. In this llamacpp guide you will learn everything from model preparation such as what a gguf is, how to convert an llm into a gguf, how to quantize an llm and also everything in regards to local llm inference. This is a complete llama.cpp tutorial so we even cover how to run LoRA's, how to benchmark your models and how you should use llama.cpp bindings to include llm inference in the applications you build. We also compare it against popular alternatives such as ollama and vllm. After watching this video you will know ever…
Watch on YouTube ↗ (saves to browser)

Chapters (15)

Why run llms locally?
1:00 What is llama.cpp?
2:10 llama.cpp vs ollama vs vllm vs lmstudio
5:30 Tour of the llama.cpp repo
8:40 How to build / install llama.cpp
19:20 How to run llms locally with llama.cpp
32:10 How to benchmark llms
35:14 Structured outputs with grammars and json-schema
37:20 Memory mapping (no-mmap, mlock)
41:10 How to create a gguf model with llama.cpp
45:33 How to quantize an llm
49:30 How to use a lora adapter
57:00 How to merge lora with base model
1:01:00 How to use llama.cpp bindings to build applications
1:06:50 Outro
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Next Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)