My M5 Max, Gemma 4, MLX LOCAL Stack. (This KILLS MODEL PROVIDERS)

IndyDevDan · Beginner ·🧠 Large Language Models ·2h ago
Model providers DON'T want you to see this video. The M5 Max just exposed the dirty secret of the cloud LLM economy: you're renting what you could already OWN. 🔥 While Anthropic and OpenAI APIs go down AGAIN mid-recording, my local stack keeps shipping. Private. Cheap. Fast. On-device. This is the beginning of the end for the API rental racket. 🎥 FEATURED LINKS: • MLX, Gemma4, Qwen3.6, Pi agent live-bench codebase: https://github.com/disler/live-bench • Tactical Agentic Coding: https://agenticengineer.com/tactical-agentic-coding?y=00Y-p62sk0s 📚 RESOURCES • Nvidia NVFP4: https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/ • Apple M5 GPU Neural Accelerators: https://machinelearning.apple.com/research/exploring-llms-mlx-m5 • mlx-vlm: https://github.com/ml-explore/mlx-lm • Ollama Gemma4 Model: https://ollama.com/library/gemma4 • Ollama MLX Blog: https://ollama.com/blog/mlx • Pi coding agent: http://pi.dev • Gemma4 26 nvfp4: https://huggingface.co/mlx-community/gemma-4-26b-a4b-it-nvfp4 • Vitalik Eth Secure LLMs: https://vitalik.eth.limo/general/2026/04/02/secure_llms.html ⚡ Here's the uncomfortable truth most engineers are ignoring: you're paying a premium for cloud inference when your M5 Max, M4 Max, or even Apple Silicon you already own can run state-of-the-art local LLMs RIGHT NOW. Gemma 4, Qwen 3.5, MLX variants optimized for Apple AI hardware are quietly eating the model providers' lunch. 🧠 In this head-to-head benchmark, I pit the M5 Max vs the M4 Max across three brutal local inference tests: raw prompt throughput, context scaling with Graph Walks, and full agentic coding workflows via the Pi coding agent. The results are going to reshape how you think about local agents. 💣 THE CONTROVERSIAL FINDING: If you're running GGUF models on Apple Silicon in 2026, you're leaving 2x performance on the table. MLX smokes GGUF. Not by a little. By a LOT. 118 tokens per second vs 60. Almost double the pre-fill sp
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

What Claude Mythos Actually Is and Why Anthropic Is Not Releasing It
Learn about Anthropic's decision not to release Claude Mythos, its most capable AI model, and the potential implications of this choice
Medium · AI
Part 4: What Does Reddit Think About AI Therapy? BERTopic Reveals the Hidden Themes
Explore hidden themes in 20,000 Reddit posts about AI therapy using BERTopic, and discover what users are really saying
Medium · AI
Part 4: What Does Reddit Think About AI Therapy? BERTopic Reveals the Hidden Themes
Uncover hidden themes in 20,000 Reddit posts about AI therapy using BERTopic, a topic modeling technique, to reveal public sentiment and concerns
Medium · NLP
How to Deploy an Open Source LLM Reliably on Kubernetes
Learn to deploy an open-source LLM reliably on Kubernetes using Mistral 7B, Ollama, Prometheus, and Grafana
Medium · DevOps
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →