📰 Dev.to · plasmon

Articles from Dev.to · plasmon · 24 articles · Updated every 3 hours · View all reads

All ⚡ AI Lessons (9388) ArXiv cs.AI Dev.to · FORUM WEB Forbes Innovation Dev.to AI OpenAI News Hugging Face Blog

Ollama, LM Studio, and GPT4All Are All Just llama.cpp — Here's Why Performance Still Differs

Dev.to · plasmon 3d ago

Ollama, LM Studio, and GPT4All Are All Just llama.cpp — Here's Why Performance Still Differs

Ollama, LM Studio, and GPT4All Are All Just llama.cpp — Here's Why Performance Still...

99.8% of LLM Inference Power Isn't Spent on Computation

Dev.to · plasmon 4d ago

99.8% of LLM Inference Power Isn't Spent on Computation

99.8% of LLM Inference Power Isn't Spent on Computation When people debate LLM inference...

Q4 KV Cache Fit 32K Context into 8GB VRAM — Only Math Broke

Dev.to · plasmon 4d ago

Q4 KV Cache Fit 32K Context into 8GB VRAM — Only Math Broke

Q4 KV Cache Fit 32K Context into 8GB VRAM — Only Math Broke The biggest VRAM hog in LLM...

HBM4 Didn't Break the Memory Wall — It Just Moved It

Dev.to · plasmon 4d ago

HBM4 Didn't Break the Memory Wall — It Just Moved It

HBM4 Didn't Break the Memory Wall — It Just Moved It HBM bandwidth has doubled every...

Running Just One LLM on 8GB VRAM Is a Waste

Dev.to · plasmon 4d ago

Running Just One LLM on 8GB VRAM Is a Waste

Liquid syntax error: Unknown tag 'endraw'

Light Just Cut KV Cache Memory Traffic to 1/16th

Dev.to · plasmon 4d ago

Light Just Cut KV Cache Memory Traffic to 1/16th

Light Just Cut KV Cache Memory Traffic to 1/16th The bottleneck in long-context LLM...

They Routed Power Through the Back of the Chip and 30% IR Drop Vanished

Dev.to · plasmon 5d ago

They Routed Power Through the Back of the Chip and 30% IR Drop Vanished

They Routed Power Through the Back of the Chip and 30% IR Drop Vanished Every...

Letting AI Control RAG Search Improved Accuracy by 79%

Dev.to · plasmon 6d ago

Letting AI Control RAG Search Improved Accuracy by 79%

Letting AI Control RAG Search Improved Accuracy by 79% Most RAG (Retrieval-Augmented...

If Memory Could Compute, Would We Still Need GPUs?

Dev.to · plasmon 6d ago

If Memory Could Compute, Would We Still Need GPUs?

If Memory Could Compute, Would We Still Need GPUs? The bottleneck for LLM inference isn't...

I Couldn't Build a Local LLM PC for $1,300 — Budget Tiers and the VRAM Cliffs Between Them

Dev.to · plasmon 1w ago

I Couldn't Build a Local LLM PC for $1,300 — Budget Tiers and the VRAM Cliffs Between Them

I Couldn't Build a Local LLM PC for $1,300 — Budget Tiers and the VRAM Cliffs Between...

8-Bit Quantization Destroyed 92% of Code Generation — The Culprit Wasn't Bit Count

Dev.to · plasmon 1w ago

8-Bit Quantization Destroyed 92% of Code Generation — The Culprit Wasn't Bit Count

8-Bit Quantization Destroyed 92% of Code Generation — The Culprit Wasn't Bit Count If you...

ML Hit 99% Accuracy on Yield Prediction — The Factory Floor Ignored It

Dev.to · plasmon 1w ago

ML Hit 99% Accuracy on Yield Prediction — The Factory Floor Ignored It

ML Hit 99% Accuracy on Yield Prediction — The Factory Floor Ignored It The pitch to bring...

3 Classifiers, 3 Answers: Why CoT Faithfulness Scores Are Meaningless

Dev.to · plasmon 1w ago

3 Classifiers, 3 Answers: Why CoT Faithfulness Scores Are Meaningless

3 Classifiers, 3 Answers: Why CoT Faithfulness Scores Are Meaningless LLM Chain-of-Thought...

Parameter Count Is the Worst Way to Pick a Model on 8GB VRAM

Dev.to · plasmon 1w ago

Parameter Count Is the Worst Way to Pick a Model on 8GB VRAM

Parameter Count Is the Worst Way to Pick a Model on 8GB VRAM I've been running local LLMs...

The Memory Bandwidth Gap Is 49x and Growing — Why Local LLMs Hit a Ceiling

Dev.to · plasmon 1w ago

The Memory Bandwidth Gap Is 49x and Growing — Why Local LLMs Hit a Ceiling

The Wall I Hit on an RTX 4060 Was a Bandwidth Wall Running Qwen3.5-9B on an RTX 4060 8GB...

MoE Beat Dense 27B by 2.4x on 8GB VRAM — The 35B-A3B Benchmark Nobody Expected

Dev.to · plasmon 1w ago

MoE Beat Dense 27B by 2.4x on 8GB VRAM — The 35B-A3B Benchmark Nobody Expected

Start with the benchmarks In a previous article, I compared three Qwen3.5 models on the...

I Designed a Memory System for Claude Code — 'Forgetting' Was the Hardest Part

Dev.to · plasmon 1w ago

I Designed a Memory System for Claude Code — 'Forgetting' Was the Hardest Part

Everyone talks about making AI remember things. Handoff prompts. System instructions. Memory files....

80% of LLM 'Thinking' Is a Lie — What CoT Faithfulness Research Actually Shows

Dev.to · plasmon 1w ago

80% of LLM 'Thinking' Is a Lie — What CoT Faithfulness Research Actually Shows

When You're Reading CoT, the Model Is Thinking Something Else Thinking models are...

80% of LLM 'Thinking' Is a Lie — What CoT Faithfulness Research Actually Shows

Dev.to · plasmon 1w ago

80% of LLM 'Thinking' Is a Lie — What CoT Faithfulness Research Actually Shows

When You're Reading CoT, the Model Is Thinking Something Else Thinking models are...

I Let Claude Code Run My Tech Blog. A Fake Article Passed Every Quality Check.

Dev.to · plasmon 2w ago

I Let Claude Code Run My Tech Blog. A Fake Article Passed Every Quality Check.

I've been letting Claude Code autonomously run a tech blog. Topic selection, article generation,...

Still Picking API vs Local LLM by Gut Feeling? A Framework With Real Benchmarks

Dev.to · plasmon 2w ago

Still Picking API vs Local LLM by Gut Feeling? A Framework With Real Benchmarks

Still Picking API vs Local LLM by Gut Feeling? A Framework With Real Benchmarks "Just use...

I Tried Speculative Decoding on RTX 4060 8GB — Every Config Was Slower Than Baseline

Dev.to · plasmon 2w ago

I Tried Speculative Decoding on RTX 4060 8GB — Every Config Was Slower Than Baseline

I Tried Speculative Decoding on RTX 4060 8GB — Every Config Was Slower Than Baseline All...

What Happens When You Bring LLMs Into a Semiconductor FAB — 5 ArXiv Papers, Brutally Honest Reviews

Dev.to · plasmon 2w ago

What Happens When You Bring LLMs Into a Semiconductor FAB — 5 ArXiv Papers, Brutally Honest Reviews

ArXiv papers on semiconductor manufacturing x AI have been surging. From late 2024 onward, proposals...

I Built a Fully Local Paper RAG on an RTX 4060 8GB — BGE-M3 + Qwen2.5-32B + ChromaDB

Dev.to · plasmon 2w ago

I Built a Fully Local Paper RAG on an RTX 4060 8GB — BGE-M3 + Qwen2.5-32B + ChromaDB

I Built a Fully Local Paper RAG on an RTX 4060 8GB — BGE-M3 + Qwen2.5-32B + ChromaDB I was...