47 articles

📰 Dev.to · João André Gomes Marques

Articles from Dev.to · João André Gomes Marques · 47 articles · Updated every 3 hours · View all reads

All ⚡ AI Lessons (11494) ArXiv cs.AIDev.to · FORUM WEBDev.to AIForbes InnovationOpenAI NewsHugging Face Blog
SDK v0.2.9: Output Verification, Attestations, Preflight and Budgets
Dev.to · João André Gomes Marques 4d ago
SDK v0.2.9: Output Verification, Attestations, Preflight and Budgets
v0.2.9 is out on PyPI. Four new things, all driven by what people asked for after shipping agents to...
Scan MCP tool definitions for prompt injection before your agent calls them
Dev.to · João André Gomes Marques 6d ago
Scan MCP tool definitions for prompt injection before your agent calls them
MCP servers expose tools to AI agents. But those tool definitions can contain prompt injection,...
Three tiers of enforcement for AI agents - strong, bounded, detectable
Dev.to · João André Gomes Marques 1w ago
Three tiers of enforcement for AI agents - strong, bounded, detectable
Most AI agent frameworks give you zero enforcement. Your agent can call any tool, take any action,...
asqav-mcp is now on Docker Hub
Dev.to · João André Gomes Marques 1w ago
asqav-mcp is now on Docker Hub
asqav-mcp is now on Docker Hub. The MCP server that gives AI agents governance capabilities - policy...
Asqav vs Microsoft Agent Governance Toolkit - what is the difference
Dev.to · João André Gomes Marques 1w ago
Asqav vs Microsoft Agent Governance Toolkit - what is the difference
Microsoft released the Agent Governance Toolkit (AGT) on April 2, 2026. I built Asqav, an open source...
Why the E8 lattice is the perfect quantizer for KV caches
Dev.to · João André Gomes Marques 1w ago
Why the E8 lattice is the perfect quantizer for KV caches
Most quantizers are chosen for convenience. E8 was chosen because the math demanded it — and then it...
Running 1M-token context on a single GPU (the math)
Dev.to · João André Gomes Marques 1w ago
Running 1M-token context on a single GPU (the math)
Most people dismiss million-token context windows as a hardware problem. It is not. It is a math...
NexusQuant is now on PyPI, HuggingFace, and 9 awesome lists
Dev.to · João André Gomes Marques 1w ago
NexusQuant is now on PyPI, HuggingFace, and 9 awesome lists
This week we shipped everything. Here is the full list. What went out the door PyPI...
Why attention-aware eviction beats random eviction (with data)
Dev.to · João André Gomes Marques 1w ago
Why attention-aware eviction beats random eviction (with data)
At high eviction rates, choosing which tokens to drop matters enormously. Here is what the numbers...
One line of Python to extend your LLM's context window 10x
Dev.to · João André Gomes Marques 1w ago
One line of Python to extend your LLM's context window 10x
Your LLM is running out of memory at 128K tokens. Here is the fix. from nexusquant import...
The 12 approaches I tested before finding one that works
Dev.to · João André Gomes Marques 1w ago
The 12 approaches I tested before finding one that works
I keep seeing ML papers that only show the final method. No dead ends, no "we tried X and it was a...
NexusQuant: compressão de memória para LLMs — guia prático
Dev.to · João André Gomes Marques 1w ago
NexusQuant: compressão de memória para LLMs — guia prático
NexusQuant: compressão de memória para LLMs — guia prático Neste guia vamos explorar os...
Como comprimir o KV cache do seu LLM em 33x sem treino
Dev.to · João André Gomes Marques 1w ago
Como comprimir o KV cache do seu LLM em 33x sem treino
Como comprimir o KV cache do seu LLM em 33x sem treino Se alguma vez tentaste correr um...
KV cache memory calculator: how much does your LLM actually use?
Dev.to · João André Gomes Marques 1w ago
KV cache memory calculator: how much does your LLM actually use?
Before you can compress something, you need to know how big it is. Most engineers know the KV cache...
How to benchmark NexusQuant on your own model
Dev.to · João André Gomes Marques 1w ago
How to benchmark NexusQuant on your own model
Running benchmarks on someone else's hardware tells you very little. This guide shows you how to...
What I Learned Testing 12 Compression Approaches That Failed
Dev.to · João André Gomes Marques 1w ago
What I Learned Testing 12 Compression Approaches That Failed
What I Learned Testing 12 Compression Approaches That Failed The most useful research I've...
The Math Behind E8 Lattice Quantization (with Code)
Dev.to · João André Gomes Marques 1w ago
The Math Behind E8 Lattice Quantization (with Code)
The Math Behind E8 Lattice Quantization (with Code) Standard scalar quantization — what...
How Much GPU Memory Does NexusQuant Actually Save?
Dev.to · João André Gomes Marques 1w ago
How Much GPU Memory Does NexusQuant Actually Save?
How Much GPU Memory Does NexusQuant Actually Save? KV cache compression numbers like "10x"...
How to deploy NexusQuant in production (and what's missing)
Dev.to · João André Gomes Marques 1w ago
How to deploy NexusQuant in production (and what's missing)
This post is a practical deployment guide. Install, configuration, how to pick the right eviction...
NexusQuant vs KVTC vs TurboQuant vs CommVQ — honest comparison
Dev.to · João André Gomes Marques 1w ago
NexusQuant vs KVTC vs TurboQuant vs CommVQ — honest comparison
There are now enough KV cache compression papers that "we beat the competition" is meaningless...