📰 Dev.to · benzsevern
Articles from Dev.to · benzsevern · 15 articles · Updated every 3 hours · View all reads
All
⚡ AI Lessons (9405)
ArXiv cs.AIDev.to · FORUM WEBForbes InnovationDev.to AIOpenAI NewsHugging Face Blog

Dev.to · benzsevern
2d ago
Reconciling 15 OSS Vulnerability Databases: What They Actually Cover
Cross-database ER across OSV, GHSA, PyPA, RustSec, Go vulndb — 869k records, 608k canonical vulns, and one structural blind spot.

Dev.to · benzsevern
2d ago
Wallet Attribution at Scale: ER on 13M Blockchain Records
Running entity resolution across 10 public blockchain attribution datasets surfaces cross-jurisdictional sanctions and universal infrastructure patterns.

Dev.to · benzsevern
3d ago
The OSS ER Bargain: What Entity Resolution Actually Costs You
The OSS ER Bargain: What Entity Resolution Actually Costs You Benchmarking dedupe vs...

Dev.to · benzsevern
4d ago
Golden Suite + MCP: Giving AI Agents a Data Cleaning Toolkit
An AI agent can write SQL, draft an email, and refactor a repo. Ask it to deduplicate a 50,000-row...

Dev.to · benzsevern
4d ago
From Dirty CSV to Golden Records: A Python Walkthrough
Download a government CSV, load it into pandas, and you'll find "MEMORIAL HOSPITAL" listed twelve...

Dev.to · benzsevern
1w ago
GoldenMatch vs. Splink vs. Dedupe vs. RecordLinkage: A Practical Comparison
We ran four Python entity resolution libraries on the same three datasets — Febrl, DBLP-ACM, and 10K real voter records. Here's where each shines.

Dev.to · benzsevern
1w ago
GoldenMatch vs. BPID: Testing Against an EMNLP Benchmark
We benchmarked GoldenMatch on Amazon's BPID dataset — 10,000 adversarial PII pairs. With DOB parsing and Vertex AI embeddings, we hit 0.750 F1 — matching Ditto

Dev.to · benzsevern
1w ago
Deduplicating 401,000 Equipment Auction Records with LLM Calibration
We ran GoldenMatch on 401,125 bulldozer auction records from Kaggle. Iterative LLM calibration learned the optimal match threshold from just 200 pairs (~$0.01).

Dev.to · benzsevern
1w ago
AI-Powered Deduplication: How LLMs Supercharge the Golden Suite
Enable LLM boost across GoldenCheck, GoldenFlow, and GoldenMatch to catch what fuzzy matching misses — with real costs under $0.10.

Dev.to · benzsevern
1w ago
Getting Started with GoldenPipe: Clean Data in Your Python Backend
Add a production-ready data quality pipeline to your Python backend in 5 minutes. One pip install, one function call, zero config.

Dev.to · benzsevern
1w ago
Entity Resolution on 208,000 Real Records with the Golden Suite
We ran the full Golden Suite pipeline on 208,505 real NC voter registration records. 61 quality findings, 197K addresses cleaned, 10,718 duplicate clusters foun

Dev.to · benzsevern
1w ago
10 Data Problems Every Pipeline Hits (and the One-Liner Fixes)
The same 10 data quality issues show up in every dataset. Here's what they look like and how to fix each in one line.

Dev.to · benzsevern
2w ago
Two Hospitals Matched Patient Records Without Sharing a Single Name
Privacy-preserving record linkage with bloom filters. 92% accuracy. Zero raw data exchanged.

Dev.to · benzsevern
2w ago
I Deduplicated 100K Records in 12 Seconds With One Command
How GoldenMatch auto-detects columns, picks scoring algorithms, and hits 97% F1 with zero configuration.

Dev.to · benzsevern
3w ago
How to Deduplicate 100,000 Records in 13 Seconds with Python
You have a CSV with duplicate records. Maybe it's customer data exported from two CRMs, a product...
DeepCamp AI