15 articles

📰 Dev.to · benzsevern

Articles from Dev.to · benzsevern · 15 articles · Updated every 3 hours · View all reads

All ⚡ AI Lessons (9405) ArXiv cs.AIDev.to · FORUM WEBForbes InnovationDev.to AIOpenAI NewsHugging Face Blog
From Dirty CSV to Golden Records: A Python Walkthrough
Dev.to · benzsevern 4d ago
From Dirty CSV to Golden Records: A Python Walkthrough
Download a government CSV, load it into pandas, and you'll find "MEMORIAL HOSPITAL" listed twelve...
GoldenMatch vs. Splink vs. Dedupe vs. RecordLinkage: A Practical Comparison
Dev.to · benzsevern 1w ago
GoldenMatch vs. Splink vs. Dedupe vs. RecordLinkage: A Practical Comparison
We ran four Python entity resolution libraries on the same three datasets — Febrl, DBLP-ACM, and 10K real voter records. Here's where each shines.
GoldenMatch vs. BPID: Testing Against an EMNLP Benchmark
Dev.to · benzsevern 1w ago
GoldenMatch vs. BPID: Testing Against an EMNLP Benchmark
We benchmarked GoldenMatch on Amazon's BPID dataset — 10,000 adversarial PII pairs. With DOB parsing and Vertex AI embeddings, we hit 0.750 F1 — matching Ditto
Deduplicating 401,000 Equipment Auction Records with LLM Calibration
Dev.to · benzsevern 1w ago
Deduplicating 401,000 Equipment Auction Records with LLM Calibration
We ran GoldenMatch on 401,125 bulldozer auction records from Kaggle. Iterative LLM calibration learned the optimal match threshold from just 200 pairs (~$0.01).
AI-Powered Deduplication: How LLMs Supercharge the Golden Suite
Dev.to · benzsevern 1w ago
AI-Powered Deduplication: How LLMs Supercharge the Golden Suite
Enable LLM boost across GoldenCheck, GoldenFlow, and GoldenMatch to catch what fuzzy matching misses — with real costs under $0.10.
Getting Started with GoldenPipe: Clean Data in Your Python Backend
Dev.to · benzsevern 1w ago
Getting Started with GoldenPipe: Clean Data in Your Python Backend
Add a production-ready data quality pipeline to your Python backend in 5 minutes. One pip install, one function call, zero config.
Entity Resolution on 208,000 Real Records with the Golden Suite
Dev.to · benzsevern 1w ago
Entity Resolution on 208,000 Real Records with the Golden Suite
We ran the full Golden Suite pipeline on 208,505 real NC voter registration records. 61 quality findings, 197K addresses cleaned, 10,718 duplicate clusters foun
10 Data Problems Every Pipeline Hits (and the One-Liner Fixes)
Dev.to · benzsevern 1w ago
10 Data Problems Every Pipeline Hits (and the One-Liner Fixes)
The same 10 data quality issues show up in every dataset. Here's what they look like and how to fix each in one line.
Two Hospitals Matched Patient Records Without Sharing a Single Name
Dev.to · benzsevern 2w ago
Two Hospitals Matched Patient Records Without Sharing a Single Name
Privacy-preserving record linkage with bloom filters. 92% accuracy. Zero raw data exchanged.
I Deduplicated 100K Records in 12 Seconds With One Command
Dev.to · benzsevern 2w ago
I Deduplicated 100K Records in 12 Seconds With One Command
How GoldenMatch auto-detects columns, picks scoring algorithms, and hits 97% F1 with zero configuration.
How to Deduplicate 100,000 Records in 13 Seconds with Python
Dev.to · benzsevern 3w ago
How to Deduplicate 100,000 Records in 13 Seconds with Python
You have a CSV with duplicate records. Maybe it's customer data exported from two CRMs, a product...