Data quality articles
This article focuses on the technical and operational issues that most often break web data collection in projects.
To understand what actually goes wrong, we analyzed 82 discussion threads (questions, issues, and conversations) from Stack Overflow, Reddit, GitHub Issues, Hacker News, niche and regional platforms.
...
Let’s say, you run a background screening platform.
You pull data from courts, registries, vendors, and public sources. Some day, you discover that one person shows up three times in your system:
...
A 2024 peer-reviewed study in Criminology found that about 60% of people had at least one false positive in their reports, and nearly 90% had at least one false negative. So they’re systemic issues you have to design around.
...