TIL I accidentally trained an AI model on 47,000 cat pictures before I realized my dataset was broken

I was messing around with a tiny image classifier for a side project using TensorFlow. Thought I had a good mix of object categories. After 6 hours of training I hit 99% accuracy and felt like a genius. Then I checked the dataset labels and realized I'd tagged every single image as 'cat' by accident because of a dumb folder naming mistake. The model didn't learn anything about different objects, it just learned to always say cat. Has anyone else spent a whole day on a model only to realize the data was garbage the whole time?

3 comments

3 Comments

evan_grant7022d agoMost Upvoted

A few years ago I would have blamed sloppy coding, but now I know data failure happens way more than code failure.

casey34222d agoMost Upvoted

Haven't you noticed that cleaning up data usually turns into a much bigger job than fixing the code itself? I've had projects where I spent weeks scrubbing bad customer records before I could even start bug fixing. What really helped me was building simple validation checks right into the input forms so garbage data never made it into the database in the first place. Even a basic dropdown instead of a free text field saved me hours of cleanup later. The other thing I found was taking screenshots of bad data patterns and sharing them with the team, so everyone knew what to avoid. It's way easier to prevent data rot than to fix it after it's been sitting there for months.

jackson.max21d ago

@evan_grant70 probably nailed it, data failure sneaks up on you way more than code bugs. A model that learns to always say cat is still technically learning something, just not what you wanted. Maybe it's not the end of the world if you caught it before shipping.