Spent 3 hours training an AI model on the wrong dataset last Tuesday

I was trying to get a custom image classifier to recognize different types of office chairs for a project. Pulled a dataset from some random GitHub repo and spent hours fine-tuning the model. Turns out the labels were all mixed up - it was labeling swivel chairs as 'executive' and basic stools as 'task chairs'. What's your method for checking dataset quality before you start training?

3 comments

3 Comments

lucashenderson24d ago

Last summer I grabbed a dataset labeled "furniture types" off some random repo and spent two hours training a model to tell tables apart from chairs and it ended up classifying everything with four legs as a table including actual chairs. @charlieh74 you're right that flipping through 10-20 samples is way faster than learning the hard way, but sometimes the urge to just YOLO it is too strong.

charlieh7424d ago

Rowan_ross out here calling me out and honestly he's not wrong. Grabbing a random GitHub dataset is basically gambling with your time, but sometimes you just gotta YOLO it and hope for the best. My method nowadays is to just flip through 10-20 samples real quick before committing, usually takes like 5 minutes and saves me from your exact situation. Swivel chairs labeled as executive chairs is comedy gold though, at least you got a funny story out of it.

rowan_ross24d ago

What kind of lunatic just grabs a random dataset off GitHub and hopes for the best? You basically asked for garbage in, garbage out when you skipped even a basic sanity check. Next time maybe spot check 20-30 images and see if the labels actually match what you're looking for before burning three hours.