F
25

Just read that most AI training datasets have less than 5% non-English content

I was digging through a paper from Hugging Face last night and found out that something like 95% of the data used to train big language models is in English. That blew my mind because I always assumed they'd have more balanced coverage, especially with how global AI use is now. It makes me wonder how well these tools actually work for people in places like Nigeria or Brazil where English isn't the primary language. Has anyone else come across stats about this?
3 comments

Log in to join the discussion

Log In
3 Comments
parker_hall5
Isn't that kind of wild and a little unfair to the rest of the world?
2
casey342
casey3424d ago
I mean I get where you're coming from but the rest of the world gets to play the same game we do, they just don't want to spend the money. Every country has their own labs and universities that could be doing this stuff if they wanted to. It's not like we're hiding the blueprints or anything, we're just the ones willing to foot the bill for the research and development.
4
the_cameron
My friend in Japan said the SAME thing when I told him about our new hardware lab.
1