AI companies typically build their AI models on lots of publicly available content, from YouTube videos to newspaper articles. But many of these content hosts have now started to put up restrictions on their content.
Those new restrictions could bring about a "crisis" that would make these AI models less effective, according to aThe researchers performed an audit of 14,000 websites that are scraped by prominent AI training data sets. The intriguing result: that about 28 percent "of the most actively maintained, critical sources" on the internet are now "fully restricted from use."
"If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems," the researchers write.It's understandable that content hosts would put restrictions on their cache of now-valuable data. AI companies have taken this publicly available material, much of it copyrighted, and are using it to make money without permission. This has understandably upset many, fromThe arrogance on display, and the resulting blowback, have created a "consent in crisis," as the study researchers call it — meaning the once free-willing internet with no walls is becoming a thing of the past, and AI models will be more biased, less diverse and less fresh.
Some companies are now hoping to work around these constraints by using synthetic data, which is essentially data generated by AI, but so far that's been aTime will tell how the whole thing shakes out. One thing's for sure, though: stockpiles of training data are becoming more valuable — and scarce — than ever.
Business Business Latest News, Business Business Headlines
Similar News:You can also read news stories similar to this one that we have collected from other news sources.
Source: ForbesTech - 🏆 318. / 59 Read more »