Posted on Feb 28, 2024Read on Mirror.xyz

Decentralized Intelligence Pt. 1

AI training data is frequently thought of as a publicly available commodity. Something that is scraped off the internet, bundled into a big file, and fed to a model like digital food.

While scraping data certainly plays a role, there’s another, often forgotten, component critical to training data: the data collectors.

There is a significant amount of human labor that goes into creating, cleaning, ranking, and enriching datasets. The unsung heroes are the thousands, if not hundreds of thousands, of individuals who annotate training data and guide models through human feedback. Many of these data collectors work for large digital labor platforms (DLPs) capturing images, annotating text, labeling videos, and carrying out many other critical micro-tasks to power our modern AI products.

The demand for data collection is compounding as billions of dollars flow into generative AI startups that are eager to train models as fast as possible. But sourcing data is getting harder. Media platforms are clamping down and filing lawsuits against AI startups harvesting their data. Publicly available stockpiles of high-quality data are rapidly being exhausted. New data types that can’t be scraped off the internet are now needed to advance AI. And content creators are fed up with AI profiting off their creativity while providing nothing in return.

Because of these dynamics, the AI industry has turned to DLPs, which have seen tremendous success as a result. DLPs allow AI startups to quickly source structured training data while avoiding lawsuits and public scrutiny. But DLPs also face meaningful challenges. Many of these organizations lack the scale, diversity, and capabilities required by AI startups. Moreover, DLPs are notorious for their poor working conditions and misaligned incentives, contributing to data quality issues. In short, DLPs are struggling to keep up as the secular demand for data is exploding.

At the same time, the rapid progress of AI development has sparked an important discussion on governance. Who owns AGI? Who determines the truth? AI leaders and academics are now advocating for a decentralized model to counterbalance the danger of a centralized AGI. But what does this model look like?

To responsibly accelerate AI into new frontiers, we need a better, more decentralized way to source and govern human intelligence at scale. We now know that model performance is a function of compute and data. Scaling these resources creates better models, and there are no signs of this trend topping out. But collecting more low-quality data doesn’t help. We need massive volumes of high-quality data, which requires more contributors with varying backgrounds, skill sets, and perspectives. We need a platform that can equitably facilitate the coordination of these contributors. And we need fast, cheap, attractive incentives to turn contributors into motivated co-owners.

We need decentralized intelligence platforms.