If 2023 was the year of parameter scaling in AI, 2024 has been the year of investing in data. Basically every major AI lab has acknowledged that we’re beginning to see diminishing returns on model scaling (either due to resource shortages, or true diminishing returns on adding new parameters).
On the data side, however, we’re still in the very early innings of performance improvement. To-date, the majority of AI models have been trained by just throwing the entire internet of text data at the wall and seeing what sticks.
This creates two obvious issues: i) everyone has the exact same data; and ii) there’s a ton of bad and duplicative data in there that’s skewing outputs.
It’s even worse at the application layer: because everyone is using frontier models to power their apps (whether those be closed or open source), every competitor is serving up the same results to customers. While many players in these categories claim that they’re fine tuning their models based on proprietary data, the reality is that this is usually dramatically overstated (in other words, companies rarely have enough data that the juice of fine tuning is worth the squeeze).
To combat these claims of commoditization, all AI companies (from the model layer to the application layer) have begun to trumpet their own unique “data assets”, where their data is always better than everyone else’s.
But what does data asset even mean? In short, it means that you have some sort of valuable data that nobody else has, that would be very difficult (or impossible) for others to replicate.
However, like all triumphant claims, it can be difficult to separate the signal from the noise when everyone is saying the same thing. Some of this is certainly real, but much of it is still slideware.
Today, I wanted to take some time to lay out what we’re seeing as some of the true forms of data assets that AI companies are beginning to leverage.
The different flavors of proprietary data
It’s worth noting that while many large, incumbent companies do have proprietary data & distribution, no startup has any unique data at day zero.
Below are some of the strategies we’ve seen work to solve this problem:
Proprietary Data Partnerships: Partnering with someone who actually does have proprietary data, so you get unique access. Obvious, but fast and powerful
We’ve increasingly seen compelling examples of this in regulated spaces like Healthcare and Insurance, where incumbents have deeply valuable customer data, but need to bring in outside AI expertise to get the most out of it
These partnerships often come with a revenue sharing component
Secondly, we’re increasingly seeing companies just purchase or license unique data outright. If you know what data is the most valuable, sometimes the best strategy is to just buy it before anyone else does
E.g., OpenAI licensing with Shutterstock to train their image models
Proprietary Usage: As many AI products go viral, communities have begun to emerge where user prompts are used to generate large amounts of proprietary outputs. These outputs can then be leveraged to retrain or finetune an existing model, creating truly unique data
This effect is most pronounced in AI companies that have a community element, where an active ecosystem continually adds to the data flywheel (e.g., Midjourney or Viggle)
Enterprise-specific Data: By collecting data from your customers, you can leverage their private data to improve your offerings for them (and sometimes, all your other customers too)
This is probably the most talked about strategy, but the hardest to shortcut. The reality is, you have to achieve this via brute force as your deploy into your customers, earn their trust, and deliver value to them before you get to access their data
E.g., Writer is able to leverage internal memos and documents to help employees generate content using their organization’s specific jargon and context
While difficult to actually achieve (and frequently over-promised), this is can result in incredibly sticky enterprise deployments
User-specific Data: Beyond data specific to an enterprise, many AI products are now able to collect very specific data about end users themselves, and can serve them content in the formats that resonate with them best
The most obvious example of this comes from personal assistant apps, wherein results can be tailored to user preferences gleaned over time
Often, this requires creating and tracking individual user profiles, or using a single sign-on partner to track specific users within an organization
Another example of this comes via intent data, which is increasingly important in a cookie-less world. Companies like Firsthand are able to infer a user’s intent via their actions, and serve them up relevant content without tracking them across the web with cookies
Synthetic Data: Any net-new data generated by AI instead of humans is usually referred to as synthetic data. While this has been referenced for years, we’re still in the early innings of anyone actually deploying this at scale
In an ironic example, The Information recently reported that OpenAI's competitors are using the o1 reasoning models to generate synthetic reasoning data for their own models to train on… all’s fair in love and war!
Data Enrichment: This technique refers to taking your existing data and improving upon it (via things like data pruning, deduplication, data enrichment, stitching different data sources together, etc.)
Clearly, not all data is created equally, and if you can sort through and classify your data before dumping it into a model, you can meaningfully improve performance. This is a solution to the classic garbage-in, garbage-out formula
E.g., Datology is one of the most exciting examples of a company doing this at scale for some of the largest AI labs
Those are some of the most exciting data strategies that I’ve seen AI startups begin to leverage, but as always, would love to hear from folks who are seeing others work in exciting ways!