Tech

Vana plans to let users rent out their Reddit data to train AI

remon BuulApril 13, 2024

40 5 minutes read

Vana plans to let users rent out their Reddit data to train AI

In the generative AI boom, data is the new oil. So why couldn’t you sell yours?

From large tech companies to startups, AI creators license e-books, images, videos, audio files and more from data brokers, all with the goal of training products based on AI more efficient (and more legally defensible). Shutterstock has deals with Meta, Google, Amazon and Apple to provide millions of images for training models, while OpenAI has deals with several news organizations to train its models on news archives.

In many cases, the individual creators and owners of this data have not seen a single penny of money change hands. A startup called Vana wants to change that.

Anna Kazlauskas and Art Abal, who met in an MIT Media Lab class focused on creating technology for emerging markets, co-founded Vana in 2021. Before Vana, Kazlauskas studied computer science and economics at MIT, which he ultimately left to launch a fintech. automation startup, Iambiq, on Y Combinator. Abal, a business lawyer by training and education, was a partner at The Cadmus Group, a Boston-based consulting firm, before leading impact procurement at data annotation company Appen.

With Vana, Kazlauskas and Abal set out to create a platform that allows users to “aggregate” their data – including chats, voice recordings and photos – into datasets that can then be used for training of generative AI models. They also want to create more personalized experiences – for example, daily motivational voice messages based on your wellness goals or an art-generating app that understands your style preferences – by refining public models on this data.

“Vana’s infrastructure actually creates a treasure trove of user-owned data,” Kazlauskas told TechCrunch. “It does this by allowing users to aggregate their personal data in a non-conservative way…Vana allows users to own AI models and use their data in AI applications.” »

Here is how Vana presents its platform and API to developers:

The Vana API connects a user’s cross-platform personal data… to allow you to personalize your application. Your application gains instant access to a user’s custom AI model or underlying data, simplifying onboarding and eliminating computational cost issues… We believe users should be able to import their personal data from walled gardens, like Instagram, Facebook and Google, to your application, so you can create an amazing personalized experience from the first time a user interacts with your consumer AI application.

Creating an account with Vana is quite simple. After confirming your email, you can attach data to a digital avatar (like selfies, a description of yourself, and voice recordings) and explore apps created using the platform and sets data from Vana. The selection of apps ranges from ChatGPT-style chatbots and interactive storybooks to a Hinge profile generator.

Image credits: Vana

Now why, you might ask – in the age of increased data privacy awareness and ransomware attacks – would anyone give their personal information to an anonymous startup, much less a startup up financed by venture capital? (Vana has already raised $20 million from Paradigm, Polychain Capital, and other backers.) Can a for-profit company really be trusted not to abuse or mismanage the monetizable data it puts on the hand ?

Image credits: Vana

In response to this question, Kazlauskas emphasized that Vana’s main goal is for users to “take back control of their data,” noting that Vana users have the option to self-host their data rather than store it on Vana’s servers and to control how their data is hosted. data is shared with applications and developers. She also argued that because Vana makes money by charging users a monthly subscription (starting at $3.99) and imposing “data transaction” fees on developers (e.g. for transferring of data sets for training AI models), the company has no incentive to exploit users and the troves of personal data they bring with them.

“We want to create models owned and governed by users who all bring their data,” Kazlauskas said, “and allow users to bring their data and models with them into any application.”

Now while Vana doesn’t sell user data to companies for training generative AI models (at least that’s what it claims), it wants to allow users to do it themselves if they want, by starting with their posts on Reddit.

This month, Vana launched what she calls Reddit Data DAO (Digital Autonomous Organization), a program that aggregates multiple users’ Reddit data (including their karma and posting history) and allows them to decide together how this combined data is used. After joining a Reddit account, submitting a request to Reddit for their data, and uploading that data to the DAO, users gain the right to vote alongside other DAO members on decisions such as licensing data combined with generative AI companies for shared benefit. .

We crunched the numbers and r/datadao is now the largest data DAO in history: Phase 1 welcomed 141,000 Reddit users with 21,000 full data downloads.

– r/datadao (@rdatadao) April 11, 2024

This is something of a response to Reddit’s recent moves to commercialize data on its platform.

Previously, Reddit did not guarantee access to posts and communities for generative AI training purposes. But it reversed course late last year, ahead of its IPO. Since the policy change, Reddit has raked in more than $203 million in licensing fees from companies including Google.

“The general idea (with the DAO is) to free user data from the major platforms that seek to accumulate and monetize it,” Kazlauskas said. “This is a first and part of our efforts to help people aggregate their data into user-owned datasets to train AI models.”

Unsurprisingly, Reddit – which doesn’t officially work with Vana – isn’t happy with the DAO.

Reddit has banned Vana’s subreddit dedicated to DAO discussions. And a Reddit spokesperson accused Vana of “exploiting” its data export system, designed to comply with data privacy regulations such as GDPR and the California Consumer Privacy Act.

“Our data arrangements allow us to put guardrails on these entities, even on public information,” the spokesperson told TechCrunch. “Reddit does not share non-public personal data with commercial companies, and when Redditors request the export of their data from us, they receive non-public personal data from us in accordance with applicable laws. Direct partnerships between Reddit and approved organizations, with clear terms and responsibilities, and these partnerships and agreements prevent the misuse and abuse of people’s data.

But does Reddit have any real reason to worry?

Kazlauskas envisions the DAO growing to the point where it will impact how much Reddit can charge customers for its data. This is far from the case, assuming it ever happens; the DAO has just over 141,000 members, a tiny fraction of Reddit’s 73 million users. And some of these members might be bots or duplicate accounts.

Then there is the question of how to fairly distribute the payments the DAO might receive from data buyers.

Currently, the DAO awards “tokens” – cryptocurrency – to users matching their Reddit karma. But karma may not be the best measure of the quality of dataset contributions, especially in smaller Reddit communities that have fewer opportunities to earn contributions.

Kazlauskas floats the idea that DAO members could choose to share their cross-platform and demographic data, which would make the DAO potentially more valuable and encourage signups. But it would also require users to trust Vana even more to handle their sensitive data responsibly.

Personally, I don’t see Vana’s DAO reaching critical mass. The obstacles standing in the way are far too numerous. I suspect, however, that this will not be the last popular attempt to assert control over the data increasingly used to train generative AI models.

Startups like Spawning are working on ways for creators to impose rules governing how their data is used for training purposes, while providers like Getty Images, Shutterstock and Adobe continue to experiment with compensation schemes. But no one has cracked the code yet. Can it even be crack? Given the cutthroat nature of the generative AI industry, this is certainly a tall order. But maybe someone will find a way – or policymakers will force it.

techcrunch