CoinFund is proud to lead a $3.1 million financing for Bagel Network, a startup building a decentralized protocol for collaborative embeddings datasets and expanding the decentralized AI computable data ecosystem. To get caught up on CoinFund’s ongoing research and investments on the emerging web3 x AI stack please see our previous content on Worldcoin, Giza, 2022 AI x web3 overview and our Gensyn seed thesis.
As an introduction, vector embeddings are a way to convert words, sentences, images, and other data into mathematical objects (vectors) while preserving individual and relative data. For example, an embedding could translate the word “apple” into a 200-dimensional vector based on the word’s context across a large dataset. This vector would capture the essential meaning of apple and its relationships to related concepts like fruit, orchard, pie, etc. While the primary applications in vector embeddings have been in text with flagship models such as Word2Vec and GloVe, vector embeddings can be produced for other kinds of data including image and audio data. This context is critical as AI development becomes increasingly focused on developing multi-modality where a model can process text, image, or audio and output any out of any of these three mediums as well. Furthermore, embedding models could be used to capture larger data types such as user-specific embeddings that capture user preferences, behaviors, and characteristics or product-level embeddings that capture a product’s attributes, features, or any other semantic information.
The commercial opportunity for vector databases has grown rapidly over the past 12 months alongside the mainstream adoption of early consumer AI applications such as ChatGPT, Midjourney and Runway, just to name a few. Bagel is among the first web3-native attempts to combine a vector embeddings database with an incentivized marketplace protocol, leveraging web3 primitives to supercharge permissioned data and model sharing and collaboration, with a potential path to winning the web3-native category from both a product and an incentivized network perspective, and the ability to move quickly to preserve its early leadership given founder Bidhan Roy’s cross-disciplinary professional experience on the Amazon Alexa team, at Instacart and Arweave. We believe Bagel Network is a key enabler for the next generation of AI applications, whose adoption today remains bottlenecked by the ability to provide contextualized, highly applicable and use case-specific responses, gated by the insatiable demand for training data, especially as most of the world’s data remains unstructured.
While some web2 embeddings companies (both VC-funded and corporate spinouts) are part of the broader competitive set, Bagel Network has been able to ship quickly to advantage of its time-limited opportunity to lead the embeddings category from a web3-native perspective, with an already-live demo, SDK and pilot users. Longer term, we believe that Bagel’s approach of building a decentralized protocol and marketplace for indexed vector embedding datasets positions the network at the intersection of two mutually-reinforcing key trends — the rise of LLMs (and derivative applications) and the embrace of the permissionless, transparent and decentralized core values of web3.
While the market is nascent for vector embeddings, there are data points we can consider. First, we can look at the relational database management market as an established market comp that could be reached (source). Today that market is worth $69.44B and growing at a CAGR of 12%. There’s also the end-market analysis: some primary industries serviced by vector embeddings include image recognition ($38B), recommendation engine ($4.55B), and AI chatbots ($5.4B) that which collectively are projected to grow with a CAGR of 20–40% through 2030. Lastly, global spending on artificial intelligence (including ML, AI robotics, computer vision, NLP and sensor tech) is now projected to grow from $300B+ in 2024 to $700B+ by 2030. With these figures in mind, vector embeddings are likely to play a role as an enabling technology for the increasingly capable multi-modal AI models and applications that will be emerging over the next decade.
We believe that Bagel Network will supercharge permissioned sharing and collaboration through its cryptonative marketplace model solving key problems within the data layer of the AI tech stack. This fits the web3 ethos of permissionless access and collaboration all while delivering needed infrastructure for the next generation of AI. Currently, a disproportionate amount of data is owned and controlled by large entities, boxing out smaller organizations via accessibility to high quality datasets or simply the compounding effect of scaled intelligence. Bagel Network redefines the AI data landscape by creating a two-sided marketplace where machine learning engineers, researchers, and AI agents collaboratively build, trade, and license datasets. Because embedding generation is often one of the most computationally intensive parts of an AI pipeline, there exist high levels of redundancy that exist in vector database systems today, leading to inefficiencies, higher costs, and duplicated work. Bagel Network allows models to share embeddings, avoiding duplicate work. This is more efficient while retaining attribution via blockchain metadata and other necessary ingredients to fairly share future monetization potential to help route around cold-start-related friction. In the context of artificial intelligence, we are already seeing open source efforts to replicate closed source datasets to advance model improvement (see RedPajama-Data, reproduction of LLaMA training dataset, or the Mistral/Mixtral open model approach).
We anticipate that a vector database coupled with a decentralized network can out-compete through leveraging open source and collaborative development (an approach that has already won in the backend web2 stack). For example, a smart contract can manage permissioned access with specificity to discrete embeddings, which is not possible with a Github-like centralized approach. A protocol can reward data contributions, monitor and incentivize network participation (through forking), and track compute resource usage. Current vector database solutions lack the ability for collaboration, while open-source platforms like Github/HuggingFace lack the incentive to produce high quality embeddings. Today, much high-quality data exists within enterprises and public datasets, fragmented and underexploited which can in the future be onboarded, aligned, and monetized. Finally, an open marketplace allows permissioned development on embedding collections by multiple teams simultaneously, for example via open-source software but for vectorized datasets. This catalyzes innovation across sectors in contrast to siloed efforts.
As with any investment, many risks (execution, competition, scaling, monetization) exist with such an ambitious vision. However, we believe that Bagel Network exhibits promising early traction and is well-positioned in a high-growth market with several secular tailwinds in its favor, especially given a currently uncrowded greenfield opportunity set to design and launch a leading web3 implementation well-aligned with the AI/data value creation flywheel. Ultimately, CoinFund views Bagel’s long-term vision of creating a decentralized marketplace for machine learning computable datasets as a missing and critical piece of the part of the web3 stack being built for AI/ML use cases. While still early days, we believe that the market potential outweighs the risks — hence CoinFund’s high-conviction bet and our excitement to roll up our sleeves together with Bidhan Roy and the rest of the Bagel team. To learn more or sign up as an early data partner, visit www.bagel.net!