Open-Source AI — Challenges, Opportunities & Ecosystem

Abel Samot
Red River West
Published in
13 min readMar 18, 2024

--

After the wave of ChatGPT, a new wave has been taking over the AI world: Open Source. As you can see in the graph below, in the last couple of months, we have seen more and more people speaking about open-source AI with the emergence of models like Llama II, Mistral 8x7B, etc., that almost match the performances of the best closed-source LLMs from Google and OpenAI.

”Open-Source artificial intelligence” research on Google

With the recent lawsuit between Elon Musk and Open AI, the clash between open-source and closed-source AI has reached its climax.

So in this article (that I mostly wrote before this clash), I’ll try to give you a clearer picture of the impact of open source on the history and the future of AI. Why are AI and open source intrinsically linked? What are the different layers that constitute the open-source AI ecosystem? Who are the leading startups of tomorrow in that field? Why do I believe in open-source AI? How to monetize open source AI models and tools? Are “open source” AI models really open source and what’s the interest of the big techs to publish open source models?

A brief history of open source AI

AI comes from research that is open by nature. In his paper “Computing Machinery and Intelligence”, Alan Turing introduced AI to the world in the 1950s. Since then, this notion has become a reality thanks to the collaboration of thousands of researchers across the world who have openly collaborated to advance AI.

In the 2010s, deep learning and machine learning became much more popular with more and more data scientists and new technological advancements in NLP, computer vision, etc.

Open source not only had an important role in these technological advancements, but it also allowed developers around the world to use these technologies easily thanks to libraries such as TensorFlow (developed by Google), PyTorch (developed by Facebook), Keras, and Scikit-learn released with open-source licenses.

It’s also an open-source project called BERT developed by Google that brought the spotlights to transformers and sparked a revolution that led to ChatGPT and all the models that amaze us today.

When Open-AI released ChatGPT and shocked the world, they chose a closed-source algorithm. At this time, a lot of people believed that with the cost of computing needed to run these algorithms, open-source couldn’t catch up. But with the release of models such as LLAMA II, Falcon, Stable Diffusion, Mistral 7B, etc. we are seeing a change of paradigm.

Beyond open-source foundation models, several vendors have emerged over the years to offer open-source tools for different parts of the AI development process, from synthetic data platforms to AI deployment software and model monitoring platforms.

The open-source AI ecosystem

When people speak about “open source AI”, they often refer to models like Llama II or Mistral. But the open-source AI ecosystem goes much beyond that. In the mapping below I mapped the most important libraries and companies advancing the open source AI ecosystem. All the companies present in the mapping have either released some open-source tools or simplified the use of open-source models (some of them are closed-source though).

This mapping splits companies and frameworks into 10 categories:

  • Generative AI — LLM Developers: creators of open-source Tranformers-based models such as Mistral, Stability AI, Meta, but also Google (with BERT), and others.
  • LLM Hosting & Deployment: Deploying and fine-tuning open-source LLMs is not easy, especially if you want to optimize your infrastructure to reduce your costs. That’s why a lot of actors like Together.AI or Baseten offer to do it for you and provide simple APIs to access any open-source LLMs, fine-tune them, etc.
  • AI Observability & Monitoring: AI observability and monitoring tools help in understanding and managing the performance of AI systems. This includes tracking the model’s decision-making process, ensuring the AI’s outputs are as expected, and identifying any issues or biases that might arise during operation.
  • Generative AI — Image Model Developers: Similar to LLM developers, this category is about the creators of AI models that generate images. Today Stability AI might be one of the only open-source companies with performances that match the ones of DALL.E or Midjourney.
  • Vector Databases: Vector databases are specialized storage systems designed to efficiently handle and search through high-dimensional data points, such as those used in AI applications for recommendations, similarity searches, and more. They are crucial for supporting the backend of many AI-powered services.
  • AI Application Development & Deployment: This broad category (that could be split into 2 or 3) encompasses the tools, platforms, and practices for building and deploying AI-powered applications. It includes everything from the initial development of AI models to their integration into user-facing applications and services.
  • Privacy, Governance & Risk Management: As AI systems handle increasingly sensitive data and make more critical decisions, managing privacy, governance, and associated risks becomes crucial. This category covers the policies, technologies, and practices designed to ensure AI systems are used ethically, responsibly, and in compliance with laws and regulations.
  • Platforms to Train Open Source Vision AI Algorithms: These platforms offer the necessary infrastructure and tools for developing and training computer vision AI models, which can recognize patterns, objects, scenes, and more. While some of the best computer vision algorithms are open source thanks to frameworks like OpenCV and TensorFlow, most companies that provide the tools for training and utilizing these algorithms remain closed source. Although the majority of attention from investors, users, and entrepreneurs has been on LLMs in recent years, I believe that computer vision will also spark a significant revolution in various sectors including industry, agriculture, supply chain management, and much more.
  • Synthetic Training Data: To train AI models, especially in situations where real-world data is scarce, expensive, or sensitive, synthetic data can be used. This category involves the creation of artificial data that mimics real data, allowing AI models to learn and improve without compromising privacy or security.
  • Bonus — Most Popular AI Frameworks: Yes, VCs don’t usually include libraries that are not companies in their mapping, but as a data scientist, how could I do an open-source AI mapping without mentioning these frameworks? AI frameworks are libraries or tools that make it easier to develop AI models by providing pre-built components and structures. These frameworks support a wide range of AI development activities, from basic machine learning to advanced deep learning tasks.

Why open-sourcing the AI model layer?

AI is too powerful to be in the hands of a few big tech companies. I think we are a lot to agree on that point.

Elon Musk said it himself, OpenAI was created to avoid that “The ‘open’ in OpenAI is all about open source”. However, it moved away from its initial purpose to maximize its profit, attract new investors, etc.

This is why Elon Musk is currently suing them, as reported in this TechCrunch article. Although I may not have all the details, it is understandable. Imagine donating $45 million to a foundation to prevent Google or any other big tech company from monopolizing AI in the future, only to have Microsoft gain access to the majority of the foundation’s work after it transformed into a for-profit company.

But if what might be the best AI company today chooses to become closed-source, what are the advantages of open-source?

  • Less Bias and More Transparency: One of the most significant advantages of open-source AI is its potential to reduce bias and increase transparency. Open-source models are available for anyone to review, critique, and improve, which means that a diverse range of developers from different backgrounds can contribute to making these models more equitable and less biased. This collective oversight helps in identifying and mitigating biases that may have been inadvertently introduced, ensuring that AI technologies are more inclusive and fair.
  • Bigger Community and Faster Time to Market: Open-source AI benefits from the contributions of a vast global community of developers and researchers. This accelerates innovation and reduces the time to market for new advancements. A bigger community means more minds working on solving complex problems, sharing insights, and pushing the boundaries of what AI can achieve.
  • More Privacy: Open-source models enable users to understand and control how their data is used and processed. This transparency fosters trust and allows for the development of AI applications that respect user privacy and data sovereignty. It also allows companies to keep their data internally by hosting the models in their infrastructure. They don’t have to trust anyone with their data which might be their most precious resource.
  • Smaller Models Outperform Larger Ones for Some Tasks: In the realm of AI, bigger isn’t always better. Open-source AI has demonstrated that smaller models can outperform their larger counterparts for specific tasks. Techniques like Low-Rank Adaptation of Large Language Models (LoRA) and QLoRA have revolutionized the way we fine-tune these models, allowing for faster iterations and optimization. This efficiency means that open-source AI can deliver high-quality results with less computational power and resources but also faster, making AI more accessible and sustainable.
  • On-device AI: Perhaps one of the most exciting advancements facilitated by open-source AI is the possibility to host AI models on your own devices such as smartphones, IoT devices, and edge computing platforms. On-device AI ensures that user data remains private and secure, as processing occurs locally rather than being sent to a remote server. This approach not only enhances privacy but also reduces latency, providing users with instant, real-time interactions with AI applications. I believe that Apple who recently released Ferret an open-source multimodal language model that surpasses GPT-4 in some ways, will be a big contender in the AI race. And I’m quite sure that in less than 3 years most of their devices will have those kinds of models natively integrated.
  • More customizability: One of the most significant advantages of open-source AI is its high degree of customizability. This trait is particularly valuable for businesses, researchers, and developers who must tailor AI models to fit specific requirements or integrate them into complex systems. Unlike proprietary AI solutions that are often “black boxes,” open-source AI allows users to modify the code at its most fundamental levels.

Despite all these points, there is currently a significant conflict on X (formerly Twitter) between supporters and opponents of open-source AI. So I would like to finish this part by debunking two key arguments made by those who are against Open-Source AI:

  • Firstly, some argue that AI should not be open-source due to security concerns, as ill-intentioned governments could exploit it. However, as Marc Andreessen pointed out in his recent Twitter post, if China wanted access to closed-source algorithms from Open-AI, they could employ various spying methods. Even the Manhattan Project (which was much more protected than GPT-4) was infiltrated by the Russians (as you all saw in Oppenheimer). In my opinion, it’s hard to imagine anyone creating an algorithm that surpasses everything currently available solely by relying on an open-source project. After all, the creators of the open-source project are the ones who are most capable of building upon it. I believe that the risks of granting a single company a monopoly on artificial intelligence outweigh any concerns about providing free access to these algorithms.
  • There is an argument that pursuing AGI through open-source means is impossible due to a lack of resources. First, I do not find AGI controlled solely by profit-driven companies desirable. Then, examples from Meta and Mistral, who have provided Open-Source models comparable to GPT-4, demonstrate the fallacy of this argument. I believe that if a product is truly transformative, there will always be ways to monetize it as an open-source project. It may be more challenging, but in the next part of this article, I will explore potential avenues for achieving monetization.

How can Open-Source AI be monetized?

Monetizing an open-source model

We have seen a lot of companies emerge like Mistral and Stability AI raising hundreds of millions to build Open-Source AI models. But the question remains: how will they make money? 💰

The most evident way to monetize for these companies is to sell a hosting service on top of their model with an API to access it. Indeed, most people don’t want to struggle with hosting the model themselves, etc. Both Stability and Mistral offer this service, but the problem is that anyone can do the same thing.

On top of AWS or Google which provide tools to host these algorithms relatively easily, Hugging Face (which can be seen as the “Github of AI”: a place where developers go to find the best models available) also offers tools to host and train these models easily.

Moreover, numerous companies such as Together.ai, Replicate, Baseten, etc. provide APIs allowing to use and fine-tune the best open-source algorithms easily making it a plug-and-play experience at a lower cost. Plus, you have the freedom to switch to a new open-source algorithm whenever you want, if it surpasses the one you are currently using.

So, why would you pay the API of Mistral or Stability when you can pay another provider like Together.ai an equivalent (or lower) amount and have the flexibility to access any models?

I believe it’s the reason why Stability AI just changed its business model and introduced a subscription fee to access its most advanced models.

It’s a choice that might prove itself successful in terms of monetization, however, this raises questions about the future of Open-Source AI. Will it become a poorer version of AI?

In this case, we can’t even talk about open-core anymore as the core of the product (i.e. the best algorithms) is closed source.

Unfortunately, it seems that Mistral also is taking this path by releasing their next best model “Mistral Large” in a closed-source way.

I’m optimistic and believe there are other solutions available. I hope that the Mistral team, known for their fast shipping and open-source philosophy, could rely more on these solutions in the future.

Potential solutions could include:

  1. Keeping the best models open-source and offering specialized models that are fine-tuned for specific industries.
  2. Building a large developer community around their algorithms and finding ways to monetize this community through selling developer tools, creating a marketplace, etc.

However, the question remains: would these solutions be sufficient to handle the valuations of these companies?

Monetizing open-source AI developer tools

Open source plays a crucial role in the evolving AI developer tools ecosystems. I believe that these tools and platforms have significant potential for monetization. A prime example of a successful open-source developer tool in the AI field is Databricks, which has recently been valued at $43 billion.

I truly believe that open source is the way to go when it comes to building developer tools. There’s so much potential for great businesses to emerge in this category. From Vector Database (Qdrant, Weaviate, etc.) to ML Ops, synthetic data generation, model monitoring, etc. I do not doubt that we’ll witness the rise of numerous open-source AI unicorns in the next decade.

But are these models really Open-Source? What are the interests of Big Tech companies?

Today, researchers across the world have raised concerns about the “open” designations of some of these models. Indeed, models like LLaMA, Alpaca, Vicuna, and Koala have not yet adopted a license that allows for commercial use.

For example, Llama 2, is available for free download, modification, and deployment. However, it is not covered under a conventional open-source license. The license for Llama 2 prohibits its use for training other language models and requires a special license for deployment in an app or service with more than 700 million daily users. This level of control gives Meta significant technical and strategic advantages. For instance, it allows the company to benefit from improvements made by external developers when using the model in its own apps.

For most big tech companies, open-source is a way to gain a technical advantage by benefiting from a community of people using their inferior models and tweaking them while keeping the best versions for themselves.

“What our analysis points to is that openness not only doesn’t serve to ‘democratize’ AI,” said Meredith Whittaker in an article “Indeed, we show that companies and institutions can and have leveraged ‘open’ technologies to entrench and expand centralized power.”

Unlocking the full potential of AI and avoiding its negative aspects requires more openness and some AI projects claiming to be open source impose restrictions that contradict the essence of openness. I believe we should be careful and differentiate projects between the ones that are really open-source and the others because we might have an unpleasant surprise if we don’t.

Conclusion

A few months ago, during an AI conference in Paris, Eric Schmidt (the former CEO of Google) said “In my history, Open-Source has always ultimately won”. It has always been the case, and it will be the case for AI too. I’m convinced that, as time goes by closed and open source algorithms will have very similar performances. It will allow anyone to have access to the best algorithms, to modify them, and to build incredible tools from it. But it will also reduce the value of these models.

The biggest winners economically speaking will be the developer tools built around these models and the companies who will have enough proprietary data to get unique outputs from these models.

In both those two categories, open source will be key, leading to an important surge of open-source unicorns and decacorns in the next 5 to 10 years.

Thanks to my colleagues Olivier and Maxime for helping me with this article. If you want to discuss open-source AI, don’t hesitate to contact us (abel@redriverwest.com — olivier@redriverwest.com — maxime@redriverwest.com).

--

--