Learnings Exploring the GPT/LLM Space

Notes from an exploratory trip in the US

Published in

Point Nine Land

11 min readMay 10, 2023

Blending the Golden gate with the NY skyline on Midjourney

At Point Nine, we’ve been partnering with companies building machine learning (ML) infrastructure (e.g. SuperAnnotate, Soda and a soon-to-be-announced one) or applied AI applications (e.g. Intenseye and Podcastle) startups for quite some time now. I also wrote a few posts about AI startups back in 2017 and 2018 (Winning Strategies for Applied AI Companies, Routes to Defensibility for your AI Startup).

But, for a few weeks, and especially since the advent of Large Language Models (LLMs), the pace of ML adoption by consumers and enterprises has reached an inflection point. Some say “We’ve reached the iPhone moment of AI.”

The graph below on the pace of ChatGPT adoption indicates this.

Using LLMs to generate code and build software much faster may also fundamentally reshape how software companies are being built.

These were enough reasons for us to be willing to wrap our heads around the state of things in the LLM space sooner rather than later.

Europe has been on the verge for the past few years but it still feels like things are happening faster in the US so I booked a trip from Paris to NY and SF and went to a bunch of LLM meetups organised by our friend Nathan (Air Street and Venture Partner at Point Nine). There I spent time with founders, VCs, and employees of tech companies experimenting with LLMs. Below are 9 learnings.

1/ The level of attention and the energy is at its peak, BUT it’s very hard to find some “signal” in (all that) “noise”

In both San Francisco and New York, there’s an LLM meetup every other day, and they are all full. The opportunity to interact with computers through natural language and build AI-based software is very exciting for existing software companies, for new ones and for academia (both students and researchers). The pace at which AI-based features, AI-first startups, and new research papers are being launched, published and implemented is incredible.

This also means, however, that the level of attention that this brings makes it very hard to differentiate what’s hype (i.e. people willing to play with the software) from real sustainable business opportunities.

Some examples:

Open-source projects like AutoGPT below get stars at an unheard of pace. The challenge is that GitHub stars now correlate less with actual usage, which makes it harder to assess the real business opportunity behind each of these projects. I heard multiple times “GitHub stars don’t mean anything anymore”.

The traction of AI-based features of existing software companies like Notion, IronClad or Duolingo is mind-blowing. NotionAI’s product is generating tens of millions in ARR at $8/month/user 2 months after launch and could get to $100M by year-end. Azure alsoreported 2500 OpenAI customers already, up 10x QoQ (!!).
AI-native GenAI companies have amazing traction: Synthesia, ElevenLabs, RunwayML, EvenUp, and, less recently Jasper.ai/Copy.ai all grew very quickly from 0 to ca. $/€10Ms in ARR and raised growth rounds. Growth investors don’t have an easy time looking at these companies because a lot of this early traction might disappear as quickly as it came.

All the above speaks for the interest and the “wow effect” generated upfront, but it doesn’t necessarily speak for long-term business value.

The original transformer paper, which is the underlying technology behind the lion’s share of these LLMs, was called “Attention is all you need”. Attention is, unfortunately, not all you need to build a sustainable business long-term. Long-term user retention and net dollar retention are still BIG question marks for most of these businesses.

2/ In this early phase, implementing LLMs requires software engineering skills, not ML skills

The amazing thing is that most of the people implementing LLMs today are software engineers, not ML experts, and these AI products can now be built very quickly. The performance of “off-the-shelves LLMs” is so good out of the gate that companies very often find their early LLM experimentations good enough to be launched after just 1–2 quarters of the building phase. I heard that from product leaders at Kustomer, Notion or Duolingo. If you’re a SaaS company experimenting with LLMs, don’t be shy about launching your products early.

That being said if we look a bit further, my intuition is that this will likely change as companies start iterating on these products. One could argue that in order to be really successful, these companies will need to work on evaluating the performance of their models, fine-tuning them, and adjusting their prompts, (ie. the inputs being given to LLMs), and this will likely require some ML expertise.

The ease of use and the performance of LLMs are likely to generate a very significant adoption of ML including with non-ML literate companies. But the question then is how important will the data collection and MLOps tooling in this new phase be?

3/ LLMOps

Speaking of MLOps, a new field is just getting started called LLMOps: what’s the right tooling to deploy these models in production?

Our friends at Base10 published a good map below:

To make it simple, the few categories that are emerging are:

i) vector databases ie. the infrastructure that’s needed to store various representations of the inputs given to these LLMs (called the embeddings),

ii) prompt orchestration/ engineering companies: how to iterate on the prompts, the input in natural language given to LLMs, and, once the models are deployed,

iii) broader observability companies: how do companies monitor the results (output) these LLMs provide to their users?

Yet, the space is moving so fast that it’s still challenging to understand what will become a part of the foundation model offerings and what will lead to standalone categories. Foundation models could also evolve in such a way that we might interact with them in a very different way.

Will prompt chaining (giving multiple sequential inputs to LLMs) still exist if models become amazing at understanding the initial intent of the user? Not that clear.

4/ Optimizing the performance of LLMs brings interesting new challenges

One of the challenges that the implementation of these LLMs brings is understanding how to optimize the results they provide. Before the advent of LLMs, optimizing ML models meant labelling more data and working on refining the model that was being used. With LLMs, companies can also work on “prompt engineering”: adjusting the input in natural language being given to the model.

The post below explains well that optimizing LLMs now means:

i) finding the right evaluation mechanism to evaluate the performance of the LLM (i.e. build “eval datasets”),

ii) fine-tuning the model with more data,

iii) working on prompt engineering (adjusting the prompts given to the LLMs), and,

iv) adjusting for the costs associated with training LLMs and doing inference.

At this stage, even the largest software companies are neither fine-tuning their models nor are they equipped with good, sophisticated LLMOps tooling. They simply use GPT-4 and have built some (very) simple custom tooling. It’s just very early!

5/ UX-UI for LLM-based experiences

At the meetup in NY, Linus Lee, Senior Research Engineer at Notion gave a great presentation about building LLM-based user experiences or user interfaces (UX/UI). Now that the user experience is probabilistic (vs. deterministic) —i.e. we don’t know for certain the output of the LLMs that’s being shown to the user —controlling and improving the user experience brings a whole new set of challenges.

How to display to the user that we don’t know whether or not the LLM is right? How to warn the user? How to ingest user feedback? How to not disappoint the user if the LLM is completely wrong?

In short, Linus’ answer is to create “playful experiences” where the user clearly understands that they’re interacting (i.e. playing) with an AI-based user interface that might… very well make mistakes.

6/ OpenAI or not OpenAI? Trade-offs in choosing the right LLMs

Most people I met mentioned that, from a performance standpoint, GPT4 was much better than other foundation models available. But many of them also mentioned that using GPT4 brought multiple issues such as:

cost: OpenAI is multiple folds more expensive than competitors and the cost of building LLM-based products on GPT-4 can quickly become prohibitive if you have a lot of (free) users to whom you can’t pass over costs,
latency: using the biggest model is not always the best idea to optimize for latency, and,
privacy: not every company is willing to share its data with OpenAI.

This opens very interesting opportunities for open source, edge running / less computing intensive and/or cheaper alternatives to OpenAI.

There are also a number of very well-funded LLM startups that want to compete with OpenAI such as Anthropic, Cohere, Stable Diffusion, and LLMs built by other large tech companies like Google Bard. The startups are now all raising (mega) growth rounds to fund compute and inference costs; the cost of training their model and serving their customers.

7/ Costs of labelling, training and inference are really high today but they’ll go down as models get optimized and hardware gets better

The cost of training LLMs is estimated to be in the tens if not hundreds of millions of dollars. Just in compute. Gigantic datasets (the whole Internet sometimes!) are being labelled and given as input to models with tens, now hundreds of billions of hyperparameters.

I’ve heard multiple times that the training costs of OpenAI are estimated in the hundreds of millions of dollars if not billions. As for inference (allowing customers to use the OpenAI API once the model has been trained), serving the (hundreds of millions of) users that ChatGPT has also costs hundreds of millions to OpenAI. This partly explains the $10Bn equity deal between Microsoft and OpenAI, even without considering the fact that Microsoft is providing free computing to OpenAI to run their models. Overall, it might very well be that this deal is rather in the tens of billions of dollars.

Looking ahead, at the SF meetup, one of the speakers who’s a Research Software Engineer at Google predicted a 3–5x reduction in GPT-4-like costs every year for the next 10 years.

This raises some interesting questions concerning:

i) the structure of the foundation model market in the long run (how consolidated will it be? Who will win?) and

ii) the key success factors for these foundation model startups that are popping up (data, model, feedback mechanism, focus on a specific part of the market)?

8/ Will most ML models be LLM-based in the future?

This is more a thought experiment than anything else but one I find interesting, nonetheless. The big success of LLMs is partly driven by the fact that transformers are amazing at learning from text and that a lot of knowledge is stored in text data, which is available on the Internet (and that a lot of computing resources are now available). On top of that, it’s now possible to use text but also images as inputs to LLMs by projecting image into a “text space”. This article on multi-modality explains it well.

But does that mean that most ML models will be based on Large Language Models in the future? I am not a ML researcher but intuitively, I wouldn’t say so. Isn’t there a big loss of information when we transform an image into text? If you have any input on this, please write to me, I am curious :)

9/ AI safety

From the petition signed a few weeks back to pause AI research to Joshua Bengio’s, Ian Hogarth’s recent article, or the State of AI Report’s safety section, much has been written about AI Safety recently so it would be strange to write about LLMs without mentioning it. The key question is; are we creating autonomous AI agents that will ultimately act against our interests?

I don’t have a firm opinion on this yet other than observing that, as of today, the LLMs now pass the Turing test (i.e. they have human-level reasoning capabilities), and AutoGPT shows that LLMs can give themselves tasks (i.e. work autonomously). At the same time, there are very few LLMs used in production today, and they still make many (basic) mistakes. Interestingly, they’re also much better at helping white collars augmenting them to perform knowledge-based tasks than at automating blue-collar tasks that require basic cognitive tasks.

Conclusion

To sum up all of the topics above and focus on where actual business value might accrue, here are a few recommendations:

SaaS startups should build AI features quickly: In 1–2 quarters, existing SaaS startups may be able to use LLMs and launch AI features that will amaze their users by just writing a few prompts and without having any ML skills. As the growth curve of Notion, Ironclad, and Duolingo’s AI products shows, this could generate very strong growth at the top of the funnel.
Having these AI-based features may become the status quo and create questionable moats: Building these initial (great) features might be so easy initially that it might very well be that most existing players in a specific market will have these relatively soon. Therefore, vertical SaaS startups going after specific industries with smaller TAMs but less competition might benefit from these even more than their horizontal counterparts.
The tooling space is exciting but incredibly early: longer term, some large companies will be built providing LLM tooling (LLMOps) to help companies implement LLMs. The space is so early that it’s difficult to know what’s here to stay and what will disappear but some large companies will emerge, for. sure. If you’re building in this space or have a strong opinion, I’d love to chat.
AI-native startups work if the tolerance for failure in their use case is high: Because LLMs still fail, AI-native startups will have an easier time getting to user adoption if i) they augment (rather than fully automate) humans, and, ii) the tolerance for failure on their use case is high. It’s not by chance that the first to grow very fast are content production machines for SEOs (mistakes don’t matter that much) rather than cancer detection algorithms (algos can’t fail). Zetta wrote this great article on product payoffs in ML a few years back. I think it still applies today.
Working on retention is critical: The initial experience of users interacting with AI-based features is often mind-blowing, but companies will need to retain these users long term. This will usually mean finding ways to improve the models but also providing value to users through traditional software tactics: building a system of records, workflow software etc… Traditional SaaS ain’t dead — it’s more lively than ever!

Huge thanks to the (tens of) people that have spent time educating me on all of the above, it’s been fun!

And just for fun here’s a Midjourney experiment of my face with a picture of an Andy Warhol painting I took at the MoMA in SF. If you haven’t done it, play with Midjourney, it’s fun!