Routes to Defensibility for your AI Startup

A simple framework for understanding the impact of data network effects and incumbents’ advantages in your industry

Published in

Point Nine Land

12 min readOct 17, 2017

Intro

One thing that you learn quickly when you enter VC is that investments are about finding moats. Why? I could not find a better way to explain it than paraphrasing Gil Dibner in this post.

In short: VCs are looking for companies that could be worth hundreds of millions or billions of dollars by the next 5–10 years and:

Projected future cash flows are a proxy for valuation,
Ability to generate profit is a proxy for projected future cash flow,
Moats are a proxy for the ability to generate profits.

Why are moats a proxy for the ability to generate profits? Simply, because moats increase a firm’s bargaining power with both their suppliers and their customers, helping the firm increase prices and reduce costs in order to generate higher profits. The result of this simple reasoning is that VCs look for companies creating moats.

Network effects at play in marketplaces are a great example of moats. The more places there are to rent on AirBnB, the more likely it is that the demand will come to the platform, attracting even more homeowners to rent their places on AirBnB. The loop is closed.

This mechanism generates a winner takes-all dynamic, and very often the largest player in a market with such a dynamic becomes far larger than its competitors. That’s in short why investors love marketplaces. If you are lucky enough to pick the market winner, there is a very high chance that you’ll meet high returns.

I. What is so special about AI companies?

Now, the interesting point is that AI brings a new type of network effects that some call “data network effects”. Machine learning algorithms need data to work. While the relationship is not linear (more of that later on), the prediction/classification work done by the machine learning algorithms increases in accuracy as they ingest more data.

The following mechanism follows: as a company adds more customers, it gains more data from each of them to train and refine their algorithms. With more data, the accuracy of the prediction goes up, as well as the overall quality of the product. A better product helps convince new customers to purchase it and contribute their data. The loop is closed.

The great things is that this mechanism helps AI companies move along the customer adoption lifecycle. Early adopters are more tolerant to initial bugs or sub-par performance. By contributing their data and their feedback, they help AI startups build better algorithms and acquire customers that are later stage adopters.

There is another self reinforcing feedback loop at play that we could call the “talent attraction loop”. The more data the company owns, the more attractive it is for a data scientist to come and work for the company, and hence the higher the chance that the team will attract great talent to help in building the best ML product.

This quote from Yoshua Bengio is a good way to sum it up: “AI is a technology that naturally lends itself to a winner take all, […]The country and company that dominates the technology will gain more power with time. More data and a larger customer base gives you an advantage that is hard to dislodge. Scientists want to go to the best places. The company with the best research labs will attract the best talent. It becomes a concentration of wealth and power.” (More here).

The problem is that a startup initially owns no (or very little) data and relies only on a small number of talented individuals, most often only the founders. Just like it takes time and resources for the network effects of a marketplace to kick-in, the reinforcing loop at play for AI companies requires initial data.

And who owns this data?

Incumbents.

That’s why several industry observers have stated that incumbents have an unfair advantage to ride the AI wave (more in this interview of Marc Andreessen).

The great news for AI investors is that it’s not as simple as that. In the next part, I outline a framework to think about Incumbents’ advantages in AI.

II. A framework to think about incumbents’ advantages

A simple equation that might explain part of the success for AI companies, and which follows the diagram of the previous part, could be

Success in AI = Data + ML Talent + Algorithms

In plain English, successful and defensible AI companies have “large enough datasets that ML talents can use to create the best algorithms.”

A useful method to think about incumbents’ advantage in AI is to look at this 2X2 matrix — thanks to Sam Levan, founder of Madkudu, who first introduced it to me — with the amount of data available per use case on one axis and the nature of the companies currently addressing each of them (tech vs. non tech) on another axis. The idea is then to look at the results of this equation for incumbents vs startups.

If we take use cases addressed by large tech companies, where each potential client has a lot of data in this case, incumbents’ advantages are very strong. Beyond the typical incumbents’ advantages (e.g. access to customers, larger ability to invest and lose money), large tech companies sit on piles of data that they have been accumulating for years.

They also benefit from both their brand and their large financial resources to afford the best machine learning talents, who will then develop the best algorithms. Incumbents’ score: 3/3

It seems fairly clear that new startups should not go head to head with tech incumbents in this situation. It would be like going after Google from scratch.

But incumbents’ advantages are not only strong on this part of the matrix. Let’s now take the bottom right part. The part that is addressed by non tech companies but where each potential customer already sits on large amounts of data. Think about a motorway operator, owning years of toll data.

History has proven that data might matter even more than algorithms themselves, especially since deep learning has emerged. This article from Edge gives an interesting demonstration in this regard:

“The average elapsed time between key algorithm proposals and corresponding advances was about eighteen years, whereas the average elapsed time between key dataset availabilities and corresponding advances was less than three years, or about six times faster, suggesting that datasets might have been limiting factors in the advances.“

In addition, large tech companies are continuously open sourcing new ML packages, thus turning algorithms into commodities, especially for object recognition, language models or speech— what we call generalised ML. Because of generalised ML, non-tech companies sitting on large data sets can get to relevant results using open source packages generously pre-trained on tech companies’ datasets.

To sum it up, a large company, which is not necessarily a tech company, and which might not have top notch machine learning expertise internally, could also build better AI products than a small startup with the best ML experts. Simply because it has access to more data than a small startup.

In our example, the motorway operator benefits from many competitive advantages protecting it from a small startup with very little initial data.

As a consequence, we should probably weigh data higher than ML Talents in the equation:

Success = Data*Data + ML Talents + Algorithm

Let’s now take the top left part of the matrix: use cases where only a small amount of data is available per customer, but these segments are addressed by large tech companies. A good example would be predicting the likelihood of a lead to become a customer (lead scoring). Each potential client does not have enough data to build a good enough prediction themselves using generalized ML.

Each of them has around hundreds of data points and 10s of predictors sitting in their CRM or their Marketing automation tool. This is likely not enough, or creates a risk of overfitting the prediction to the company’s data.

They will hence need to purchase a product built on a larger dataset. The question is then whether or not the CRM provider is the right player to sell this product, or if a startup is.

Incumbents’ advantages here are less clear and there might still be many opportunities for startups.

Especially if they can:

combine different data sources that large tech companies do not have (eg. Salesforce does not have access to Hubspot data), or,
generate additional proprietary data (more in the next part).

Left is the space where the biggest opportunity might be: segments that are not addressed by tech companies and where customer do not have access to a large enough dataset for generalised ML to work well enough. Some segments of agriculture and healthcare are good examples of situations where there are no large tech companies dominating the market, and where each customer only has a small amount of data.

In the next part, I’ll dive deeper into the left hand side of the matrix, especially in this square at the bottom left.

III. The new moats

The good news with the new equation ‘Success = Data*Data + ML Talents + Algorithm’ is that when data is initially available in smaller quantities (let’s say, by normalising, that data < 1), its impact is more limited than in the initial equation. ML Talents and Algorithms also have a higher impact on the output, and incumbents have less of an unfair advantage.

The immediate consequence is that a startup with the right machine learning talents and innovative algorithms has a chance to thrive in a market where data is scarce.

Here are three ways, which are not mutually exclusive, to overcome this scarcity problem.

# Case 1: Collecting data from many customers

While each company taken separately might not have a large enough datasets to build a great AI product, an AI startup pooling datasets from several customers might be the only company able to build a product that meets their expectations. Each single player would give away their data to benefit from algorithms better trained on larger datasets.

Think about a SaaS solution for greenhouses which could combine data from multiple greenhouses and derive the best prediction of the yield. Each greenhouse owner is probably not sitting on large enough datasets but would benefit tremendously from an AI agent building a better forecast or even controlling the whole greenhouse.

Tom Tunguz brings an interesting counterpoint in this post, applying a few lessons from the Adtech world.

# Case 2: System(s) of Intelligence

If we push this reasoning a little further, another reason why large datasets are not available is because they’re not only siloed between different customers, but siloed between different SaaS tools — some of them being Systems of Engagement (a website, Slack) and some others System of Records (a marketing automation tool, a CRM).

An AI Startup that would sit between these two datasets might very well be in the best position to build the best prediction and become what Jerry Chen from Greylock calls in this post a System of Intelligence.

Think about the CRM use case again. Aren’t the way leads react to marketing collaterals good predictors of their likelihood to buy? The problem is that Salesforce does not have this data because it’s locked in Hubspot’s database.

Similarly, Hubspot does not know at which pace leads have been evolving in the sales pipeline. Hence, provided that data is scarce in this market (the left side of the matrix), neither Salesforce nor Hubspot is in the right position to build the best prediction. An AI startup building their prediction on both databases could beat Salesforce and Hubspot in this endeavor.

A good way to think about this is to view datasets as complementary assets in the value chain. New, seemingly inoffensive, AI startups can partner with companies that incumbents will never want to partner with, and thus build complementary assets protecting them from these.

The contraposition of this statement is that any company that relies on a single source of non proprietary data is much less defensible than one which combines several ones.

At the end of the day, it all goes back to answering the question: “Who makes money with my data?” Is it the company generating the data? Is it the company storing it? Or is the company building the best ML product on top of them?

This is not new with AI startups but it could take a whole different dimension with AI as people realize the value of their data. The same way Twitter killed all companies developing alternative Twitter clients, Salesforce could kill any startup deriving too much value from data stored in Salesforce.

There is one last case which solves the issue of data ownership.

# Case 3: Owning unique datasets of user generated data

If a company can’t collect data from multiple customers or from multiple SaaS tools, or if it’s simply not enough to build a good enough prediction, then it can try to generate additional data from its own SaaS offering. This is a unique opportunity to build a proprietary dataset that no other incumbents own.

Our portfolio company Juro, which develops a contract management software, builds for example a unique dataset to understand a contract’s formation and negotiation.

IV. Learning curves

The whole reasoning can be summed up by drawing learning curves that describe: “How much time, effort or funding is required to get to a high enough accuracy, which meets customers’ expectations?”

In a situation where data is not scarce, the following learning curve applies:

With little time, effort and funding, the company can get enough data to meet customers’ expectations. Hence the defensibility is relatively limited. This applies in particular to cases where the data used is publicly available.

At the opposite end of the spectrum, in a situation where data is scarce, and a lot of time, effort and funding is required, the curve would probably look like this:

In such a situation, a lot of time, effort and funding is required to get to a high enough accuracy, hence the defensibility is strong.

It’s all the more strong since customers will likely not contribute their data, and data network effect won’t kick-in before a very long time has passed.

It’s important to emphasize that these situations are very theoretical and are just here to provide a framework to think about defensibility stemming from data network effect.

The second situation where the data is scarce might create a lot of defensibility, but might also be a difficult situation to be in as the company has to wait until after Series A to meet customers’ expectations.

It’s also a difficult situation to be in as Seed Investors, because we don’t know what the curve will look like after the seed. These curves look like S-curves, but they might actually be different. Uncertainty remains as to whether or not the product will ever be good enough to provide value to customers.

The last thing is that ML Defensibility and SaaS defensibility are not mutually exclusive. A very long product roadmap and a superior UX or user/data lock-in are still very important contributors to a company’s defensibility, beyond the one that stems from data network effects.

Hence, whatever situation you’re in, if you’re building a ML startup — don’t hesitate to get in touch, I’d love to have chat! I am sure we’ll enjoy chatting about defensibility ;)

A huge thanks to Savina Van der Straten, Clement Vouillon, Alex Flamant and Nathan Benaich for reviewing early (and very rough draft of this post).