Hello! I am going to write my master’s thesis this spring, on data markets. For that reason, I am starting to investigate the concept. This blog post is mostly for myself as a way to digest what I have learned but here it is for anyone interested in a semi-structured format.

what are data markets?

Let’s start with a little motivating example. Say that you and I both had a supermarket. We both have a lot of data about our customers, like what wares they like to buy, how sales affect their purchasing habits, or when the supermarket has the most customers. Also, I could probably improve my supermarket somewhat by having access to your data and vice versa. None of us would ever give our data away to each other freely though, for several reasons:

  1. We are competitors, so by giving you my data I would worsen my position in the market.
  2. The transfer of data requires effort that would not be rewarded.
  3. Business owners generally don’t just do things without being compensated; not succesful ones at least.

Data markets are an attempt to increase social welfare (an economist word that means how well an economic system is allocating resources for both consumers and producers) by incentivizing people to share their data with each other. The goal is that by compensating data owners for sharing their data, both data owners and data consumers can have their circumstances improved. By constructing a marketplace with buyers and sellers, a mechanism for selling and bidding on data can be constructed and both parties can earn more and improve the overall function of the system in the process.

data as an economic good

Data as a good has a bunch of problems associated with it that other goods do not have. The crux of the problem is really that data is infinitely replicable. The same data can be shared once or ten thousand times. This makes it much harder to assign value to data. In a data auction, several people can own very similar data. If all of this data is of similar value to a buyer, it is very hard to split the money fairly among the sellers. To further complicate the situation, one malicious seller can pretend to be several people, and duplicate their own data to try to increase their part of the pie. We’ll get to proposed solutions for this later.

Another problem is that as such, you would be willing to sell data for basically any amount of compensation (as long as the buyer is not in direct competition with you) but if you sell data to the lowest bidder, they can turn around and share your data with all the higher bidders, and earn all the returns that should have gone to you. There are legal protections from these in the form of e.g. copyright, but it still leaves you at risk (see for example the current state of LLM input data sourcing). This makes it more risky to sell your data in a raw form if there are multiple buyers. In these kinds of markets you would rather find a way to sell an output based on your data instead of selling your data directly.

You might also have privacy concerns in parting with your data that are unconnected to the direct financial effects of parting with your data. This mental model is most useful when thinking about individuals selling off data, as for companies and other actors I think the effects are often better thought of as some kind of indirect financial consequence. The privacy model shows up in research mostly modeling indvidual participation in data markets.

dissection of a data market

A data market is a structure that takes in offers from sellers and bids from buyers and then gives out data to buyers and money to sellers. That is all. What we need to determine to create a data market is:

  1. A method of allocating data to buyers
  2. A method of distributing money between sellers

We then want these two methods to satisfy a lot of properties. We want the market to be fair by what we normally understand as fair and we want everyone in the market to operate in a cooperative way, i.e. not lying about their data or what they’re willing to pay for it.

Data markets can have many auctions. With our supermarket example from earlier we might want to have an auction every day or even every hour to properly share up-to-date data between participants.

There are examples where the pricing in the market is based on the gains from buyers on the data market. These markets have good properties for markets with many buyers and many sellers, often in direct competition. Think of a situation like our supermarkets from earlier, or several competing wind farms, or any situation like that. Another avenue of data market research focuses instead on markets where the price in the marketplace is set by the seller’s loss of privacy. This could be something like a marketplace for individuals selling data to a marketing firm, since sellers aren’t in direct conflict with the marketplace.

I’m going to primarily focus on markets that are intended for large actors on both sides, like companies or other similar actors composed of many people under one legal entity. I find these cases more interesting, since a lot of current systems could perform better if actors of this kind within them shared data but they would never share it without an incentive since they are in direct competition. For that reason, I also find it more interesting if we can construct a pricing model that prices features somehow based on the gain of market participants, since I think this is easier to get to interface with the negative externality of giving data to a competitor.

an example outcome-focused market

A very good research paper (Anish Agarwal et al. (2019)1) introduced an algorithm for a real-time data marketplace that prices data based on the outcome it provides for a buyer. By doing this, it both achieves a very nice way of avoiding sending raw data to buyers, who would then be able to reuse data for other tasks, and an algorithm that is robust to replication. This is a really good article, I really recommend reading it if this research area might interest you.

The marketplace they present receives offers from data sellers, and then receives bids from buyers. It then uses an algorithm to allocate payoff to the sellers and data to the buyers. The algorithm is based on approximating the Shapley value, which is a value that economists often use to represent the fair outcome of a market. By using the Shapley value, they get its fairness properties as a bundled deal, except one (read on), and so they have a market that is (mostly) fair as understood by economic literature.

The paper’s solution to robustness to replication involves reducing the compensation of sellers based on the similarity of their data to other data in the market. While the marketplace presented now doesn’t reward data replication, it violates a property known as “balance”, meaning the marketplace no longer guarantees that it distributes all the money paid by the buyers to market participants. This means that bad actors can’t make themselves better off by replicating data, but they can ruin the data market for all the other sellers. So while this is theoretically “robust” to replication, in that replication won’t win you money, it is only surface-level robust. If an actor would like to ruin the market they can do so using replication.

A very cool byproduct of the way they approximate the Shapley value is a way to determine feature importance for a machine learning model. They iteratively choose subsets of the features, and by looking at the loss, determine the importance of the features using the performance of the sets the feature is included in. I think some machine learning tasks might have something to gain from this approach, actually. Maybe I’ll test it out and report my results some time.

on from here

This was a short introduction to data markets. I’m going to spend way longer on them, and I’ll try to dive into some of these topics that I only got skin-deep into, like robustness to replication, fair markets, and auction theory. If I follow my own plan, a post on auction theory is coming next, hopefully within the next month.

other cool articles

https://arxiv.org/abs/2003.08345

D. Han, M. Wooldridge, A. Rogers, O. Ohrimenko and S. Tschiatschek, “Replication Robust Payoff Allocation in Submodular Cooperative Games,” in IEEE Transactions on Artificial Intelligence, 2022, doi: 10.1109/TAI.2022.3195686.

references

  1. Anish Agarwal, Munther Dahleh, and Tuhin Sarkar. 2019. A Marketplace for Data: An Algorithmic Solution. In Proceedings of the 2019 ACM Conference on Economics and Computation (EC ‘19). Association for Computing Machinery, New York, NY, USA, 701–726. https://doi.org/10.1145/3328526.3329589