Ocean’s Top-Level Goals
Data Commons, Privacy, Data Regulations, and Governance in Ocean Protocol
Last week we introduced the Ocean Protocol. This post picks up from that: it outlines our high level goals, and how we’re approaching them. We will answer:
- Go for a “data commons” — a win for humanity, or for paid data — which pays the bills?
- What’s the role of tokens?
- What about individual data privacy? In light of quantum computing, in light of regulations? What tech can help?
- What about network governance?
Data Commons; Connecting Data Suppliers & Consumers
Data producers like enterprises have tremendous data assets but don’t know how to unlock the data’s potential. Conversely, data consumers like AI startups are starving for data. Data marketplaces can connect data suppliers and consumers. Let’s make it super-easy for data marketplaces to start up (via a common protocol), drawing on the same massive data pool (via a common network).
If we do this right, the marketplaces can even handle a data commons, where data is free to use, yet you are rewarded if you contribute to the commons. This can complement paid data; free and paid can work hand-in-hand, making each other stronger.
Decentralized Control, Yet In One Logical Place
Many data providers must participate to make an ocean’s worth of data available. For that to happen, the marketplace must first feel fair. It can’t be controlled by any single entity that might pull the plug at any minute. Every time I think about the singular power that AWS has, I cringe.
And of course one good answer is to decentralize control. Spread power among the nodes and the token holders. Bake in trust (or at least cryptographically secure claims).
- How can we align incentives among participants, rather than having zero-sum competitions (when one wins, the other must lose)?
- How can we incentivize data suppliers and consumers to participate?
- How can we incentivize commons data?
- How can we incentivize referrals?
Tokens have emerged as a potent mechanism to address these questions.
The community of Ocean Token holders has a common incentive to increase the value of Ocean Tokens. This makes participation in the ecosystem a positive-sum game, i.e. win-win.
We use Ocean Tokens. These have many roles. They incentivize submission of data. They act as a currency to buy and sell data (but pricing can be in more stable currencies). They are used to curate towards quality participants and quality data, via staking so participants have skin-in-the-game. Staking also helps network governance. Later posts will elaborate.
Low Friction to Pricing
When someone wants to post some data, how do they price it? When a data consumer sees all the data sets available, how can they choose which to use? What about data “escaping” and people starting to use it for free, against the wishes of the data owner?
To answer, we’ve categorized data into several types:
- Commons data. Pricing is free. Easy!
- Fungible data. Therefore pricing is easy too: use an exchange.
- Non-fungible data. This one has a million answers. But we can distill into a handful: fixed pricing, auction pricing, royalties, and programmable. The first three are good defaults; the last gives full flexibility for creative programmers.
There’s another dimension. When you “buy” data, do you really need to see it or do you just want to compute on it? This matters, because if you only want to compute on it, you can let the data stay on-premise (even encrypted!). This sweeps away many thorns: data protection worries, bandwidth limitations for massive datasets, and general unease about data leaving the premises. This deserves its own category…
Let’s get to the heart of why leaving data on-premise is useful.
Privacy must be a first-class citizen. The data providers must have both rights and control on the use of their data, with verifiable audit.
It’s not a good idea for the data to leave the premises if any of the following hold true.
- Laws. The data cannot leave the jurisdiction for legal reasons, like German medical data
- Big data. The dataset is so massive that it doesn’t make sense to transmit it over the network.
- Queasiness. The data provider feels uncomfortable to release the data
- Quantum. The data is highly sensitive even years from now, and should not get out in the advent of quantum computing. Here’s the scenario: Imagine if some sensitive data was encrypted and posted publicly. Perhaps it’s secure now. However, when quantum computing gets good enough for dedicated hackers to use (in say 5–20 years) then we want to make sure they don’t decrypt that sensitive data. We can’t just remove the encrypted data from the network either, because it may be stored immutably (IPFS/Filecoin style) or savvy hackers may already be making their own copies in preparation. In short: if data is still sensitive after a few years, storing it encrypted on a public net is not enough.
Several technologies can help. First, we can leverage crypto and blockchain technologies for access control.
- Traditional crypto. Encryption for data safety. Digital signatures for proof that X did Y.
- On-premise storage & firewalls. Store data behind firewalls. This way, when quantum computing goes mainstream, public encrypted data doesn’t get exposed.
- Role-Based Access Control (RBAC) aka Tokenized Write Permissions. Users are assigned roles as tokens. Those roles are assigned write permissions as tokens. RBAC is a BigchainDB add-on.
- Access provenance. Here, the blockchain stores provenance of what data was accessed when, by whom. This is also a recent BigchainDB features.
Next, we can bring compute to the data, that is, on-premise compute:
- Federated learning. Update an AI model by sending training code & initial parameters to each data-holder, computing data-side, and sending back a gradient update to the model. The model must have decentralized control so that a single entity can’t choose privacy-compromising initial parameters (e.g. a vector of zeros). Example: OpenMined.
- Homomorphic encryption (HE) with federated learning. Make model gradient updates on encrypted local user data. This may sound like science fiction, but in fact it’s here now. The recent holdup has been inefficiency. But it turns out that HE is efficient with thoughtful choice of AI algorithms, e.g. rectified linear units rather than sigmoids. Example: OpenMined.
- Secure containers. One container can see the data, but not the network. Another container does model building and can’t see the network. To talk to the network, it must talk locally with a blockchain-powered “gateway” container that can see the network. Example: Amethix.
Finally, we can anonymize and obfuscate.
- Anonymization / Obfuscation. From a raw dataset, compute a new dataset from which the original dataset cannot be inferred, while retaining the ability to train a model. Example: Numerai’s use of GANs.
Ocean won’t have all of these in the beginning. But this is where we’re heading.
Here’s some mouthfuls for you:
- The Health Insurance Portability and Accountability Act of 1996 (HIPAA) in the USA
- The General Data Protection Regulation (GDPR) in the EU
- The Sarbanes-Oxley Act of 2002 (SOX) in the USA
- HSBC Data Policy Communication for Singapore.
We could potentially use something like outlined and publicly available here:
If you’re a developer or an entrepreneur that just wants to build, I hear you! However these laws are incredibly important; they exist to protect you as a citizen. So we’re going out of our way to make it easy to be compliant. My favorite precedent is how Stripe streamlined PCI compliance. It used to cost $100K to do that, eek! Now Stripe makes it an API call away.
Tokenized Data with Legal Teeth
How do we prevent data fraud, and other attacks? Ocean will constrain and incentivize for good behavior using crypto, staking and more. It’s more efficient to use these tools, than defaulting to an army of lawyers.
But bad behavior will still occur. So as a backup we go “old school”, by giving legal teeth to your ownership of the data. How? It’s just good ‘ol copyright law, which is a branch of intellectual property (IP) law. You can claim copyright on data, and license data as IP. We tie these laws to blockchain using Ricardian Contracts. This is what we did for ascribe, then made it more flexible with COALA IP. We’re bringing COALA IP to Ocean.
Do you know those “updated terms of service” emails, that are never in your best interest? These happen in centralized data marketplaces too. The result can be pain for data suppliers and consumers trying to earn a living.
So let’s spread governance with decentralization. The main goal in blockchain network governance is: how do you update the protocol? (Especially for controversial proposals.)
Blockchain governance isn’t always better either. Small numbers of mining whales and token whales can control the show too. Oops!
There also must be means for fast decision-making when issues arise.
Let’s aim to do better. First is the easy stuff: ensure that stakeholders can see what’s happening, and to have a say in decision making. Give diminishing returns on voting power for whales. We’ve been designing Ocean’s network governance around these goals. Once again, staking is our friend:) Later posts will elaborate.
We’re building Ocean because we believe that society needs equal opportunity to access data. We need to spread the control of data infrastructure and markets. This post outlined the top-level goals. More posts will follow — we have a lot to share. And to build! Let’s keep building the future we want.