Technical Guide to Ocean Compute-to-Data

An overview of our v2 release, Ocean Compute-to-Data

Manan Patel
Ocean Protocol

--

[Note from Nov 2021: some of the content in this post is obsolete, as V3 and later Ocean releases interact with Compute-to-Data in slightly different ways. Please refer to oceanprotocol.com/technology/compute-to-data for up-to-date info.]

With the v2 Compute-to-Data release, Ocean Protocol provides a means to exchange data while preserving privacy. This guide explains Compute-to-Data without requiring deep technical know-how.

Motivation

Private data is data that people or organizations keep to themselves. It can mean any personal, personally identifiable, medical, lifestyle, financial, sensitive or regulated information.

Benefits of Private Data. Private data can help research, leading to life-altering innovations in science and technology. For example, more data improves the predictive accuracy of modern Artificial Intelligence (AI) models. Private data is often considered the most valuable data because it’s so hard to get at, and using it can lead to potentially big payoffs.

Risks of Private Data. Sharing or selling private data comes with risk. What if you don’t get hired because of your private medical history? What if you are persecuted for private lifestyle choices? Large organizations that have massive datasets know their data is valuable — and potentially monetizable — but do not pursue the opportunity for risk of data escaping and the related liability.

Resolving the Tradeoff. There appears to be a tradeoff between benefits of using private data, and risks of exposing it. What if there was a way to get the benefits, while minimizing the risks? This is the idea behind Compute-to-Data: let the data stay on-premise, yet allow 3rd parties to run specific compute jobs on it to get useful analytics results like averaging or building an AI model. The analytics results help in science, technology, or business contexts; yet the compute is sufficiently “aggregating” or “anonymizing” that the privacy risk is minimized.

Share or Sell. Compute-to-Data is meant to be useful for data sharing in science or technology contexts. It’s also meant to be useful for selling private data, while preserving privacy. This might look like a paradox at first glance but it’s not! The private data isn’t directly sold; rather, specific access to it is sold, access “for compute eyes only” rather than human eyes. So Compute-to-Data in data marketplaces is an opportunity for companies to monetize their data assets.

What’s new in Ocean Compute-to-Data?

Ocean Compute-to-Data works in Pacific (Ocean’s mainnet) and Nile (Ocean testnet). Here’s what Compute-to-Data introduces.

New Actors

Ocean Protocol has these actors:

  • Data Providers, who want to sell their data
  • Data Consumers, who want to buy data
  • Marketplaces, dApps that facilitate data exchange

Compute-to-Data adds a new actor, the Compute Provider.

  • Compute Provider sells compute on data, instead of data itself. They can be the same actor as the Data Provider; or they can be separate from the Data Provider and trusted by the Data Provider to compute on the data. Below, we refer to the Data Provider as the Compute Provider.

New Components

Ocean technology has several components. Operator Service and Operator Engine are new for v2 Compute-to-Data.

  • Operator Service — a micro-service in charge of managing the workflow and executing requests. It directly communicates and takes orders from Brizo (the data provider’s proxy server) and performs computation on data, provided by Brizo.
  • Operator Engine — a backend service in charge of orchestrating the compute infrastructure using Kubernetes as backend. Typically, the Operator Engine retrieves the workflows created by the Operator Service in Kubernetes. It also manages the infrastructure necessary to complete the execution of the compute workflows.

New Asset Type

Before, datasets were the only asset type in metadata (DDO). Compute-to-Data introduces a new asset type — algorithm, which is a script that can be executed on datasets.

How does Compute-to-Data work?

Let’s dive into how Compute-to-Data works with an example.

Suppose Arena is a major player in the automotive industry. Arena wants to create Autonomous Vehicles (AV) and needs a ton of data to train their AV’s AI models to operate efficiently. Unfortunately, Arena doesn’t have enough data to do so. They plan to purchase data from major automotive supplier Axios.

Axios is not interested in selling data to Arena because of customer privacy issues. Arena proposes to use Ocean Compute-to-Data, so Arena can build AI models on Axios’ data, without data ever leaving Axios servers. Axios agrees, as it allows them to monetize their data while preserving privacy.

  1. Axios uses Ocean Compute-to-Data to set up their compute infrastructure (including Brizo, Operator Service, and Operator Engine). Then, Axios publishes their data assets onto Ocean and receives a DID for the published data asset.
Provider publishes data using marketplace

2. Arena discovers for the published data in an Ocean data marketplace using search, filtering, or browsing.

Consumer searches for data in a marketplace

3. Arena believes the data asset could prove useful in their AV development, based on the description. They purchase access to train their AI model on that data via the compute service.

Consumer sends compute access request to Brizo

4. As usual, Brizo (the digital proxy for data/compute provider) performs necessary validations on conditions like permission to consume, signing a service agreement, and confirmation of payment.

Brizo performs configured checks on behalf of Provider

5. In case of validation failure, Brizo asks Arena to perform a few necessary steps.

Brizo enforces mandatory actions needed from consumer, determined by the provider

6. Arena publishes their algorithm into Ocean and receives a DID (e.g. algoDID) for the algorithm.

NOTE — Compute-to-Data is language agnostic and supports all types of compute platforms, environments and programming languages. Compute providers need to provide proper details about what type of compute service (platform, environment, CPU, RAM etc.) they’re offering. Later, consumers can browse through different environments of compute providers that support their algorithms and choose one accordingly.

Consumer publishes Algorithm to be used to compute on provider’s data

7. Arena signs a Service Agreement and pays 50 OCEAN to the escrow contract (part of Keeper contracts) as a payment for the compute service.

Consumer signs Service Agreement and pays for the compute service access

8. Now that Arena has performed all mandatory steps needed by Axios, Arena sends a compute request back to Brizo.

NOTE — The consumer doesn’t need to send a compute request immediately after purchasing compute service access. They can send compute request later, until the compute service access expires.

Consumer sends compute service request to provider

9. Brizo verifies that payment and all other mandatory actions have been performed by Arena.

Brizo verifies all mandatory actions are performed

10. Once all actions are validated and completed, Brizo gets hold of datasets and algorithm (using dataDID and algorithmDID, respectively) and instructs Operator Service to initiate compute using given algorithm on given data.

NOTE — Since this whole process takes place on the data provider (aka compute provider) side, data remains private and is not revealed to the consumer.

Brizo instructs Operator Service to start compute

11. The Operator Service performs checks on all inputs and, once ready, instructs Operator Engine to start compute process within given parameters of data and algorithm.

NOTE — Operator Service and Operator Engine uses Kubernetes cluster for the compute execution. A consumer can only run one active job per compute service. Consumers can choose to restart the same job or start a new job once an active job is completed or manually stopped.

After necessary validations, Operator Service instructs Operator Engine to initiate compute

12. Once compute is executed successfully, Operator Engine publishes results to an AWS S3 bucket. The results consist of output model and execution logs.

NOTE — When executed successfully, compute service produces two types of results: (1) output and (2) execution logs. Consumers can choose either or both to be delivered to them upon completion. These results are published to AWS S3 storage upon completion, and an AWS S3 URL is shared with the consumer. The consumer can choose to download or move those results to their S3 storage.

Operator engine publishes results to S3 bucket upon compute completion

13. Arena can inquire about the compute completion status anytime. Upon inquiry Brizo gets current compute job status from Operator Engine via Operator Service. Once the job has successfully completed, Brizo, on behalf of Axios, shares the results (output model and logs) URL with Arena.

Consumer gets informed after inquiry about the status of compute job completion

14. At this point, Arena can decide if they are satisfied with the results. If they are not satisfied, they can choose to restart compute execution with the same or a different algorithm, until the compute access expires.

Here is complete architecture for Ocean Compute-to-Data.

Conclusion

This article has described how Compute-to-Data enables data providers to share or sell their valuable private data while preserving privacy.

The following Github repos provide more detailed technical information.

Once you’re ready to get started, jump into the following tutorials and test out Compute-to-Data for yourself!

If you are new to Ocean Protocol, you may find our on-boarding guide useful. It’s the ultimate guide to Ocean’s architecture and tech stack for newcomers.

Follow Ocean Protocol on Twitter, Telegram, LinkedIn, GitHub & Newsletter for project updates and announcements. And chat directly with other developers on Gitter.

--

--