Here we introduce the background for the problem of decentralized data collaboration.

To see the overview of the concept doc, visit Overview.

When centralized solutions are not enough…

Data is the lifeblood of many research and industrial projects. To improve product or service quality, data collaboration has become a common practice. More available data helps gain better data insights. Mature centralized solutions often require all participants to grant data access to a centralized party, usually the solution provider. Data privacy is protected by legal contracts and participants can achieve functionalities predefined by the solution provider.

However, centralized solutions rely on trust in the solution provider and suffer from two major drawbacks:

First, auditing the running software from the solution provider is technically challenging. Legal-level constraints are not enough to defend against unexpected and undetectable misuse of private data.
Second, industries like medicine and finance have stricter privacy regulations, which can prevent sharing data with a third party even if the third party is legally constrained.

Recent developments in privacy-enhancing technologies and innovations in applied cryptography provide privacy-enhanced alternatives for many specific applications. They can protect security and privacy even in a decentralized setting with less trust between different participants.

9 challenges to build decentralized solutions

However, the functionality is often more constrained and the performance is often significantly lower than the centralized solutions. There are much less mature decentralized data collaboration solutions than centralized ones, mostly because building a complete decentralized solution poses many challenges:

Similar to distributed systems, decentralized systems also involve multiple machines.
- C1: The system is required to orchestrate the multi-host executions.
- C2: Machines need to be able to discover and communicate with each other.
- C3: When there are more users and workload, the system needs to be scalable and load-balance if needed.
- C4: Convenient testing in the decentralized setting is also necessary for correctness and robustness.
In addition, decentralized systems require more consideration of security and privacy compared with distributed systems.
- C5: In distributed systems, one storage service is often sufficient and can be used by all other components. However, in a decentralized setting, data ownership requires private storage for each participant and,
- C6: Data access also requires extra authentication.
- C7: As an initial step in most collaborations, all participants need to explicitly agree to an execution plan before starting.
The lack of interoperability prevents code reuse. Centralized solutions can be directly built upon existing libraries and packages.
- C8: However, for decentralized systems, the difference in programming languages, data formats, communication interfaces, and other abstractions all add difficulties to the solution integration. Even after the initial integration is done, the extra effort in dependency upgrades and alternative replacements challenges long-term maintenance.
- C9: In addition, cryptographic protocols often rely on one another. The lack of interoperability also decreases the expressiveness to combine multiple decentralized solutions.

Facing all these challenges, in CoLink, we ask the following question:

❓ How to accelerate the development and deployment of decentralized data collaboration?

We notice there are many missing abstractions and building blocks, which are repeated efforts in building different decentralized solutions.

Design goals for an ideal programming abstraction

Observing the gap, we analyze existing decentralized solutions, summarize the missing pieces, and design CoLink, which is a new programming abstraction designed for decentralized data collaborations.

Specifically, we come up with the following design goals: