Here we introduce the background for the problem of decentralized data collaboration.
To see the overview of the concept doc, visit Overview.
Data is the lifeblood of many research and industrial projects. To improve product or service quality, data collaboration has become a common practice. More available data helps gain better data insights. Mature centralized solutions often require all participants to grant data access to a centralized party, usually the solution provider. Data privacy is protected by legal contracts and participants can achieve functionalities predefined by the solution provider.
However, centralized solutions rely on trust in the solution provider and suffer from two major drawbacks:
Recent developments in privacy-enhancing technologies and innovations in applied cryptography provide privacy-enhanced alternatives for many specific applications. They can protect security and privacy even in a decentralized setting with less trust between different participants.
However, the functionality is often more constrained and the performance is often significantly lower than the centralized solutions. There are much less mature decentralized data collaboration solutions than centralized ones, mostly because building a complete decentralized solution poses many challenges:
C1
: The system is required to orchestrate the multi-host executions.C2
: Machines need to be able to discover and communicate with each other.C3
: When there are more users and workload, the system needs to be scalable and load-balance if needed.C4
: Convenient testing in the decentralized setting is also necessary for correctness and robustness.C5
: In distributed systems, one storage service is often sufficient and can be used by all other components. However, in a decentralized setting, data ownership requires private storage for each participant and,C6
: Data access also requires extra authentication.C7
: As an initial step in most collaborations, all participants need to explicitly agree to an execution plan before starting.C8
: However, for decentralized systems, the difference in programming languages, data formats, communication interfaces, and other abstractions all add difficulties to the solution integration. Even after the initial integration is done, the extra effort in dependency upgrades and alternative replacements challenges long-term maintenance.C9
: In addition, cryptographic protocols often rely on one another. The lack of interoperability also decreases the expressiveness to combine multiple decentralized solutions.Facing all these challenges, in CoLink, we ask the following question:
❓ How to accelerate the development and deployment of decentralized data collaboration?
We notice there are many missing abstractions and building blocks, which are repeated efforts in building different decentralized solutions.
Observing the gap, we analyze existing decentralized solutions, summarize the missing pieces, and design CoLink, which is a new programming abstraction designed for decentralized data collaborations.
Specifically, we come up with the following design goals: