“Data is more valuable than oil,” said Yang Anze, a candidate for the 2020 election in the United States. Regarding the importance of data, driven by the rapid development of the Internet in the past two decades, it has gradually become popular among the people. However, news of user data leaks and abuse is not uncommon. Therefore, how users use data and how to ensure data sovereignty has become a more important and urgent issue.
“Privacy computing” is the most potential exploration direction of blockchain and cryptography in the field of data. Today, PlatON, the most important project in this field, has released its mainnet.
To understand the importance of “privacy computing” and PlatON, we must start with clarifying basic concepts such as data ownership and data value.
This article was first published in 2019 and introduced the most basic introductory knowledge to understand the business model of private computing.
Original title: “Yang Anze, a Chinese candidate in the US election, said that data is more valuable than oil, but how to achieve it? 》
Written by: Li Hua
Acknowledgements: Sun Lilin, founder of PlatON, and Sheng Chao, researcher of secure multi-party computing
Even Andrew Yang, the Chinese Democratic candidate for the 2020 U.S. presidential election, said that ” data is more valuable than oil ,” which shows how deeply this concept is deeply rooted in the hearts of the people.
However, although “data is the oil of the digital age” and “data ownership should be in our own hands” sounds very attractive, it is actually difficult to say clearly how we should implement it.
“Brokerage Scholar” published a cover article as early as 2017, claiming that ” data will replace oil ” as the most valuable resource of the current era. But to this day, ordinary people with sovereignty over “data oil” still cannot benefit from this precious resource.
On the contrary, these data also bring serious privacy leakage problems to their owners.
Why is there a huge gap between a good vision and the reality? How can data ownership and data value be realized? This article tries to explore from the existing practice, hoping to clarify some clues, and contribute a little to the establishment of a thinking framework on this issue.
We can’t sell data
I believe that each of us has had the experience of receiving sales calls. The personal data of most people has been bought and sold, the simplest such as phone numbers and some consumer information, these data may be waiting to be sold again somewhere at the moment.
Data can indeed be sold for money, and the money falls into the pockets of the institutions that have obtained our data.
This phenomenon is prone to lead to a misunderstanding, that is, we can realize the value of data by selling data. That is to say, after we have data sovereignty with the help of legal provisions and technical means, we can sell these data to those who need it. To get the value of data and sell “oil” for money.
But this is wrong, we cannot buy or sell data. Before addressing this question, we have the right to use necessary to distinguish data ownership and data.
For the vast majority of assets in the world, buying and selling means the transfer of asset ownership: one party gets ownership, the other loses ownership. But buying and selling data does not transfer the ownership of the data. You sell the data, but the ownership of the data still belongs to you.
Therefore , transactions surrounding data are actually transactions surrounding data usage rights, not data ownership . But because data can be copied indefinitely, if we sell the data, we cannot guarantee how the buyer will use the data and whether the buyer will sell the data again. To be more precise, we have “lost” the data to some extent, even if We own the data.
Illegal data transactions directly buy and sell data because they don’t care about the rights and interests of data owners, but when we truly own the ownership of the data, in order to realize the value of the data, we cannot buy or sell the data.
So how to trade the right to use the data without losing the data? The answer is not to trade the data itself, only the calculation results of the trade data . In other words, the buyer can use these data to calculate and get the result it needs, but the buyer cannot obtain the original data itself.
This is the first and perhaps the most important thing to understand when we discuss data ownership and data value: we cannot realize data value by selling data, we can only realize data value by selling data results.
In other words, we need to separate the ownership of the data from the right to use the data, and only trade the right to use the data.
Privacy computing is not just for user privacy issues
How to achieve only selling data results? The answer is: through privacy calculations .
Privacy calculation is to calculate data without exposing the original data, and the calculation result can be verified. It includes multiple research directions such as fully homomorphic encryption and secure multi-party computing. There are many professional technical articles describing their working principles. If you want to learn more, you can check it out.
Here we have a second ambiguity that needs to be clarified, that is: privacy computing is not only a service for protecting user privacy, it is also the basis for realizing data use rights transactions, that is, the basis for realizing the value of data.
The reason for this clarification is that “privacy computing” can easily be understood as another privacy protection technology, and the focus is on “privacy”, but in fact the focus of “privacy computing” is on ” computing “.
In the blockchain industry, since privacy computing is often used as a method to enhance user privacy in cryptocurrency transactions and on the blockchain, it is easier for people to understand privacy computing as a service for user privacy. This understanding is not wrong, but it limits privacy computing to a small area.
Perhaps it will be clearer to look at this issue from another angle. We split the data issue into user privacy issues and data value issues. The problem of user privacy is that the original data related to the user will not be leaked, and the user’s privacy will not be exposed. We can regard this problem as a kind of data privacy protection within a specific scope.
In this stage, the role of privacy computing is an alternative method of protecting privacy.
After the user obtains data privacy, if he/company chooses to put the data there and do nothing, the story is over; but if the user/company wants to go further and get the value of the data, they must use the data. The matter has entered the next stage. At this time, various methods are needed to ensure that the data is not leaked during the entire life cycle of being used. We can regard this as a full range of data privacy protection.
At this stage, the role of privacy computing is no longer an alternative method, but a necessary way, because the way to realize the value of data is to sell the data results and use the data without exposing the original data. Only private computing can achieve this goal.
If data is compared to oil, then privacy calculation is the first process of oil refining. It is the basis for us to convert “crude oil” into various products under the premise of ensuring user privacy.
Not all data has similar value
Not all data have similar value, and not all data can achieve data value. This may be another place where we need to be clear when discussing the value of data.
Only when we understand the complexity and diversity of the data, can it be possible to use different legal and technical terms and methods to solve the problem in different situations.
This article will try to make a simple division of data categories from the perspective of application, and then introduce the data value of this type of data. The data classification method proposed here is not necessarily comprehensive and accurate, it is only for the establishment of a basic framework for discussion.
We can divide the data into three main categories:
- The first category is identity data ;
- The second category is behavioral data ;
- The third category is productivity value data .
The first type of identity data is used for registration and identity determination in the Internet and the real world , such as ID numbers, phone numbers, account information, etc. This type of information has the greatest value for illegal industries, and once it is leaked, it will bring users There is a big safety hazard. But for the formal data industry, this type of information has no computational value, and they cannot calculate meaningful results.
Therefore, this type of data itself does not need to consider how to realize the value of the data through privacy calculations.
The second category is behavioral data, which includes user browsing traces and consumption data on the Internet, as well as user product usage habits data. These data can be calculated to make personal portraits of users, and then based on the portraits to push advertisements, push content, provide services, and even promote opinions.
Behavioral data has two major types of value. One is the value of advertising . We all know that advertising feeds the entire Internet industry; the other is that it can help products understand users and provide users with better personalized services.
At present, data ownership issues that are widely concerned and discussed around the world are mainly focused on this type of data. For a long time, the various permissions of this type of data were not clear, and people did not care. We did not realize the seriousness of the problem until the calculation results of these data were used more and more to influence or control us. .
One of the landmark events was the data gate event of Facebook in 2018. In this incident, a data operating company called Cambridge Analytica obtained data from more than 50 million Facebook users. Through data calculations, they screened out the politically swaying targets and placed precise matching political propaganda advertisements on them. Affected the U.S. election and the Brexit referendum in the United Kingdom.
The good news is that we seem to be taking back ownership of this type of data. The General Data Protection Regulation (GDPR) promulgated by the European Union stipulates that the individual who generates the data is the data subject, and he has the right to request the erasure of his personal data, as well as the right to object and request the cessation of the processing of his personal data.
The bad news is that we did not get back the right to use data. As mentioned above, the value of data is based on the transaction of data use rights, so we are still far away from using this type of data to realize the value of data attributable to users. . The difficulty lies in:
On the one hand, even if they are called in the history of the most stringent data protection regulations, GDPR only require companies to inform the user before using the data which the data is used, and what to do with these data, that is, it is only binding companies do not abuse Data, but does not restrict the use of data by enterprises.
On the other hand, because this type of data can be used to help products understand users, if companies use data on the grounds of improving the user experience-which they do now-it seems difficult to refuse. It seems difficult for users to sacrifice user experience to require companies to have no right to use any behavioral data. It seems even more difficult for companies to actively distinguish between the two uses of this type of data and transfer part of the advertising value.
Does this mean that companies can still act in accordance with the previous data processing methods? Not really. We will find that the above-mentioned separation of data ownership and usage rights is only in the literal sense. Although companies only have the right to use the data, they “get” and use the original data itself, which allows the data to still be abused and safe. Aspects of the problem.
And because of the awakening of public privacy awareness and the promulgation of data protection laws in various countries (putting security responsibilities on companies that use data), once problems occur, companies may face resistance from users and huge fines. Therefore, we can see Google, Companies such as Apple are now doing a lot of research in the field of privacy computing.
Take Google as an example. Its ” Federated Learning ” integrates machine learning models on every device. When user parameters are aggregated and sent to the cloud, privacy calculations are realized through privacy-protected aggregation algorithms and system engineering.
But it needs to be pointed out again that the separation of data ownership and usage rights by companies through privacy calculations is not for users to be able to trade data usage rights. They want to reduce data usage risks, avoid privacy leaks, and be able to meet the requirements. Those who comply with the requirements will continue to use the user’s data for free.
Therefore, it is a long way for users to obtain the data value of this type of data. The biggest difficulty lies in awareness. Only when we have a strong sense of data ownership and usage rights can we push the government to introduce more stringent data protection regulations. , Or promote the new Internet architecture to subvert the current centralized server model.
“Productivity Value Data” is the most valuable
After understanding “identity data” and “behavior data”, the third type of data will be introduced, which we call ” productivity value data ” in this article.
One of the major uses of this type of data is for machine learning and AI training ; another major use is for data analysis to help scientific research, product design, decision-making, etc. If this type of data is used appropriately, it can drive society to develop in a more efficient and friendly direction. They are a kind of productivity.
The third type of data has the widest range and the largest amount of data. It can come from humans, such as personal medical data and financial data, personal product usage habits data, etc.; it can also come from IoT devices, such as atmospheric condition data collected by sensors, autonomous driving data, and so on.
Some of its data sources are the same as those of the second type. They are all users of Internet products, but the processing methods and purposes of the collected data are different: the second type of data is taken from users and used for users, while the second type of data is taken from users and used for users. The three types of data are used across data subjects after being collected. From the perspective of the data itself, we can think of a certain data as both the second type of data and the third type of data.
The third type of data has the greatest data value. At the same time, they may also enter the trading market of data use rights first to realize data value.
Different from the second type of data, Internet companies own the right to use the data while using the data themselves, and do not need to conduct data transactions. In the application scenario of productivity value data, there are roles that do not have the right to use the data but want to use the data. From this perspective, we can think of the third type of data as a collection of all data that can be capitalized.
We can take medical data as an example to better understand how to use the third type of data. If scientific research institutions or pharmaceutical factories have the support of a large amount of medical data, they can research diseases and develop new drugs better and faster. However, medical institutions with data resources will not use these data because of user privacy issues and their own interests. Provided for use by other institutions.
If we separate the ownership and use rights of data through privacy calculations, we can establish a trading market for data use rights, and data from different medical institutions, scientific research institutions, and pharmaceutical factories can be connected on this platform—the popular saying is Break the data islands-these institutions can buy and sell data, and they can also share data for joint disease research.
If we want to train AI capable of diagnosing diseases, we also need to break the data island through the above methods, so as to provide AI with more and more comprehensive data.
What needs to be repeated is that at this stage, even if the transaction and value of data are realized, because the legal and use boundaries of data use rights are not clear, it is still difficult for us as individuals to get back all the value of data.
Data ownership and use is one of the most important issues of this era, author of “A Brief History of mankind,” the historian Yuval He Lali (Yuval Noah Harari) view is: “If we want to avoid both wealth and power In the hands of a small group of elites, the key is to regulate the authority of the data. “
Because of the complexity and diversification of the data itself, it may be fast and effective to define problems and solve problems from the small points that can be accurately described with clear boundaries instead of hoping that public opinion, legislation and technology can solve the problem as a whole. method. We can perform more specific classification and analysis of different data categories, or use different classification standards to discuss the classification of data, and then discuss data privacy, data ownership, and data value realization based on this.
Re-understanding “data is oil”
Data is often compared to oil.
Although there are records of humans collecting natural oil on the coast of the Dead Sea in cuneiform writing, it was not until 1846 that Abraham Disner invented a method of extracting kerosene from coal. In 1853, Ignatius Vukasevic and Jan. The history of the modern petroleum industry has only truly begun by distilling refined kerosene from crude oil.
But this is just the beginning. Oil used as fuel for kerosene lamps is not special. Only when it is used in internal combustion engines did it explode with great potential and become the most important resource in the world.
The similarity between data and oil is that data alone is not enough. Only when the “refining technology” of data is realized can the era of data industry be opened.
The difference between data and oil is that oil first has refineries and then the demand for internal combustion engines, while data is already in huge demand for use, but there is no mature technology and infrastructure to support this demand.
This may be a good thing. The road is long, but we know the direction.
Reference materials:
1. “Federated Learning: Collaborative Machine Learning without Centralized Training Data”
2. “Helping organizations do more without collecting more data”