Original title: “Hardcore丨How to use machine learning to identify the risks of encryption projects? 》
Written by: Pengtai Xu
Translation: Sherrie
Cryptocurrency is a transaction medium (another form of payment) that exists in the digital world, relying on encryption technology to make transactions secure. The technology behind cryptocurrency allows users to send money directly to others without going through a third party, such as a bank. In order to conduct these transactions, users need to set up a digital wallet instead of providing personal details such as ID numbers or credit scores, so users can be pseudo-anonymous.
For ordinary cryptocurrency users, this anonymity can reassure them that their personal information or transaction data will not be stolen by hackers. However, the increased anonymity of such transactions can also be easily abused by criminals for illegal activities such as money laundering and terrorist financing. This illegal activity has caused huge losses to both blockchain wallet users and cryptocurrency entities. Although regulatory agencies such as the Financial Action Task Force (FATF) have introduced standardized guidelines in the supervision of these entities, due to the large number of cryptocurrency entities and transactions that occur every day, monitoring the cryptocurrency space is a challenge Task.
solution
Image Source:
Therefore, people are interested in using open source information, such as news sites or social media platforms, to identify possible security breaches or illegal activities. In cooperation with Lynx Analytics, we (a student team from the National University of Singapore) have worked to develop an automated tool to scrape open source information, predict the risk score of each news article, and flag risk articles. This tool will be integrated into, which is a tool developed by Lynx Analytics to help regulators monitor blockchain activities through the use of various information sources.
Data acquisition of open source information
We have identified 3 types of open source data that can provide valuable information to help detect suspicious activities in the cryptocurrency field. These categories are:
- Traditional news sites, such as Google News, will report major hacking incidents.
- Cryptocurrency-specific news sites, such as Cryptonews and Cointelegraph, are more likely to report news on small entities and small security incidents.
- On social media sites, such as Twitter and Reddit, cryptocurrency owners may post news about hackers there before they officially release hacker news.
Retrieve the content of articles and social media posts, and then build a sentiment analysis model. This model assigns the probability of a risk activity to the entities mentioned in the article.
Sentiment analysis model
We tried four different natural language processing tools for sentiment analysis, namely VADER, Word2Vec, fastText and BERT models. After evaluating these models with selected key metrics (recall rate, accuracy, and F1), the RoBERTa model (a variant of BERT) performed best and was selected as the final model.
Image Source:
The RoBERTa model processes the text of news articles (titles and excerpts) or social media posts and assigns a risk score to specific text. Since this text has been marked as an entity during the data collection process, we now have relevant risk indicators for encrypted entities. In the later stage, we combine the risk scores of multiple texts to give an entity’s overall risk score.
RoBERTa was originally a sentiment analysis model built using a neural network structure. We mapped the last layer with our labeled risk score to adapt to the risk score environment. In order to improve the versatility of the model on future text data, we have carried out several text processing methods, namely replacing entities, deleting urls and replacing hashes. Then we use this best performing model to score the risk.
Risk score
Now, every article has a related source (news/reddit/twitter), a risk probability and a count, which refer to the number of times the article has been reposted, shared or reposted. In order to convert these risk probabilities into a single risk score of a cryptocurrency entity, we first scale the probability value of the article to a range of 0 to 100, and obtain a weighted average of each source, combining the risk score and count of the article. The weighted average is used to give greater importance to articles with higher counts, because the number of shares is likely to indicate the relevance or importance of the article.
After calculating the risk scores of each source, we weighted and sum the risk scores of each source to obtain a comprehensive score, the formula is as follows:
Traditional news sources are given higher weight because these sources are more likely to report major security breaches (as opposed to hacking incidents by individual users).
Effectiveness of the solution
We tested our solution on a list of 174 cryptocurrency entities from January 1, 2020 to October 30, 2020, and compared the results with known hacking cases during that time period. We found that our risk scoring method performed quite well, identifying 32 out of 37 known hacking cases. We also analyzed the effectiveness of our solution for a single entity. The figure below shows Binance’s risk score from January 1, 2020 to October 30, 2020. The dashed red line represents known hacking cases. From the graph we observe that our solution reported an increase in risk scores for 4 of the 5 known hackers. There are also several peaks that are not consistent with known hacking cases. However, this does not constitute a major problem, because for our model, it is more important to identify as many hackers as possible and reduce the number of unidentified hackers.
Interesting discovery
In the risk scoring process, we noticed that compared with smaller entities, the risk scores of larger entities tend to have a larger proportion of false positive records. This is because large entities are talked about more, so there will be more negative posts and false rumors, leading to higher inaccuracy rates.
Another interesting trend worth highlighting is that there are usually several obvious peaks surrounding hacker attacks. This is due to the different response times of different data sources. Social media sites Twitter and Reddit are usually the first to see peaks when high-risk events occur, because users will post anomalies they observe, such as an entity’s website going down without prior notice to users. Official news is generally released after the official announcement.
limitation
We found that our solution has two potential limitations. The first is the need to constantly maintain the collector. The website design may change over time, and the scrapers of these websites need to be updated to ensure that relevant information can still be retrieved to achieve the purpose of risk scoring.
The second limitation is that it is challenging to verify that an article has been correctly labeled as a cryptocurrency entity. For example, an article reporting suspicious Bancor activity may also mention Binance because of an unrelated incident. Our solution would incorrectly mark news as two entities and Binance as risk, even if it is not a key topic in the text. However, this is not a major limitation because we only use the headlines and excerpts of news articles for risk scoring, which usually contain only the key information of the article.
Conclusion
Our project allows regulators to easily mine open source information and better identify risk events in the cryptocurrency field. We provide a language model that analyzes articles and predicts risk scores, as well as methods to aggregate these scores based on entity and source information. These methods are woven into an automated pipeline that can run end-to-end. Integrating the project into the Cylynx platform will complement its existing functions and provide great help for regulators to identify high-risk cryptocurrency entities.