Skip to main content
Platform blog

As cryptocurrencies, particularly Bitcoin, have grown in popularity, so has the phenomenon of Bitcoin mining. While normal mining operations are critical for blockchain validation and security, a disturbing trend has emerged: malevolent actors exploiting cloud computing resources for illegitimate mining purposes. This not only wastes expensive processing resources and offers serious security threats to both cloud service providers and their clients. Effective threat detection and response are challenged by the cost and complexity of siloed tools that neither scale nor provide capabilities for advanced threat detection.

In this blog, we will look at how a data lakehouse can be leveraged to combat Bitcoin mining abuse. Organizations can use the lakehouse to analyze petabytes of data and apply advanced analytics to reduce their cyber risk and operational costs. With Databricks, organizations can combat malicious intent for its cyber operations because the Lakehouse Platform can handle large amounts of data, support complex data processing tasks (including advanced analytics capabilities such as artificial intelligence and machine learning), and scale cost-effectively. The Databricks Lakehouse platform is a hidden gem for cybersecurity that unifies data, analytics and AI in a single platform.

Our use case is around the Databricks Community Edition (CE), a free version of Databricks that allows users to access a micro-cluster, cluster manager, and notebook environment for educational/training purposes only.

Eliminating Bitcoin Mining Abuse on Community Edition

Bitcoin mining is a process that involves the use of computing resources to validate transactions and add them to the Bitcoin blockchain. Malicious actors often engage in Bitcoin mining as a way to generate income, and they do so by using stolen computing resources. The free compute power offered by Databricks Community Edition is lucrative to bitcoin miners and other abusive users [1].

Suppose a user has access to free or low-cost compute resources through Databricks or another cloud provider. In that case, they may be able to use these resources to mine Bitcoin more efficiently and profitably than if they had to purchase their own hardware. Bots and human farms have signed up in bulk, causing CE resources to be diverted to fraudulent activity, leaving legitimate users unable to use CE. This has caused service disruptions, negatively impacted usability, and increased operational costs

Data Driven Approach to Combat Abuse using Lakehouse

Our approach to reducing abuse associated with bitcoin mining is through the use of the Lakehouse Platform. The Databricks Lakehouse Platform is a unified data platform allowing organizations to store and manage structured and unstructured data. By leveraging the power of the lakehouse, organizations can more effectively detect and prevent abuse.

When using CE, data about the Databricks workspace usage, such as creating notebooks or job scheduling or cluster usage, are captured and stored as logs in various forms, such as structured, semi-structured and unstructured and analyzed to detect threats.

To combat CE abuse, we’ve adopted a data-driven approach. Our data team developed a system built on the Lakehouse to compute features from the log data that various downstream machine learning models use to detect abuse. This is all done on Databricks!

Databricks is committed to protecting the privacy and security of the personal information collected and processed as part of the CE service.

Identify abuse patterns using Machine Learning

Our team leveraged machine learning methods to learn specific activities or abuse behavior patterns that are trained using the lakehouse. The system uses pre-trained supervised learning models to identify patterns of abusive activity in user activity data. For example, learning patterns in the domain names used while signing up for a CE account, could help identify the common domain names used by abusers.

We develop a supervised learning system to classify domain names based on the domain features. Features are extracted from each domain to characterize the domain. We have collected a corpus of domains over a few months, and each domain is labeled as “malicious” or “benign”, depending on whether abuse activity is detected from the domain. Certain domains like “gmail.com” could be used for abuse and genuine activity, such domains are labeled as “average”. Figure 1 below shows the training data for a few domain names.

Figure 1: Domain features and labels of few domain names used for training
Figure 1: Domain features and labels of few domain names used for training

Using MLflow for Model Management

A classifier is trained using these domain features. We use MLflow for model management as it allows us to track the experiments parameters, metrics and artifacts and integrates with a wide range of machine learning tools like scikit-learn etc. By varying the hyperparameters in the classifier, we track various runs as a separate experiment in MLflow. The evaluation metrics such as precision, recall, false positive rate etc., are recorded for each experiment. MLflow’s API can be used to compare the evaluation metrics of different experiments. We can filter and sort the experiments based on specific evaluation metrics to identify the best-performing models. The best model can be registered in MLflow's model registry for future use and deployed in production.

This system is deployed in real-time using the Lakehouse Platform to quickly identify abusive users. Real-time monitoring and detection helps us stop abusive activity before it causes damage to our computing resources. To do this, during the sign-up process, each new domain is analyzed using the domain classification model registered in the MLflow model registry. If a domain is deemed abusive, it is blocked from future sign-ups.

Figure 2 below shows the end-to-end workflow of the domain classification model.

Fig 2: Domain classification using MLflow
Fig 2: Domain classification using MLflow

Using an Ensemble Approach to Detect Abuse

In addition to blocking suspicious domains at sign-up, the system also uses an ensemble of techniques to detect Bitcoin mining activity at each stage of user journey. Behavioral features are generated from the data to summarize user activity. By analyzing these features, our team can identify suspicious activity associated with Bitcoin mining, such as high CPU usage or unusual network activity. The system employs an anomaly detection algorithm to detect anomalies in the behavioral features that correspond to abusive users. An irregularity in a user's compute resources, for example, could suggest Bitcoin mining activity.

According to BTC.com, a Bitcoin mining pool distribution website, the top five mining pools control over 60% of the total Bitcoin network hashrate. These pools consist of numerous individual miners, some with multiple accounts, who collaborate to increase their chances of mining blocks and earning rewards. Detecting such clusters of mining activity becomes crucial to protect compute resources from malicious actors. Clustering is an unsupervised learning technique used to group similar objects together. The system uses clustering algorithms to group similar patterns of user behavior together. These clusters are evaluated to determine if they are indicative of abuse and the process is automated to detect abusive clusters automatically.

Model Performance Monitoring using Lakehouse

To monitor the data and identify trends and patterns associated with abuse activity, the system uses Databricks SQL to create visualizations. For example, visualizing the total cost or compute used in real time helps us identify unusual abuse-related activity that corresponds to sudden spikes. We use dashboards that provide an overview of all types of visualizations like time series plots, network traffic visualization and heat maps.

Figure 3: Time series plot of cluster uptime each day
Figure 3: Time series plot of cluster uptime each day

False positives are expensive as they distract from real abuse activity that goes undetected. When a Databricks Workspace is considered abusive, we cancel it to prevent further abuse. If a workspace is wrongly canceled, it can disrupt tasks and lead to unhappy users. In order to have a low false positive rate, the system uses MLflow to compare and select the best-performing machine learning model stored in the Lakehouse. By comparing different models and tuning hyperparameters, MLflow can help improve model accuracy and reduce false positives. The false positives from the system are very low and the system is able to achieve sustained decrease in CE cost.

The abuse patterns are evolving over time. MLflow can automatically retrain machine learning models when new data becomes available. This keeps the model up-to-date with the latest data and patterns of abuse.

The benefits of using Databricks Lakehouse to reduce Bitcoin mining are:

  • Scalability: Databricks can handle large volumes of data, making it possible to detect abuse activity across a large number of users.
  • Efficiency: Databricks can process data quickly, allowing organizations to detect real-time abuse activity.
  • Adaptability: Databricks can adapt to changes in user behavior, making detecting new types of abuse activity possible.
  • Accuracy: Databricks helps fine-tune models and achieve low false positive rate, leading to more accurate detection of abuse activity.

Conclusion

In this blog you have learned how organizations can use Databricks Lakehouse Platform to analyze vast amounts of data, apply advanced analytics, and implement machine learning models to detect and prevent malicious intent effectively. By unifying data, analytics, and AI in a single platform, Databricks offers a seamless solution to tackle cybersecurity challenges head-on.

Don't miss out on the opportunity to fortify your defense against abuse and secure your cloud computing resources. Embrace the potential of the Lakehouse Platform and join the community dedicated to protecting data privacy and security. Together, we can create a safer digital environment for everyone.

References:
[1] The Economics of Bitcoin Mining, or Bitcoin in the Presence of Adversaries Joshua A. Kroll, Ian C. Davey, and Edward W. Felten, Princeton University

Try Databricks for free

Related posts

Platform blog

Near Real-Time Anomaly Detection with Delta Live Tables and Databricks Machine Learning

Why is Anomaly Detection Important? Whether in retail, finance, cyber security, or any other industry, spotting anomalous behavior as soon as it happens...
Engineering blog

Detecting Abuse at Scale: Locality Sensitive Hashing at Uber Engineering

This is a cross blog post effort between Databricks and Uber Engineering. Yun Ni is a software engineer on Uber’s Machine Learning Platform...
Platform blog

Analyzing Okta Logs With Databricks Lakehouse Platform to Detect Unusual Activity

April 7, 2022 by Arun Pamulapati in Product
With the recent social media reports of an Okta incident through a third party contractor, security teams ran to their logs and asked...
See all Platform Blog posts