Skip to main content
Engineering blog

According to Sophos, 46% of all malware now uses Transport Layer Security (TLS) to conceal its communication channels. A number that has doubled in the last year alone. Malware, such as LockBit ransomware, AgentTesla and Bladabini remote access tools (RATs), has been observed using TLS for powerShell based droppers, for accessing pastebin to retrieve code and many others recently.

In this blog, we will walk through how security teams can ingest x509 certificates (found in the TLS handshake) into Delta Lake from AWS S3 storage, enrich it, and perform threat hunting techniques on it.

LS is the de facto standard for securing web applications, and forms part of the overall trust hierarchy within a public key infrastructure (PKI) solution.

During the initial connection from a client to server, the TLS protocol performs a two-phase handshake, whereby the web server proves its identity to the client by way of information held in the x509 certificate. Following this, both parties agree on a number of algorithms, then generate and exchange symmetric keys, which are subsequently used to transmit encrypted data.

For all practical purposes x509 certificates are totally unique and can be identified using hashing algorithms (commonly SHA1, SHA256 and MD5) called fingerprints. The nature of hashing makes them great threat indicators and are commonly used in threat intelligence feeds to represent objects. Since the information within them is used for cryptographic key material (agreement, exchange, creation etc), they themselves are encoded but not encrypted and therefore, can be read.

Capturing, storing and analyzing network traffic is a challenging task. However, landing it into cheap cloud object storage, processing it at scale with Databricks and only keeping the interesting bits could be a valuable workflow for security analysts and threat hunters. If we can identify suspicious connections, we have an opportunity to create indicators of compromise (iocs) and have our SIEM security tools help to prevent further malicious activity downstream.

About the data sets

We are using x509 data collected from a network scan, and alongside it, we will use the Cisco Umbrella top 1 million list and the SSL blacklist produced by as lookups.

One of the best places within an enterprise network to get hold of certificate data is off the wire using packet capture techniques. Zeek, TCPDump and Wireshark are all good examples.

If you are not aware of the cyber threat hunting tool SSLblacklist, it is run by with the goal of detecting malicious SSL connections. The Cisco Umbrella top 1m are the most popular DNS lookups on the planet as seen by Cisco. We will use this to demonstrate filtering and lookup techniques. If you or your hunt team want to follow along with the notebook and data you can import the accompanying notebook.

The SSLblacklist is run by with the goal of detecting malicious SSL connections.


Ingesting the data sets

For simplicity, if you are following along at home, we will be using Delta Lake batch capability to ingest the data from an AWS S3 bucket to a bronze table, refine and enrich it into a silver table (medallion architecture).  However, you can upgrade your experience in real-world applications using structured streaming!

We’ll focus on the blacklist and umbrella files first, followed by x509 certificate data.


# Alexa-Top1m
rawTop1mDF = read_batch(spark, top1m_file, format='csv', schema=alexa_schema)



# Write to Bronze Table
alexaTop1mBronzeWriter = create_batch_writer(spark=spark, df=rawTop1mDF, mode='overwrite')
alexaTop1mBronzeWriter.saveAsTable(databaseName + ".alexaTop1m_bronze")



# Make Transformations to Top1m
bronzeTop1mDF = spark.table(databaseName + ".alexaTop1m_bronze")
bronzeTop1mDF = bronzeTop1mDF.filter(~bronzeTop1mDF.alexa_top_host.rlike('localhost')).drop("RecordNumber")



# Write to Silver Table
alexaTop1mSilverWriter = create_batch_writer(spark=spark, df=bronzeTop1mDF, mode='overwrite')
alexaTop1mSilverWriter.saveAsTable(databaseName + ".alexaTop1m_silver")


The above code snippets read the csv file from an s3 bucket, writes it directly to the bronze table unaltered, then reads the bronze delta table, makes transformations and writes it to the silver table. That’s the top 1 million data ready for use!

Sample dataframe displaying results from the Databricks cyber threat hunting ETL process.
Resulting Dataframe

Next we follow the same format for the SSL blacklist data


# SSLBlacklist
rawBlackListDF = read_batch(spark, blacklist_file, format='csv')
rawBlackListDF = rawBlackListDF.withColumnRenamed(, # Write to Bronze Table
sslBlBronzeWriter = create_batch_writer(spark=spark, df=rawBlackListDF, mode='overwrite')
sslBlBronzeWriter.saveAsTable(databaseName + ".sslBlacklist_bronze")



# Make Transformations to the SSLBlacklist
bronzeBlackListDF = spark.table(databaseName + ".sslBlackList_bronze")
bronzeBlackListDF =*(col(x).alias('sslbl_' + x) for x in bronzeBlackListDF.columns)



# Write to Silver Table
BlackListSilverWriter = create_batch_writer(spark=spark, df=bronzeBlackListDF, mode='overwrite')
BlackListSilverWriter.saveAsTable(databaseName + ".sslBlackList_silver")


The above process is the same, as for the top 1 million file presented below. Our transformation simply prefixes all columns with ‘sslbl_’ so it is easily identified later. 

Sample sslblacklist dataframe generated by Databricks cyber threat hunting workflow.

Resulting sslblacklist dataframe
Next we ingest the x509 certificate data using exactly the same methodologies. Here’s how that dataframe looks after ingestion and transformation into the silver table. 

Sample dataframe with x509 certificate data generated as part of the Databricks cyber threat hunting workflow.

X509 certificates are complex and there are many fields available. Some of the most interesting for our initial purposes are; 

  • subject, issuer, common_name, valid to/from fields, dest_ip, dest_port, rdns

Analyze the data

Looking for certificates of interest can be done in many ways. We’ll begin by looking for distinct values in the issuer field.

Example of how Databricks cyber threat hunting solution can be used to identify potential threats and vulnerabilities within TSL certificates.

If you are new to pyspark, it is a python API for Apache Spark. The above search makes use of collect_set, countDistinct, agg, and groupBy. You can read more about those in the links.

A hypothesis we have is that when certificates are either temporary, self-signed or otherwise not used for genuine purposes, the issuer field tends to have limited detail. Let’s create a search looking at the length of that field.

Sample search that looks at the length of the issuer field, which is part of the cyber threat hunting techniques used by the Databricks solution.

withColumn adds a new column, after evaluating the given expression.

The top entry has the shortest length and has unique subject and issuer fields. This is a good candidate for some OSINT!

Sample Google search displaying a number of hits to a TLS certificate believed to be used on malicious websites.

A google search shows a number of hits that believe this certificate is or has been used on malicious websites. This hash is a great candidate to pivot from and explore further in our network.

Let's now use our ssl blacklist table to correlate with known malicious hashes. 


# SSLBlacklist
isSSLBlackListedDF =
"sslbl_Listingreason","common_name", "country", "dest_ip","rdns","issuer",
"sha1_fingerprint", "not_valid_before", "not_valid_after"

).filter(silverX509DF.sslbl_SHA1 != 'null')


Sample search result displaying the movements of adversary infrastructure over time, used as part of the Databricks cyber threat hunting methodology.

This search raises some interesting findings. The top four entries show hits for a number of different malware families' command and control infrastructure. We also see the same sha1 fingerprint being used for ransomware command and control, using different IP addresses and DNS names. There could be a number of reasons for this but the observation would be that  adversary infrastructure is moving around over time. First seen, last seen work should be done using the threat data’s listing date and other techniques such as passive DNS lookups to further understand this and gain more situational awareness. New information discovered here should also be used to pivot back into an organisation for any other signs of communication with any of these hosts or IP addresses.

Finally, a great technique for hunting command and control communication is to use a Shannon entropy calculation to look for randomized strings.


def entropy(string):
"Calculates the Shannon entropy of a string"
# get probability of chars in string
prob = [ float(string.count(c)) / len(string) for c in dict.fromkeys(list(string)) ]

# calculate the entropy
entropy = - sum([ p * math.log(p) / math.log(2.0) for p in prob ])
except Exception as e:
entropy = -1

return entropy

entropy_udf = udf(entropy, StringType())

entropyDF = silverX509DF.where(length(col("subject")) 15).select(
entropy_udf(col('common_name'))).orderBy(col("entropy_score").desc()).where(col('entropy_score') > 1.5)



Sample search result identifying potentially malicious fingerprints, generated as part of Databricks’ cyber threat hunting solution.

As the saying goes, ‘the internet is a bad place’ and ‘math is a bad mistress’! Our initial search included all certificates, which produces a lot of noise due to the nature of the fields content. Experimenting further, we learned from our earlier search and focused only on those with a subject length of less than fifteen characters, and surfaced only the highest entropy of that data set. The resulting nine entries can be manually googled, or further automation could be applied. The top entry in this scenario is of interest, as this appears to be used as part of the CobaltStrike exploit kit.

Sample Google search of a suspicious fingerprint, demonstrating how Databricks’ cyber threat hunting techniques can be applied to real-world situations.

Further work

This walk through has demonstrated some techniques we can use to identify suspicious or malicious traffic using simple unique properties of x509 certificates. Further exploration using machine learning techniques may also provide benefits.


Analyzing certificates for unusual properties, or against known threat data can identify infrastructure known to host malicious software. It can be used as an initial pivot point to gather further information that can be used to search for signs of compromise. However, since the certificate identifies a host and not the content it serves, it cannot provide high confidence alerts alone.

Before being eligible for operationalization in a security operations centre (soc), the initial indicators need to be triaged further. Data from other internal and external sources such as, firewalls, passive DNS, VirusTotal, who is and also process creation events from endpoints should be used.

Let us know at [email protected] how you think processing TLS/x509 data either in an enterprise or more passively on the internet can be used to track adversaries and their infrastructure. If you are not already a Databricks customer, feel free to spin up a community edition.

Download the notebook.

Try Databricks for free

Related posts

Platform blog

Augment Your SIEM for Cybersecurity at Cloud Scale

July 23, 2021 by Michael Ortega and Monzy Merza in Platform Blog
Over the last decade, security incident and event management tools (SIEMs) have become a standard in enterprise security operations. SIEMs have always had...
Platform blog

Building a Cybersecurity Lakehouse for CrowdStrike Falcon Events

Get started now in your own Databricks deployment and run these notebooks . Endpoint data is required by security teams for threat detection...
Platform blog

Detecting Criminals and Nation States through DNS Analytics

Quick link to the accelerator notebooks referenced through this post. You are a security practitioner, a data scientist or a security data engineer...
See all Engineering Blog posts