Calculating Entropy of Queries (DNS - Part 3)

Hi !

Calculating DNS query entropy is an advanced cybersecurity technique to detect malicious activity that would go undetected with traditional filtering methods.

Here are the main benefits of this analysis:

  1. Data Exfiltration Detection

Data exfiltration via DNS (DNS tunneling) involves stealing information by cutting it up and inserting it into subdomains of DNS queries.

Normal query: “google.com” or “amazon.ca” means (Low entropy).

Strange query: “VjI1bWxtbHpkR2x2Ym1VdWJXRnpZMnBvYm1sdWRBPT0.pretty-malicious-domain-name.com.”

These encoded strings (Base64 or Hex) contain a wide variety of characters without linguistic structure, which increases the entropy index.

  1. Identifying Domain Generation Algorithms DGA

Malware often uses DGAs to generate thousands of random domain names to contact its command and control (C2) server.

A “human-readable” domain is typically composed of recognizable words or syllables.

A “DGA” domain like z8f2l9m4q1p.net has much higher entropy because the distribution of characters is virtually random. By monitoring entropy, you can block these connections before they are even listed in threat intelligence.

  1. Detecting Beacons and C2

Some malicious implants use DNS to send signals (“beacons”) to their server. While the frequency may be low to avoid detection, the complexity of the subdomain used to transmit the infected machine’s state will often reveal abnormal entropy compared to standard user browsing traffic.

  1. Attack Surface Analysis

In a complex network environment, entropy allows traffic to be segmented:

Low entropy: Legitimate traffic, known domains, CDN services.

High entropy: Requires investigation. This could be tunneling, cryptocurrency mining, or simply very verbose software telemetry services.

What are we talking about when we talk about entropy?

A good source of information is the following page: Entropy calculation formula (Shannon()

Now, let’s create a small program in R that will read all the entries (DNS queries) present in Zeek’s “dns.log” file and display the results (entropy).

library(tidyverse)

F_calc_entropy <- function(input_string) {
  if (is.na(input_string) || input_string == "") return(0)
  
  # We split 'input_string'
  chars <- strsplit(input_string, "")[[1]]
  # Calculating the frequencies of each character
  p <- table(chars) / length(chars)
  # Apply Shannon's formula (base 2)
  -sum(p * log2(p))
}

# Reading 'dns.log'. We skip lines beginning with '#'
dns_data <- read_delim("dns.log", delim = "\t", comment = "#", 
            col_names = c("ts", "uid", "id.orig_h", "id.orig_p", "id.resp_h", 
                        "id.resp_p", "proto", "trans_id", "rtt", "query", 
                        "qclass", "qclass_name", "qtype", "qtype_name", 
                        "rcode", "rcode_name", "AA", "TC", "RD", "RA", 
                        "Z", "answers", "TTLs", "rejected"))


# Calculating entropy for each query
print("Working on dns_analysis")
dns_analysis <- dns_data %>%
  filter(!is.na(query)) %>%
  mutate(entropy = sapply(query, F_calc_entropy)) %>%
  select(ts, id.orig_h, query, entropy)

print(dns_analysis)

Example results:

            ts id.orig_h       query                      entropy
         <dbl> <chr>           <chr>                        <dbl>
 1 1775502025. 162.212.157.188 pfcloud.io                                   3.12
 2 1775502025. 162.212.157.188 ns2.pfcloud.io                               3.52
 3 1775502027. 162.212.157.188 145.1.184.180.in-addr.arpa                   3.46
 4 1775502031. 162.212.157.188 cpserver.net                                 3.02
 5 1775502031. 162.212.157.188 ns02.cpserver.net                            3.34
 6 1775502037. 162.212.157.188 ns2-05.azure-dns.net                         3.68
 7 1775502042. 162.212.157.188 ns6.pptechnology.cc                          3.58
 8 1775502048. 162.212.157.188 ns1.kapelan.de                               3.24
 9 1775503742. 162.212.157.188 VjI1bWxtbHpkR2x2Ym1VdWJXRnpZMnBvYm1sdWRB…    5.05
10 1775503933. 162.212.157.188 www.google.com                               2.84

We can see on line 9 the bizarre domain name mentioned above which has an entropy greater than a value of 5.

Possible way to interpret the results:

Entropy Level Interpretation Probable Actions
0.0 - 3.5. White zone: standard domain names (google.com, etc.). Nothing special, usual requests.
3.5 - 4.2. Grey zone: CDNs, complex hostnames, telemetry tools. Increased monitoring is necessary.
4.2 - 5.0. Yellow zone: Base32/64 encoding, DGA (domain generation algorithms). An alert must be lifted and investigations conducted.
> 5.0. Red zone: Suspicion of DNS tunneling and/or data exfiltration. A thorough check is needed to potentially block the connection. The incident response process must be activated.

In a future article, we will focus on visualizing these concepts.

Cheers.