Visualizing DNS Query Entropy via a Scatterplot Graph (DNS - Part 4)
Hi !
I recently discussed the use of entropy in relation to DNS queries and the benefits of calculating it.
Today’s topic will be to present a way to visualize these queries.
First, I need a script that will build a reduced version of dns.log and aggregate the result into a file that will represent the day.
- This is simpler than working with dns.log files that are rotated hourly.
- Please note that this file needs deleted at midnight.
/usr/bin/awk -F'\t' '/^[^#]/ {print $1 "," $3 "," $10}' dns.log >> zeek_dns_reduced_output.csv
Now the R script that will analyze the data.
#!/usr/local/bin/Rscript
# We need this one, one of my favorites
library(tidyverse)
# Definition of a function to calculate the entropy of the chain passed as a parameter
F_calc_entropy <- function(input_string) {
# If nothing, return
if (is.na(input_string) || input_string == "") return(0)
# We split 'input_string'
chars <- strsplit(input_string, "")[[1]]
# Calculating the frequencies of each character
p <- table(chars) / length(chars)
# Shannon's formula
-sum(p * log2(p))
}
# Definition of a function to do the main job
F_analyse_dns_data <- function() {
# Reading a reduced version of dns.log. We skip lines beginning with '#'
dns_data <- read_delim("zeek_dns_reduced_output.csv", delim = ",", comment = "#",
col_names = c("ts", "id.orig_h", "query"))
# Calculating entropy for each query
# We do some checks, remove useless queries then apply the function 'F_calc_entropy()' on each good queries
dns_analysis <- dns_data %>%
filter(!is.na(query)) %>%
filter(!grepl("in-addr|\\(empty\\)",query, ignore.case = TRUE)) %>%
mutate(
# zeek timestamp conversion ('ts' uses seconds in Zeek)
datetime = as.POSIXct(ts, origin = "1970-01-01", tz = ""),
# Calculating the minutes elapsed since midnight
minutes_since_minuit = (as.numeric(format(datetime, "%H")) * 60) +
(as.numeric(format(datetime, "%M"))),
entropy = sapply(query, F_calc_entropy)) %>%
select(ts, minutes_since_minuit, id.orig_h, query, entropy)
# We create the graph
mygraph <- ggplot(dns_analysis, aes(x = minutes_depuis_minuit, y = entropy)) +
# Conditional coloring : 'red' if entropy is > 4, else 'blue'
geom_point(aes(color = entropy > 4.0), alpha = 0.5) +
scale_color_manual(values = c("steelblue", "firebrick"),
name = "Alert",
labels = c("Normal (< 4)", "Suspect (> 4)")) +
geom_hline(yintercept = 4.0, linetype = "dashed", color = "black", linewidth = 0.8) +
theme_minimal() +
labs(
title = "Visualizing DNS query entropy via a scatterplot graph",
x = "minutes since midnight",
y = "Entropy (bits)"
)
# Making JPG image
jpeg("analyse_dns_entropie-scatter.jpg", width = 1200, height = 800, res = 120, quality = 90)
print(mygraph)
# Closing the graphics device
dev.off()
}
# calling the main function
F_analyse_dns_data()
And now the graph that represents all of this.

Cheers.