R - read.csv() or readlines() ?

Let’s compare the execution speed of two functions.

To make some treemaps,I use informations from the following link: https://github.com/ccodwg/CovidTimelineCanada/blob/main/data/hr/cases_hr.csv

To read this data through the R language, I tried two ways.

Use either the read.csv() function or the readlines() function. Let’s look at the fastest.

Using read.csv()

#!/usr/local/bin/Rscript

df<-read.csv("cases_hr.csv",header=TRUE)

for (i in 1:nrow(df)) {
    province <- df[i,1]
    health_region <- df[i,2]
    date_report <- df[i,3]
    cases <- df[i,4]
    cumulative_cases <- df[i,5]
}

Using readlines()

#!/usr/local/bin/Rscript


conn <- file("cases_hr.csv",open="r")
lines <- readLines(conn)
close(conn)

for (i in 2:length(lines)) {
    lline <- unlist(strsplit(lines[i], ","))
    province <- lline[1]
    health_region <- lline[2]
    date_report <- lline[3]
    cases <- lline[4]
    cumulative_cases <- lline[5]
}

What is the data file made up of?

$ cat cases_hr.csv | wc -l                                                                                                                       
   94433

Now let’s compare execution times

$ time test_read_csv.R                                                                                                                           
    0m17.68s real     0m16.58s user     0m00.72s system

$ time test_readlines.R                                                                                                                          
    0m10.67s real     0m06.70s user     0m03.69s system

It is certain that in my case, using the readlines() function turns out to be faster. However, with other data types this may not be the case.