R - read.csv() or readlines() ?
Let’s compare the execution speed of two functions.
To make some treemaps,I use informations from the following link: https://github.com/ccodwg/CovidTimelineCanada/blob/main/data/hr/cases_hr.csv
To read this data through the R language, I tried two ways.
Use either the read.csv() function or the readlines() function. Let’s look at the fastest.
Using read.csv()
#!/usr/local/bin/Rscript
df<-read.csv("cases_hr.csv",header=TRUE)
for (i in 1:nrow(df)) {
province <- df[i,1]
health_region <- df[i,2]
date_report <- df[i,3]
cases <- df[i,4]
cumulative_cases <- df[i,5]
}
Using readlines()
#!/usr/local/bin/Rscript
conn <- file("cases_hr.csv",open="r")
lines <- readLines(conn)
close(conn)
for (i in 2:length(lines)) {
lline <- unlist(strsplit(lines[i], ","))
province <- lline[1]
health_region <- lline[2]
date_report <- lline[3]
cases <- lline[4]
cumulative_cases <- lline[5]
}
What is the data file made up of?
$ cat cases_hr.csv | wc -l
94433
Now let’s compare execution times
$ time test_read_csv.R
0m17.68s real 0m16.58s user 0m00.72s system
$ time test_readlines.R
0m10.67s real 0m06.70s user 0m03.69s system
It is certain that in my case, using the readlines() function turns out to be faster. However, with other data types this may not be the case.