Gawk - Dealing with nested commas
A few posts ago, I was echoing the possible presence of nested commas in CSV file fields.
Reminder: because of this, for those who program in R, it makes sense to use the R function read.csv() instead of the function readlines().
But that’s not the topic of the day. As much as I love the R language, as much as I love the ease of using basic Unix tools.
Let’s take an excerpt from the monkeypox data file ( https://ourworldindata.org/monkeypox ).
$ cat file.txt
N37,confirmed,London,London,England,GBR,,,,2022-05-18,,,,,,,,,,N,,,,,,,https://www.gov.uk/government/news/monkeypox-cases-confirmed-in-england-latest-updates,,,,,,,2022-05-18,,2022-05-18
N38,confirmed,South East,South East,England,GBR,,,,2022-05-18,,,,,,,,,,N,,,,,,,https://www.gov.uk/government/news/monkeypox-cases-confirmed-in-england-latest-updates,,,,,,,2022-05-18,,2022-05-18
N39,confirmed,Quebec,"Clinique l'Actuel, Montreal",Canada,CAN,30-59,male,,2022-05-23,"oral and genital ulcers, fever",,,Y,,,,,,,,,,,,,https://ici.radio-canada.ca/nouvelle/1884547/orthopoxvirose-simienne-monkeypox-maladie-sante-publique-quebec-canada,https://montreal.citynews.ca/2022/05/19/montreal-public-health-to-provide-update-on-monkeypox/,,,,,,2022-05-18,,2022-05-24
Let’s use the standard “awk” binary to find the country codes. We know it’s the sixth field.
$ cat file.txt | awk -F, '{print $6}'
GBR
GBR
Canada
It’s easy to see that something isn’t working.
The word “Canada” is not a country code in the sense of “IBAN” ( I use this: “https://www.iban.com/country-codes" ).
Luckily the folks at the GNU Foundation have come up with a solution which can handle this sort of thing.
In this case “GNU Awk” (Gawk) allows the use of the “FPAT” variable which offers a solution for cases like this.
Reference: ( https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html ).
$ cat file.txt | gawk 'BEGIN { FPAT = "([^,]+)|(\"[^\"]+\")" } { print $6 }'
GBR
GBR
CAN
It’s immediately better.
Regards