Gawk - Dealing with nested commas

A few posts ago, I was echoing the possible presence of nested commas in CSV file fields.

Reminder: because of this, for those who program in R, it makes sense to use the R function read.csv() instead of the function readlines().

But that’s not the topic of the day. As much as I love the R language, as much as I love the ease of using basic Unix tools.

Let’s take an excerpt from the monkeypox data file ( https://ourworldindata.org/monkeypox ).

$ cat file.txt
N37,confirmed,London,London,England,GBR,,,,2022-05-18,,,,,,,,,,N,,,,,,,https://www.gov.uk/government/news/monkeypox-cases-confirmed-in-england-latest-updates,,,,,,,2022-05-18,,2022-05-18
N38,confirmed,South East,South East,England,GBR,,,,2022-05-18,,,,,,,,,,N,,,,,,,https://www.gov.uk/government/news/monkeypox-cases-confirmed-in-england-latest-updates,,,,,,,2022-05-18,,2022-05-18
N39,confirmed,Quebec,"Clinique l'Actuel, Montreal",Canada,CAN,30-59,male,,2022-05-23,"oral and genital ulcers, fever",,,Y,,,,,,,,,,,,,https://ici.radio-canada.ca/nouvelle/1884547/orthopoxvirose-simienne-monkeypox-maladie-sante-publique-quebec-canada,https://montreal.citynews.ca/2022/05/19/montreal-public-health-to-provide-update-on-monkeypox/,,,,,,2022-05-18,,2022-05-24

Let’s use the standard “awk” binary to find the country codes. We know it’s the sixth field.

$ cat file.txt | awk -F, '{print $6}' 
GBR
GBR
Canada

It’s easy to see that something isn’t working.

The word “Canada” is not a country code in the sense of “IBAN” ( I use this: “https://www.iban.com/country-codes" ).

Luckily the folks at the GNU Foundation have come up with a solution which can handle this sort of thing.

In this case “GNU Awk” (Gawk) allows the use of the “FPAT” variable which offers a solution for cases like this.

Reference: ( https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html ).

$ cat file.txt | gawk 'BEGIN {  FPAT = "([^,]+)|(\"[^\"]+\")" } { print $6 }'
GBR
GBR
CAN

It’s immediately better.

Regards