CSV is not a standard | Chris Warrick

CSV is not a standard. What does that really mean for anyone using that format? The file’s recipient may be unable to read it the way you intended. Separators, decimal marks, escaping and encodings are all problems — and Excel does them all pretty badly.

So first, some people might claim that RFC 4180 is the CSV standard. Those people also have not read the document they’re referring to. It states:

This memo provides information for the Internet community. It does not specify an Internet standard of any kind.

The problem with this is the fact that a .csv file does not mean much. There are a few problems. The first question is,

What is the field separator? Is it a comma or a semicolon?

Hey, wait a minute, doesn’t the file format/extension stand for comma-separated values? Yes, it does. But that does not matter in the slightest. You see, Microsoft Excel — which most people will use to read/write their CSV files — makes this decision based on the user locale settings. If the OS is set to a locale where the comma is the decimal mark (eg. most of Europe), the list separator is set to ; instead of , — and Excel uses that.

Of course, there’s also the TSV data format — those are tab-separated values. And some people might name their TSV files .csv.

To read files saved in a different locale, or with a different separator, Excel users need to change the file extension to .txt, or go to Data → Get External Data → From Text (documentation) and use the import wizard. You can’t double-click on files.

On a side note, Apple Numbers guesses the format — one of the few things it gets right. LibreOffice always asks the user to pick import settings, but by default it uses tab AND comma AND semicolon for CSV files, which brings its own host of problems.

Here’s a quick test:

What does foo;bar,baz;quux mean? What about foo,bar;baz,quux?

LibreOffice assumes it’s (Chinese) UTF-16 text, but after telling it the real encoding, both files contain 4 columns.
Microsoft Excel says one of the files contains 3 columns and the other contains 2 columns (which is which depends on locale)
Apple Numbers says the first file contains 3 columns and the other contains 2 columns if set to English, and both files contain 3 columns if set to Polish.

But let’s get back to gotchas:

What is the decimal mark? Is it a dot or a comma?

That’s a direct consequence of the previous question. However, one can’t simply assume comma/dot and semicolon/comma, because users might do crazy stuff.

What is used to escape rows containing the field separator? Quotes? Backslashes? What is used to escape the escape character?

Excel, for example, puts some things in "quotes". If a literal quote character appears in the spreadsheet, it’s represented as "", and the entire cell is quoted as well. But there might be programs that use backslashes for escapes, or even bad code that does not consider the need of escaping like this, with tragic results.

There’s still one more thing to cover: encodings. You see, even though the TSV format effectively solves the issues I named before, both CSV and TSV suffer from one problem:

Which encoding to use when reading this file?

I already mentioned that LibreOffice believed my sample file was UTF-16, containing Chinese text — in reality, this file was UTF-8 (or ASCII).

What does Microsoft Excel do then? It looks like it follows System locale for non-Unicode programs. While there is an encoding option hidden in the Save dialog, it does not seem to affect the output. So what does that mean? You can’t expect a CSV file that contains characters outside of your system locale — or outside of ASCII if you’re working with people around the world — to look right. Unless you’re on Excel 2016 and Office 365 — if you have the October 2016 update, you can read and write UTF-8 files. But if you’re using an older version of Excel, or you’re using a non-Office 365 license, tough luck.

So, to reiterate: CSV can mean a lot of things. And you can’t trust it to work well most of the time, unless you’re dealing with people in one country, all using the same locale settings and software. Which is pretty unlikely. TSV can work around most of the problems, but encodings are still troublesome.