<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>Posts about Excel</title><link>https://chriswarrick.com/</link><atom:link href="https://chriswarrick.com/blog/tags/excel.xml" rel="self" type="application/rss+xml" /><description>A rarely updated blog, mostly about programming.</description><lastBuildDate>Fri, 07 Apr 2017 18:00:00 GMT</lastBuildDate><generator>https://github.com/Kwpolska/YetAnotherBlogGenerator</generator><item><title>CSV is not a standard</title><dc:creator>Chris Warrick</dc:creator><link>https://chriswarrick.com/blog/2017/04/07/csv-is-not-a-standard/</link><pubDate>Fri, 07 Apr 2017 18:00:00 GMT</pubDate><guid>https://chriswarrick.com/blog/2017/04/07/csv-is-not-a-standard/</guid><description>
CSV is not a standard. What does that really mean for anyone using that format?
The file’s recipient may be unable to read it the way you intended. Separators,
decimal marks, escaping and encodings are all problems — and Excel does them
all pretty badly.
</description><content:encoded><![CDATA[
<p>CSV is not a standard. What does that really mean for anyone using that format?
The file’s recipient may be unable to read it the way you intended. Separators,
decimal marks, escaping and encodings are all problems — and Excel does them
all pretty badly.</p>



<p>So first, some people might claim that <a class="reference external" href="http://www.ietf.org/rfc/rfc4180.txt">RFC 4180</a> is the CSV standard. Those
people also have not read the document they’re referring to. It states:</p>
<blockquote>
<p>This memo provides information for the Internet community.  It does
not specify an Internet standard of any kind.</p>
</blockquote>
<p>The problem with this is the fact that a <code class="docutils literal">.csv</code> file does not mean much. There
are a few problems. The first question is,</p>
<blockquote>
<p>What is the field separator? Is it a comma or a semicolon?</p>
</blockquote>
<p>Hey, wait a minute, doesn’t the file format/extension stand for
<em>comma-separated values</em>? Yes, it does. But that does not matter in the
slightest. You see, Microsoft Excel — which most people will use to read/write
their CSV files — makes this decision based on the user locale settings. If the
OS is set to a locale where the comma is the <a class="reference external" href="https://en.wikipedia.org/wiki/Decimal_mark#Hindu.E2.80.93Arabic_numeral_system">decimal mark</a> (eg. most of
Europe), the list separator is set to <code class="docutils literal">;</code> instead of <code class="docutils literal">,</code> — and Excel uses
that.</p>
<p>Of course, there’s also the TSV data format — those are tab-separated values.
And some people might name their TSV files <code class="docutils literal">.csv</code>.</p>
<p>To read files saved in a different locale, or with a different separator, Excel
users need to change the file extension to <code class="docutils literal">.txt</code>, or go to Data → Get
External Data → From Text <a class="reference external" href="https://support.office.com/en-us/article/Text-Import-Wizard-c5b02af6-fda1-4440-899f-f78bafe41857">(documentation)</a> and use the import wizard. You
can’t double-click on files.</p>
<p>On a side note, Apple Numbers guesses the format — one of the few things it
gets right. LibreOffice always asks the user to pick import settings, but by
default it uses tab AND comma AND semicolon for CSV files, which brings its own
host of problems.</p>
<p>Here’s a quick test:</p>
<blockquote>
<p>What does <code class="docutils literal">foo;bar,baz;quux</code> mean? What about <code class="docutils literal">foo,bar;baz,quux</code>?</p>
</blockquote>
<ul class="simple">
<li><p>LibreOffice assumes it’s (Chinese) UTF-16 text, but after telling it the real encoding, both
files contain <strong>4 columns</strong>.</p></li>
<li><p>Microsoft Excel says one of the files contains <strong>3 columns</strong> and the other contains <strong>2 columns</strong>
(which is which depends on locale)</p></li>
<li><p>Apple Numbers says the first file contains <strong>3 columns</strong> and the other
contains <strong>2 columns</strong> if set to English, and both files contain <strong>3
columns</strong> if set to Polish.</p></li>
</ul>
<p>But let’s get back to gotchas:</p>
<blockquote>
<p>What is the decimal mark? Is it a dot or a comma?</p>
</blockquote>
<p>That’s a direct consequence of the previous question. However, one can’t simply
assume <code class="docutils literal">comma/dot</code> and <code class="docutils literal">semicolon/comma</code>, because users might do crazy
stuff.</p>
<blockquote>
<p>What is used to escape rows containing the field separator? Quotes?
Backslashes?  What is used to escape the escape character?</p>
</blockquote>
<p>Excel, for example, puts some things in <code class="docutils literal">&quot;quotes&quot;</code>. If a literal quote
character appears in the spreadsheet, it’s represented as <code class="docutils literal">&quot;&quot;</code>, and
the entire cell is quoted as well. But there might be programs that use
backslashes for escapes, or even bad code that does not consider the need of
escaping like this, with tragic results.</p>
<p>There’s still one more thing to cover: encodings. You see, even though the TSV
format effectively solves the issues I named before, both CSV and TSV suffer
from one problem:</p>
<blockquote>
<p>Which encoding to use when reading this file?</p>
</blockquote>
<p>I already mentioned that LibreOffice believed my sample file was UTF-16,
containing Chinese text — in reality, this file was UTF-8 (or ASCII).</p>
<p>What does Microsoft Excel do then? It looks like it follows <em>System locale for
non-Unicode programs</em>. While there is an encoding option hidden in the Save
dialog, it does not seem to affect the output. So what does that mean? You
can’t expect a CSV file that contains characters outside of your system locale
— or outside of ASCII if you’re working with people around the world — to look
right. Unless you’re on <a class="reference external" href="https://answers.microsoft.com/en-us/msoffice/forum/msoffice_install-mso_win10/announcing-october-feature-update-for-office-2016/927eea90-eea3-479a-a78a-45f7612460e1">Excel 2016</a> and Office 365 — if you have the October
2016 update, you can read and write UTF-8 files. But if you’re using an older
version of Excel, or you’re using a non-Office 365 license, tough luck.</p>
<p>So, to reiterate: CSV can mean a lot of things. And you can’t trust it to work
well most of the time, unless you’re dealing with people in one country, all
using the same locale settings and software. Which is pretty unlikely. TSV
can work around most of the problems, but encodings are still troublesome.</p>
]]></content:encoded><category>Programming</category><category>CSV</category><category>Excel</category><category>Microsoft</category><category>Microsoft Office</category></item></channel></rss>