Microsoft Word’s formatting garbage, quantified

Anyone who’s spent any amount of time working on the web dreads it: content delivered in Microsoft Word format. Word adds tons of formatting garbage that results in bloated files and messes up the presentation when content gets brought into HTML.

When Microsoft released Office 2007, they touted switching to an XML-based document format for all of the apps. But all XML is not created equal.

Case in point: I am currently working on a project that is going to involve receiving content for a number of web pages in a tabular form, either in Word or Excel format. A spreadsheet, essentially (if not technically), with each page represented by a row, and its text content in a cell. I will be writing a PHP script to parse the spreadsheet data and generate a set of HTML files with the content loaded in them.

I’m currently trying to determine if Word or Excel would be the better format to receive the content in, which involves opening up .xlsx and .docx files in BBEdit and looking at the raw data stored within them. I’ve managed to identify the embedded XML files in each that hold the actual content. These files store the same actual text content, but their XML schemas vary based on the needs of Word and Excel.

So… how do they match up? The XML file I pulled out of Excel is 14 KB. The one from Word is 202 KB. For the mathematically inclined amongst you, that’s a little more than 14 times larger. Yes… another (perhaps more hyperbolic) way you could say it is that the Word document is exponentially larger.

That’s just ridiculous.

What makes up the difference? Well, the Excel file’s XML is nothing but basic tags. There are no attributes on any of the tags, as far as I can tell. It’s pure semantic structure. The Word XML, on the other hand, is almost nothing but attributes. And there’s nothing smart about them either. Most of them are assigning fonts to the text. The same font names, over and over and over again throughout the file.

That’s… beyond ridiculous.

When is a CSV not a CSV? When you’re downloading it in Safari

Here’s another post that’s basically a cry for help. I did find this forum thread on the topic, but not a solution.

The problem: when I download a CSV file in Safari, for some inexplicable reason, Safari appends a .xls (Microsoft Excel) extension to the filename.

Never mind that I don’t use Excel… I use Apple’s own spreadsheet software, Numbers, from the iWork suite. Never mind that I don’t even have Excel installed on my Mac. Why, why on Earth, would Safari append a .xls extension on a CSV file? It’s not an Excel file; it’s a CSV. Different format. Sure, Excel can open it. But, you know what? Numbers doesn’t open it properly when it has that stupid extension on it.

Take the exact same file, remove the .xls extension (leaving the .csv extension), and Numbers opens it just fine. Leave it the way Safari has it, and it’s a mess.

This is not the only annoyance I have with Safari’s handling of downloads. I also hate how it automatically expands “safe” files, placing the original .zip or .dmg file in the Trash. I don’t want to delete those files! But if I turn this option off, it also doesn’t open the files I want it to open automatically, like Amazon MP3 downloads.

But hands down, this CSV bug — yes, that’s right, I called it a bug — is my biggest source of frustration. Sure, it’s easy enough to remove the extension. But it shouldn’t be there in the first place!