Microsoft Word’s formatting garbage, quantified

Anyone who’s spent any amount of time working on the web dreads it: content delivered in Microsoft Word format. Word adds tons of formatting garbage that results in bloated files and messes up the presentation when content gets brought into HTML.

When Microsoft released Office 2007, they touted switching to an XML-based document format for all of the apps. But all XML is not created equal.

Case in point: I am currently working on a project that is going to involve receiving content for a number of web pages in a tabular form, either in Word or Excel format. A spreadsheet, essentially (if not technically), with each page represented by a row, and its text content in a cell. I will be writing a PHP script to parse the spreadsheet data and generate a set of HTML files with the content loaded in them.

I’m currently trying to determine if Word or Excel would be the better format to receive the content in, which involves opening up .xlsx and .docx files in BBEdit and looking at the raw data stored within them. I’ve managed to identify the embedded XML files in each that hold the actual content. These files store the same actual text content, but their XML schemas vary based on the needs of Word and Excel.

So… how do they match up? The XML file I pulled out of Excel is 14 KB. The one from Word is 202 KB. For the mathematically inclined amongst you, that’s a little more than 14 times larger. Yes… another (perhaps more hyperbolic) way you could say it is that the Word document is exponentially larger.

That’s just ridiculous.

What makes up the difference? Well, the Excel file’s XML is nothing but basic tags. There are no attributes on any of the tags, as far as I can tell. It’s pure semantic structure. The Word XML, on the other hand, is almost nothing but attributes. And there’s nothing smart about them either. Most of them are assigning fonts to the text. The same font names, over and over and over again throughout the file.

That’s… beyond ridiculous.