I am currently in the midst of developing a website for a client (great client, by the way), featuring a WYSIWYG text editor (TinyMCE) within its CMS. At some point along the way, I looked at the source code of some of the HTML that the client had inserted into the database via TinyMCE. With dismay, I noticed the likes of the following amongst the “actual” content:
<p><!–[if gte mso 9]><xml> <w:WordDocument> <w:View>Normal</w:View> <w:Zoom>0</w:Zoom> <w:PunctuationKerning /> <w:ValidateAgainstSchemas /> <w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid> <w:IgnoreMixedContent>false</w:IgnoreMixedContent> <w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText> <w:Compatibility> <w:BreakWrappedTables /> <w:SnapToGridInCell /> <w:WrapTextWithPunct /> <w:UseAsianBreakRules /> <w:DontGrowAutofit /> </w:Compatibility> <w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel> </w:WordDocument> </xml><![endif]–><!–[if gte mso 9]><xml> <w:LatentStyles DefLockedState=”false” LatentStyleCount=”156″> </w:LatentStyles> </xml><![endif]–> <!–[if gte mso 10]> <mce:style><! /* Style Definitions */ table.MsoNormalTable {mso-style-name:”Table Normal”; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-parent:””; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:”Times New Roman”; mso-ansi-language:#0400; mso-fareast-language:#0400; mso-bidi-language:#0400;} –> <!–[endif]–></p>
Ugh. For those keeping score at home, that would be more than a kilobyte of… well… absolutely nothing on a web page. Oh, except some browsers will turn it into a couple of extra, unwanted blank lines thanks to the paragraph tags wrapping it all.
The sad thing about all of this is that I am no longer even at a point where I look at that and say “What the hell is this?” I know exactly what the hell it is. It’s code generated by Microsoft Office for some inscrutable, presumably nefarious purpose. It seems to have no effect whatsoever on the presentation of the content on the page in any known browser or application that I’ve bothered to investigate. But whenever you copy-and-paste content out of a Microsoft Office application like Word and into a web-based text editor, or if you use Word’s “Save as Web Page” feature, the resulting HTML consists of significantly more of this bloat than of the content itself.
Aghast — or, at least, I would be aghast if I weren’t so numb to all of this now, after more than a decade of confronting it — I began crafting an email to the client, in my ongoing quest to reduce dependence upon Microsoft applications, one user at a time. But then I decided the client didn’t need my admonitions (although I don’t rule out the possibility that he’s reading this); the whole world does.
I was all set to copy and paste that draft email into this post. Unfortunately, although I had copied it to the clipboard, I didn’t paste it before going off and copy-pasting of the Microsoft garbage code above, and I also didn’t save it as a draft in Mail. Once again Microsoft seems to have the upper hand. Balllllmerrrrrrr!!!!!
The upshot of all of it, though, was simply that Microsoft Word generates copious quantities of garbage HTML and includes it in what gets put into the clipboard when you copy-paste content from a Word document into other applications (such as a WYSIWYG text editor box in a CMS). Most of the time there’s no visible effect from this garbage code (other than the fact that it increases page load times slightly by virtue of being more data to download), but there’s no way to know for sure that it’s not going to break a page in some browser, either now or at some point in the future when browsers adhere more strictly to XHTML DOCTYPE specifications. Plus, it’s just pointless garbage Microsoft is making me look at when I view source. I object to it on principle.
Then again, who am I to challenge the Goliath of software?