Microsoft Word’s formatting garbage, quantified

Anyone who’s spent any amount of time working on the web dreads it: content delivered in Microsoft Word format. Word adds tons of formatting garbage that results in bloated files and messes up the presentation when content gets brought into HTML.

When Microsoft released Office 2007, they touted switching to an XML-based document format for all of the apps. But all XML is not created equal.

Case in point: I am currently working on a project that is going to involve receiving content for a number of web pages in a tabular form, either in Word or Excel format. A spreadsheet, essentially (if not technically), with each page represented by a row, and its text content in a cell. I will be writing a PHP script to parse the spreadsheet data and generate a set of HTML files with the content loaded in them.

I’m currently trying to determine if Word or Excel would be the better format to receive the content in, which involves opening up .xlsx and .docx files in BBEdit and looking at the raw data stored within them. I’ve managed to identify the embedded XML files in each that hold the actual content. These files store the same actual text content, but their XML schemas vary based on the needs of Word and Excel.

So… how do they match up? The XML file I pulled out of Excel is 14 KB. The one from Word is 202 KB. For the mathematically inclined amongst you, that’s a little more than 14 times larger. Yes… another (perhaps more hyperbolic) way you could say it is that the Word document is exponentially larger.

That’s just ridiculous.

What makes up the difference? Well, the Excel file’s XML is nothing but basic tags. There are no attributes on any of the tags, as far as I can tell. It’s pure semantic structure. The Word XML, on the other hand, is almost nothing but attributes. And there’s nothing smart about them either. Most of them are assigning fonts to the text. The same font names, over and over and over again throughout the file.

That’s… beyond ridiculous.

OK, I’m a geek, but this was totally cool! (In a geeky way…)

Ever since I read Word Freak, an exposé on the world of competitive Scrabble by Wall Street Journal sports reporter Stefan Fatsis [wow, such a convoluted sentence, simply to avoid having to write “Fatsis’s”], I’ve been obsessed with improving my Scrabble game. (Excuse me, my SCRABBLE® Brand Crossword Game… er… game.)

For a couple of nights, SLP was into it (why does that sound dirty?), but she just couldn’t match my endurance (again… why?). So I had to resort to playing against the computer. I’ve been playing tournament style to boost my (unofficial) rating, playing mostly against the 1220-rated “Veteran” (whom I beat about 2/3 of the time) and the 1400-rated “Smart” (to whom I lose about 2/3 of the time).

Tonight my rating finally topped the 1300 mark (against Smart, no less), and I celebrated by playing one more game against Veteran. And therein came my greatest moment.

Veteran had set up the triple word column on the right edge with LAVE, and my draw that turn included the Q, the Z, and both blanks! I stared at the board for a moment before realizing I had a most unusual play (if it was actually a word). And so it was that I laid down QUIZZER through the E in LAVE, with the Q on the triple word and the (real) Z on the double letter, using the blanks for the U and the other Z, giving me 99 points. If only I’d had an S on my rack, I could have hit the other triple word as well, for a triple-triple-bingo, worth 356 points! (That’s among the highest possible scores for a single word, even though, as a fairly pedestrian word, it doesn’t carry as much cachet among Scrabblers as words like QUIXOTIC or MEZQUITE.)

I still ended up winning the game with several other high-scoring moves, including a ballsy (if anything Scrabble-related is ever even remotely “ballsy”) multi-turn set-up that allowed me to play EXITS on a triple word for 49 points, after having already milked the X for all it was worth. I almost screwed it up though. I had played LURID early on, to which I later added the double XI. Already I was planning EXITS, but I was missing the T. So to try to build it up even more, I played RIDE while waiting for the T, but… d’oh! I really should have just played RID, because I needed that E! Naturally the T landed on my rack in the next turn, but we were getting close to the end of the game, and I didn’t have an E! Without counting, I assumed they were all on the board already, but I got one in my second-to-last draw, and EXITS appeared! On the final turn, I was left with AINRT on my rack, which fit nicely in the same area to turn ER into TRAINER for the victory! 447 may be my highest single game score ever.

Yeah, I’m a geek. But I represent! (Saying that makes me even more of a geek, doesn’t it?)

Wow, looking back at that screenshot, I’m even more impressed with myself (if that’s possible). Tournament play uses a clock, just like chess, with each player limited to 25 minutes total (going over the time limit carries a steep penalty at the end of the game). When I first started playing computer Scrabble a couple months back, I’d usually use up almost all of my 25 minutes, but in this game my clock read 17:34 at the end, meaning I had only used 7 minutes and 26 seconds for the entire game! Of course, as usual, Veteran only used 15 seconds. One time I think the computer only used four seconds for the whole game. I think the computer needs a handicap on the tournament clock: the player gets 25 minutes and the computer gets 25 seconds. Yeah, that sounds fair.

I’ll stop now. If I go on, I may just have to beat myself up.

Commander Mark!

Wow, here’s a blast from the past. And perhaps a disturbing look into the bizarre and meandering train of thought my brain often follows.

Sometimes I will get a word or name stuck in my head, the way many people (myself included) often get songs stuck in their heads. For instance, once a few years ago I had the name Frau Farbissina on a mental audio loop.

Anyway, today the word in my head was “foreshortening.” And whenever I think of that word, I think of the place where I first heard it… from a crazy PBS drawing show host named Commander Mark!

Back in the day, I watched his original show, The Secret City.

Catalog of Annoying Grammatical and Spelling Errors

Originally posted July 12, 2006

First off, let me acknowledge that my English ain’t perfect. (Get it?) That said, it’s pretty damn good. And when I make grammatical errors, it’s usually on purpose and I’m aware that I’m using something incorrectly. In those cases, I’m only doing it because I don’t really care and it’s not something I’d label as an egregious mistake. (I don’t make spelling errors, period! Well, OK… maybe once.) I will also acknowledge that English is not a fixed language, and that the rules of its use are arbitrary and subject to permutation. (And, of course, I’m sure anyone outside the U.S. who’s reading this will find the title itself to be unacceptable. Too bad! I’m an American! I get to be an arrogant jerk at least once in a while!)

With all of those qualifiers and disclaimers out of the way, let’s get down to business. There’s a difference between novel usage (and I’ll even let 1337 pass in that context) and just plain boneheaded errors though, and the latter is what I’m dealing with here. This page will be updated periodically as I encounter (or remember) errors of speech or (more commonly) writing that I simply find intolerable. (Split infinitives and dangling participles are OK. And so is beginning a sentence with a conjunction.)

These errors fall into three distinct categories: spelling, word usage and grammar. (We’ll skip the matter of whether to use a comma before “and” in a list, as I’m at a point of transition on this matter personally.) OK, maybe they’re not really so distinct. But that’s how I’m slicin’ ’em up anyway.

Spelling

I will not bother to offer the correct spellings of these words. Look ’em up!

  • comming
  • definately

Word Usage

This is a bit of a nebulous category, as sometimes it’s hard to tell whether what you’re dealing with is a spelling error or a grammatical error. In fact it’s a mixture of both, because often it involves spelling a word incorrectly, but in a way that happens to be another legitimate word; it’s just one that’s incorrect for the context.

Apostrophes in plurals
Granted, I am a super-genius, but I got this rule back when I first learned it in elementary school. Is it really that hard to tell the difference between a plural and a possessive? Apparently so. Of course, we also have the confounding situation of “its”/”it’s,” where the posessive does not contain an apostrophe. But then again, the one with the apostrophe is a contraction, not a plural.
“Alot” vs. “a lot”
Have too many things to count? That sounds like “a lot” to me. You’d better “alot” plenty of time for the task.
“Everyday” vs. “every day”
This is an “everyday” mistake. In fact, I encounter it almost “every day.” Since that one’s still a bit opaque, I’ll suspend the witticisms. “Everyday” is an adjective. You can’t do something “everyday,” unless “everyday” is describing the something and not when you’re doing it.
“Formally” vs. “formerly”
Both are legitimate and useful words. However, despite similar (or, depending on your accent, identical) pronunciation, they mean two completely different things. Yet I am amazed at how often I see “formally” used in cases where the intended meaning is clearly “formerly.” I have yet to see the reverse mistake.
“Of” in place of “have”
I know you generally don’t really get into dissecting parts of speech until junior high school, long after most Americans have completely tuned out, but just think for a minute. “Have” is a verb. “Of” is a preposition. You can “think of” something, but you can’t “must of” or “could of” something!
“To” instead of “too”
*Sigh* Do we really even need to get into this? I’ll admit this is an easy one to slip into, as I often do it myself if I’m not paying enough attention. (Therefore, if even I am susceptible to the error, it must be more excusable.) Just remember, there are “too” many o’s in “too.” (Yes, there also happen to be “two” o’s in “too” whereas there is only one in “two.” But… ah, forget it.)

Grammar

I don’t have any for this section yet, but I’m sure I’ll think of some any minute now…

More to come! Be sure to use the comment form to suggest any I’ve forgotten!