Download those PDFs!

Wow, I really like these 512x512 icons in Snow Leopard...This post is strictly for web developers/server administrators. The rest of you can resume your daily activities, confident that nothing that was even remotely relevant to you transpired here.

PDFs. Web browsers. Both are a daily, or at least frequent, part of the lives of most computer users. But not all web browsers are created equal when it comes to the matter of handling PDFs. Some browsers (say, the ones developed by commercial OS makers) take the approach of trying to manage everything for the user. They include PDF readers that are embedded right into the browser, and PDFs load almost like another web page. Other browsers (most notably Firefox) treat PDFs as downloadable files, and when the user clicks a link to one, the file gets downloaded to their hard drive to be opened in a “helper app” — usually Adobe Reader.

Most website owners prefer the latter approach, and I suspect most users do as well. PDFs in general are not much like web pages, and it does not seem logical that they should be viewed within a web browser. Generally when people are accessing a PDF, it’s because they want to download the document to their computer to be used offline or to print. It is illogical for these actions to take place in a web browser. Sure, savvy users can right-click (or on Mac, Ctrl-click) and select “Save linked file as…” or some such nonsense from the contextual menu. But a lot of Windows users don’t even know their mouse has a right button, a lot of Mac users have no idea that you can press keys and click the mouse button simultaneously to perform special actions, and a lot of both would be confused by this entire process.

So we come to the matter of web developers (such as myself) trying to find ways to force the web browser to download the file instead of loading it directly. A trick I have used often is to link not to the file directly, but to a special PHP script that reads the file into the server’s memory, changes various aspects of the response (such as the MIME type), and then streams the content out to the browser. This is especially useful when you want to restrict access to files, say only to logged-in users, or only to users who have entered a special passcode. The PHP script is perfect for that, because it allows you to execute some code before sending the file to the browser. It even lets you store your files in a directory on the server that web browsers cannot access directly, ensuring (more or less… this article isn’t about hacking) the security of your files.

But what if your files aren’t in a protected area? What if you don’t want to bother with the mucky-muck of the PHP wrapper — you just want to link directly to the (browser-accessible) file, but you still want to force the download? Well, if you’re using Apache, you’re in luck. I found this great explanation of a small block of code you need to add to your httpd.conf file to achieve the same effect.

Ultimately, what you want to do is change the MIME type of the response. Browsers that are inclined to load PDFs internally perform this magic by seeking out files that are sent with the application/pdf MIME type. But there’s a very handy, “generic” MIME type for binary files, which all browsers treat as files to be downloaded rather than displayed directly: application/octet-stream. It may sound like a group of 8 singers standing in a small river, but it really just means… a generic binary file.

Here’s the complete block of code to put into your httpd.conf file, or into the appropriate virtual host configuration, however it’s stored on your particular server. I placed the code just within the virtual host configuration for the client in question, so the change applies only to their site, and not to any others that are also running on the server:

<FilesMatch "\.(?i:pdf)$">
ForceType application/octet-stream
Header set Content-Disposition attachment
</FilesMatch>

If you’re not the server admin, you should also be able to put this in a .htaccess file in your site’s root directory, but I haven’t actually tested that to see if it works.

Barack Obama: the open-source candidate?

Now we’re speaking close to my heart. Granted, I’m a freeloader in the open source world. I have yet to contribute a single line of code to an open source project. (OK, I guess that’s not entirely true: I did write a WordPress plugin. Sweet. I’m in the club! Sort of.) But I have wholly embraced open source software in my work. PHP FTW, baby! (Uh yeah… whatever.)

These days the only thing I’m a more enthusiastic and outspoken proponent of than open source software is Barack Obama. So I’m surprised it took me so long to research what he’s running his website on.

Linux PWS/1.3.28

*Whew* Glad to see it’s Linux. But what the heck is PWS? I was at a loss. Then I found this blog talking about the very same issue. And suddenly it made sense why I didn’t recognize the acronym. I never would have considered Microsoft’s Personal Web Server to be the web server of choice running on a Linux server. I am still scratching my head at it. The whole VM thing seems the only logical explanation, except that there’s no logic to explain it. At least it’s not so transparently ass-backwards as John McCain’s configuration:

Linux Microsoft-IIS/6.0

And the inexplicable:

Linux ECAcc (lhr)

Interestingly, though, a Google search for “ECAcc (lhr)” reveals a link to a Digg post entitled John McCain Owns VoteForTheMILF.com. Stay classy, San Diego.

Let’s be clear: I think the idea of running a web site under Windows in a virtual machine on a Linux box is the most incomprehensible, mind-bogglingly stupid arrangement you’d ever bother with. I’d have to guess that the sites were developed to run in a Windows environment, but when it came time to deal with practical server and network capacity issues, load balancing and whatnot, some sysadmin made the (probably prudent) decision to load balance on Linux boxes instead of Windows, but since the site was tied to some feeble Windows technology, they couldn’t just move it over to Linux wholesale.

But let’s take this a step further. Back in late spring I received an email from Barack Obama’s IT director soliciting applicants for web developer positions with the campaign. Even though the job was in Boston, I figured it would be insane not to apply, so I submitted my resume. (I never heard back, for what it’s worth.) And it’s from this that I happen to know that the campaign was specifically seeking PHP developers. Rock on.

With that in mind, the whole Windows-on-Linux-through-VM arrangement made even less sense. Why would they develop the site in PHP, run it on a Windows server (definitely not the optimal arrangement for a PHP-based app, though it certainly will work), and then VM that Windows environment on a Linux box, instead of just gearing the PHP app for a Linux server in the first place? And that’s when I remembered that just earlier in the day I had been looking at taxcut.barackobama.com. Of course! Separate third-level domains are all over Obama’s site. Let’s check the configuration on that domain. Now that’s much better:

Linux Apache/1.3.34 (Debian) mod_gzip/1.3.26.1a AuthMySQL/4.3.9-2 PHP/5.2.0-8+etch10

And I think it explains a lot. Campaigns start off small. Obama had to register barackobama.com and put something up there ages ago, long before he was the Democratic nominee and the hugely successful fundraiser he became along the way. So that original site, www.barackobama.com, was probably developed on a Windows box in someone’s proverbial basement, probably when was running for the U.S. Senate or maybe even the Illinois Senate. But as the campaign has grown, its websites (plural) have grown as well, and in a decidedly open-source direction. There’s some good stuff in there. Debian (which could mean Ubuntu, too… I haven’t checked the signature on Ubuntu’s Apache package to see if it’s split from its Debian roots), PHP (and a reasonably up-to-date version at that), MySQL, etc.

It’s kind of fun to do this kind of research, as long as you don’t mind being distracted along the way, because there are plenty of weird sources of distraction.

Aside from the aforementioned MILF site (classy), and the somewhat interesting fact that searching on “PWS/1.3.28” brings back as its first result a reference to Obama’s hosting, I discovered that for some reason the page title on John McCain’s official store is “Independent Online Stores.” OK. No one looks at title bars. And even fewer web developers look at <title> tags. I know that from experience. But of course that’s just a transitional landing page, announcing that McCain wares are not actually sold by the campaign, but by independent, for-profit companies, and buying these items doesn’t translate into money going back into the campaign. Huh. I can’t quite wrap my brain around that, but I’m a lifelong, union-loving Democrat, so I guess I wasn’t meant to. The only thing that comes to mind is that maybe it has to be that way, legally, now that he’s accepted public campaign financing. Anyway, the first McCain store link I found, which as they state is apparently an independent operation not affiliated with the campaign, is, not surprisingly, running:

Windows Server 2008 Microsoft-IIS/7.0

I also found that the company that hosts some of Obama’s pages also hosts a site for the American Model Yachting Association. Really? Model yachting? That exists?