Getting Google to remove fake hack URLs from its indexes for your site

As a web developer/systems admin, dealing with a hacked site is one of the most annoying parts of the job. Partly that’s on principle… you just shouldn’t have to waste your time on it. But also because it can just be incredibly frustrating to track down and squash every vector of attack.

Google adds another layer of frustration when they start labeling your site with a “This site may be hacked” warning.

A lot of times, this is happening because the hack invented new URLs under your domain that Google indexed, and for various reasons, Google may not remove these pages from its index after it crawls your thoroughly-cleaned site, even though those URLs are no longer there and are not in your sitemap.xml file. This issue may be exacerbated by the way your site handles redirecting users when they request a non-existent URL. Be sure your site is returning a 404 error in those cases… but even a 404 error may not be enough to deter Google from keeping a URL indexed, because the 404 might be temporary.

410 Gone

Enter the 410 Gone HTTP status. It differs from 404 in one key way. 404 says, “What you’re looking for isn’t here.” 410 says, “What you’re looking for isn’t here and never will be again, so stop trying!”

Or, to put it another way…

A quick way to find (some of) the pages on your site that Google has indexed is to head on over to Google (uh, yeah, like you need me to provide a link) and just do an empty site search, like this:

site:blog.room34.com

Look for anything that doesn’t belong. And if you find some things, make note of their URLs.

A better way of doing this is using Google Search Console. If you run a website, you really need to set yourself up on Google Search Console. Just go do it now. I’ll wait.

OK, welcome back.

Google Search Console lets you see URLs that it has indexed. It also provides helpful notifications, so if Google finds your site has been hacked, it will let you know, and even provide you with (some of) the affected URLs.

Now, look for patterns in those URLs.

Why look for patterns? To make the next step easier. You’re going to edit your site’s .htaccess file (assuming you’re using Apache, anyway… sorry I’m not 1337 enough for nginx), and set up rewrite rules to return a 410 status for these nasty, nasty URLs. And you don’t want to create a rule for every URL if you can avoid it.

When I had to deal with this recently, the pattern I noticed was that the affected URLs all had a query string, and each query string started with a key that was one of two things: either a 3-digit hexadecimal number, or the string SPID. With that observation in hand, I was able to construct the following code to insert into the .htaccess file:

# Force remove hacked URLs from Google
RewriteCond %{QUERY_STRING} ^([0-9a-f]{3})=
RewriteRule (.*) – [L,R=410]
RewriteCond %{QUERY_STRING} ^SPID=
RewriteRule (.*) – [L,R=410]

Astute observers (such as me, right now, looking back on my own handiwork from two months ago) may notice that these could possibly be combined into one. I think that’s true, but I also seem to recall that regular expressions work a bit differently in this context than I am accustomed to, so I kept it simple by… um… keeping it more complicated.

The first RewriteCond matches any query string that begins with a key consisting of a 3-digit hex number. The second matches any query string that begins with a key of SPID. Either way, the response is a 410 Gone status, and no content.

Make that change, then try to cajole Google into recrawling your site. (In my case it took multiple requests over several days before they actually recrawled, even though they’re “supposed” to do it every 48-72 hours.)

Good luck!

Light, pollution, memory

Light pollution

I remember the first time I ever observed light pollution. I didn’t know what it was, and I’m not sure it even had a name back then.

It was 1993. I was in college, and I was home for Easter. In fact it was early Easter morning. My uncle was staying with us, in my room, which was in the process of becoming the guest room. He always stayed in my room when he stayed with us. Eventually I would stay in that room, as a guest room, not my room, once I was no longer a resident of the house, but a guest.

At the time, though, I was not yet a guest, though no longer quite a resident. Nonetheless, he was visiting, so he got my room and I was relegated to the couch in the family room. The family room, which had been added on in 1987, when I was 13, had two skylights. One was directly above the couch, so when I was lying on the couch I could look directly up at the sky.

When I was growing up, cities, at least the small town in which I grew up (which I always thought of as a city, despite its modest population of 26,210 — which was no longer the population, but had been the population in the 1970 census, and the city could not yet bring itself to acknowledge the loss of over 10% of its population in the subsequent decades, so it still appeared on the signs as you drove into town) had not yet switched over to sodium-based street lights. However this particular small town/city had made the switch in the brief time since I had gone off to college at an even smaller town — one small enough that even I could make no pretense as to its being a “city.”

I awoke in the middle of the night. Technically, the early morning, Easter morning. It was overcast, and as I now know well, in a city illuminated by sodium streetlights on an overcast night, it is never truly dark, never truly nighttime. Instead, the best you get is an eerie orange twilight, which is what I observed for the first time in my life, that early Easter morning in 1993, 20 years ago.

It was perhaps 2 AM, and as I awoke, then arose, and walked to the kitchen to get a better view, I beheld the city aglow in an unnatural orange luminescence, and… well… it freaked the shit out of me. I had never seen anything like it, and I didn’t understand what could be causing it. Being Easter morning, and being highly impressionable, especially to my own half-lucid, half-dreamlike fantasies, I was sure Armageddon, or… something… was nigh.

Of course, it was not. And eventually I made the connection between the reference to sodium lights I’d heard on Sting’s The Soul Cages album with the eerie orange light, which has since become commonplace in my mostly urban adult life, where I am usually far too busy or distracted or just simply tired to bother to look up into the sky at night and think the kinds of existential, philosophical, cosmic, spiritual, infinite thoughts I used to dwell on so much between the ages of 5 and 22.

But tonight, for a brief moment, I lingered at my back door in south Minneapolis, with a glass of scotch in one hand and my iPhone in the other. On that late night/early Easter morning 20 years ago, I’m not sure which of the two would have seemed more out-of-place in my hands. Surely both would be just as out-of-place as apocalyptic paranoia in my 2013 brain. But still, the connection to that moment half a lifetime ago was there, and I was transported back to a place where I can stare into the sky at night, silently, and wonder.

Gettin’ Tumblrized

Aside

I asked SLP to help me find a way to automatically cross-post from WordPress to Tumblr, and she found this helpful article describing the available options, and recommending the WordPress plugin Tumblrize. I’m giving it a try with this post!

Update: Well… it doesn’t seem to work. Let’s keep trying…

Fun with CSS in WordPress

I just had an email exchange with an old friend and fellow web developer (and WordPress user) regarding some techniques for CSS trickery on home pages in WordPress themes. Up until this week, I had been running a version of my theme that just featured brief excerpts of articles on the home page. I was doing this by brute force in PHP, truncating the post text with the substr() function, and then cleaning things up using the strip_tags() function, which removes all HTML tags from a string.

It got the job done, but as he and I were discussing, it wasn’t pretty: it stripped out the “dangerous” stuff — that is, unclosed HTML tags (cut off during truncation) that would have screwed up the formatting of the rest of the page. But it also stripped out desirable styling (bold, italics, links) and paragraph breaks.

The ideal situation would be to have a way to show just the first two paragraphs of each post, retaining all of their original formatting. Of course, WordPress has a feature to handle this: if you put a <!--more--> comment tag in your post, your page template can truncate the post at that point, with a link to the single-post page to display the rest of the content. But I’ve never liked having to put that <!--more--> into my posts. I want a completely automated solution.

And then it hit me… this could be done with CSS. It took a little trial and error, but I came up with the following:

#content .entry p,
#content .entry h3
{
display: none;
}

#content .entry h2:first-child,
#content .entry h2:first-child + p,
#content .entry h2:first-child + p + p
{
display: block;
}

A few things to note:

  • This assumes that your entire loop is wrapped inside <div id="content">...</div>. You may need to come up with a specific ID to use just for this block in your index page, and be sure not to use that in your single-post page, or your posts will never appear in their entirety.
  • This also assumes that you’re using the WordPress convention of wrapping your posts in a pair of <div> tags with the attributes class="post" and class="entry" (though technically, class="entry" is the only one that matters here).
  • Your post title should be in an <h2> tag, immediately following <div class="entry">.
  • The first definition may need to be extended to include other HTML tags you want to hide on your index page. In this example, it’s only hiding content that is inside <p> or <h3> tags.
  • If you want to hide every HTML tag except the ones you explicitly specify, you could change the first block to #content .entry *, but keep in mind that will also remove styling like bold and italics, and it will remove links. Probably not what you want.

The specifics may vary depending upon how your WordPress theme is set up; I just know that with the way mine is set up, which pretty closely follows standard convention, this CSS worked to get the index page to list the posts and only show the first two paragraphs of each. (It also retained the images that I embed at the start of each post, and also retains any embedded video from YouTube or Vimeo, since — at least the way I insert them — those are not wrapped in <p> tags.

Note that all of the HTML content for each post is still loaded by the browser — we’re just using CSS to tell the browser not to show it on the page. This is not going to help with performance; it’s strictly aesthetic.

Two videos featuring Gov. Tim Pawlenty

Our governor measures his words with a Vernier caliper while dissembling Rush Limbaugh’s hope that Obama fails — as subtly hinted at by a vague, ambiguously titled article on his website (Limbaugh: I Hope Obama Fails) — on the Rachel Maddow Show:

…And is slightly less politic (though no more factually accurate) when criticizing Minneapolis Mayor R.T. Rybak in front of a less public audience in Rochester last month, as shown in this video of Mayor Rybak’s rebuttal:

On a positive note, I can confidently say that I’d rather have this Republican governor than a certain former Democratic governor in a certain other state who tried to sell a certain Senate seat vacated by a certain current President of the United States. Or any of the four (maybe five) other Republican governors I can name off the top of my head.

Don’t believe me? From west to east, Sarah Palin, Arnold Schwarzenegger, Bobby Jindal, Charlie Crist. Is Sonny Purdue still in Georgia? I ought to know him; I voted against him. I know there are some more, but in the words of the inimitable Donald Rumsfeld, they’re “known unknowns.”