Getting Google to remove fake hack URLs from its indexes for your site

As a web developer/systems admin, dealing with a hacked site is one of the most annoying parts of the job. Partly that’s on principle… you just shouldn’t have to waste your time on it. But also because it can just be incredibly frustrating to track down and squash every vector of attack.

Google adds another layer of frustration when they start labeling your site with a “This site may be hacked” warning.

A lot of times, this is happening because the hack invented new URLs under your domain that Google indexed, and for various reasons, Google may not remove these pages from its index after it crawls your thoroughly-cleaned site, even though those URLs are no longer there and are not in your sitemap.xml file. This issue may be exacerbated by the way your site handles redirecting users when they request a non-existent URL. Be sure your site is returning a 404 error in those cases… but even a 404 error may not be enough to deter Google from keeping a URL indexed, because the 404 might be temporary.

410 Gone

Enter the 410 Gone HTTP status. It differs from 404 in one key way. 404 says, “What you’re looking for isn’t here.” 410 says, “What you’re looking for isn’t here and never will be again, so stop trying!”

Or, to put it another way…

A quick way to find (some of) the pages on your site that Google has indexed is to head on over to Google (uh, yeah, like you need me to provide a link) and just do an empty site search, like this:

site:blog.room34.com

Look for anything that doesn’t belong. And if you find some things, make note of their URLs.

A better way of doing this is using Google Search Console. If you run a website, you really need to set yourself up on Google Search Console. Just go do it now. I’ll wait.

OK, welcome back.

Google Search Console lets you see URLs that it has indexed. It also provides helpful notifications, so if Google finds your site has been hacked, it will let you know, and even provide you with (some of) the affected URLs.

Now, look for patterns in those URLs.

Why look for patterns? To make the next step easier. You’re going to edit your site’s .htaccess file (assuming you’re using Apache, anyway… sorry I’m not 1337 enough for nginx), and set up rewrite rules to return a 410 status for these nasty, nasty URLs. And you don’t want to create a rule for every URL if you can avoid it.

When I had to deal with this recently, the pattern I noticed was that the affected URLs all had a query string, and each query string started with a key that was one of two things: either a 3-digit hexadecimal number, or the string SPID. With that observation in hand, I was able to construct the following code to insert into the .htaccess file:

# Force remove hacked URLs from Google
RewriteCond %{QUERY_STRING} ^([0-9a-f]{3})=
RewriteRule (.*) – [L,R=410]
RewriteCond %{QUERY_STRING} ^SPID=
RewriteRule (.*) – [L,R=410]

Astute observers (such as me, right now, looking back on my own handiwork from two months ago) may notice that these could possibly be combined into one. I think that’s true, but I also seem to recall that regular expressions work a bit differently in this context than I am accustomed to, so I kept it simple by… um… keeping it more complicated.

The first RewriteCond matches any query string that begins with a key consisting of a 3-digit hex number. The second matches any query string that begins with a key of SPID. Either way, the response is a 410 Gone status, and no content.

Make that change, then try to cajole Google into recrawling your site. (In my case it took multiple requests over several days before they actually recrawled, even though they’re “supposed” to do it every 48-72 hours.)

Good luck!

Fun with CSS in WordPress

I just had an email exchange with an old friend and fellow web developer (and WordPress user) regarding some techniques for CSS trickery on home pages in WordPress themes. Up until this week, I had been running a version of my theme that just featured brief excerpts of articles on the home page. I was doing this by brute force in PHP, truncating the post text with the substr() function, and then cleaning things up using the strip_tags() function, which removes all HTML tags from a string.

It got the job done, but as he and I were discussing, it wasn’t pretty: it stripped out the “dangerous” stuff — that is, unclosed HTML tags (cut off during truncation) that would have screwed up the formatting of the rest of the page. But it also stripped out desirable styling (bold, italics, links) and paragraph breaks.

The ideal situation would be to have a way to show just the first two paragraphs of each post, retaining all of their original formatting. Of course, WordPress has a feature to handle this: if you put a <!--more--> comment tag in your post, your page template can truncate the post at that point, with a link to the single-post page to display the rest of the content. But I’ve never liked having to put that <!--more--> into my posts. I want a completely automated solution.

And then it hit me… this could be done with CSS. It took a little trial and error, but I came up with the following:

#content .entry p,
#content .entry h3
{
display: none;
}

#content .entry h2:first-child,
#content .entry h2:first-child + p,
#content .entry h2:first-child + p + p
{
display: block;
}

A few things to note:

  • This assumes that your entire loop is wrapped inside <div id="content">...</div>. You may need to come up with a specific ID to use just for this block in your index page, and be sure not to use that in your single-post page, or your posts will never appear in their entirety.
  • This also assumes that you’re using the WordPress convention of wrapping your posts in a pair of <div> tags with the attributes class="post" and class="entry" (though technically, class="entry" is the only one that matters here).
  • Your post title should be in an <h2> tag, immediately following <div class="entry">.
  • The first definition may need to be extended to include other HTML tags you want to hide on your index page. In this example, it’s only hiding content that is inside <p> or <h3> tags.
  • If you want to hide every HTML tag except the ones you explicitly specify, you could change the first block to #content .entry *, but keep in mind that will also remove styling like bold and italics, and it will remove links. Probably not what you want.

The specifics may vary depending upon how your WordPress theme is set up; I just know that with the way mine is set up, which pretty closely follows standard convention, this CSS worked to get the index page to list the posts and only show the first two paragraphs of each. (It also retained the images that I embed at the start of each post, and also retains any embedded video from YouTube or Vimeo, since — at least the way I insert them — those are not wrapped in <p> tags.

Note that all of the HTML content for each post is still loaded by the browser — we’re just using CSS to tell the browser not to show it on the page. This is not going to help with performance; it’s strictly aesthetic.