As a web developer/systems admin, dealing with a hacked site is one of the most annoying parts of the job. Partly that’s on principle… you just shouldn’t have to waste your time on it. But also because it can just be incredibly frustrating to track down and squash every vector of attack.
Google adds another layer of frustration when they start labeling your site with a “This site may be hacked” warning.
A lot of times, this is happening because the hack invented new URLs under your domain that Google indexed, and for various reasons, Google may not remove these pages from its index after it crawls your thoroughly-cleaned site, even though those URLs are no longer there and are not in your sitemap.xml file. This issue may be exacerbated by the way your site handles redirecting users when they request a non-existent URL. Be sure your site is returning a 404 error in those cases… but even a 404 error may not be enough to deter Google from keeping a URL indexed, because the 404 might be temporary.
410 Gone
Enter the 410 Gone HTTP status. It differs from 404 in one key way. 404 says, “What you’re looking for isn’t here.” 410 says, “What you’re looking for isn’t here and never will be again, so stop trying!”
Or, to put it another way…
A quick way to find (some of) the pages on your site that Google has indexed is to head on over to Google (uh, yeah, like you need me to provide a link) and just do an empty site search, like this:
site:blog.room34.com
Look for anything that doesn’t belong. And if you find some things, make note of their URLs.
A better way of doing this is using Google Search Console. If you run a website, you really need to set yourself up on Google Search Console. Just go do it now. I’ll wait.
OK, welcome back.
Google Search Console lets you see URLs that it has indexed. It also provides helpful notifications, so if Google finds your site has been hacked, it will let you know, and even provide you with (some of) the affected URLs.
Now, look for patterns in those URLs.
Why look for patterns? To make the next step easier. You’re going to edit your site’s .htaccess
file (assuming you’re using Apache, anyway… sorry I’m not 1337 enough for nginx), and set up rewrite rules to return a 410 status for these nasty, nasty URLs. And you don’t want to create a rule for every URL if you can avoid it.
When I had to deal with this recently, the pattern I noticed was that the affected URLs all had a query string, and each query string started with a key that was one of two things: either a 3-digit hexadecimal number, or the string SPID
. With that observation in hand, I was able to construct the following code to insert into the .htaccess
file:
# Force remove hacked URLs from Google
RewriteCond %{QUERY_STRING} ^([0-9a-f]{3})=
RewriteRule (.*) – [L,R=410]
RewriteCond %{QUERY_STRING} ^SPID=
RewriteRule (.*) – [L,R=410]
Astute observers (such as me, right now, looking back on my own handiwork from two months ago) may notice that these could possibly be combined into one. I think that’s true, but I also seem to recall that regular expressions work a bit differently in this context than I am accustomed to, so I kept it simple by… um… keeping it more complicated.
The first RewriteCond
matches any query string that begins with a key consisting of a 3-digit hex number. The second matches any query string that begins with a key of SPID
. Either way, the response is a 410 Gone status, and no content.
Make that change, then try to cajole Google into recrawling your site. (In my case it took multiple requests over several days before they actually recrawled, even though they’re “supposed” to do it every 48-72 hours.)
Good luck!