Getting Google to remove fake hack URLs from its indexes for your site

As a web developer/systems admin, dealing with a hacked site is one of the most annoying parts of the job. Partly that’s on principle… you just shouldn’t have to waste your time on it. But also because it can just be incredibly frustrating to track down and squash every vector of attack.

Google adds another layer of frustration when they start labeling your site with a “This site may be hacked” warning.

A lot of times, this is happening because the hack invented new URLs under your domain that Google indexed, and for various reasons, Google may not remove these pages from its index after it crawls your thoroughly-cleaned site, even though those URLs are no longer there and are not in your sitemap.xml file. This issue may be exacerbated by the way your site handles redirecting users when they request a non-existent URL. Be sure your site is returning a 404 error in those cases… but even a 404 error may not be enough to deter Google from keeping a URL indexed, because the 404 might be temporary.

410 Gone

Enter the 410 Gone HTTP status. It differs from 404 in one key way. 404 says, “What you’re looking for isn’t here.” 410 says, “What you’re looking for isn’t here and never will be again, so stop trying!”

Or, to put it another way…

A quick way to find (some of) the pages on your site that Google has indexed is to head on over to Google (uh, yeah, like you need me to provide a link) and just do an empty site search, like this:

site:blog.room34.com

Look for anything that doesn’t belong. And if you find some things, make note of their URLs.

A better way of doing this is using Google Search Console. If you run a website, you really need to set yourself up on Google Search Console. Just go do it now. I’ll wait.

OK, welcome back.

Google Search Console lets you see URLs that it has indexed. It also provides helpful notifications, so if Google finds your site has been hacked, it will let you know, and even provide you with (some of) the affected URLs.

Now, look for patterns in those URLs.

Why look for patterns? To make the next step easier. You’re going to edit your site’s .htaccess file (assuming you’re using Apache, anyway… sorry I’m not 1337 enough for nginx), and set up rewrite rules to return a 410 status for these nasty, nasty URLs. And you don’t want to create a rule for every URL if you can avoid it.

When I had to deal with this recently, the pattern I noticed was that the affected URLs all had a query string, and each query string started with a key that was one of two things: either a 3-digit hexadecimal number, or the string SPID. With that observation in hand, I was able to construct the following code to insert into the .htaccess file:

# Force remove hacked URLs from Google
RewriteCond %{QUERY_STRING} ^([0-9a-f]{3})=
RewriteRule (.*) – [L,R=410]
RewriteCond %{QUERY_STRING} ^SPID=
RewriteRule (.*) – [L,R=410]

Astute observers (such as me, right now, looking back on my own handiwork from two months ago) may notice that these could possibly be combined into one. I think that’s true, but I also seem to recall that regular expressions work a bit differently in this context than I am accustomed to, so I kept it simple by… um… keeping it more complicated.

The first RewriteCond matches any query string that begins with a key consisting of a 3-digit hex number. The second matches any query string that begins with a key of SPID. Either way, the response is a 410 Gone status, and no content.

Make that change, then try to cajole Google into recrawling your site. (In my case it took multiple requests over several days before they actually recrawled, even though they’re “supposed” to do it every 48-72 hours.)

Good luck!

When switching servers breaks code: a WordPress mystery

Earlier this week I launched a brand new WordPress site for a long-time client. Break out the champagne! But of course it’s never that simple, is it?

The client’s live server is a newly configured VPS running Ubuntu 16.04 LTS and PHP 7.0; meanwhile, our staging server is still chugging away on Ubuntu 14.04 LTS and PHP 5.5. So, clearly, a difference there. But I was pleased to find that, for the most part, the site functions perfectly on the new server.

But then the client discovered a problem: on one page, content from a custom post type query wasn’t displaying.

Here’s a short version of the pertinent code:

$people = new WP_Query(array(
  ‘order’ => ‘ASC’,
  ‘orderby’ => ‘menu_order’,
  ‘posts_per_page’ => -1,
  ‘post_type’ => ‘person’,
));

if ($people->have_posts()) {
  while ($people->have_posts()) {
    $people->the_post();
    ?>
    <article>

      <header><h2><?php the_title(); ?></h2></header>
      <div><?php the_content(); ?></div>

    </article>
    <?php
  }
}

Strangely, the_title() was working fine, but the_content() wasn’t. It had been — still is, in fact — working on our staging server, all other things within the WordPress context being equal. (Identical, up-to-date versions of the theme files and all plugins, and WP itself.) And the client confirmed that the content was present in WP admin.

I found, confusingly, that get_the_content() works, even though the_content() doesn’t. But of course you don’t get all of the proper formatting (like paragraph breaks) without some WP filters that the_content() applies, so I tried this:

<?php echo apply_filters(‘the_content’, get_the_content()); ?>

Still didn’t work. After a bit more research I was reminded that the pertinent function that filter runs is wpautop(), so I just called that directly:

<?php echo wpautop(get_the_content()); ?>

Now I have the content displaying nicely, but this is clumsy and I really do not get what might be different. I know the new server is running PHP 7.0 and our staging server is running PHP 5.5… but I’m struggling to understand what kind of changes in PHP could cause this specific problem.

Since get_the_content() works, and the_content() doesn’t, the problem has to lie in something that’s happening with the filters on the_content(). Why? Because the_content() calls get_the_content() right up front. In fact, there’s not a lot to the_content() at all. This function lives in wp-includes/post-template.php (beginning at line 230 in WP 4.6). Here it is in its entirety (reformatted slightly for presentation here):

function the_content( $more_link_text = null, $strip_teaser = false) {
  $content = get_the_content( $more_link_text, $strip_teaser );

  /**
  * Filters the post content.
  *
  * @since 0.71
  *
  * @param string $content Content of the current post.
  */
  $content = apply_filters( ‘the_content’, $content );
  $content = str_replace( ‘]]>’, ‘]]>’, $content );
  echo $content;
}

As you can see, it’s really just 4 lines of actual code. It calls get_the_content() to retrieve the content, applies filters, does an obscure string replacement (which I think I understand but is not really pertinent here), and then echoes the results out to the page.

It’s pretty clear to me that the problem has to lie in one (or more) of the filters in the 'the_content' stack. I have to admit that even after years of working with it, I only have a rather nebulous understanding of how hooks work, so I’m not even sure where to begin dissecting the filter stack here to pinpoint the source of the trouble.

Whenever I know something works in one place and doesn’t work in another place, the first course of action in troubleshooting is to try to identify all of the differences between the two environments. Obviously we have some big differences here as I noted at the top of this post. But I am going to assume that the problem does not lie at the OS layer. Most likely it’s either a difference between PHP 5.5 and 7.0, or, even more likely, a difference between the PHP configurations on the two servers… specifically, modules that are or are not active. See my previous post on The Hierarchy of Coding Errors for my rationale here. Also keep in mind that I personally was responsible for installing LAMP on the server and configuring PHP, and it’s pretty obvious that we’re looking at the sysadmin equivalent of #1 or #2 in that list.

The next step, were I to care to pursue it much further (and if I didn’t have 200 other more important things to do, now that I have the problem “fixed”), would be to run phpinfo() on both servers and identify all of the differences.

That’s one possible path, at least. Another thing to consider is that the_content() actually is working just fine in other parts of the site, so maybe it would be worth digging into that WordPress filter stack first.

At this point, because as I said I have a few other more important things to work on, I will probably leave the mystery unresolved here. But I’d welcome any ideas from readers as to an explanation for all of this.


Update! I just couldn’t leave well enough alone, so a few minutes after I published this post, with the client’s permission, I restored the old version of the template, turned on WP_DEBUG and installed the Debug Bar plugin. Jackpot! Debug Bar returned the following error message when I was calling the_content(), but not when I had my “fixed” code in place:

Screen Shot 2016-09-01 at 9.24.16 AM

Well, how about that? As it turns out, the problem is due to a filter I myself had added, using a previously written function. (That’s #3 on the hierarchy list.) Combine that with deprecated functionality that was removed in PHP 7.0, and problem solved. And I even figured out why the problem is only occurring on this page and not site-wide… because my filter only runs if there’s an email address link in the content.

Pocket Sysadmin

I’m leaving on a jet plane, don’t know when I’ll be back again. External link

Actually, that’s not true: I will be back on Sunday. But the point is, I’m going on a trip… and more importantly, I’m not taking my laptop. Gasp! Can it be true? I’ll answer that for you: yes, it’s true.

OK, enough with this absurdly bad writing. On to the matter at hand: traveling sans laptop. Since I began freelancing in mid-2008, it’s been a given that I would always have my laptop with me. Not because I am a workaholic (not true) or because I’m an Internet junkie (true), but primarily because I needed to have a way to monitor and troubleshoot web server performance in case any of my clients had technical problems while I was gone.

But for this trip, I’ve decided I want to make it a real vacation. I want to remove the temptation to work when I really shouldn’t. I need a break. But I still need to be accessible if an emergency arises, and I need to be able to do something about it. Accessibility has been a non-issue ever since I got a cell phone: just call me. Or, since I got an iPhone: email, text, or call me (preferably in that order). But heretofore, the best I could do with a phone/iPhone was become aware of a problem. I still needed a full-fledged computer to actually do anything productive.

Over the past couple of weeks I’ve been preparing my iPhone to become an all-in-one tool for managing my business on the road. That meant setting up all of the diagnostic and troubleshooting tools I could, to ensure that I can adequately monitor the performance of the web servers I’m responsible for, and fix any problems that come up. Here are the tools I’m using to make that possible:

iStat

iStat

iStat is a very polished little utility for monitoring system performance. Its main feature is that it provides detailed stats on your iPhone itself: battery capacity, memory usage, storage available, IP address, uptime, running processes, and MAC address. But what’s really cool about it is that there are remote monitoring tools that allow you to monitor your Mac from your iPhone, or, more importantly for me, you can monitor a Linux, BSD or Solaris server. It requires a fair bit of command-line mucky-muck to set up (including compiling from source), but in a matter of minutes I had multiple Linux servers set up, sending their performance data to my iPhone wherever I go. With iStat, I can be proactive in monitoring server performance.

TouchTerm SSH

TouchTerm SSH

So, great. iStat lets me see how my servers are doing. But what if there’s a problem? That’s where TouchTerm SSH comes in. When I work with servers from my computer, the main application I use is Terminal. The SSH protocol allows me to connect securely to my servers with a command-line interface, where I can do anything I need to do: check running processes, modify configurations, restart services, etc. TouchTerm SSH is a fully-functional SSH terminal on the iPhone. With it, anything that I can do via SSH from my computer, I can now do with the iPhone. I just installed it today, so I haven’t completely put it through its paces, but I am sure that before long this will be one of the most indispensable apps I have installed on my iPhone. Even more than Ramp Champ.

Slicehost

Slicehost Pro

This one’s a bit more specialized, obviously, but since Slicehost’s iPhone app was one of the main selling points for me to go with them in the first place, it’s worth acknowledging.

Slicehost is a VPS hosting company based in St. Louis. They offer great service at unbeatable prices. Running a VPS is not for the feint of heart, but if you’re not afraid of taking full responsibility for your own server, Slicehost is the way to go. Their simple web-based admin interface makes it a snap to set up your own server with one of any number of Linux distros (Ubuntu, Debian, Gentoo, CentOS, Fedora, Arch or Red Hat), and once it’s running, to monitor its performance and reboot if necessary.

The iPhone app’s functionality is pretty limited, but it has the one critical function I need: if the slice becomes unresponsive, you can reboot it. Sure, you can do that from their mobile website too, but it’s always fun to have another app on your home screen.