Getting Google to remove fake hack URLs from its indexes for your site

As a web developer/systems admin, dealing with a hacked site is one of the most annoying parts of the job. Partly that’s on principle… you just shouldn’t have to waste your time on it. But also because it can just be incredibly frustrating to track down and squash every vector of attack.

Google adds another layer of frustration when they start labeling your site with a “This site may be hacked” warning.

A lot of times, this is happening because the hack invented new URLs under your domain that Google indexed, and for various reasons, Google may not remove these pages from its index after it crawls your thoroughly-cleaned site, even though those URLs are no longer there and are not in your sitemap.xml file. This issue may be exacerbated by the way your site handles redirecting users when they request a non-existent URL. Be sure your site is returning a 404 error in those cases… but even a 404 error may not be enough to deter Google from keeping a URL indexed, because the 404 might be temporary.

410 Gone

Enter the 410 Gone HTTP status. It differs from 404 in one key way. 404 says, “What you’re looking for isn’t here.” 410 says, “What you’re looking for isn’t here and never will be again, so stop trying!”

Or, to put it another way…

A quick way to find (some of) the pages on your site that Google has indexed is to head on over to Google (uh, yeah, like you need me to provide a link) and just do an empty site search, like this:

site:blog.room34.com

Look for anything that doesn’t belong. And if you find some things, make note of their URLs.

A better way of doing this is using Google Search Console. If you run a website, you really need to set yourself up on Google Search Console. Just go do it now. I’ll wait.

OK, welcome back.

Google Search Console lets you see URLs that it has indexed. It also provides helpful notifications, so if Google finds your site has been hacked, it will let you know, and even provide you with (some of) the affected URLs.

Now, look for patterns in those URLs.

Why look for patterns? To make the next step easier. You’re going to edit your site’s .htaccess file (assuming you’re using Apache, anyway… sorry I’m not 1337 enough for nginx), and set up rewrite rules to return a 410 status for these nasty, nasty URLs. And you don’t want to create a rule for every URL if you can avoid it.

When I had to deal with this recently, the pattern I noticed was that the affected URLs all had a query string, and each query string started with a key that was one of two things: either a 3-digit hexadecimal number, or the string SPID. With that observation in hand, I was able to construct the following code to insert into the .htaccess file:

# Force remove hacked URLs from Google
RewriteCond %{QUERY_STRING} ^([0-9a-f]{3})=
RewriteRule (.*) – [L,R=410]
RewriteCond %{QUERY_STRING} ^SPID=
RewriteRule (.*) – [L,R=410]

Astute observers (such as me, right now, looking back on my own handiwork from two months ago) may notice that these could possibly be combined into one. I think that’s true, but I also seem to recall that regular expressions work a bit differently in this context than I am accustomed to, so I kept it simple by… um… keeping it more complicated.

The first RewriteCond matches any query string that begins with a key consisting of a 3-digit hex number. The second matches any query string that begins with a key of SPID. Either way, the response is a 410 Gone status, and no content.

Make that change, then try to cajole Google into recrawling your site. (In my case it took multiple requests over several days before they actually recrawled, even though they’re “supposed” to do it every 48-72 hours.)

Good luck!

Reflections on the frustratingly user-hostile motivations behind Google’s unified user accounts

“If it’s free, you’re not the customer, you’re the product.”

–Everyone on the Internet

As I’ve written about several times on this blog, my 11-year-old son did an informal internship with us at Room 34 this summer. Part of the process of getting him set up as a part of the business was giving him his own email address.

We use Gmail (as part of Google Apps for Business) for our email. As such, creating an account for him on our email domain essentially created a Google user account for him, because Google has, of course, bundled all of their services together under a single login: Gmail, YouTube, Google+ (which no one uses), etc. Sounds convenient, right? Sure, but…

A couple of weeks ago, unbeknownst to me (go ahead and judge my parenting now), my son discovered that with his mail login he was able to log into YouTube as well. We have made it clear to him in the past that (legally) you have to be 13 to get a YouTube account, and that we had no intention of helping him circumvent that. But, kids being kids, he tried to take advantage of this back door he had discovered.

Problem is, YouTube asked for his birthdate. And he gave it. His real birthdate.

Nope! said YouTube, and his account was suspended. But not just his YouTube account. His entire Google account. Suddenly we found he couldn’t log into his email. I went into our Google Apps for Business account to manage the domain, and I discovered, to my supreme annoyance and frustration, that when a user account is “suspended” it really is suspended — it’s in a strange state of semi-existence. It can’t be used, but it also can’t be deleted, even by a domain administrator. So now his email address — his email address on my business domain name, not “gmail.com” — is entirely untouchable.

It’s no surprise that we are Google’s product. A customer is a person or company who pays for products or services rendered. Google’s advertisers are their customers, and our attention is the product they are selling.

As a result, Google collects enormous amounts of data about its users. It tracks as much of our activity across all realms of the Internet as possible. That’s why we are a valuable product to their customers — the advertisers. The more information Google collects about us, the more valuable we become as targets for advertising. And all of that data collection is why Google is required to comply with the federal law regarding collection of information about people under the age of 13 on the Internet. Therefore, my 11-year-old son not only can’t have a YouTube account, but he can’t have an email address that is connected to Gmail, because a Google account is a Google account, period.

On a basic level this is a major inconvenience to me and to my son for our purposes of getting him experience working on the Internet. But on a much deeper level, it is more profoundly disturbing for its privacy implications.

As a web developer, I work often with Google Analytics. I help our clients set it up on their websites. I even use it on my own sites (including this one). It’s great to see where your traffic is coming from, which parts of your site are or aren’t getting traffic, which devices and browsers your visitors are using, etc.

But remember, Google isn’t just collecting that data for your benefit. They’re collecting all of that and much more for their own purposes, far beyond what they even make available to site owners on Google Analytics.

Google has created a scenario through Gmail and YouTube (and, I suppose, Google+) where a large percentage of Internet users are logged into Google at all times, with cookies stored in their browsers. Combine that with Google Analytics being installed on a large percentage of websites around the world, and Google knows that you are visiting all of those sites. You may not be providing the sites you’re visiting with any of your information, and they can’t read Google’s cookies themselves, but they’re pulling in a little piece of Google on every page load, and that piece of Google can read the Google cookies on your computer, identifying not just a computer with your same OS and web browser, connecting from your specific IP address, but you, the logged-in Google user.

What are they doing with that information? And what might someone else do with that information?

I do not like this, not one bit. And yet I still happily use these Google services. And you probably do too.

The Outside Scoop: Thoughts on Android Wear and a possible iWatch

The big news in tech today is Google’s announcement of Android Wear, a version of their Android OS specifically optimized for “wearables” like watches.

The tech media is erupting with ridiculously titled blog posts that refer to this as Google’s “answer” to the iWatch, a product that Apple has not announced, nor even acknowledged working on.

Surprisingly, for the first time I actually found one of these wearables mildly interesting, the Moto 360. But I am still skeptical of wearables in general, smart watches in particular, and especially the idea that Apple is working on one. But I’ve learned from my past mistakes, when I was convinced Apple was neither working on a smartphone in late 2006 nor a tablet in late 2009. So, in my world at least, my adamant belief that Apple is not developing a watch should probably be my biggest clue that they are.

So where is Apple’s “iWatch”? Aren’t all of these competitors eating Apple’s lunch (before it’s even cooked)? Perhaps. But consider this:

Remember the original iPod. It came into a market that already existed (but sucked), and delivered a radically superior user experience, and was a huge hit. Remember the iPhone. Once again, it came into a market that already existed (but sucked) and totally revolutionized it.

The thing is… a smart watch market doesn’t really exist (or didn’t when rumors of an “iWatch” first started to circulate). It almost seems like Apple got the wheels of the rumor mill turning deliberately, to goad their competition into creating the market, thinking they were beating Apple to the punch but in fact creating the exact environment of suck Apple needs to release a product into.

Oversharing and paranoia

Oversharing is an inherent part of social media. Just ask anyone who’s made the mistake of clicking a Socialcam link on Facebook.

But oversharing takes different forms, and the most potentially dangerous type is one many people don’t even realize exists: the copious logging of your online activities by the social networking sites you’re logged into. Thanks to their “deep integration” with other websites, you may be “sharing” your browsing habits with Facebook, Twitter and Google even when you’re not on their sites.

Have you ever been on a site and noticed a little corner of the site looks like it’s been invaded by Facebook? That sickly blue, the font, the little profile pictures of your friends who’ve liked or commented on the page you’re currently viewing?

How did that get there? It’s because the site is integrating with Facebook, and through the magic of cookies, Facebook’s servers can tell that it’s you looking at the page and deliver content customized to your profile. Maybe you like that, but I find it a little creepy. Twitter and Google do it too, even if it’s not as obvious.

Google may be the most insidious, with so many of its tools now consolidated under a single login. If you use Gmail, and you keep your account logged in, every Google search you do is logged. Ostensibly this is to help deliver “personalized” results. More crassly, it is used to put “targeted” ads in front of your eyeballs. But that data is being collected, and regardless of what Google says their privacy policy is now, the data is there, and could stay there for a long time. Someday Google might change their policies or sell that data or the government might subpoena it or just come in and take it.

What’s worse, Google Analytics is everywhere. Heck, even paranoid old me uses it. Google says Analytics isn’t tied in with your Google account, and maybe it’s not… yet. But why assume it will always be that way?

Fortunately, there’s something very simple you can do to combat all of this data collection. It’s the online equivalent of a tinfoil hat, except it actually works. Log out. And just to be safe, clear your cookies.

I’m trying something out right now that takes all of this even a step further. It all hinges on the fact that in all three of these cases — Facebook, Twitter and Gmail — the web interface is probably the least usable, least satisfying way to experience these services. I’ve never really been a user of Gmail’s web interface; I’ve always preferred using the Mac’s built in Mail application. But now I’m also strictly using the Twitter app on my Mac. (I already use Tweetbot on my iPhone.) And I have made the decision not to use Facebook on my computer at all. I already hated the Facebook web experience anyway, so why bother with it? Now I am only going to check it using the Facebook iPhone app.

The real reason Android is (and has always been) in trouble

Over on Daring Fireball, John Gruber links to a Business Insider piece by Jay Yarow, called “Android Is Suddenly in a Lot of Trouble.”

Gruber responds:

It’s not that Android is suddenly in a lot of trouble — it’s that a lot of people are suddenly realizing that Android has been in trouble all along.

Exactly. But he doesn’t go on to mention why it’s been in trouble all along (though as I recall, he has in the past). I’ve seen plenty of reports, like this one from comScore that iPhones use WiFi networks significantly more than Android phones in the U.S. and U.K. This is one way of measuring the qualitative differences in how people use iPhones compared to how they use Android phones. You could also talk about app revenue, for instance.

All of these measurements and analysis revolve around one clear conclusion, especially when one considers how people end up walking out of a store with either an iPhone or an Android phone. Carriers are pushing Android because they can control the experience more. They’re giving away Android phones as stock upgrade models when customers’ contracts come up. People who don’t even care about owning a “smartphone” are bringing home Android phones because that’s just what the sales rep at the store recommended.

Android is in trouble because a lot of its users (the majority? the vast majority?) are just using it as a phone. It’s a commodity. A lot of the people buying it don’t really know or care what it is, and will never actively use its full potential. It’s just a phone. It may be capable of much more, but if it’s not being used for more, what difference does that make?

People who go into a store wanting to purchase a smartphone predominantly choose the iPhone. Not all of them, of course. Tech-savvy people do choose other smartphone platforms, including Android, especially those who want to tinker with the system. But the rest take whatever they are told to buy by their carriers’ sales reps.

This is the biggest reason Android tablets haven’t taken off, and it’s been discussed too. There’s a built-in market for the apathetic purchase of an Android smartphone. But no one (well, I hope) is walking into a cellular carrier’s store and saying “I want a tablet. What tablet do you recommend?” People who want a tablet don’t just want a tablet; overwhelmingly they want an iPad. Most people who don’t want an iPad don’t want a tablet at all. (Almost) everybody needs a phone.

The problem for the carriers, and the reason they’ve been promoting Android, has typically been that Apple retains too much control (from the carriers’ perspective) over the iPhone. That’s not likely to change, but with Windows Phone, suddenly the carriers have other options. Microsoft is definitely keeping a tighter rein on Windows Phone than Google does with Android, but with Windows Phone, the carriers still have options they don’t get with the iPhone. (Not that this lack of control has prevented them from selling millions of the things.)

If Verizon is serious about pushing Windows Phone (along with the fact that they still sell huge numbers of iPhones), then we’ll soon begin to see just how Android was, as Gruber says, in trouble all along. The success it has achieved to date was largely dependent upon carriers pushing it on unsuspecting or indifferent customers. If they stop doing that…