Migrating primary keys in a CakePHP site’s database from GUIDs to integers

If you are a CakePHP developer, and find yourself in a situation where your database is using GUIDs but you want to switch to using integer-based primary keys instead, this post will describe how I dealt with that exact situation. (Actually, the post will do that even if you’re not a CakePHP developer; it just probably won’t be of any interest to you.)

Some unnecessary background details you can skip if you’re in a hurry

I may be the only CakePHP developer silly enough ever to have used GUIDs (those crazy pseudo-unique 36-character hexadecimal strings) instead of plain auto-increment integers as primary keys. Maybe not. In any case, it only took me one CakePHP project to realize the folly of the approach and I quickly abandoned GUIDs on my second CakePHP site.

Now the time has come, nearly four years later, to finally update that first CakePHP site to the latest version of cms34. There will surely be numerous challenges involved in the upgrade, given the extensive evolution the system has undergone in the intervening years. But without a doubt the biggest, or at least the first, of those challenges is getting the data converted from using GUIDs to integer-based primary keys.

The reason I used GUIDs in the first place was that I have a few multi-purpose cross-reference tables in the structure, and I wanted to just be able to refer to a single ID field as a foreign key, without having to keep track of both the table/model and an integer ID. Kind of silly, in retrospect, especially since I quickly realized how important it was to be able to easily identify which table the foreign key pointed to. At the time I was singularly focused on trying to keep the database as stripped-down basic as possible, to a self-defeating point.

But my early CakePHP learning curve is a bit beside the point here. The goal of this post is to document the process I am undertaking to migrate the site, for the benefit of anyone else who finds themselves in a similar situation. (Hopefully, for you, the idiot who used GUIDs was not yourself.)

Step 1: Preliminary set-up

First, a qualifier: the way I am doing this will result in a lot of gaps in the sequence of integers used in the individual tables after the migration. If the actual values of your integer primary keys matters to you, you may not want to take this approach. I can’t imagine a reason why the values would really matter though.

The idea in this step is, in essence, to convert each GUID in the system into a unique integer that can be used later in the update process. There’s no reason why the integers have to be unique across tables any longer; it’s just that having a definitive reference, in a single table, of how each GUID maps to an integer will make some of the later steps in the process easier. If you are compelled not to skip numbers in individual tables, you could add an extra field called table_id, and increment table-specific IDs in that field. But that would require extra code in the next step, and it just seems like an extra complication without any real benefits.

Do not do any of this on a live site. Set up a duplicate copy of your site somewhere else, and do all of this conversion work there. Also discourage users from updating the existing site while you’re doing the migration, if possible. This conversion can be set up as a repeatable process, however, so you could always do some test runs, and then once you have everything working reliably, get a fresh dump of the live database again, minimizing downtime.

In your copy database, create all of the tables for your old database and load in the data. Also set up empty tables for the new database (which hopefully doesn’t have overlapping table names; if it does, add a prefix to the table names for the old database). Once you have both the old and new tables in your database, and the old tables loaded with data, you’re ready to proceed.

Step 2: Build and populate a GUID map table

Assuming you’re using MySQL, this will take care of it:

CREATE TABLE IF NOT EXISTS `guid_map` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`table` varchar(255) NOT NULL,
`guid` char(36) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1;

If you have adequate permissions to create the table from within a PHP script, you may as well start building one, because you’ll be adding to it next anyway:

<?php
// Connect to database (replace values as appropriate)
$db_name = 'database'; // Set to your database name
$dbconn = mysql_connect('localhost','username','password',true);
$db = mysql_select_db($db_name);

// Build guid_map table
mysql_query("CREATE TABLE IF NOT EXISTS `guid_map` (
    `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
    `table` varchar(255) NOT NULL,
    `guid` char(36) NOT NULL,
    PRIMARY KEY (`id`)
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1"
);

Next, you’ll want to build a PHP script that will crank through all of your old tables, grab all of the GUIDs, and write them into this new table. Here’s a snippet of PHP to get the job done. You’ll need to decide how you want to run it; it assumes a database connection is already established, so you could just drop it in as an action in one of your controllers or you could build a stand-alone PHP script and add the necessary database connection code at the top.

Note: This works with direct SQL queries, not CakePHP’s model methods. That’s just how I did it. Also, when I build PHP utilities like this, I run them in a browser and like to dump status information as basic HTML. If you prefer to run this at the command line, or don’t care about a verbose dump of the details, you could modify/remove any lines here that contain HTML.

// Clear current GUID map data (makes this process repeatable)
mysql_query('TRUNCATE TABLE guid_map');
if (mysql_error()) { echo '<p>' . mysql_error() . '</p>'; exit; }
echo '<p>GUID map cleared.</p>';

// Get tables
$old_prefix = 'old_'; // Set to prefix for old table names
$rs = mysql_query('SHOW TABLES');
$tables = array();
if (!empty($rs)) {
    while ($row = mysql_fetch_array($rs)) {
        if (strpos($row[0],$old_prefix) === 0 && $row[0] != 'guid_map') {
            $tables[] = $row[0];
        }
    }
}

// Loop through tables, retrieving all GUIDs
foreach ((array)$tables as $table) {
    $sql = 'SELECT id FROM `' . $table . '`';
    $rs = mysql_query($sql);
    
    // Loop through rows, adding to GUID map
    $error_count = 0;
    $mapped_count = 0;
    if (mysql_num_rows($rs) > 0) {
        echo '<hr /><h3>Processing ' . mysql_num_rows($rs) . ' rows in table: ' . $table . '</h3>';
        while ($row = mysql_fetch_assoc($rs)) {
            $sql = 'INSERT INTO `guid_map` (`table`, `guid`) VALUES (\'' . mysql_real_escape_string($table) . '\', \'' . mysql_real_escape_string($row['id']) . '\');';
            mysql_query($sql);
            if (mysql_error()) { echo '<p>' . mysql_error() . '</p>'; }
            else { $mapped_count++; }
        }
        echo '<p>Mapped ' . $mapped_count . ' rows with ' . $error_count . ' error(s).</p>';
    }
    else {
        echo '<p>No data rows found in table: ' . $table . '</p>';
    }
}

Once you’ve run this script, your guid_map table will now contain a row for every row of every table in your database. And you now have an integer ID you can use for each of them. And what’s extra cool about it is that it’s repeatable. Run it as many times as you want. Get a new dump of data from the old site and run it again. Fun!

The next step is where things get a lot more complicated, and a lot more customized to your specific data structures. I’ll be using one of my actual data tables as an example, but bear in mind that your system is unlikely to match this schema exactly. This is just a demonstration to work from in building your own script.

Microsoft Word’s formatting garbage, quantified

Anyone who’s spent any amount of time working on the web dreads it: content delivered in Microsoft Word format. Word adds tons of formatting garbage that results in bloated files and messes up the presentation when content gets brought into HTML.

When Microsoft released Office 2007, they touted switching to an XML-based document format for all of the apps. But all XML is not created equal.

Case in point: I am currently working on a project that is going to involve receiving content for a number of web pages in a tabular form, either in Word or Excel format. A spreadsheet, essentially (if not technically), with each page represented by a row, and its text content in a cell. I will be writing a PHP script to parse the spreadsheet data and generate a set of HTML files with the content loaded in them.

I’m currently trying to determine if Word or Excel would be the better format to receive the content in, which involves opening up .xlsx and .docx files in BBEdit and looking at the raw data stored within them. I’ve managed to identify the embedded XML files in each that hold the actual content. These files store the same actual text content, but their XML schemas vary based on the needs of Word and Excel.

So… how do they match up? The XML file I pulled out of Excel is 14 KB. The one from Word is 202 KB. For the mathematically inclined amongst you, that’s a little more than 14 times larger. Yes… another (perhaps more hyperbolic) way you could say it is that the Word document is exponentially larger.

That’s just ridiculous.

What makes up the difference? Well, the Excel file’s XML is nothing but basic tags. There are no attributes on any of the tags, as far as I can tell. It’s pure semantic structure. The Word XML, on the other hand, is almost nothing but attributes. And there’s nothing smart about them either. Most of them are assigning fonts to the text. The same font names, over and over and over again throughout the file.

That’s… beyond ridiculous.

The (highly biased) case against CSS preprocessors

Everybody who’s anybody is using CSS preprocessors!

Or so it feels. I’m an old-school vanilla CSS curmudgeon, and the more I’ve dipped my toes into working with CSS preprocessors (specifically, SCSS with Compass), the less I like them.

As I see it, there are three main problems with vanilla CSS:

No variables. Honestly this is probably the only problem I really have with CSS. I’d like to be able to set variables for things like colors that I use throughout a site. To a lesser extent, I see the benefit of “mixins” — reusable chunks of CSS.

Redundant code. It really depends on how you conceptualize your CSS structures, but it is very easy to fall into a habit of writing the same CSS code over and over again, resulting in bloated, hard to maintain files. While I am guilty of this just like anyone else, I find that if you format your CSS code properly you can combine properties efficiently to avoid redundancy without needing any external solutions.

Lack of programmatic logic. Here I’m thinking about things like conditional statements, and also math operations. This is probably as much of a strength as a weakness. CSS is a stylesheet, not a program. It’s a set of rules to be applied to formatting a document. There’s nothing programmatic about it. But still, as CSS selectors become more complicated and convoluted, it is clear that in some cases light programming logic would be helpful.

The real question is, do CSS preprocessors actually solve these problems? Or, more specifically, do they solve them without introducing new problems that are at least as bad as the ones they’re trying to fix?

For me the answer has been, and continues to be, no, they don’t. But I’m trying to get a more tangible explanation for why that is, rather than the simple gut feeling that’s been driving me away from using them up to this point.

What are other people saying?

My first stop in trying to answer this question of why I dislike CSS preprocessors was Google. I wanted to see what other people were saying, pro and con. Here are some interesting blog posts I found, going back a few years to the early (or at least earlier) days of CSS preprocessors:

The problem with CSS pre-processors
This article by Miller Medeiros was the first one I came across a few months ago when I initially pondered this question, and at the time it was all I needed to satisfy myself that I was not crazy for wanting to avoid CSS preprocessors.

So, assuming preprocessors do solve the problems with vanilla CSS, what are the problems they introduce? And how bad are they, really?

I need to get specific to my own situation for a minute here. I have a former coworker who is a firm believer in SCSS, and now that he’s gone, I’m left to maintain and extend the code he was writing. This experience casts my aversion to CSS preprocessors into stark relief.

Maintenance can be a challenge

The most obvious issue with using a preprocessor is that the output CSS is not exactly easy to hand edit, and worse, you shouldn’t hand edit it, because your changes don’t end up back in the original SCSS (or whatever format you’re using) files. If someone else goes back in later and edits the original SCSS and generates new CSS, your changes will get lost.

But let’s set that matter aside for a minute. In a broader sense, one of the key challenges for any web developer is to build code that is easy to maintain. Not just for you to maintain, but for whoever comes after you to maintain. Face it, you’re not going to be working on this project forever. At some point, someone else is either going to need to take it over, or throw away what you did and start fresh.

If you’ve ever jumped into an existing project midstream, or been handed the task of maintaining something someone else built — especially if it’s something built by someone who is no longer around to answer your questions — you know that it’s rarely an ideal situation. Even if the previous developer left copious documentation, it can take hours of picking apart their code to really get a firm understanding of how it all works. This is true with plain CSS too, but at least with plain CSS it’s a lot less work to track down a particular piece of troublesome code.

It’s non-standard

What happens when certain features of your favorite preprocessor get rolled into a future version of vanilla CSS? What happens if similar (but incompatible) features of a different preprocessor become the standard? In short, what if everything you’re doing right now becomes obsolete? How long are you going to hang on to doing things your old way with your no-longer-relevant preprocessor, before you have to scrap it and start over, or at least rewrite big chunks of your code to fit the new way everyone is doing things?

Call me a curmudgeon, but having been a professional web developer for over 18 years I’ve seen a lot of technologies and design trends come and go. I’ve always been skeptical of anything non-standard. I never used VBScript or ActiveX, I never embraced Flash, and in general I’ve done everything I could to both champion and adhere to open standards as much as possible throughout my career.

Suffice to say, resisting CSS preprocessors is just in my blood. They just don’t feel right to me. I’d rather do without features I want, if they’re not part of the standard, than resort to a non-standard workaround to make them happen… especially if it looks like there’s a reasonable chance they’ll be added to the standard at some point in the not-too-distant future.

On web browsers ignoring autocomplete="off" for passwords (and, on devbros ignoring the messiness of reality)

A couple of frustrating things have occupied much of my afternoon: unexpected web browser behavior, and annoying devbro behavior.

The unexpected web browser behavior was a stubborn, persistent refusal to honor the autocomplete="off" attribute for passwords. The annoying devbro behavior was a stubborn, persistent refusal to honor the legitimacy of a question. In this thrilling post, I’ll be looking at both.

tl;dr

If you’re just trying to find a simple way to get browsers to honor autocomplete="off" on password fields, here you go: THERE ISN’T ONE. Browsers ignore it for a legitimate reason, but I feel they’re overzealous in their implementation.

However… if you’re trying to do what I was trying to do — basically using a password field as a write-only input — it’s never going to work, for a completely different reason. Why? View source and get back to me.

The only way to really create a write-only input is to not pass the value into the form at all, and have some other mechanism on the back end to detect whether or not the user entered a new value.

I was working on an admin page for a site I built; specifically, a page for editing some low-level configuration settings on the site.

One setting is the “client secret” for authenticating with an external API. I realized having that visible on-screen (even if it’s only on pages that require an administrator-level login) was a bit of a security issue, so I thought I’d quickly solve it by changing the input from type="text" to type="password".

Well, not so fast. Because while it worked, it also triggered Safari’s autocomplete feature, replacing the value in that field with my password for the site, and, maddeningly, replacing the value in the field before it with my username.

Then I remembered an attribute I almost never use: autocomplete="off". I figured I’d just stick that on those fields and, boom, problem solved. Only then I realized, Safari (and as it turns out, every other modern browser) completely ignores autocomplete="off" on passwords.

I set off for StackOverflow in search of an answer to my conundrum, and as is often the case, I stumbled upon another maddening thing: a devbro offering a “You’re doing it wrong” answer. (Seriously… I mean, I know nothing about the author of this post, so I wouldn’t use that epithet, but he literally said this:

It is so wrong that browsers are intentionally ignoring anyone who tries to do it. Those people should stop doing the wrong thing.™

This was at the end of a 600+ word answer that demonstrated classic devbro reasoning:

  1. Rejecting plausibly legitimate use cases as invalid.
  2. Denying obvious design flaws in the existing solution.
  3. Blaming the user for “doing the wrong thing.™”

Back to the specific matter of browsers ignoring autocomplete="off", there are some real problems here:

The autocomplete attribute is standard HTML that browser developers have deliberately chosen to ignore. I understand that browsers are constantly evolving, especially in response to unscrupulous web developers who find ways of exploiting features, and there are definitely ways this feature could be exploited. But!

Browsers are absurdly overzealous in their rejection of autocomplete="off". I was in a situation where the password field was in the middle of a form with 9 text inputs and 5 sets of radio buttons. How often does a login form have all of that? And, none of the fields had “username” or “password” anywhere near them… not in the name or id attributes, nor in their associated <label> fields. AND, there were already prefilled values in the form that got overwritten by autocomplete. AND, Safari is even programmed to aggressively fight any efforts at a workaround, such as extra password fields, hidden with CSS.

It was the last bit that really irked me, because my experimentation showed that Safari is programmed to detect very specific CSS attributes being applied to other fields in the form to determine whether or not there’s a second password field (i.e. that this is a form for resetting your password). If it detects any of various ways of making the field invisible, like display: none; or visibility: hidden; or position: absolute; left: -9999px;, then it will still autocomplete the password. The only CSS I was able to fool it with was opacity: 0; but I suspect that will change soon too.


It’s fine to argue that web developers shouldn’t use autocomplete="off" on password fields, and I even agree, to an extent, for login forms. But if browsers are going to ignore it, and have intricate means of detecting efforts at circumventing that, then they should also recognize that there are legitimate uses for <input type="password"> outside of login forms, and in those cases, honor autocomplete="off".

Here are some suggestions:

Don’t ignore autocomplete="off" if the form contains more than three or four visible input fields. Most login forms only contain two fields. I’ve seen some with more, so I’m being generous here. But how often does a login form contain ten or more fields? Come on.

Look for additional context before autocompleting login fields. Yes, developers can find ways to circumvent this, but if the only clue that this is a login form is the presence of a single password field, then maybe it’s not a login form. Look at the name and id attributes, the <label> tag, and the form’s action attribute. If it really, really doesn’t look like a login form, especially if the developer has also added autocomplete="off", maybe give them the benefit of the doubt and settle down with the autocomplete. At least, maybe in these cases ask the user if they want to autocomplete instead of just doing it. I mean, in some cases the form may be long enough that the autocompleted fields are off screen and the user didn’t even notice they turned yellow.

Don’t replace prefilled values with autocomplete values. I would sincerely appreciate if someone could explain to me the logic behind this one. If a form already has values in the fields, then autocomplete should never overwrite those values. Maybe it’s because I mostly work with HTML forms that are really data editing screens, rather than submission forms, so my perspective is skewed. But the fact that web-based data editing screens are a common thing makes this a valid use case. If a form loads with most or all of the values prefilled, a browser should not presume that it’s OK to autocomplete any of the values.

Check for an indication that the user is already logged in. I suppose there’s a murky security issue here somewhere, but if the browser could detect a cookie with a session ID, or a “log out” link on the page, it would be a way of indicating that the user might already be logged in, and so autocompleting passwords wouldn’t make sense.

Implement support for autocomplete="new-password". Apparently Chrome supports this already, and admittedly it reeks of workaround, but at least it’s something.

With all of these suggestions, I am not saying browsers shouldn’t autocomplete password fields… I’m just suggesting that they look for signs that a password field is not part of a login form, and in those cases, honor autocomplete="off" on password inputs.

The day Facebook performed seppuku

I don’t have much to say about all of this, other than that I would probably, yes, be posting this on Facebook if it were affecting literally anything else in my known realm of existence.

Today Facebook killed itself. But its undead corpse will surely rise again.

The problem is some kind of colossal DNS snafu, which has, for all intents and purposes, temporarily caused facebook.com to cease to exist.

Ah… the air somehow smells fresher today. The water tastes better. The sun shines brighter.

But I know it won’t last.

Anyway… today’s the day it happened. Here’s some more in-depth information from Ars Technica which hopefully will not disappear down the Memory Hole anytime soon.

Update: This Cloudflare blog post probably provides the definitive explanation of what happened.