Debugging a segfault in Drupal

– by our senior web developer Chris Thornett

There are few things more intimidating, when coding in a scripting language like PHP, than a Segmentation Fault. That’s memory allocation and system logic and all the headache that scripting languages are meant to spare you from. It’s an error message with no stacktrace, or even line number, because it comes from several layers below the PHP engine. How is a web developer meant to sort it out? Especially on a mammoth codebase like Drupal?

We encountered a segfault recently while working on a Drupal site. Most of us work on linux platforms, but one of our developers uses MAMP. The error cropped up after he upgraded to PHP 5.5.9. All of the site’s admin pages went whitescreen. I checked the apache error log, hoping to find a helpful php error and stacktrace, but instead all I got was:

[sourcecode language=”plain”][notice] child pid 10723 exit signal Segmentation Fault (11)[/sourcecode]

Do not pass Go, do not collect £200.

Normally, I’d try reverting the code back a few versions to see if that solved the problem, and let [customcode type=”shell”]git[/customcode] and [customcode type=”shell”]diff[/customcode] do the work for me. But with a PHP version change, the problem might have been added months ago and only cropped up now. I’d have tried rolling back to the previous PHP version, but frankly I don’t get Macs, and MAMP doesn’t seem to make it easy to roll back to a specific version. I did re-install MAMP, but that didn’t help. I played with php.ini, increasing memory limit, turning off extensions, to no avail.

The curious thing was that other sites – other Drupal sites – running on the same machine worked fine. Even the front-end of this site worked great, it was just the back-end admin pages where we were crashing and burning. I had to roll up my sleeves and dive into the code. Except: this developer doesn’t use an IDE (he’s a CSS/graphics guy primarily) and I’ve no idea which IDE will work smoothly with a PHP debugger. (Normally I use Netbeans with xdebug, which does ok most of the time.) Since I’ve got the Mac-fear, and I’m not sure there isn’t a horrible problem with the .so extensions, I’m not going to tackle that problem today.

So, debugging without a debugger, old-school. Well, given that we don’t mess with Drupal Core, the problem is likely to be in sites/modules or sites/themes. Fortunately, it’s easy enough to turn these off, or even just move/delete the files, and add them back in one by one, to see which one is causing the problem. Except, even with modules and themes removed, the Segfault is still crashing the site. What…?

Ok, time for the blunt approach. Let’s add some debugging code into Core. Simpler is better. Let’s go with this snippet:

[php]echo ‘Got here’; system.exit();[/php]

Crude, but effective. A poor-man’s breakpoint. I start by inserting it in Drupal’s [customcode type=”shell”]index.php[/customcode] file, just before the first function call [customcode type=”functions”]drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);[/customcode]

Quick visit to mysite.dev/admin, and I get exactly the output you’d expect:

[plain]Got here[/plain]

Move the code to after that function call, before [customcode type=”functions”]menu_execute_active_handler();[/customcode]

Whitescreen. Segfault in the error log. So the error must be happening somewhere within the bootstrap process.
Carefully undoing my changes in index.php, I move onto [customcode]includes/bootstrap.inc[/customcode], where [customcode type=”functions”]drupal_bootstrap()[/customcode] is defined, and follow a similar procedure. Moving on to [customcode type=”functions”]_drupal_bootstrap_full()[/customcode], and then [customcode type=”functions”]module_invoke_all(‘init’);[/customcode]

Ah, this is where it gets tricky. Seriously, these ‘multiplexer’ functions that Drupal (and WordPress, amongst others) use for loose coupling really make it difficult to follow code and analyse problems. At least when you’re working with object oriented code and dependency injection, you can get your IDE to come up with reasonable suggestions for what line of code might be executing next. But that’s a rant for another post.

Anyway, the options here seem to be to either search the codebase for any mention (in the comments, no less) of [customcode type=”functions”]hook_init[/customcode], or to step through the code and analyse variable state. Which I can’t do on the mac, but at least I can do so on a parallel installation on my Ubuntu box. Which gives me 30+ hook functions to analyse.

Eventually I narrow it down to [customcode type=”functions”]update_hook()[/customcode] in the core [customcode]update[/customcode] module.

That makes sense, it’s a module which only runs when viewing admin pages. It calls [customcode type=”functions”]update_requirements()[/customcode], which calls [customcode type=”functions”]update_calculate_project_data()[/customcode], which calls [customcode type=”functions”]update_get_projects()[/customcode], which calls [customcode type=”functions”]system_rebuild_module_data()[/customcode]. Over to the [customcode]system[/customcode] core module. Here [customcode type=”functions”]system_rebuild_module_data()[/customcode] builds a list of all the modules, and runs a bunch of code on each one. And somewhere in that big [customcode type=”keyword”]for[/customcode] loop, it’s crashing.

Ok, so I need a different tool to analyse a loop. There are over a hundred modules in Drupal’s directory structure, even when only looking at core. So instead of a poor-man’s breakpoint, let’s use a poor-man’s watch.

[php]echo $key . ‘<br/>’;
if ($i++ > 10) system.exit();[/php]

([customcode type=”variable”]$key[/customcode] is the loop variable. [customcode type=”variable”]$i[/customcode] is a fresh non-conflicting variable)

Vary that limit number up and down, with binary search, until we find the highest value that doesn’t crash. That should give me the name of the module that’s causing the problem.

Oh. Turns out it wasn’t a module after all, but a Profile. Profiles work much the same way as modules, and the system module is treating it like a module here. But that’s why I couldn’t trace the problem by hiding away all my custom modules. D’oh! Should have thought of that at the beginning and saved myself a few hours of headache. Remove the profile, and the site works fine.

So what was the problem with the profile? Drilling down into the function a bit deeper, [customcode type=”functions”]_system_rebuild_module_data()[/customcode] calls [customcode type=”functions”]drupal_parse_info_file()[/customcode], which calls [customcode type=”functions”]drupal_parse_info_file()[/customcode]. This interprets the [customcode].info[/customcode] file that each module, theme and profile needs to have in Drupal.

It uses a fairly complex Regular Expression to do this. Aha, Regular Expressions – a language-within-a-language, with its own private stack and finite state machine processing power. I can well imagine that the right combination of context-free grammar and arbitrary input string could cause an infinite loop or infinite recursion, leading to memory problems and maybe a Segfault.

So what input are we giving this routine? I look at the [customcode]profile.info[/customcode], and it looks pretty normal, except near the bottom, I find:

[plain]
variables[robotstxt] = ‘#
# robots.txt
#
User-agent: *
Disallow: /includes/
Disallow: /misc/

[/plain]

(many more lines excluded for brevity)

Aha. Does the .info format allow multi-line values? Does it permit the full syntactical language of a robots.txt file? Possibly not. Anyway, removing these lines fixes the problem.

Looks like one of our developers found a clever way to tackle an SEO problem, but didn’t realise the problems it caused for certain versions of the PHP regex engine. Still, that’s easily fixed, there are plenty of different ways of achieving the same thing in Drupal.

So that little problem took several hours to solve, but at least I got to see another part of Drupal’s underbelly that I’ve not looked at before, and that’s always fascinating. There may be better ways of analysing this kind of site failure, but this got there in the end.

Addendum:
With a bit more experimentation, I discovered that it was not the set of characters that mattered, but the length of the string, specifically the length of the value between the quotes in that [customcode]variables[robotstxt] = [/customcode] line.
I put together a little scriptlet to demonstrate the bug, and submitted it to drupal.org. You can follow it at https://drupal.org/node/2220609