Troubleshooting a SIGKILL

17 Jul 2017 in ConfigServer, SIGKILL

Recently I was called to investigate a problem where a PHP script stopped working after a random amount of time. It was a long running script interacting with mysql and it was called from both the webserver (apache via suphp) and the command line.

To give you some context the script was badly written, continuing the tradition of how bad the PHP as a programming language can be - of course this is mainly the programmer's fault:

ignore_user_abort(true);    
// the end of time is coming

set_time_limit(0);          
// who cares if the server is fucked up

error_reporting(E_ALL ^ E_DEPRECATED ^ E_NOTICE ^ E_WARNING);
// we don't care about warnings - of course *WE* introduced them

ini_set('display_errors', 'On'); 
// yes, display errors on the production server

mysql_query("SET NAMES = 'greek'");
mysql_query("set character_set_connection=greek");
mysql_query("set character_set_client=greek");
mysql_query('set character set greek');
mysql_query('set character_set_results greek');
mysql_query('SET wait_timeout=28800;');
// because we don't exactly know which one is working,
// just try out all possible permutations

function my_file_get_contents($url)
{
    $filename = $url;
    $handle = fopen($filename, "r");
    $contents = fread($handle, filesize($filename));
    fclose($handle);
    return $contents;
}
// is this code so old considering that file_get_contents
// is in PHP core since 4.3.0?

The linux server was setup (and managed) with CPanel from some hosting company with operators doing things like this (excerpt from .bash_history):

wall Is any one working on the server
wall <some username>
chmod 777 /on/some/publicly/available/file-from-webserber

Add on top of that the absense of any kind of manual about installed packages, customizations, checklists, policies, security, etc, and things gets interesting pretty quickly.

Back to the problem: After trying the script with nothing written in Apache/PHP error log about a possible error (of course I changed error_reporting to E_ALL), I switched to command line in order to find out what was happening:

php the-offending-script
<some output>
Killed

Killed? WTF!

Ok, let's look at kernel messages, is this run out of memory? Did the oom killer kick in? Unfortunatelly no, everything seemed fine. Is MySQL low on connections? Increased, no solution. Watched memory footprint via memory_get_usage(). Nothing suspicious. Next step to try with strace but again no usable hint.

The script was killed with an exit code of 137 (that is 128 + 9) which means it received the SIGKILL signal. So I increased user limits - or more preciselly disabling the limits cpanel software has introduced. Still the script was killed at random points.

Confused, I tail'd all of the /var/log/*.log, run the script and voilla:

Jul 17 XX:XX:XX host lfd[31295]: *User Processing* PID:24845 Kill:1 User:XXX RSS:457(MB) EXE:/usr/local/bin/php CMD:php the-offending-script

This is from a file called /var/log/lfd.log and it turned out its part of the ConfigServer package. What this does is to kill a process when it's above a memory limit, time limit or a number of processes per user limit - in this case it was the memory.

The fun part was that the comments in the configuration file at /etc/csf/csf.conf about "Process Tracking" shows a warning about not enabling this:

Warning: We don't recommend enabling this option unless absolutely necessary as it can cause unexpected problems when processes are suddenly terminated. It can also lead to system processes being terminated which could cause stability issues. It is much better to leave this option disabled and to investigate each case as it is reported when the triggers above are breached

So I just set the PT_USERKILL to "0", restarted the LFD daemon via /etc/init.d/lfd restart and problem solved!

PS: I forgot to tell you that no email was set in CSF configuration to receive these warnings. How awesome is that?