Tuesday, August 27, 2013

callbacks, what I said was

Preamble

It seems some explanation is in order. Usually I wouldn't care but, since there's a nagging feeling of obligation to represent Node.js and Mozilla, it needs to be said. The context of my last post/rant was about developers wanting to replace the callback style in Node core with something "better" like generators, and how doing so would have dramatic performance implications.

Also, this blog has long been a place for my rants (many were removed when I switched blog hosts for lack of current relevance), and until my last post had only seen a few hundred visitors over the course of a year. So at 4am I let off some steam, published and hit the sack. Woke up 4 hours later to pee and saw several thousand people had viewed my blog. In my still sleep deprived state I tried to do some damage control, but whatever. Now I'll try to explain more in depth why the callback system is better for my use case.

First though, a note on who I am and what to expect from this blog. It was a conscience decision to advocate performance through hyperbole. The use of satire like the "Right Way" et al. is intentionally provocative as a means to aid developer recollection of what they've read here.

Of Callbacks and Closures

Since many readers didn't attempt to read between the lines, I'll try to make this clear as possible. Function declarations in closures comes at a cost, and this is very expensive in hot code paths. Most of what I do lives in hot code paths.

As an example I've created a minimal HTTP server sitting directly on top of the TCP wrapper and just delivers "hello world!" In the first example everything has been placed in the onComplete callback.

var TCP = process.binding('tcp_wrap').TCP;
var util = require('util');
var headers = 'HTTP/1.1 200 OK\r\n' +
              'Server: TCPTest\r\n' +
              'Content-Type: text/plain; charset=latin-1\r\n' +
              'Content-Length: 12\r\n\r\n' +
              'hello world!';
var data = require('buffer').SlowBuffer(headers.length).fill(headers);

function fail(err, syscall) {
  throw util._errnoException(err, syscall);
}

var server = new TCP();
var err = server.bind('127.0.0.1', 8012);
if (err)
  fail(err, 'bind');

err = server.listen(511);
if (err)
  fail(err, 'listen');

// Wrap every callback up within a closure.
server.onconnection = function onConnection(err, client) {
  if (err)
    fail(err, 'connect');

  client.onread = function onRead(nread, buffer) {
    var writeReq = {
      oncomplete: function afterWrite(err, handle, req) {
        if (err)
          fail(err, 'write');
      }
    };
    if (nread >= 0)
      client.writeBuffer(writeReq, data);
    client.close();
  };

  client.readStart();
};

Side note: all this is running of latest master.

Let's give this thing a run shall we? The following is the median of a dozen runs or so:

$ ./wrk -c 60 -t 6 -d 10 'http://127.0.0.1:8012/'
Running 10s test @ http://127.0.0.1:8012/
  6 threads and 60 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.48ms    2.20ms  20.12ms   98.25%
    Req/Sec     8.44k     1.73k   11.00k    81.71%
  475297 requests in 10.00s, 50.31MB read
  Socket errors: connect 0, read 475291, write 0, timeout 0
Requests/sec:  47538.02
Transfer/sec:  5.03MB

48k req/sec. Not bad. Here we're intentionally not using keep-alive because my measurements were geared more towards services where that would not be of much help. But let's make a slight difference in our code hierarchy:

// so there's no confusion, this simply replaces the bottom of the
// previous script. no the entire thing.
var writeReq = { oncomplete: afterWrite };

function afterWrite(err, handle, req) {
  if (err)
    fail(err, 'write');
}

function onRead(nread, buffer) {
  if (nread >= 0)
    this.writeBuffer(writeReq, data);
  this.close();
}

server.onconnection = function onConnection(err, client) {
  if (err)
    fail(err, 'connect');

  client.onread = onRead;
  client.readStart();
};

Now running the same test:

$ ./wrk -c 60 -t 6 -d 10 'http://127.0.0.1:8012/'
Running 10s test @ http://127.0.0.1:8012/
  6 threads and 60 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.11ms    1.26ms  19.75ms   98.96%
    Req/Sec    10.00k     1.75k   13.00k    69.00%
  566254 requests in 10.00s, 59.94MB read
  Socket errors: connect 0, read 566248, write 0, timeout 0
Requests/sec:  56637.30
Transfer/sec:  6.00MB

Up to 57k req/sec. That's a 19% performance gain just for some stupid callbacks wrapped in closures. Now you might be saying something about how there are still so many layers of abstraction and functionality to go before we have a fully running HTTP server. You're right, and go ahead and imagine what would happen if at each layer we allowed something like this to slide. Oh wait, you don't have to.

How about we take a look at the same basic thing using the HTTP module:

var http = require('http');
http.createServer(function(req, res) {
  res.writeHead(200, {
    'Content-Type': 'text/plain',
    'Content-Length': 12
  });
  res.end('hello world!');
}).listen(8012);

And the median output:

$ ./wrk -c 60 -t 6 -d 10 'http://127.0.0.1:8012/'
Running 10s test @ http://127.0.0.1:8012/
  6 threads and 60 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.37ms  505.73us  14.44ms   96.11%
    Req/Sec     3.19k   243.38     3.78k    89.44%
  178711 requests in 10.00s, 23.52MB read
Requests/sec:  17874.50
Transfer/sec:  2.35MB

Hm, wow. So same basic thing accomplished. Except it's 216% slower.

There are a lot of layers that need to go between the basic TCP connection and the fully working HTTP server, but each 20% sacrifice adds up to a lot. Refraining from creating functions within closures is an easy performance win, and doesn't add much, if any, complexity to the project.

Know your Callback Mechanics

As a final clarification for my previous post, the benchmarks were specifically created to exasperate and augment the performance penalties for declaring functions within closures. Especially when they don't understand the mechanics of how the callback is stored. Event emitters are more lenient because they store a reference to the callback, but process.nextTick is much less forgiving. Let's take a look back at a revised example of the prime generator:

var SB = require('buffer').SlowBuffer;
var max = 3e4;
(function runTick() {
  process.nextTick(function genPrimes() {
    if (--max < 0)
      return;
    var primes = [];
    var len = (max >>> 3) + 1;
    var sieve = new SB(len);
    sieve.fill(0xff, 0, len);
    var cntr, x, j;
    for (cntr = 0, x = 2; x <= max; ++x) {
      if (sieve[x >>> 3] & (1 << (x & 7))) {
        primes[cntr++] = x
        for(j = 2 * x; j <= max; j += x)
          sieve[j >>> 3] &= ~(1 << (j & 7))
      }
    }
    runTick();
  });
}());

// $ /usr/bin/time node primegen0.js
// 19.53user 0.00system 0:19.58elapsed 99%CPU (0avgtext+0avgdata 51040maxresident)k
// 0inputs+0outputs (0major+13104minor)pagefaults 0swaps

Thanks to Chris Dickinson for the bit tweaking optimizations!

And taking into account that nextTick will need to recompile the function every time, so swap it out with a much smaller closure:

var SB = require('buffer').SlowBuffer;

function genPrimes(max) {
  if (--max < 0)
    return;
  var primes = [];
  var len = (max >>> 3) + 1;
  var sieve = new SB(len);
  sieve.fill(0xff, 0, len);
  var cntr, x, j;
  for (cntr = 0, x = 2; x <= max; ++x) {
    if (sieve[x >>> 3] & (1 << (x & 7))) {
      primes[cntr++] = x
      for(j = 2 * x; j <= max; j += x)
        sieve[j >>> 3] &= ~(1 << (j & 7))
    }
  }
  runTick();
}

var max = 3e4;
function runTick() {
  process.nextTick(function() {
    if (--max < 0)
      return;
    genPrimes(max);
    runTick();
  });
}
runTick();

// $ /usr/bin/time node primegen1.js
// 4.59user 0.04system 0:04.64elapsed 99%CPU (0avgtext+0avgdata 50540maxresident)k
// 0inputs+0outputs (0major+12977minor)pagefaults 0swaps

Awesome. 320% faster by using a simple callback wrapping trick. While this may seem outlandish to some users out there, it's not. Delaying possibly processing heavy lifting until later in the event loop via process.nextTick, setImmediate or setTimeout is not uncommon, and each case can suffer from the same performance penalty.

In Conclusion

These use cases are specific to performance concerns within Node core. These are published so anyone can learn from what I've spent time researching. At my talk in NodeConf.eu I'll be addressing more core specific performance implementations. Afterwards the slides will be posted publicly for your enjoyment.

If you feel the urge to bring up how this makes debugging more difficult, please refrain. For the reasons of this post that's not my concern, and even though I hate domains with a passion I'm already working on a patch that'll allow users to gain that beloved long stack trace or whatever they're looking for to finally end the loud few that continually ask for it.

Happy performance hunting, and remember. There's never enough.

No comments:

Post a Comment