Tuesday, August 27, 2013

callbacks, what I said was

Preamble

It seems some explanation is in order. Usually I wouldn't care but, since there's a nagging feeling of obligation to represent Node.js and Mozilla, it needs to be said. The context of my last post/rant was about developers wanting to replace the callback style in Node core with something "better" like generators, and how doing so would have dramatic performance implications.

Also, this blog has long been a place for my rants (many were removed when I switched blog hosts for lack of current relevance), and until my last post had only seen a few hundred visitors over the course of a year. So at 4am I let off some steam, published and hit the sack. Woke up 4 hours later to pee and saw several thousand people had viewed my blog. In my still sleep deprived state I tried to do some damage control, but whatever. Now I'll try to explain more in depth why the callback system is better for my use case.

First though, a note on who I am and what to expect from this blog. It was a conscience decision to advocate performance through hyperbole. The use of satire like the "Right Way" et al. is intentionally provocative as a means to aid developer recollection of what they've read here.

Of Callbacks and Closures

Since many readers didn't attempt to read between the lines, I'll try to make this clear as possible. Function declarations in closures comes at a cost, and this is very expensive in hot code paths. Most of what I do lives in hot code paths.

As an example I've created a minimal HTTP server sitting directly on top of the TCP wrapper and just delivers "hello world!" In the first example everything has been placed in the onComplete callback.

var TCP = process.binding('tcp_wrap').TCP;
var util = require('util');
var headers = 'HTTP/1.1 200 OK\r\n' +
              'Server: TCPTest\r\n' +
              'Content-Type: text/plain; charset=latin-1\r\n' +
              'Content-Length: 12\r\n\r\n' +
              'hello world!';
var data = require('buffer').SlowBuffer(headers.length).fill(headers);

function fail(err, syscall) {
  throw util._errnoException(err, syscall);
}

var server = new TCP();
var err = server.bind('127.0.0.1', 8012);
if (err)
  fail(err, 'bind');

err = server.listen(511);
if (err)
  fail(err, 'listen');

// Wrap every callback up within a closure.
server.onconnection = function onConnection(err, client) {
  if (err)
    fail(err, 'connect');

  client.onread = function onRead(nread, buffer) {
    var writeReq = {
      oncomplete: function afterWrite(err, handle, req) {
        if (err)
          fail(err, 'write');
      }
    };
    if (nread >= 0)
      client.writeBuffer(writeReq, data);
    client.close();
  };

  client.readStart();
};

Side note: all this is running of latest master.

Let's give this thing a run shall we? The following is the median of a dozen runs or so:

$ ./wrk -c 60 -t 6 -d 10 'http://127.0.0.1:8012/'
Running 10s test @ http://127.0.0.1:8012/
  6 threads and 60 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.48ms    2.20ms  20.12ms   98.25%
    Req/Sec     8.44k     1.73k   11.00k    81.71%
  475297 requests in 10.00s, 50.31MB read
  Socket errors: connect 0, read 475291, write 0, timeout 0
Requests/sec:  47538.02
Transfer/sec:  5.03MB

48k req/sec. Not bad. Here we're intentionally not using keep-alive because my measurements were geared more towards services where that would not be of much help. But let's make a slight difference in our code hierarchy:

// so there's no confusion, this simply replaces the bottom of the
// previous script. no the entire thing.
var writeReq = { oncomplete: afterWrite };

function afterWrite(err, handle, req) {
  if (err)
    fail(err, 'write');
}

function onRead(nread, buffer) {
  if (nread >= 0)
    this.writeBuffer(writeReq, data);
  this.close();
}

server.onconnection = function onConnection(err, client) {
  if (err)
    fail(err, 'connect');

  client.onread = onRead;
  client.readStart();
};

Now running the same test:

$ ./wrk -c 60 -t 6 -d 10 'http://127.0.0.1:8012/'
Running 10s test @ http://127.0.0.1:8012/
  6 threads and 60 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.11ms    1.26ms  19.75ms   98.96%
    Req/Sec    10.00k     1.75k   13.00k    69.00%
  566254 requests in 10.00s, 59.94MB read
  Socket errors: connect 0, read 566248, write 0, timeout 0
Requests/sec:  56637.30
Transfer/sec:  6.00MB

Up to 57k req/sec. That's a 19% performance gain just for some stupid callbacks wrapped in closures. Now you might be saying something about how there are still so many layers of abstraction and functionality to go before we have a fully running HTTP server. You're right, and go ahead and imagine what would happen if at each layer we allowed something like this to slide. Oh wait, you don't have to.

How about we take a look at the same basic thing using the HTTP module:

var http = require('http');
http.createServer(function(req, res) {
  res.writeHead(200, {
    'Content-Type': 'text/plain',
    'Content-Length': 12
  });
  res.end('hello world!');
}).listen(8012);

And the median output:

$ ./wrk -c 60 -t 6 -d 10 'http://127.0.0.1:8012/'
Running 10s test @ http://127.0.0.1:8012/
  6 threads and 60 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.37ms  505.73us  14.44ms   96.11%
    Req/Sec     3.19k   243.38     3.78k    89.44%
  178711 requests in 10.00s, 23.52MB read
Requests/sec:  17874.50
Transfer/sec:  2.35MB

Hm, wow. So same basic thing accomplished. Except it's 216% slower.

There are a lot of layers that need to go between the basic TCP connection and the fully working HTTP server, but each 20% sacrifice adds up to a lot. Refraining from creating functions within closures is an easy performance win, and doesn't add much, if any, complexity to the project.

Know your Callback Mechanics

As a final clarification for my previous post, the benchmarks were specifically created to exasperate and augment the performance penalties for declaring functions within closures. Especially when they don't understand the mechanics of how the callback is stored. Event emitters are more lenient because they store a reference to the callback, but process.nextTick is much less forgiving. Let's take a look back at a revised example of the prime generator:

var SB = require('buffer').SlowBuffer;
var max = 3e4;
(function runTick() {
  process.nextTick(function genPrimes() {
    if (--max < 0)
      return;
    var primes = [];
    var len = (max >>> 3) + 1;
    var sieve = new SB(len);
    sieve.fill(0xff, 0, len);
    var cntr, x, j;
    for (cntr = 0, x = 2; x <= max; ++x) {
      if (sieve[x >>> 3] & (1 << (x & 7))) {
        primes[cntr++] = x
        for(j = 2 * x; j <= max; j += x)
          sieve[j >>> 3] &= ~(1 << (j & 7))
      }
    }
    runTick();
  });
}());

// $ /usr/bin/time node primegen0.js
// 19.53user 0.00system 0:19.58elapsed 99%CPU (0avgtext+0avgdata 51040maxresident)k
// 0inputs+0outputs (0major+13104minor)pagefaults 0swaps

Thanks to Chris Dickinson for the bit tweaking optimizations!

And taking into account that nextTick will need to recompile the function every time, so swap it out with a much smaller closure:

var SB = require('buffer').SlowBuffer;

function genPrimes(max) {
  if (--max < 0)
    return;
  var primes = [];
  var len = (max >>> 3) + 1;
  var sieve = new SB(len);
  sieve.fill(0xff, 0, len);
  var cntr, x, j;
  for (cntr = 0, x = 2; x <= max; ++x) {
    if (sieve[x >>> 3] & (1 << (x & 7))) {
      primes[cntr++] = x
      for(j = 2 * x; j <= max; j += x)
        sieve[j >>> 3] &= ~(1 << (j & 7))
    }
  }
  runTick();
}

var max = 3e4;
function runTick() {
  process.nextTick(function() {
    if (--max < 0)
      return;
    genPrimes(max);
    runTick();
  });
}
runTick();

// $ /usr/bin/time node primegen1.js
// 4.59user 0.04system 0:04.64elapsed 99%CPU (0avgtext+0avgdata 50540maxresident)k
// 0inputs+0outputs (0major+12977minor)pagefaults 0swaps

Awesome. 320% faster by using a simple callback wrapping trick. While this may seem outlandish to some users out there, it's not. Delaying possibly processing heavy lifting until later in the event loop via process.nextTick, setImmediate or setTimeout is not uncommon, and each case can suffer from the same performance penalty.

In Conclusion

These use cases are specific to performance concerns within Node core. These are published so anyone can learn from what I've spent time researching. At my talk in NodeConf.eu I'll be addressing more core specific performance implementations. Afterwards the slides will be posted publicly for your enjoyment.

If you feel the urge to bring up how this makes debugging more difficult, please refrain. For the reasons of this post that's not my concern, and even though I hate domains with a passion I'm already working on a patch that'll allow users to gain that beloved long stack trace or whatever they're looking for to finally end the loud few that continually ask for it.

Happy performance hunting, and remember. There's never enough.

Thursday, August 22, 2013

long live the callbacks

Update: Either before or after you read this, also read my follow up.

Been thinking about what to write that won't be made useless within the next month due to upcoming API changes. So today, instead of giving you something useful I'm going to contribute to the so called "callback hell" flame war.

Honestly, I've never understood why people hate callbacks so much. Oh wait, I know. It's because they're Doing it Wrong. Then once they're 6+ indentations deep the realization comes that the code is hard to read. So they blame the callback! Poor little callback. People spit on your very existence because they don't get to know you, and instead marry themselves to the idea of chainability. I can understand that. Did web development with jQuery for years. I prided myself on bending those chains to my will. Then one day I asked, what was I gaining?

So let's get into the first issue I have: declaring functions within closures. Don't do it. Don't even think about it (ok, there are a few cases it's necessary, but apply sparingly). It makes code difficult to read, and if you give even the tiniest pigeon's poop about performance you'll heed this advice. Just to be sure you understand, let's show an exmaple:

function Points(x0, y0, x1, y1) {
  this.distance = function distance() {
    var x = x1 - x0;
    var y = y1 - y0;
    return Math.sqrt(x * x + y * y);
  };
}


var iter = 1e6;
var rand = Math.random;
for (var i = 0; i < iter; i++) {
  var p = new Points(rand(), rand(), rand(), rand());
  for (var j = 0; j < 1e3; j++)
    p.distance();
}

This doesn't look so bad, right? WRONG! You've saved yourself needing to assign some variables, but at what cost? Well...

$ /usr/bin/time node points.js 
33.47user 0.09system 0:33.65elapsed 99%CPU (0avgtext+0avgdata 49584maxresident)k
0inputs+0outputs (0major+34036minor)pagefaults 0swaps

Ok, so it took 33 seconds and used right around 50MB memory. Let's make a slight adjustment to the implementation:

function Points(x0, y0, x1, y1) {
  this._points = {
    x0: x0,
    y0: y0,
    x1: x1,
    y1: y1
  };
}

Points.prototype.distance = function distance() {
  var p = this._points;
  var x = p.x1 - p.x0;
  var y = p.y1 - p.y0;
  return Math.sqrt(x * x + y * y);
};

And how did it do?

$ /usr/bin/time node points.js 
1.21user 0.01system 0:01.23elapsed 99%CPU (0avgtext+0avgdata 14224maxresident)k
0inputs+0outputs (0major+3902minor)pagefaults 0swaps

Um. Well... Ok. Actually had to double check my code because I wasn't expecting the difference to be this dramatic. Now it runs in under 2 seconds and only uses 15MB memory. We're not going to get into an in depth look at why this is happening. That's for another day, but let me reiterate the point. DON'T USE FUNCTION CLOSURES DON'T DECLARE FUNCTIONS WITHIN CLOSURES! (in performance critical paths)

What does that have to do with callbacks? Simple, don't nest your callback functions. There have been a lot of articles written in the last couple months about the awesomeness of generators and how horrible callbacks are. Though if you take a look at the benchmarks you'll notice the callback are usually fairly nested.

Let's take a look at a very contrived example just to get the point across:

var SB = require('buffer').SlowBuffer;

function runner(cb, arg) {
  process.nextTick(function() {
    cb(arg);
  });
}


var iter = 2e4;

for (var i = 0; i < iter; i++) {
  runner(function genPrimes(max) {
    var primes = [];
    var len = ((max / 8) >>> 0) + 1;
    var sieve = new SB(len);
    sieve.fill(0xff, 0, len);
    var cntr, x, j;
    for (cntr = 0, x = 2; x <= max; x++) {
      if (sieve[(x / 8) >>> 0] & (1 << (x % 8))) {
        primes[cntr++] = x;
        for (j = 2 * x; j <= max; j += x) {
          sieve[(j / 8) >>> 0] &= ~(1 << (j % 8));
        }
      }
    }
    return primes;
  }, i);
}

Side note: I challenge anyone to come up with a faster prime generator in JavaScript.

This style of passing the callback directly into the function is used all over the place, and while it looks innocent enough it can be the death of any hopeful performance.

$ /usr/bin/time node genprimes.js 
18.84user 0.02system 0:18.91elapsed 99%CPU (0avgtext+0avgdata 34132maxresident)k
0inputs+0outputs (0major+8896minor)pagefaults 0swaps

18 seconds. Not bad I guess. All we did was declare genPrimes in the location it's being passed, but let's make the minor adjustment of moving it just below runner() and let's see what we get:

$ /usr/bin/time node genprimes.js 
2.48user 0.01system 0:02.50elapsed 99%CPU (0avgtext+0avgdata 30352maxresident)k
0inputs+0outputs (0major+7958minor)pagefaults 0swaps

Awesome. Execution time down to 3 seconds, and all we had to do was flatten our code a little. So this solves two problems. First, we've gained a massive amount of performance. Second, we're not a dozen indentations deep with our callback structure.

As far as performance goes, I think the argument is empirically pretty simple. Any of these other overly complicated ways of doing asynchronous callback structures have to use at least this at the core of it's execution model. Also there's any overhead of the library itself. So there's no possible way any other method of managing your callbacks could be faster, and as we've demonstrated that difference may not be trivial.

In conclusion, suck it up and use callbacks. They're easy to understand and maintain, and you won't be left wondering how much extra is your fancy-schmancy way of doing things costing you. Also, if anyone reading this article plans on doing additional performance analysis on basic callbacks vs whatever else, make sure it's done correctly. Because I will find them, and then I'll publicly mock them.

UPDATE: There seem to be two things people are bringing up:

First, do I understand what I'm measuring? Yes. The point of the benchmarks is to show the difference between what's common practice (creating functions within closures to access variables and declaring the function where it's being passed) and doing it the Right Way. It has everything to do with how the function is declared. That's the point. My assumption was that people would only think about declaring a function within a closure to access the variables within it. I'll have a followup post explaining why these two things affect performance so severely.

Second, is this post really about callbacks? Yes. I realize it may be hard to see, but the point was callbacks Done Right are easy to understand, just as easy to read (i.e. don't experience indentation hell) as other implementations and they're always faster.

Considering I finished this at 4am, and have only gotten 3 hours of sleep before writing this update, I might find this entire post absurd tomorrow. But it's unlikely.

UPDATE2: I fogot that Buffer#fill() only returns the instance on master, so I update the example to work appropriately on previous versions of Node.