Friday, February 6, 2015

AsyncWrap Tutorial - An Introduction

Some recent changes have both removed an API people were looking forward to and added another that isn't documented, but that does expose the hooks used so anyone can implement their own tracing or profiling API.

Because there is no official documentation on the new internal API I've written this short tutorial on how to use AsyncWrap for fun and profit. I'd encourage as many as are willing to try it out and give feedback on what more they'd like to see. Instead of exposing a complex tracing API, I'd rather give you the ability to create your own.

Please note that these features are experimental and may change. While I don't expect any existing APIs to break in the future, there are definitely features that will be added. Also that the functionality listed here will only work with Node v0.12.1 and post io.js v1.1.1.

Prepping the World

First thing we have to do is tell AsyncWrap what init, before and after callbacks we want fired. Here's the boilerplate to have everything setup:

// Grab the JS hooks to AsyncWrap.
var async_wrap = process.binding('async_wrap');

// Setup needed variables and callbacks.
var kCallInitHook = 0;
var asyncHooksObject = {};

function asyncInit() {
  // This propery must be set to an Object.
  this._asyncQueue = {};
}

function asyncBefore() { }

function asyncAfter() { }

// Set all the necessary callbacks.
async_wrap.setupHooks(asyncHooksObject, asyncInit, asyncBefore, asyncAfter);

If you're like me you would have taken the above and ran with it before reading on. Probably done something like added some logging to the callbacks to see when they get called. Though you'd have also noticed that nothing was happening.

There's actually one more important line that needs to be added, but I left it out because you need to understand how to use it before attempting to log from any of the async callbacks. Here's a continuation of the script:

// Tell AsyncWrap we want callbacks to be called.
asyncHooksObject[kCallInitHook] = 1;

This tells AsyncWrap that we want to start listening for asynchronous events. And it can just as easily be turned off by setting it equal to zero.

Now do not try to log information from any of the callbacks just yet. Unless you want the application to abort and core dump all over the place. Before doing so AsyncWrap should be told to not listen for anything. Otherwise it may get caught in an infinite loop. Here's a partial re-implementation of the script:

function asyncInit() {
  this._asyncQueue = {};

  // Cache the current setting.
  var was_listening = asyncHooksObject[kCallInitHook];
  // Disable listening for any calls to AsyncWrap.
  asyncHooksObject[kCallInitHook] = 0;

  // Safely make other calls.
  console.log('in asyncInit');

  // Restore the previous setting.
  asyncHooksObject[kCallInitHook] = was_listening;
}

So why is it important to cache the previous state for kCallInitHook? Because if a handle or request was created while AsyncWrap was active then the callbacks will always run. In other words kCallInitHook only affects the call to the init callback, and if it did run then the before/after callbacks must also run.

Propagating Information

So, we have this power API at our fingertips, but how do we leverage it? Let's first start with a simple example to see the ordering of how things are triggered:

var async_wrap = process.binding('async_wrap');
var uid = 0;

var kCallInitHook = 0;
var asyncHooksObject = {};

function asyncInit() {
  this._asyncQueue = { uid: ++uid };
  // Internal mechanism to write directly to stderr.
  process._rawDebug('init    ' + this._asyncQueue.uid);
}

function asyncBefore() {
  process._rawDebug('before: ' + this._asyncQueue.uid);
}

function asyncAfter() {
  process._rawDebug('after:  ' + this._asyncQueue.uid);
  process.exit();
}

async_wrap.setupHooks(asyncHooksObject, asyncInit, asyncBefore, asyncAfter);
asyncHooksObject[kCallInitHook] = 1;

var net = require('net');
var s = net.createServer();

s.on('connection', function onConnection(c) {
  c.end('bye\n');
});

s.listen(8000, function() {
  // Don't want to trace the req making the conntion. Only the received
  // req by the server.
  asyncHooksObject[kCallInitHook] = 0;

  // Create a connection for logging, then exit app.
  net.connect(8000, function() { });
});

// Output:
// init    1
// before: 1
// init    2
// after:  1

The important point to pay attention to is that the init call to 2 was made in between calls to before and after on 1. Allowing us to setup and tear down state for any newly created async requests. This will come in handy in a short while to create a very simple long stack trace script.

AsyncWrap in Action

Best way I can teach is by example. So let's write a simple example script that creates long stack traces for requests.

var async_wrap = process.binding('async_wrap');

var kCallInitHook = 0;
var asyncHooksObject = {};
// Keep track of our current stack.
var currentStack = [];

function asyncInit() {
  // Generate a new stack for this call site.
  var stack = {};
  Error.captureStackTrace(stack, asyncInit);

  // Add a copy of the current stack plus new stack to propagate.
  var newStack = currentStack.slice(0);
  newStack.push(stack);
  this._asyncQueue = { stack: stack, stack_arr: newStack };
}

function asyncBefore() {
  // Add this stack to the current stack.
  currentStack.push(this._asyncQueue.stack);
}

function asyncAfter() {
  // Remove this stack.
  currentStack.pop();
}

async_wrap.setupHooks(asyncHooksObject, asyncInit, asyncBefore, asyncAfter);
asyncHooksObject[kCallInitHook] = 1;


var net = require('net');
var s = net.createServer();

s.on('connection', function onConnection(c) {
  c.end('bye\n', function() {
    // Print long stack and exit app.
    var stack_arr = this._handle._asyncQueue.stack_arr;
    for (var i = stack_arr.length - 1; i >= 0; i--)
      console.log(i + '\n' + stack_arr[i].stack.substr(6));
    process.exit();
  });
});

s.listen(8000, function() {
  asyncHooksObject[kCallInitHook] = 0;
  net.connect(8000, function() { });
});
The output of this is rather interesting:
1

0
    at createTCP (net.js:32:10)
    at exports._createServerHandle (net.js:1077:14)
    at Server._listen2 (net.js:1121:14)
    at listen (net.js:1178:10)
    at Server.listen (net.js:1265:5)
    at Object. (/tmp/async-wrap-post.js:46:3)
    at Module._compile (module.js:446:26)
    at Object.Module._extensions..js (module.js:464:10)
    at Module.load (module.js:341:32)
    at Function.Module._load (module.js:296:12)

The space below 1 is empty. Meaning the incoming connection had no stack trace. This makes sense because it was created coming from the native layer, and no specific call from JS. So, in the least we'll be able to fine the reason the connection was made in the first place.

Future Stuff

There are several key features that will be included in the future.

First is having proper support for timers. Currently timers are stored in a linked list in JavaScript, so C++ isn't notified of every timer created.

Second is error tracing. This was a feature included in async listener, but was removed for expediency of getting the patch landed. With this you'll be able to be notified of any errors that happen within the stack of an asynchronous event you are watching. The error will bubble from the origin point through normal handling mechanisms, and will allow the error to be captured and handled just before the 'uncaughtException' event.

Third is subsystem filters. Instead of getting notified for all events, you can filter out when the init callback is called (e.g. TCP, PIPE).

Fourth is the ability to add hooks from the native layer (e.g. C++) so as to further reduce overhead of collecting metrics.

I look forward to working with the community to making Node's internal tracing and profiling hooks as robust as possible. Thanks for your help.

Friday, August 22, 2014

Intercepting the HTTP Data Stream

Word of warning: This post only works with +v0.11 of Node, and a few things here are not official API but based on implementation details. Even so, figured this would be a worthwhile post for some.

Callback in the Middle Attack

What we'll be doing is hijacking the data stream between being received by the TCP socket and being processed by the HTTP parser. This allows manipulation of the data with no module the wiser. Let's get started:

var http = require('http');

// First start by adding a property to the Socket prototype that
// will be used to store the old data event callback(s).
require('net').Socket.prototype._data_fn_ = undefined;

var server = http.createServer().listen(8000);

server.on('connection', onConnection);

function onConnection(conn) {
  // The use of ._events is a complete implementation detail of
  // the EventListener.
  conn._data_fn_ = conn._events.data;

  // Now replace the data event with our own. This is to ensure
  // the custom data handler always runs first.
  conn._events.data = interceptData;
}

function interceptData(chunk) {
  // This is where the magic will happen, and be explained later.

  // After the magic, call the data event callback(s). For ease of
  // the example we'll assume there was only one data event callback.
  this._data_fn_(chunk);
}

server.on('request', onRequest);

function onRequest(req, res) {
  res.end('bye\n');
}

Running the above should produce a normally running HTTP server. Though now there's an entry point into the data stream. The function interceptData() is the first place data is received by the user. Even before it reaches the HTTP parser. Take note, trying to do this on the readable event will fail. This is an implementation detail and difference between how pushing data immediately to the user via the data event and buffering the data to be read by the user has an ever so slight difference.

Take special notice of the fact we're storing the old 'data' event and running it directly after the data stream manipulation. The reason for this is twofold. First, it is the most sure way to make sure interceptData() runs first. Second, in the case of an error the full stack trace is available. Whereas simply adding another event would hide the fact the data stream has been manipulated.

Exploiting the Data Stream

When data is received, the amount of data read is accumulated on the bytesRead property. If we change the length of the data then the HTTP parser may stall or botch the parsing, having expected more/less data. Luckily getting past this is trivial, and once the counter measure is in place we can begin to manipulate the data however we want.

function interceptData(chunk) {
  // Roll back how much data was actually read.
  this.bytesRead -= chunk.length;

  // Make a change to the data stream.
  chunk = manipulateData(chunk);

  // Now add back the actual length propagated.
  this.bytesRead += chunk.length;

  this._data_fn_(chunk);
}

An important point to keep in mind is that this example over simplifies a key issue. Which is, all the data may not arrive in a single TCP packet. Meaning it will be up to you to buffer data accordingly until all the necessary data has arrived.

For a quick example, let's remove the header customHeader:

function manipulateData(chunk) {
  var arr = chunk.toString().split(/\r\n/);
  var customHeader;
  for (var i = 0; i < arr.length; i++)
    if (arr[i].indexOf('customHeader:') === 0)
        customHeader = arr.splice(i, 1);
  // Return new Buffer that doesn't contain customHeader. 
  return new Buffer(arr.join('\r\n'));
}

Fairly straight forward really. The basic template here should allow for any type of data manipulation to the http module. Though keep in mind I do not recommend this technique. Feel the need to say that since I'm a core maintainer, and should never publicly advocate to use anything other than the "official" API. More importantly is that multiple modules don't attempt to do this in the same application. If for no other reason, the implications of each one stepping on the other's toes could lead to unpredictable results that are very difficult to track down. But, in light of those two things, it could be very helpful in cases like wanting to pre-parse headers before the http module has a chance.

I haven't thought of an appropriate module-type API to encapsulate this functionality. If you have any ideas, please feel free to share in the comments and maybe I'll make one (unofficial of course ;-). There would also be a way to make this backwards compatible with v0.10. Though that would require a substantially deeper hack. But that's one thing I love about code. Anything is possible as long as you know how.

Thursday, June 5, 2014

From Mozilla to NodeSource

Today I have officially handed in my resignation to Mozilla and will be starting work at NodeSource on June 16.

The decision to move was difficult. Dan Shaw can attest to that by the fact he'd taken every opportunity these last six months to bring me over. During that time I'd told him straight out that it would never happen. Working at Mozilla was about as good as any job could be. I mean, come on, Mozilla paid me to work on Node.js as a core maintainer but without any of the corporate oversight which could have required my efforts to focus the company's needs over my own desires as a maintainer to improve Node. Doesn't get much better than that for someone who enjoys developing open source software. Because of Mozilla I've been able to dive into many of the gritty aspects of core code and make them better just for the sake of having better code.

What eventually changed my mind? First, I have to give Dan credit at meeting my many requirements. Event to the point that he personally called my old manager Mark Mayo and discussed how I could continue to be closely affiliated with Mozilla after the transition. I realize that with how open Mozilla is it wouldn't be a problem for me to continue contributing, but there is some difference. Like, being able to walk a few desks down and speak with the Rust devs vs. chatting on IRC.

After feeling confident in being able to continue contributing back to the community just as openly as I have, and continue my work as core maintainer the same as usual, the decision to make the move felt more appropriate.

At NodeSource I'll have greater opportunity to work with companies that are using Node in the wild. These new interactions will give additional insight on how to focus my efforts as a core maintainer. My mantra is "make all the things fast", but with limited development hours it helps to know what to prioritize. Similarly I'm hopeful it will spark more creative madness to take Node to new levels of performance awesomeness.

NodeSource has coalesced an awesome group of talent. They'll definitely help me help the community, and to be honest, the community will get more out of me than before. Simply because I have additional assistance to assemble my mess of thoughts and code into something legible. So, you can thank NodeSource for future blog posts on how to make your Node apps faster.

So, here I jump from one awesome company to another. Never found myself a sentimental person. Irrational, rageful and sometimes impertinent sure. But it's always been about the code. Though in this case I'll make an exception (or maybe it's the Yo-Yo Ma instrumental piece I'm listening to while writing this) and say this is both a sad and exciting time in my career.

Wednesday, February 5, 2014

Node.js, ES6 and me

On January 31, 2014 my second son was born. That evening, while my wife and newborn were getting some sleep, I decided to jump on my phone and try to respond to the never ending mound of issues that seem to pile up. One of which was about enabling the --harmony option by default. Unfortunately one of my posts was misinterpreted by some. I will gladly take the blame for being simultaneously overbroad in my description of features accepted into Node core, and nonspecific about what core means.

Though bear in mind I never thought my comment would be taken so far as to lead some to believe Node would actually fork v8, or some such, to prevent ES6 features from being available in Node (especially since we had already done for one feature what this PR suggested we do for several others). Or because just prior to that paragraph I had discussed my conditions for allowing --harmony to be enabled by default, and it would have been a contradiction to then say it would never be allowed.

So first I'd like to clarify that I improperly referred to the entire ES6 spec. In that post I was specifically referring to iterators, generators and promises. Second I'd like to state for the record that I've never had the intention of preventing any Node developer from using new features coming to JavaScript, or from preventing these features from being exposed to the user if enabled in v8 by default. Last clarification is that the reference to "core" meant core code.

My Node Future

Some of your are probably thinking, "Why wouldn't he want to save us from callback hell and integrate a generator/promise based API?" The reasons are simple and pragmatic. First is because the key thing Node does (integrating JavaScript with the v8 C++ bindings/libuv) can only be done via callbacks. Second is that I want to remove abstraction and allow users to interface their code as close to the action as possible.

The community has proven two things. First is that none of us agree on how an API should be written. Second is that there are very talented developers that will push Node to the limits. My hopeful goal for Node is to introduce a standardized minimal API that sits very close to these C++ bindings. From there the community can create whatever interface they want with as close to zero unnecessary overhead as possible.

I want Node to be tremendously good at one thing. Being the kernel of a web platform for app developers who wish to write in JavaScript. That means writing an API that allows minimal abstraction and as close to zero overhead as possible. So why use the callback style? Because currently it's the only way to interface with the v8 C++ bindings. When the v8 team introduces an API for generators or promises I will give it honest consideration.

Moving at Node Speed

Some days I like to write little apps that sit directly on top of the internal C++ API (Max, promise I haven't forgotten about the router app). Every time I'm reminded of how much performance is left on the table due to abstraction.

Imagine you want to saturate 100Gb pipe with TCP traffic. That means you'll be sending at least 204,800 packets a second. Which also means something needs to be executed at least 204,800 times per second to send that data. Small amounts of extra abstraction start to make a big difference in performance at this level.

The v8 team has done some amazing stuff to improve performance in key areas that are important to Node (doubt they had Node in mind when making those changes, but I'll take it anyways). Things like making calls between C++ and JavaScript have become extremely cheap (we're talking double digit nanosecond cheap). I'm working on new performance patterns that can hopefully be integrated down the road that will allow techniques like using JavaScript to chain a series of C++ callbacks together (useful for high performance transforms).

There is still so much more to be done, but if we're really going to make Node fast and keep it light weight then it must be done using a specific subset of the JavaScript specification. Just how far can we push Node? Let me just say you ain't seen nothing yet.

Tuesday, August 27, 2013

callbacks, what I said was

Preamble

It seems some explanation is in order. Usually I wouldn't care but, since there's a nagging feeling of obligation to represent Node.js and Mozilla, it needs to be said. The context of my last post/rant was about developers wanting to replace the callback style in Node core with something "better" like generators, and how doing so would have dramatic performance implications.

Also, this blog has long been a place for my rants (many were removed when I switched blog hosts for lack of current relevance), and until my last post had only seen a few hundred visitors over the course of a year. So at 4am I let off some steam, published and hit the sack. Woke up 4 hours later to pee and saw several thousand people had viewed my blog. In my still sleep deprived state I tried to do some damage control, but whatever. Now I'll try to explain more in depth why the callback system is better for my use case.

First though, a note on who I am and what to expect from this blog. It was a conscience decision to advocate performance through hyperbole. The use of satire like the "Right Way" et al. is intentionally provocative as a means to aid developer recollection of what they've read here.

Of Callbacks and Closures

Since many readers didn't attempt to read between the lines, I'll try to make this clear as possible. Function declarations in closures comes at a cost, and this is very expensive in hot code paths. Most of what I do lives in hot code paths.

As an example I've created a minimal HTTP server sitting directly on top of the TCP wrapper and just delivers "hello world!" In the first example everything has been placed in the onComplete callback.

var TCP = process.binding('tcp_wrap').TCP;
var util = require('util');
var headers = 'HTTP/1.1 200 OK\r\n' +
              'Server: TCPTest\r\n' +
              'Content-Type: text/plain; charset=latin-1\r\n' +
              'Content-Length: 12\r\n\r\n' +
              'hello world!';
var data = require('buffer').SlowBuffer(headers.length).fill(headers);

function fail(err, syscall) {
  throw util._errnoException(err, syscall);
}

var server = new TCP();
var err = server.bind('127.0.0.1', 8012);
if (err)
  fail(err, 'bind');

err = server.listen(511);
if (err)
  fail(err, 'listen');

// Wrap every callback up within a closure.
server.onconnection = function onConnection(err, client) {
  if (err)
    fail(err, 'connect');

  client.onread = function onRead(nread, buffer) {
    var writeReq = {
      oncomplete: function afterWrite(err, handle, req) {
        if (err)
          fail(err, 'write');
      }
    };
    if (nread >= 0)
      client.writeBuffer(writeReq, data);
    client.close();
  };

  client.readStart();
};

Side note: all this is running of latest master.

Let's give this thing a run shall we? The following is the median of a dozen runs or so:

$ ./wrk -c 60 -t 6 -d 10 'http://127.0.0.1:8012/'
Running 10s test @ http://127.0.0.1:8012/
  6 threads and 60 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.48ms    2.20ms  20.12ms   98.25%
    Req/Sec     8.44k     1.73k   11.00k    81.71%
  475297 requests in 10.00s, 50.31MB read
  Socket errors: connect 0, read 475291, write 0, timeout 0
Requests/sec:  47538.02
Transfer/sec:  5.03MB

48k req/sec. Not bad. Here we're intentionally not using keep-alive because my measurements were geared more towards services where that would not be of much help. But let's make a slight difference in our code hierarchy:

// so there's no confusion, this simply replaces the bottom of the
// previous script. no the entire thing.
var writeReq = { oncomplete: afterWrite };

function afterWrite(err, handle, req) {
  if (err)
    fail(err, 'write');
}

function onRead(nread, buffer) {
  if (nread >= 0)
    this.writeBuffer(writeReq, data);
  this.close();
}

server.onconnection = function onConnection(err, client) {
  if (err)
    fail(err, 'connect');

  client.onread = onRead;
  client.readStart();
};

Now running the same test:

$ ./wrk -c 60 -t 6 -d 10 'http://127.0.0.1:8012/'
Running 10s test @ http://127.0.0.1:8012/
  6 threads and 60 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.11ms    1.26ms  19.75ms   98.96%
    Req/Sec    10.00k     1.75k   13.00k    69.00%
  566254 requests in 10.00s, 59.94MB read
  Socket errors: connect 0, read 566248, write 0, timeout 0
Requests/sec:  56637.30
Transfer/sec:  6.00MB

Up to 57k req/sec. That's a 19% performance gain just for some stupid callbacks wrapped in closures. Now you might be saying something about how there are still so many layers of abstraction and functionality to go before we have a fully running HTTP server. You're right, and go ahead and imagine what would happen if at each layer we allowed something like this to slide. Oh wait, you don't have to.

How about we take a look at the same basic thing using the HTTP module:

var http = require('http');
http.createServer(function(req, res) {
  res.writeHead(200, {
    'Content-Type': 'text/plain',
    'Content-Length': 12
  });
  res.end('hello world!');
}).listen(8012);

And the median output:

$ ./wrk -c 60 -t 6 -d 10 'http://127.0.0.1:8012/'
Running 10s test @ http://127.0.0.1:8012/
  6 threads and 60 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.37ms  505.73us  14.44ms   96.11%
    Req/Sec     3.19k   243.38     3.78k    89.44%
  178711 requests in 10.00s, 23.52MB read
Requests/sec:  17874.50
Transfer/sec:  2.35MB

Hm, wow. So same basic thing accomplished. Except it's 216% slower.

There are a lot of layers that need to go between the basic TCP connection and the fully working HTTP server, but each 20% sacrifice adds up to a lot. Refraining from creating functions within closures is an easy performance win, and doesn't add much, if any, complexity to the project.

Know your Callback Mechanics

As a final clarification for my previous post, the benchmarks were specifically created to exasperate and augment the performance penalties for declaring functions within closures. Especially when they don't understand the mechanics of how the callback is stored. Event emitters are more lenient because they store a reference to the callback, but process.nextTick is much less forgiving. Let's take a look back at a revised example of the prime generator:

var SB = require('buffer').SlowBuffer;
var max = 3e4;
(function runTick() {
  process.nextTick(function genPrimes() {
    if (--max < 0)
      return;
    var primes = [];
    var len = (max >>> 3) + 1;
    var sieve = new SB(len);
    sieve.fill(0xff, 0, len);
    var cntr, x, j;
    for (cntr = 0, x = 2; x <= max; ++x) {
      if (sieve[x >>> 3] & (1 << (x & 7))) {
        primes[cntr++] = x
        for(j = 2 * x; j <= max; j += x)
          sieve[j >>> 3] &= ~(1 << (j & 7))
      }
    }
    runTick();
  });
}());

// $ /usr/bin/time node primegen0.js
// 19.53user 0.00system 0:19.58elapsed 99%CPU (0avgtext+0avgdata 51040maxresident)k
// 0inputs+0outputs (0major+13104minor)pagefaults 0swaps

Thanks to Chris Dickinson for the bit tweaking optimizations!

And taking into account that nextTick will need to recompile the function every time, so swap it out with a much smaller closure:

var SB = require('buffer').SlowBuffer;

function genPrimes(max) {
  if (--max < 0)
    return;
  var primes = [];
  var len = (max >>> 3) + 1;
  var sieve = new SB(len);
  sieve.fill(0xff, 0, len);
  var cntr, x, j;
  for (cntr = 0, x = 2; x <= max; ++x) {
    if (sieve[x >>> 3] & (1 << (x & 7))) {
      primes[cntr++] = x
      for(j = 2 * x; j <= max; j += x)
        sieve[j >>> 3] &= ~(1 << (j & 7))
    }
  }
  runTick();
}

var max = 3e4;
function runTick() {
  process.nextTick(function() {
    if (--max < 0)
      return;
    genPrimes(max);
    runTick();
  });
}
runTick();

// $ /usr/bin/time node primegen1.js
// 4.59user 0.04system 0:04.64elapsed 99%CPU (0avgtext+0avgdata 50540maxresident)k
// 0inputs+0outputs (0major+12977minor)pagefaults 0swaps

Awesome. 320% faster by using a simple callback wrapping trick. While this may seem outlandish to some users out there, it's not. Delaying possibly processing heavy lifting until later in the event loop via process.nextTick, setImmediate or setTimeout is not uncommon, and each case can suffer from the same performance penalty.

In Conclusion

These use cases are specific to performance concerns within Node core. These are published so anyone can learn from what I've spent time researching. At my talk in NodeConf.eu I'll be addressing more core specific performance implementations. Afterwards the slides will be posted publicly for your enjoyment.

If you feel the urge to bring up how this makes debugging more difficult, please refrain. For the reasons of this post that's not my concern, and even though I hate domains with a passion I'm already working on a patch that'll allow users to gain that beloved long stack trace or whatever they're looking for to finally end the loud few that continually ask for it.

Happy performance hunting, and remember. There's never enough.