Tuesday, July 23, 2013

measuring node performance - part 1

This is the first of some unknown number of blog posts about tracking your Node application's performance. Today we're going to start off with a few simple ones. Though before we being I'd just like to make sure everyone is familiar with the --v8-options flag. There is all sorts of magic here waiting to be found. Go ahead, take a quick look. I'll be waiting here for you... Now a lot of these won't be much use until we combine them with other tools, but we'll get to those in future posts. Right now we're going to focus on a few that can give us a quick peek into how our code is doing.

Let's start off with this tiny snippet of code:

function sumItUp(n) {
  for (var i = n + 1, m = n * 2; i <= m; i += 2)
    n += i;
  return n;
}

function runTest(max) {
  for (var i = 0; ++i < max;)
    var tmp = sumItUp(i);
  return tmp;
}

runTest(53509);

Now run it with --trace-deopt and --code-comments. See anything? What do you mean you don't see anything? Oh, yeah. Here, change 53509 to 53510 and run it again. Now see anything? Good. It should look something like this:

[deoptimizing (DEOPT eager): begin 0x1dfe78f8b109 runTest @7, FP to SP delta: 64]
            ;;; jump table entry 3: deoptimization bailout 7.
  translating runTest => node=40, height=24
    0x7fff7f678860: [top + 64] <- 0x5e3f1c5fe69 ; rdi 0x5e3f1c5fe69 <JS Global Object>
    0x7fff7f678858: [top + 56] <- 53510 ; rcx (smi)
    0x7fff7f678850: [top + 48] <- 0x2f3759173318 ; caller's pc
    0x7fff7f678848: [top + 40] <- 0x7fff7f678888 ; caller's fp
    0x7fff7f678840: [top + 32] <- 0x1dfe78f8b089; context
    0x7fff7f678838: [top + 24] <- 0x1dfe78f8b109; function
    0x7fff7f678830: [top + 16] <- 53509 ; r9 (smi)
    0x7fff7f678828: [top + 8] <- 0x5e3f1c04121 <undefined> ; literal
    0x7fff7f678820: [top + 0] <- 0x1dfe78f8b0c1 ; rdx 0x1dfe78f8b0c1 <JS Function sumItUp (SharedFunctionInfo 0x2911dc22c219)>
  translating sumItUp => node=38, height=16
    0x7fff7f678818: [top + 56] <- 0x5e3f1c5fe69 ; r8 0x5e3f1c5fe69 <JS Global Object>
    0x7fff7f678810: [top + 48] <- 2147409811 ; r11 (smi)
    0x7fff7f678808: [top + 40] <- 0x2f3759173526 ; caller's pc
    0x7fff7f678800: [top + 32] <- 0x7fff7f678848 ; caller's fp
    0x7fff7f6787f8: [top + 24] <- 0x1dfe78f8b089; context
    0x7fff7f6787f0: [top + 16] <- 0x1dfe78f8b0c1; function
    0x7fff7f6787e8: [top + 8] <- 107018 ; rax (smi)
    0x7fff7f6787e0: [top + 0] <- 107018 ; rbx (smi)
[deoptimizing (eager): end 0x1dfe78f8b0c1 runTest @7 => node=38, pc=0x2f37591736df, state=NO_REGISTERS, alignment=no padding, took 0.160 ms]
[removing optimized code for: runTest]

Wow. This looks all cryptic and scary, but there's a lot we can learn. It's basically the assembly output of what your JS became. The order of execution goes from the bottom up, so let's start there.

First let's pick out the values that we understand: 107018 ; rax (smi) is generated by the loop in sumItUp (remember since 53509 * 2 == 107018) and 2147409811 ; r11 (smi) is the return value.

What's that magical smi you ask? Well I'm glad you did. It stands for "small integer" and while all numbers in JS are technically type double, an optimization technique v8 uses is to detect when only smi's are used. On my x64 machine that's a full 32 bit integer. So, the largest smi would be... 2147483647.

Working our way up notice 53510 ; rcx (smi) and then the deoptimization. Grab the return value there with a simple console.log(runTest(53510)); and the output should be 2147516829. Applying our new found knowledge of smi's we can see that 2^32 / 2 - 1 - 2147516829 == -33182. Oops, overflow. Sometimes this may not be avoidable, but at least it's good to know why code is deoptimizing.

Another useful one is --trace-opt. Using that with our working code it'll output:

[marking 0x1dc7f8f8b011 <JS Function sumItUp (SharedFunctionInfo 0x6815302c219)> for recompilation, reason: small function, ICs with typeinfo: 5/5 (100%)]
[optimizing 0x1dc7f8f8b011 <JS Function sumItUp (SharedFunctionInfo 0x6815302c219)> - took 0.025, 0.114, 0.038 ms]
[marking 0x1dc7f8f8b059 <JS Function runTest (SharedFunctionInfo 0x6815302c2a9)> for recompilation, reason: small function, ICs with typeinfo: 2/2 (100%)]
[optimizing 0x1dc7f8f8b059 <JS Function runTest (SharedFunctionInfo 0x6815302c2a9)> - took 0.044, 0.109, 0.021 ms]

Nice to know and all, but where I really find it advantageous is to find problems like the following:

function retSumFn() {
  function sumItUp(n) {
    for (var i = n + 1, m = n * 2; i <= m; i += 2)
      n += i;
    return n;
  }
  return sumItUp;
}

function runTest(max) {
  for (var i = 0; ++i < max;)
    var tmp = retSumFn()(i);
  return tmp;
}

runTest(1e3);

Run this and prepare yourself for a wall of text. See, v8 needs to optimize every function returned. So keep functions at the top level. If your function needs something, pass it in. Here's a very simple modification that greatly reduces the stress on your application:

function sumItUp(n) {
  for (var i = n + 1, m = n * 2; i <= m; i += 2)
    n += i;
  return n;
}

function retSumFn() { return sumItUp; }

function runTest(max) {
  for (var i = 0; ++i < max;)
    var tmp = retSumFn()(i);
  return tmp;
}

runTest(53509);

And finally, the mother of all optimization --trace-inlining. This is code that can be not just optimized, but the resulting assembly can be inline'd into the function making the call. For this to happen the function body must be small. Under 600 characters actually. Variable types cannot change. So if you're expecting a number, then go ahead and coerce it on entry (e.g. +'0xff' == 255). Also limit the types of arguments that are passed.

Let's throw --trace-inling onto our original code and see the output. Actually pretty basic: Inlined sumItUp called from runTest. This is great. It means internally it placed the assembly generated from sumItUp directly into runTest. Now, we have reached a new height in optimization!

Welcome to the basics of optimization. Here are a couple links with more in-depth material:

Understanding v8
Optimizing for v8

As always, if I screwed up please let me know. Don't want to propagate bad information, but guess since no one reads this anyways that won't be a problem :)

Wednesday, July 10, 2013

node with threads

Put this together while reviewing someone else's code. It's a simple example of how to use uv_queue_work. I made the example to work with Node's v0.10 branch, as master has upgrade to v8 3.20 which introduced non-trivial API changes. Also, this code is academic and non-optimal. I promise you're code won't be faster implementing this as-is.

Starting at the bottom, we'll start by binding a native method to a JavaScript function.

Handle<Value> Run(const Arguments& args) {
  HandleScope scope;
  assert(args[0]->IsFunction());

  js_work* work = new js_work;
  work->req.data = work;
  work->callback = Persistent<Function>::New(args[0].As<Function>());

  // pretty simple, right?
  uv_queue_work(uv_default_loop(), &work->req, run_work, run_callback);
  return Undefined();
}

void Init(Handle<Object> target) {
  HandleScope scope;
  target->Set(String::New("run"),
      FunctionTemplate::New(Run)->GetFunction());
}

NODE_MODULE(basics, Init)

Pretty standard setup for binding a native method to the return object. In Run() the first thing that happens is to make sure the passed argument is a function, which will be run from run_callback when the work is complete. Next we're setting up a small struct containing necessary information that will propagate through the call. Finally we're making the call to uv_queue_work. This will run run_work asynchronously in the thread pool then call run_complete.

Now let's start defining some of these mystic variables. Starting from the top:

#include <v8.h>
#include <node.h>
#include <node_buffer.h>
#include <assert.h>

#define SIZE 8

using namespace v8;

struct js_work {
  uv_work_t req;
  Persistent<Function> callback;
  char* data;
  size_t len;
};

Referring back to the first section you'll see work->req.data = work;. The uv_work_t has a data field where we can store a void pointer. So by creating this loop reference to work we'll be able to get at either later on.

void run_work(uv_work_t* req) {
  js_work* work = static_cast<js_work*>(req->data);
  char* data = new char[SIZE];
  for (int i = 0; i < SIZE; i++)
    data[i] = 97;
  work->data = data;
  work->len = SIZE;
}

This snippet clarifies why it's useful storing a void pointer to our js_work struct. For this simple example we're just allocating a small chunk of memory and filling it with char code 97 (i.e. 'a'). Then we're able to assign it back to the struct we created earlier.

Awesome. We've created a thread and had it execute some work. Time to wrap it all up:

void run_callback(uv_work_t* req, int status) {
  HandleScope scope;

  js_work* work = static_cast<js_work*>(req->data);
  char* data = work->data;
  node::Buffer* buf = node::Buffer::New(data, work->len);

  Handle<Value> argv[1] = { buf->handle_ };

  // proper way to reenter the js world
  node::MakeCallback(Context::GetCurrent()->Global(),
                     work->callback,
                     1,
                     argv);

  // properly cleanup, or death by millions of tiny leaks
  work->callback.Dispose();
  work->callback.Clear();
  // unfortunately in v0.10 Buffer::New(char*, size_t) makes a copy
  // and we don't have the Buffer::Use() api yet
  delete[] data;
  delete work;
}

Note: This is an abstract and not implementation ready. In the case that node::MakeCallback() throws then data wouldn't be free'd.

To start we unravel our data and make a Buffer out of it. In v0.10 we're still stuck with the old way of doing things so it's not too pretty, but it'll get the job done. Afterwards we need to create an array of Value's that'll be passed to the callback.

Now finally, time to call our callback! MakeCallback is the proper way to reenter js-land from an asynchronous native method. So first we pass the global scope (any object would do actually), the callback function and then the callback argument count and array. Last thing necessary is some proper cleanup. Don't forget to wash behind the ears!

To complete the whole thing, here's the script I used to test:

var basics = require('./build/Release/basics');

function runMe(buf) {
  console.log(buf.toString());
}

basics.run(runMe);

console.log('called');

// output:
// called
// aaaaaaaa

Simple tutorial done. Though be warned, today was the first time I actually did this. So trolls may be lurking.

Tuesday, July 9, 2013

child process multiple file descriptors

Not sure why I never put these together, but a recent post on a GitHub issue from Ben demonstrated how simple it is to use file descriptors to pipe data between parent/child processes:
// null means to use the default setting (pipe) for that fd
var options = { stdio: [null, null, null, 'pipe'] };
var args = [ /* ... */ ];
var proc = child_process.spawn(cmd, args, options);
var pipe = proc.stdio[3];
pipe.write(Buffer('awesome'));


// now setup your child process like so
var pipe = new net.Socket({ fd: 3 });
pipe.on('data', function(buf) {
 // do whatever
});
Great. Now we can pipe data back and forth without possibly colliding with output of standard I/O.