Friday, August 22, 2014

Intercepting the HTTP Data Stream

Word of warning: This post only works with +v0.11 of Node, and a few things here are not official API but based on implementation details. Even so, figured this would be a worthwhile post for some.

Callback in the Middle Attack

What we'll be doing is hijacking the data stream between being received by the TCP socket and being processed by the HTTP parser. This allows manipulation of the data with no module the wiser. Let's get started:

var http = require('http');

// First start by adding a property to the Socket prototype that
// will be used to store the old data event callback(s).
require('net').Socket.prototype._data_fn_ = undefined;

var server = http.createServer().listen(8000);

server.on('connection', onConnection);

function onConnection(conn) {
  // The use of ._events is a complete implementation detail of
  // the EventListener.
  conn._data_fn_ = conn._events.data;

  // Now replace the data event with our own. This is to ensure
  // the custom data handler always runs first.
  conn._events.data = interceptData;
}

function interceptData(chunk) {
  // This is where the magic will happen, and be explained later.

  // After the magic, call the data event callback(s). For ease of
  // the example we'll assume there was only one data event callback.
  this._data_fn_(chunk);
}

server.on('request', onRequest);

function onRequest(req, res) {
  res.end('bye\n');
}

Running the above should produce a normally running HTTP server. Though now there's an entry point into the data stream. The function interceptData() is the first place data is received by the user. Even before it reaches the HTTP parser. Take note, trying to do this on the readable event will fail. This is an implementation detail and difference between how pushing data immediately to the user via the data event and buffering the data to be read by the user has an ever so slight difference.

Take special notice of the fact we're storing the old 'data' event and running it directly after the data stream manipulation. The reason for this is twofold. First, it is the most sure way to make sure interceptData() runs first. Second, in the case of an error the full stack trace is available. Whereas simply adding another event would hide the fact the data stream has been manipulated.

Exploiting the Data Stream

When data is received, the amount of data read is accumulated on the bytesRead property. If we change the length of the data then the HTTP parser may stall or botch the parsing, having expected more/less data. Luckily getting past this is trivial, and once the counter measure is in place we can begin to manipulate the data however we want.

function interceptData(chunk) {
  // Roll back how much data was actually read.
  this.bytesRead -= chunk.length;

  // Make a change to the data stream.
  chunk = manipulateData(chunk);

  // Now add back the actual length propagated.
  this.bytesRead += chunk.length;

  this._data_fn_(chunk);
}

An important point to keep in mind is that this example over simplifies a key issue. Which is, all the data may not arrive in a single TCP packet. Meaning it will be up to you to buffer data accordingly until all the necessary data has arrived.

For a quick example, let's remove the header customHeader:

function manipulateData(chunk) {
  var arr = chunk.toString().split(/\r\n/);
  var customHeader;
  for (var i = 0; i < arr.length; i++)
    if (arr[i].indexOf('customHeader:') === 0)
        customHeader = arr.splice(i, 1);
  // Return new Buffer that doesn't contain customHeader. 
  return new Buffer(arr.join('\r\n'));
}

Fairly straight forward really. The basic template here should allow for any type of data manipulation to the http module. Though keep in mind I do not recommend this technique. Feel the need to say that since I'm a core maintainer, and should never publicly advocate to use anything other than the "official" API. More importantly is that multiple modules don't attempt to do this in the same application. If for no other reason, the implications of each one stepping on the other's toes could lead to unpredictable results that are very difficult to track down. But, in light of those two things, it could be very helpful in cases like wanting to pre-parse headers before the http module has a chance.

I haven't thought of an appropriate module-type API to encapsulate this functionality. If you have any ideas, please feel free to share in the comments and maybe I'll make one (unofficial of course ;-). There would also be a way to make this backwards compatible with v0.10. Though that would require a substantially deeper hack. But that's one thing I love about code. Anything is possible as long as you know how.