Kazuho's Weblog: December 2014

Friday, December 26, 2014

[Ann] Initial release of H2O, and why HTTPD performance will matter in 2015

Happy Holidays!

Today I am delighted to announce the first release of H2O, version 0.9.0; this is a christmas gift from me.

github.com/h2o/h2o

H2O is an optimized HTTP server with support for HTTP/1.x and the upcoming HTTP/2; it can be used either as a standalone server or a library.

Built around PicoHTTPParser (a very efficient HTTP/1 parser), H2O outperforms Nginx by a considerable margin. It also excels in HTTP/2 performance.

Why do we need a new HTTP server? The answer is because its performance does matter in the coming years.

It is expected that the number of files being served by the HTTP server will dramatically increase as we transit from HTTP/1 to HTTP/2.

This is because current techniques used to decrease the number of asset files (e.g. CSS sprites and CSS concatenation) becomes a drag in page rendering speed in HTTP/2. Such techniques were beneficial in HTTP/1 since the protocol had difficulty in utilizing all the available bandwidth. But in HTTP/2 the issue is fixed, and the overhead of transmitting all the images / CSS styles used by the website at once while only some of them is needed to render a specific page, becomes a bad idea. Instead, switching back to sending small asset files for every required element consisting the webpage being request becomes an ideal approach.

Having an efficient HTTP/1 server is also a good thing, as we large-scale adopt the idea of Microservices; it increases the number of HTTP requests transmitted within the datacenter.

As shown in the benchmark charts, H2O is designed with these facts in mind, making it (as we believe) an ideal choice of HTTP server of the future.

With this first release, H2O is concentrates on serving static files / working as a reverse proxy at high performance.

Together with the contributors I will continue to optimize / add more features to the server, and hopefully reach a stable release (version 1.0.0) when HTTP/2 becomes standardized in the coming months.

Stay tuned.

PS. It is also great that the tools developed by H2O is causing other effects; not only have we raised the bar on HTTP/2 server performance (nghttp2 (a de-facto reference implementation of HTTP/2) has become much faster in recent months), the performance race of HTTP/1 parser has once again become (Performance improvement and benchmark by indutny · Pull Request #200 · joyent/http-parser, Improving PicoHTTPParser further with AVX2), @imasahiro is working on merging qrintf (a preprocessor that speeds up the sprinf(3) family by a magnitude developed as a subproduct of H2O) to Clang. Using H2O as a footstep, I am looking forward to bringing in new approaches for running / maintaining websites next year.

Monday, December 22, 2014

URL パーサにおける IPv6 対応

プログラマにとって IPv6 対応といえば、C言語のような低レベルな言語の話であって、ホスト名を文字列で扱うスクリプト言語には関係ないと考えがちです。ですが、実際には、文字列の取り扱いにおいても対応が必要になるところがあります。

その代表例が URL のパーサです。多くのエンジニアがイメージする URL の書式は「protocol://host:port/path#fragment」です。この書式を見れば、「:」と「/」を用いてトークンを分割するのが自然であり、RFC を参照せずに記述された URL パーサは、host の中に「:」が含まれないことを前提としていることがあります。あるいは、/[0-9A-Za-z_-\.]+/ のような正規表現を使って、ホスト名または IPv4 アドレスをチェックしている場合も多いのではないでしょうか。

ところが、IPv6 アドレスの文字列表現には「:」が含まれるため、IPv6 のアドレスを含む URL については、

http://[::FFFF:129.144.52.38]:80/index.html

のように、「[]」で囲むこと、と規定されています^注1。つまり、「http://」のあとの最初の「:」をポート番号の始まりだと解釈する URL パーサは、IPv6 アドレスを正しく取り扱えないのです。

今後、IPv6 を利用する機会が増えるにともない、このような IPv6 アドレスを含んだ URL を取り扱えないパーサによるバグに直面する可能性も出てくるでしょう。

年末の大掃除がてら、自分が使っているURLパーサの挙動を確認・修正するのも良いかもしれませんね^注2。

注1: 参照 RFC 3986
注2: なにを言いたかったというと、H2Oの URL パーサで IPv6 対応するのを忘れていたってことですね！！！！！　というわけでこの記事は H2O Advent Calendar 2014 の一部です。てへぺろ

Friday, December 19, 2014

Memory Management in H2O

This blogpost (as part of the H2O Advent Calendar 2014) provides a high-level overview of the memory management functions in H2O that can be categorized into four groups.

h2o_mem_alloc, h2o_mem_realloc

They are wrappers of malloc(3) / realloc(3), that calls abort(3) if memory allocation fails. The returned chunks should be freed by calling free(3).

h2o_mem_init_pool, h2o_mem_clear_pool, h2o_mem_alloc_pool

The functions create, clear, and allocate from a memory pool. The term memory pool has several meanings, but in case of H2O the term has been borrowed from Apache; it refers to a memory allocator that frees all associated chunks at once when the destructor (h2o_mem_clear_pool) is being called.

The primary use-case of the functions is to allocate memory that relates to a HTTP request. The request object h2o_req_t has a memory pool associated to it; small chunks of memory that need to be allocated while handling a request should be obtained by calling h2o_mem_alloc_pool instead of h2o_mem_alloc, since the former is generally faster than the latter.

h2o_mem_alloc_shared, h2o_mem_link_shared, h2o_mem_addref_shared, h2o_mem_release_shared

They are the functions to handle ref-counted chunks of memory. Eeach shared chunk has its own dispose callback that gets called when the reference counter reaches zero. A chunk can be optionally associated to a memory pool, so that the reference counter gets decremented when the pool gets flushed.

The functions are used for handling things like headers transferred via HTTP/2, or to for associating a resource that needs a custom dispose callback to a HTTP request through the use of the memory pool.

h2o_buffer_init, h2o_buffer_dispose, h2o_buffer_reserve, h2o_buffer_consume, h2o_buffer_link_to_pool

The functions provide access to buffer, that can hold any length of octets. They internally use malloc(3) / realloc(3) for handling short buffers, and switch to using temporary-file-backed mmap(2) when the length of the buffer reaches a predefined threshold (default: 32MB). A buffer can also be associated to memory pool by calling the h2o_buffer_link_to_pool function.

The primary use-case of the buffer is to store incoming HTTP requests and POST contents (as it can be used to hold huge chunks on 64-bit systems since it switches to temporary-file-backed memory as described).

h2o_vector_reserve

The function reserves given number of slots for H2O_VECTOR which is a variable length array of an arbitrary type of data. Either h2o_mem_realloc or the memory pool can be used as the underlying memory allocator (in the former case, the allocated memory should be manually freed by the caller). The structure is initialized by zero-filling it.

The vector is used everywhere, from storing a list of HTTP headers to a list of configuration directives.

For details, please refer to their doc-comment and the definitions in include/h2o/memory.h and lib/memory.c.

Tuesday, December 16, 2014

GitHub で submodule ではなく subtree を使うべき理由

GitHub には、タグを打つとソースパッケージを自動的にリリースするという機能があります。スクリプト言語においては、それぞれの言語について一般的なパッケージ管理システム^注1があるため、この機能を使うことが少ないかと思いますが、デファクトのパッケージ管理システムが存在しないC等の言語で書かれたプログラムや、単独で動作する管理用のスクリプトを GitHub で開発・配布する際には、本機能はとても便利なものです。

しかし、この機能は git-archive コマンドのラッパーとして実装されているため、サブモジュールのファイルが含まれないという問題を抱えています。この点は GitHub の人たちも認識しているものの、今のところ GitHub で独自に対応するということは考えていないようです^注2。

私がこの問題を知ることになったのは、picojson の issue で指摘を受けたからです。picojson については問題が「テストが動かない」という程度なので後回しにしても良かったのですが、H2O についても同様の問題が発生することが目に見えていました。

そこでどうするか、irc で相談、実験した結果、サブモジュールのかわりにサブツリーを使えば、参照先のファイルについても git-archive の結果に含めることが可能であることがわかり、picojson についてはサブツリーへの移行を完了しました。

ツールの仕様に引っ張られてやり方を変えるという、ある意味しょうもない話なのですが、H2O についても今後リリースまでにサブツリーへの切り替えを行おうと考えています。

※本記事も H2O Advent Calendar 2014 の一部です。

注1: たとえば Perl については CPAN、JavaScript については NPM が存在する
注2: 参照: » Github zip doesn’t include Submodules Academic Technology Group Developers Blog のコメント

Monday, December 15, 2014

PicoHTTPParser now has a chunked-encoding decoder

Today I have added phr_decode_chunked - a function for decoding chunked-encoded input - to picohttpparser.

As suggested in the doc-comment of the function (shown below), the function is designed to decode the data in-place. In other words, it is not copy-less.

/* the function rewrites the buffer given as (buf, bufsz) removing the chunked-
 * encoding headers. When the function returns without an error, bufsz is
 * updated to the length of the decoded data available. Applications should
 * repeatedly call the function while it returns -2 (incomplete) every time
 * supplying newly arrived data. If the end of the chunked-encoded data is
 * found, the function returns a non-negative number indicating the number of
 * octets left undecoded at the tail of the supplied buffer. Returns -1 on
 * error.
 */
ssize_t phr_decode_chunked(struct phr_chunked_decoder *decoder, char *buf,
                           size_t *bufsz);

It is intentionally designed as such.

Consider a input like the following. The example is more than 2MB long even though it contains only 2 bytes of data. The input is conformant to the HTTP/1.1 specification since it does not define the maximum length of the chunked extensions, requires every conforming implementation to ignore unknown extensions.

1 very-very-veery long extension that lasts ...(snip) 1MB
a
1 very-very-veery long extension that lasts ...(snip) 1MB
a

To handle such input without getting the memory exhausted, a decoder should either a) only preserve the decoded data (requires a copy), or b) limit the size of the chunked-encoded data.

B might have been easier to implement, but such a feature might be difficult to administer. So I decided to take the route a, and for simplicity implemented the decoder to always adjust the position of the data in-place.

Always calling memmove for adjusting the position might induce some overhead, but I assume it to be negligible for two reasons: both the source and destination would exist in the CPU cache / the overhead of unaligned memory access is small on recent Intel CPU.

For ease-of-use, I have added examples to the README.

Saturday, December 13, 2014

C言語で可変長引数をとる関数を、型安全に書く方法

C言語の可変長引数は、型安全でない（まちがった型の引数を渡してもコンパイルエラーにならない）とされています。これは言語仕様の理解としては正しいのですが、特定の型の引数を任意の個数とる関数に限っては、マクロを使うことで型安全性を確保することができます。

任意の個数のdoubleを引数にとり、その和を返す関数「sumf」を例にあげて説明します。

C言語の可変長引数機構を使ってsumfを定義すると、以下のようになります。

#include <math.h>
#include <stdarg.h>
#include <stdio.h>

static double sumf(double nfirst, ...)
{
  double r = 0, n;
  va_list args;

  va_start(args, nfirst);
  for (n = nfirst; ! isnan(n); n = va_arg(args, double))
    r += n;
  va_end(args);

  return r;
}

int main(int argc, char **argv)
{
  printf("%f\n", sumf(NAN)); /* => 0 */
  printf("%f\n", sumf(1., NAN)); /* => 1 */
  printf("%f\n", sumf(1., 2.5, 3., NAN)); /* => 6.5 */
  return 0;
}

が、この定義には「NANを終端に使っているがために、NANを引数として渡すことができない（＝終端を表す値が必要になる）」「型安全でない」という２点の問題があります。後者については、たとえば、sumf(1, 1, NAN)のように、うっかりdouble型以外の引数を渡してしまってもコンパイルエラーにならず、ただ結果がおかしくなったりコアダンプしたりすることになります^注1。

では、どのようにsumfを定義すれば良いのでしょう。答えを書いてしまうと、こんな感じです。

#include <stdio.h>

#define sumf(...)                                       \
  _sumf(                                                \
    (double[]){ __VA_ARGS__ },                          \
    sizeof((double[]){ __VA_ARGS__ }) / sizeof(double)  \
  )

static double _sumf(double *list, size_t count)
{
  double r = 0;
  size_t i;

  for (i = 0; i != count; ++i)
    r += list[i];

  return r;
}

int main(int argc, char **argv)
{
  printf("%f\n", sumf()); /* => 0 */^注2
  printf("%f\n", sumf(1.)); /* => 1 */
  printf("%f\n", sumf(1., 2.5, 3)); /* => 6.5 */
  return 0;
}

この定義では、可変長の引数群をマクロを用いてインラインで配列として初期化し、かつ、その要素数をsizeof演算子を用いて計算しています。そのため、C言語標準の可変長引数機構を使った場合の問題はいずれも発生しません。要素数が_sumf関数に引数countとして渡されるため、終端を表す特殊な値は必要になりませんし、また、実引数はdouble型の配列として呼出側で構築されるため、誤った型の引数を渡してしまうとコンパイルエラーになります。あるいは、たとえばint型の値を渡してしまった場合は、コンパイラによってdouble型に昇格することになるからです。

私たちが開発しているHTTPサーバ「H2O」では、この手法を用いて、型安全な文字列結合関数h2o_concatを定義、使用しています。

以上、H2Oで使っているC言語の小ネタ紹介でした。

※この記事はH2O Advent Calendar 2014の一部です。

注1: 手元の環境だと、sumf(1, 1, NAN)の結果は1となります
注2: 可変長マクロに対して０個の引数を渡すのはC99の規格には違反しますが、GCCやClangは問題なく処理します

Monday, December 8, 2014

64bit時代のバッファ処理

プログラミングの「常識」は時代とともに変化します。そのひとつが、サーバプログラムにおけるバッファ処理です。

1990年代後半から2010年頃までは、メモリ空間の大きさ（32bitすなわち4GB^注1）を超える大きさのファイルを扱う時代でした。このため、httpdなどのサーバプログラムにおいても、入出力データをいったんテンポラリファイルとしてバッファリングする必要がありました。ですが、ファイルI/Oはメモリアクセスと比べると低速です。このため、小さなサイズのデータについてはメモリアクセスする一方で、大きなサイズのデータについてはファイルI/Oを用いる、という煩雑なコードを書く必要がありました。

しかし、2014年も暮れとなる今、サーバサイドにおいては64bit環境のみを考えれば良い時代に入りつつあります。

もちろん、64bit環境といったところで、64bit空間の全てをユーザプロセスが使えるわけではありません。現行のx86-64 CPUがサポートする論理アドレス空間は48bit（256TB相当）であり、多くのOSではその上位アドレス半分がカーネル空間に割り当てられ、残った128TBがユーザプロセスで使用可能な空間となっています^注2。

128TBものメモリ空間があれば、全てのテンポラリデータを「メモリ空間にマッピングして」使用することができます。

実際に、H2Oのバッファ処理では、64MBを超えるものについてはテンポラリファイルをftruncate/mmapすることで領域を確保^注3し、これに対してメモリアクセスを行うようになっています。詳しくはh2o_buffer_reserve関数の定義をご覧ください。

バッファを利用する側のコードにおいては、サイズの大小に関係なく単なるメモリブロックとしてアクセスできるため、コードが単純かつ高速になるのです。

※本記事はH2O Advent Calendar 2014の一部です。

注1: 多くのOSにおいては、カーネルとメモリ空間を分け合う必要があるため、実際にユーザプロセスで使用できるメモリ空間は1GB〜3GB程度でした
注2: 参照: x86-64 - Wikipedia, the free encyclopedia - Operating system compatibility and characteristics
注3: malloc(3)で確保可能なメモリの総量である物理メモリ+スワップ領域の和は必ずしも十分に大きくないため、テンポラリファイルを生成してftruncateすることでディスク上に必要な領域を確保しています

Q. 条件分岐や算術演算を使わずに、max(a,b) を計算するプログラムを書けますか？

「if文（条件分岐）を使わず、max(a, b) を計算別解 | 津田の開発な日記」に関連した話です。リンク先のブログ記事では、条件分岐を使わずにmax(a,b)を実装する方法が議論されています。

では、更に条件を厳しくして、「条件分岐も算術演算も使わずに」max(a,b)を実装することはできるでしょうか？

も
ち
ろ
ん
可
能
で
す

回
答
例
は
以
下
に
あ
り
ま
す

なぜ、「もちろん」なのか。CPUは、ANDやOR、NOTのようなデジタルな論理回路から構成されています。であれば、当然、ビット演算（ビットシフトと&, |, ^）を使って、max(a, b)を実装することも可能なわけです。こんな感じ。

#include <stdio.h>

#define BIT(n, pos) (((n) >> (pos)) & 1)

static int mymax(int a, int b)
{
  int islt = BIT(a, 31) & (BIT(b, 31) ^ 1);
  int iseq = BIT(a, 31) ^ BIT(b, 31) ^ 1;

#define CHECK_BIT(pos) do { \
  islt |= iseq & (BIT(a, pos) ^ 1) & BIT(b, pos); \
  iseq &= BIT(a, pos) ^ BIT(b, pos) ^ 1; \
} while (0)

  CHECK_BIT(30); CHECK_BIT(29); CHECK_BIT(28);
  CHECK_BIT(27); CHECK_BIT(26); CHECK_BIT(25); CHECK_BIT(24);
  CHECK_BIT(23); CHECK_BIT(22); CHECK_BIT(21); CHECK_BIT(20);
  CHECK_BIT(19); CHECK_BIT(18); CHECK_BIT(17); CHECK_BIT(16);
  CHECK_BIT(15); CHECK_BIT(14); CHECK_BIT(13); CHECK_BIT(12);
  CHECK_BIT(11); CHECK_BIT(10); CHECK_BIT(9); CHECK_BIT(8);
  CHECK_BIT(7); CHECK_BIT(6); CHECK_BIT(5); CHECK_BIT(4);
  CHECK_BIT(3); CHECK_BIT(2); CHECK_BIT(1); CHECK_BIT(0);

#undef CHECK_BIT

  /* extend flag to 32-bit mask */
  islt <<= 31;
  islt >>= 31;

  return (a & (islt ^ 0xffffffff)) | (b & islt);
}

int main(int argc, char **argv)
{
  int a, b;

  if (argc != 3) {
    fprintf(stderr, "Usage: %s a b\n", argv[0]);
    return 1;
  }
  if (sscanf(argv[1], "%d", &a) != 1) {
    fprintf(stderr, "%s is not a number\n", argv[1]);
    return 1;
  }
  if (sscanf(argv[2], "%d", &b) != 1) {
    fprintf(stderr, "%s is not a number\n", argv[2]);
    return 0;
  }

  printf("max(%d,%d) is %d\n", a, b, mymax(a, b));

  return 0;
}

なぜHTTPSはHTTPより速いのか

先週、httpvshttps.com というウェブサイトが公開されました。このウェブサイトでは、HTTP と HTTPS を用いてアクセスした場合のウェブページのダウンロード完了までにかかる時間の比較ができるのですが、多くの環境で HTTPS の方が HTTP よりも高速なことに驚きの声が上がっていました。

HTTP が TCP 上で平文を送受信するのに対し、HTTPS は TCP 上で TLS (SSL) という暗号化技術を用いて通信を行います。ならば、TLS のオーバーヘッドのぶん HTTPS のほうが遅いはずだ、という予測に反する結果になったのですから、驚くのも無理はありません。

実は、この結果にはからくりがありました。

Google Chrome、Mozilla Firefox、最近のSafari^注1は、Google が開発した通信プロトコル「SPDY」に対応しており、HTTPS の通信においてはHTTP/1.1 ではなく SPDY を TLS 上の通信プロトコルとして利用します。

HTTP/1.1 は20年近く前に標準化された技術であり、今日では様々な限界が指摘されるようになってきています。そのうち最大の問題が、ウェブブラウザ上において、現在ダウンロード中のファイルのダウンロードが完了するまで次のファイルのダウンロード要求を送信できないという問題です。クライアントとサーバ間を通信が往復するには、数十ミリ秒から数百ミリ秒という時間がかかります（この時間は、ラウンドトリップタイム (RTT) と呼ばれます）。

ウェブブラウザが、第一のファイルのダウンロードを完了した直後に第二のファイルのダウンロード要求をサーバに送信したとしても、その中身が届き始めるのは、このラウンドトリップタイムが経過した後になります。その間、サーバからクライアント方向の通信回路は無駄に遊んでいることになります。

httpvshttps.com のような、小さな画像ファイルを多数ダウンロードするウェブページでは（これは多くのウェブサイトに見られる一般的な特徴です）、この無駄が占める割合が大きくなるのです（ウェブブラウザの動作としては、リクエストを送信後、RTTが経過するまでじっと待ち、その後一瞬でファイル受信が完了、直後に次のリクエストを送信し、RTT が経過するまでじっと待つ…を繰り返す形になります）。

一方、SPDY はこれら HTTP/1.1 が抱える問題を改善しようとする Google の試みとして開発が進んできたプロトコルであり、多数のファイルを並行してダウンロードすることが可能となっています。HTTP/1.1 のように、ファイルの到着を待つだけの時間は発生しません。

この違いが、暗号化のオーバーヘッドがあるにも関わらず、HTTPS のウェブページの方が HTTP より速くダウンロードが完了することの原因なのです。

「OSS界のちょっと気になる話（7）：次世代HTTP 2.0はSPDYの取り組みがベースに - ＠IT」より

■この傾向は今後も続くのか

SPDY は Google が開発した実証実験的な性格の強いプロトコルでした。そこから得られた成果をもとに、現在、HTTP/2 という次世代の HTTP プロトコルの標準化が進められており、はやければあと数週間で標準化される見込みとなっています。

HTTP/2 プロトコルの標準化案自体は HTTP でのアクセスにおいても HTTPS でのアクセスにおいても利用可能なものとなっています。ですが、多くのウェブブラウザはHTTPS でのアクセスにおいてのみ、HTTP/2 を有効化する公算が高くなっています^注2。

このため、HTTPS の方が HTTP よりも高速である、という状況は今後とも続くと考えられます。

■HTTPS のサーバ負荷はどうなのか

HTTPS に対する事業者の不安としては、ダウンロード速度に加えてもう一点、サーバ負荷の問題をあげることができます。暗号化を導入することで、サーバの負荷が増大するのではないか。

筆者らが開発している、高速なHTTPサーバであるH2Oでは…（続く）

本記事は HTTP2 Advent Calendar 2014 および H2O Advent Calendar 2014 の一部です。筆者体調不良により遅延＆書きかけの状態ですみませんすみません。

注1: YosemiteまたはiOS 8上で動作するものに限る
注2: 参照: (HTTP/2標準化ワークグループ座長のブログ記事, https://wiki.mozilla.org/Networking/http2

Monday, December 1, 2014

Improving Parser Performance using SSE Instructions (in case of PicoHTTPParser)

PicoHTTPParser is a tiny but very fast HTTP parser library known for its use by many Perl applications. In the slides I used last month, I suggested it could be even faster if SIMD instructions were used. Now the feature is available thanks to @herumi.

The library now uses the PCMPESTRI instruction which is part of SSE 4.2, running 68% to 90% faster.

Benchmark	clang -O3	clang -O3 -msse4.2	Improvement
bench.c	2,181,025	4,140,787	+90%
fukamachi.c	3,213,440	5,400,064	+68%

PCMPxSTRx is a SIMD instruction that can be used for parsing text. In _SIDD_CMP_RANGES mode, it checks at most 16 bytes at once, if each byte is within given set of ranges. Herumi and I have created a wrapper function for the instruction named findchar_fast that iterates though every 16 bytes of a given buffer to find the first occurrence of a byte within a set of given ranges.

And the function is merged neatly into the parser; the code below at first uses the SIMD function to look for a control character (the match condition is defined as ranges1), then falls back to the non-SIMD code to handling up to 15 remaining characters (that cannot be processed by the SIMD function due to out-of-bounds access).

#ifdef __SSE4_2__
  static const char ranges1[] =
    "\0\010"
    /* allow HT */
    "\012\037"
    /* allow SP and up to but not including DEL */
    "\177\177"
    /* allow chars w. MSB set */
    ;
  int found;
  buf = findchar_fast(buf, buf_end,
                      ranges1, sizeof(ranges1) - 1,
                      &found);
  if (found)
    goto FOUND_CTL;
#endif
  /* code that handles the input byte-by-byte */

To summarize, PCMPxSTRx is an excellent instruction for performance that can be cleanly integrated into existing parsers (tokenizers). Hopefully we will see performance improvements in various parsers in the future through the use of the instruction.

This blog post has been written as part of the H2O Advent Calendar.