Wednesday, November 5, 2014

[memo] Installing nghttp2 onto OSX

Following the steps below should install nghttp2 with all of its useful apps for testing into /usr/local/http2-15 of your OS X machine.

-- openssl
% wget --no-check-certificate https://www.openssl.org/source/openssl-1.0.2-beta3.tar.gz
% tar xzf openssl-1.0.2-beta3.tar.gz
% cd openssl-1.0.2-beta3
% KERNEL_BITS=64 ./config shared enable-ec_nistp_64_gcc_128 --prefix=/usr/local/http2-15
% make
% make install

-- libevent
% wget https://github.com/downloads/libevent/libevent/libevent-2.0.21-stable.tar.gz
% CFLAGS=-I/usr/local/http2-15/include CXXFLAGS=-I/usr/local/http2-15/include LDFLAGS=-L/usr/local/http2-15/lib ./configure --prefix=/usr/local/http2-15
% make
% make install

-- zlib-1.2.8
% wget http://zlib.net/zlib-1.2.8.tar.gz
% tar xzf zlib-1.2.8.tar.gz
% cd zlib-1.2.8
% ./configure --prefix=/usr/local/http2-15
% make
% make install

--  spdylay
% git clone https://github.com/tatsuhiro-t/spdylay.git
% cd spdylay
% (cd m4 && wget http://keithr.org/eric/src/whoisserver-nightly/m4/am_path_xml2.m4)
% autoreconf -i
% PKG_CONFIG_PATH=/usr/local/http2-15/lib/pkgconfig ./configure --prefix=/usr/local/http-15
% make
% make install

-- nghttp2
% git clone https://github.com/tatsuhiro-t/nghttp2.git
% cd nghttp2
% (cd m4 && wget http://keithr.org/eric/src/whoisserver-nightly/m4/am_path_xml2.m4)
% autoreconf -i
% PKG_CONFIG_PATH=/usr/local/http2-15/lib/pkgconfig ./configure --prefix=/usr/local/http2-15 --enable-app --disable-threads
% make
% make install

note: I have gathered this from my command history. Please forgive me if any of them are wrong.
EDIT: added libevent as pointed out by @tatsuhiro-t.

Tuesday, November 4, 2014

The internals H2O (or how to write a fast server)

Yesterday, I had the opportunity to speak at the HTTP2Conference held in Tokyo.

Until now, I have rather been silent about H2O since I think it is still premature for wild use. But you do not need to worry about such thing in a technology conference, so I happily talked about the motives behind H2O, and its implementation.

IMO the talk should be interesting not only to people working with HTTP server implementations but also to those interested in programming in general, since it covers the common performance pitfalls (or misconceptions as I see) in many server implementations.

Below are the slides and the recorded video of my presentation.

Enjoy!

PS. Please forgive me for the provocative attitude in the questions raised in the last part of the slides. They were intentionally written as such in order to spark discussion at the hall.



my talk starts a 3h15m (in Japanese)

Related links:

Friday, October 10, 2014

[メモ] root権限でrsyncする方法

サーバの移転作業時など、rootしかアクセスできない設定ファイルやアクセス権を保ったままrsyncしたいことってありませんか?

そういった際には、sudo の -A オプションと rsync の --rsync-path オプションを使うと良いようです。

まず、リモートサーバに、パスワードを標準出力に出力するスクリプトファイルを配置します(ファイルのパーミッションを厳しくするのを忘れずに)。
% cat > bin/askpass
#! /bin/sh
echo "{{my_password}}"
% chmod 700 bin/askpass
% 
そして、rsync を実行する際には --rsync-path オプションを使い、リモートサーバの rsync を sudo -A 経由で起動するようにすれば良いのです。
% sudo rsync -avz -e ssh \
    --rsync-path='SUDO_ASKPASS=/home/remote-user/bin/askpass sudo -A rsync' \
    remote-user@remote-host:remote-dir local-dir

これは簡単!

sudo -Aオプションはパスワードを記述したファイルを設置する必要があるという点でNOPASSWD同様セキュリティの懸念はありますが、使いどころを誤らなければ便利だなと思いました。

Tuesday, October 7, 2014

Making the HTTP server run 20% faster by using qrintf-gcc; a sprintf-optimizing preprocessor / compiler

tl;dr

This is an initial release announcement of qrintf, a preprocessor (and qrintf-gcc is the compiler frontend) that optimizes calls to sprintf and snprintf. C programs calling the functions may run faster by using the preprocessor / compiler in place of the standard compiler toolchain.

github.com/kazuho/qrintf

Background

sprintf (snprintf) is a great function for converting data to strings. The downside is that it is slow. Recently the functions have been one of the bottlenecks of H2O, which is a high performance HTTP server / library with support for HTTP/1.x and HTTP/2.

The function is slow not because stringification in general is a slow operation. It is slow mainly due to its design.

The function takes a format string (e.g. "my name is %s") and parse it at runtime, every time the function gets called. The design must have been a good choice when the C programming language was designed. Until the 1990s, every byte of memory was a precious resource. Using a memory-conscious approach (e.g. a state-machine for formatting strings) was a logical choice. But nowadays, the cost of memory is hundreds-of-thousand times cheaper than it used to be. On the other hand, the relative cost of running a state machine for building formatted strings has become large since it causes lots of branch mis-predictions that stall the CPU pipeline.

How it works

The understanding has led me to write qrintf, which works as a wrapper against the C preprocessor. It precompiles invocations of sprintf (and snprintf) with a constant format string (which is mostly the case) into optimized forms.

The example below illustrates the conversion performed by the qrintf preprocessor. As can be seen, the call to sprintf is precomplied into a series of function calls specialized for each portion of the format (and the C compiler may inline-expand the specialized calls). In the case of the example, the optimized code runs more than 10 times faster than the original form.
// source code
snprintf(
    buf,
    sizeof(buf),
    "%u.%u.%u.%u",
    sizeof(buf),
    (addr >> 24) & 0xff,
    (addr >> 16) & 0xff,
    (addr >> 8) & 0xff,
    addr & 0xff);

// after preprocessed by qrintf
_qrintf_chk_finalize(
    _qrintf_chk_u(
        _qrintf_chk_c(
            _qrintf_chk_u(
                _qrintf_chk_c(
                    _qrintf_chk_u(
                        _qrintf_chk_c(
                            _qrintf_chk_u(
                                _qrintf_chk_init(buf, sizeof(buf)),
                                (addr >> 24) & 0xff),
                            '.'),
                        (addr >> 16) & 0xff),
                    '.'),
                (addr >> 8) & 0xff),
            '.'),
        addr & 0xff));
$ gcc -Wall -O2 examples/ipv4addr.c && time ./a.out 1234567890
result: 73.150.2.210

real 0m2.475s
user 0m2.470s
sys 0m0.002s
$ qrintf-gcc -Wall -O2 examples/ipv4addr.c && time ./a.out 1234567890
result: 73.150.2.210

real 0m0.215s
user 0m0.211s
sys 0m0.002s

Performance impact on H2O

H2O uses sprintf in three parts: building the HTTP status line, building ETag, and building the access log.

By using qrintf-gcc in place of gcc, the performance of H2O jumps up by 20% in a benchmark that sends tiny files and with access logging enabled (from 82,900 reqs/sec. to 99,200 reqs/sec.).

The fact not only makes my happy (as the developer of qrintf and H2O) but also shows that optimizing sprintf at compile-time do have benefit for real world programs.

So if you are interested in using qrintf or H2O, please give it a try. Thank you for reading.

Thursday, October 2, 2014

sprintf を最大10倍以上高速化するプリプロセッサ「qrintf」を作った

最近H2OというHTTPサーバを書いているのですが、プロファイルを取ってみるとsprintfが結構な時間を食っていて不満に感じていました。実際、sprintfは数値や文字列をフォーマットするのに十徳ナイフ的に便利なので、HTTPサーバに限らず良く使われる(そしてCPU時間を消費しがちな)関数です。

では、sprintfを最適化すれば、様々なプログラムが より高速に動作するようになるのではないでしょうか。ということで作ったのが、qrintfです。

qrintfは、Cプリプロセッサのラッパーとしてソースコードに含まれるsprintfの呼出フォーマットを解析し、フォーマットにあわせたコードに書き換えることで、sprintfを高速化します。

たとえば、以下のようなIPv4アドレスを文字列化するコード片を
sprintf(
    buf,
    "%d.%d.%d.%d",
    (addr >> 24) & 0xff,
    (addr >> 16) & 0xff,
    (addr >> 8) & 0xff,
    addr & 0xff);
以下のようにコンパイル時に自動的に書き換え、実行時にフォーマット文字列の解析する負荷を省くことで大幅な高速化を実現しています。
((int)(_qrintf_d(
    _qrintf_c(
        _qrintf_d(
            _qrintf_c(
                _qrintf_d(
                    _qrintf_c(
                        _qrintf_d(
                            _qrintf_init(buf),
                            (addr >> 24) & 0xff),
                        '.'),
                    (addr >> 16) & 0xff),
                '.'),
            (addr >> 8) & 0xff),
        '.'),
    addr & 0xff)
).off);

たとえば上記のコードの場合、以下のように、qrintfを用いることで10倍以上文字列化処理が高速化することを確認しています。
$ gcc -O2 examples/ipv4addr.c
$ time ./a.out 1234567890
result: 73.150.2.210

real    0m2.602s
user    0m2.598s
sys 0m0.003s
$ ./qrintf-gcc -O2 examples/ipv4addr.c
$ time ./a.out 1234567890
result: 73.150.2.210

real    0m0.196s
user    0m0.192s
sys 0m0.003s

sprintfを呼んでいるコードを書き換えずに高速化できて、ハッピーですね!

Monday, September 8, 2014

The reasons I stopped using libuv for H2O

Libuv is a great cross-platform library that abstracts various types of I/O by using callbacks.

So when I started writing H2O - a high-performance HTTP server / library implementation with support for HTTP1, HTTP2 and websocket, using libuv seemed like a good idea. But recently, I have stopped using it for sereval reasons. This blog post explains them.

■No Support for TLS

Although libuv provides an unified interface for various types of streams, it does not provide access to a TLS stream though the interface, nor does provide a way for application developers to implement their own types of streams.

The fact has been a disappointment to me. Initially I had created a tiny library called uvwslay that binds libuv streams to wslay - a websocket protocol implementation. But since libuv does not provide support for TLS, it was impossible to support SSL within the tiny library.

To support wss protocol (i.e. websocket over TLS) I needed to reimplement the websocket binding, to stop using libuv as its stream abstraction layer and switch to an abstraction layer defined atop of libuv that abstracts the differences between libuv streams and the I/O API provided by the SSL library being used (in my case it is OpenSSL).

It would be great if libuv could provide direct support for TLS in future (or provide a way to implement their own type of libuv streams), so that code directly calling the libuv API can support various types of streams.

■Memory Usage is not Optimal

Libuv provides a callback-based API for writes and I love the idea. It simplifies the complicatedness of implementing non-blocking I/O operations. However the problem of libuv is that it does not call the completion callback right after write succeeds. Instead, it is only until all I/Os are performed that the callbacks are called. This means that if you have 1,000 connections sending 64KB of data, your code first allocates 64KB of memory 1000 times (64MB in total) and then call free for all of those memory chunks, even when the network operation does not block. IMO, libuv should better call the completion callback immediately after the application returns the control back to the library after calling uv_write, so that the memory allocation pattern could be a repetition of malloc-and-then-free for 1000 times.

■No Support for Delayed Tasks

Libuv does not provide an interface to schedule actions to be run just before the I/O polling method is being called.

Such interface is necessary if you need to aggregate I/O operations and send the results to a single client. In case of HTTP2, a certain number of HTTP requests need to be multiplexed over a single TCP (or TLS) connection.

With lack of support for delayed tasks, you would need to use uv_timer with 0 second timeout to implement delayed tasks.

This works fine if the intention of the application developer is to aggregate read operations, but does not if the intension was to aggregate the result of write operations, since in libuv the write completion callbacks are called after the timers are called.

■Synchronous File I/O is Slow

Synchronous file I/O seems to be slow when compared to directly calling the POSIX API directly.

■Network I/O has Some Overhead

After I found out the issues mentioned above, I started to wonder how much overhead libuv imposes for network I/O.

So I stripped the parts that depended on libuv off from the stream abstraction layer of H2O (that hides the difference between TCP and TLS as described above) and replaced it with direct calls to POSIX APIs and various polling methods (i.e. select, epoll, queue).

The chart below shows the benchmark results taken on my development environment running Ubuntu 14.04 on VMware. H2O become 14% to 29% faster by not depending on libuv.


■The Decision

Considering these facts altogether, I have decided to stop using libuv in H2O, and use the direct binding that I have written and used for the benchmark, at least until the performance penalty vanishes and the TLS support gets integrated.

OTOH I will continue to use libuv for other projects I have as well as recommending it to others, since I still think that it does an excellent job in hiding the messy details of platform-specific asynchronous APIs with adequate performance.

PS. Please do not blame me for not trying to feedback the issues as pull requests to libuv. I might do so in the future, but such task is too heavy for me right now. I am writing this post in the hope that it would help people (including myself) understand more about libuv, and as an answer to @saghul.

Tuesday, July 15, 2014

自社サーバと交信するスマホアプリにおけるサーバ証明書の発行手法について(SSL Pinningと独自CA再考)

■背景




■他社CAは信頼できるのか



■独自CAに問題はあるのか




■自社CAと他社CAのリスク評価




■SSL Pinningの是非




■2種類のSSL Pinning



■では、どうすれば良いか





■補足:優良なCAから証明書を買うだけではダメな理由