Kazuho's Weblog: C++

Showing posts with label C++. Show all posts

Thursday, June 18, 2020

QUICむけにAES-GCM実装を最適化した話 (2/2)

前半で述べたように、OpenSSLのAEAD暗号器は、長いAEADブロックの処理を前提に作られています。平文の暗号化処理においては理論上の上限にあたる速度を叩き出す一方、事前処理と事後処理、および呼出オーバーヘッドについては、あまり最適化が図られているとは言えません。これは、AEAD暗号の主な使用用途が、これまでTLSという長いAEADブロックを使う（ことが一般的な）プロトコルであったことを反映していると言えるでしょう。

一方、QUICにおいては、UDPパケット毎に独立した、短いAEADブロックを暗号化する必要があり、したがって、次のような速度向上の機会があることが分かります。

AEAD処理をひとつの関数にまとめ、事前処理と事後処理を、パイプライン化されスティッチングされた暗号処理と並行に走らせることができれば、AEADブロックが短くても、理論値に近いスループットを発揮するような、AES-GCM実装を作ることができる（前半より引用）

この条件を満たすような関数を実装し、ボトルネックをつぶしていって速度向上を図るというのは一案です。しかし、往々にして、そのような対症療法的なプログラミングスタイルでは、何回もの変更に伴う手戻りが発生したり、必ずしも最適でないコードが成果物の一部に残ったりしがちです。

より効率的な設計手法はないものでしょうか。

■QUIC向けAES-GCM実装「fusion」の設計方針

幸いなことに、AES-GCMについては、第9世代Core CPUにおけるボトルネックがAES-NIであり、そのスループットの理論上の上限が64バイト/40クロックであることが分かっています。スティッチングを用いたAES-GCM実装が、暗号化処理中、AES-NIを最高速度で回しつつ、他の演算ユニットを用いてGCMのハッシュ計算を行うという手法であることも、先に述べたとおりです。

ならば、AES-NIを常時実行しつつ、その合間をぬって、AEADの事前処理、事後処理を含む他のあらゆる処理を行うようにすれば、理論上の上限値に迫るようなAES-GCM実装が作れるのではないでしょうか。

このような考えに基づき、以下のような特徴をもつAES-GCM暗号ライブラリ「fusion」を作成することにしました：

できるだけ長い間、6*16バイト単位でAES-NIを実行する
その間に、AAD（＝事前処理）を含む、任意の長さのGCMハッシュ計算を行う
複雑な設計をメンテ可能とするために、アセンブリではなくCで記述する
AEADブロック全体にわたって、GCMハッシュの事前計算を行う。それにより、reductionの負荷を下げる
パケットヘッダ暗号化（パケット番号暗号化）に必要なAES演算を重畳する

AES-GCM暗号化の典型的なデータフローを可視化してみましょう。第一の図が、古典的な（OpenSSLのような）暗号化部分に注力したアプローチです。第二の図が、fusionのアプローチです。横軸が時間軸で、縦に並んでいる処理は同時実行（スティッチング）されています。fusionでは、より多くの処理がスティッチングされることがわかります。

以下が、fusion.cの暗号化のホットループです。gfmul_onestepは、1ブロック分のGCMハッシュの乗算演算を行うインライン関数です。6ブロック分（bits0〜bits5）のAES計算をする間に、gdata_cntで指定された回数だけgfmul_onestepを呼び出していることがわかります。

#define AESECB6_UPDATE(i) \
    do { \
        __m128i k = ctx->ecb.keys[i]; \
        bits0 = _mm_aesenc_si128(bits0, k); \
        bits1 = _mm_aesenc_si128(bits1, k); \
        bits2 = _mm_aesenc_si128(bits2, k); \
        bits3 = _mm_aesenc_si128(bits3, k); \
        bits4 = _mm_aesenc_si128(bits4, bits4keys[i]); \
        bits5 = _mm_aesenc_si128(bits5, k); \
    } while (0)
#define AESECB6_FINAL(i) \
    do { \
        __m128i k = ctx->ecb.keys[i]; \
        bits0 = _mm_aesenclast_si128(bits0, k); \
        bits1 = _mm_aesenclast_si128(bits1, k); \
        bits2 = _mm_aesenclast_si128(bits2, k); \
        bits3 = _mm_aesenclast_si128(bits3, k); \
        bits4 = _mm_aesenclast_si128(bits4, bits4keys[i]); \
        bits5 = _mm_aesenclast_si128(bits5, k); \
    } while (0)

    /* run AES and multiplication in parallel */
    size_t i;
    for (i = 2; i < gdata_cnt + 2; ++i) {
        AESECB6_UPDATE(i);
        gfmul_onestep(&gstate, _mm_loadu_si128(gdata++),
                      --ghash_precompute);
    }
    for (; i < ctx->ecb.rounds; ++i)
        AESECB6_UPDATE(i);
    AESECB6_FINAL(i);

コードを注意深く読んだ方は、bits4の計算だけ、異なる鍵を使うようになっていることに気づいたかもしれません。これが、パケットヘッダ暗号化のためのAES計算を重畳するための工夫です。

■パケットヘッダ暗号化

パケットヘッダ（パケット番号）の暗号化は、QUICやDTLS 1.3といった新世代のトランスポートプロトコルに見られる機能です。パケットヘッダを暗号化することで、傍受者による通信内容の推測をより難しくしたり、中継装置（ルータ）が特定の通信パターンを前提にしてしまうことによりトランスポートプロトコルの改良が困難になること（ossification）を防ぐ効果が期待されています。

なぜ、パケットヘッダ暗号化のAES計算を重畳するのか。それは、6ブロック分のAES計算を一度に行う以上、パケット長を96で割った余りが65から80の間にならない限り、使われないスロットが発生するためです。その余ったスロットをパケットヘッダ暗号化のAES演算に使うことで、パケットヘッダ暗号化のコストを隠蔽するのが目的です。

パケットヘッダ暗号化を重畳した場合のデータフローを、以下に示します。

■ベンチマーク

では、ベンチマーク結果を見てみましょう。

青の棒は、OpenSSLのAES-GCM処理のうち、事前処理と事後処理を含まないスループットを、赤の棒は、両者を含んだトータルでのスループットを表しています。黄色はfusionのトータルスループット、緑は、パケットヘッダ暗号化に必要な演算を重畳した場合の値です。

まずは、最近のIntel製CPUである、Core i5 9400の値を見てみましょう。

AEADブロックサイズが16KBの場合、OpenSSLの事前事後処理を含まないスループットとfusionのスループットが、いずれも6.4GB/sという理論上の上限に達していることが分かります（微妙なズレは、CPUクロック制御の精度に起因するものです）。OpenSSLの事前事後処理を含むスループットは若干遅い6.2GB/sですが、TLSにおいて、事前事後処理を最適化しないオーバーヘッドは3%以下である、という風に読むこともできます。

一方で、AEADブロックサイズが1440バイトの場合、差は顕著です。OpenSSLのトータルスループットが4.4GB/sと、理論値の約70%にまで落ち込むのに対し、fusionは理論値の90%を超えるスループットを発揮します。また、パケットヘッダ暗号化によるオーバーヘッドが1%以下なのも見てとることができます。

AMD Ryzenに目を向けると、AEADブロックサイズ1440バイトの場合のみならず16KBの場合でも、fusionが勝っていることが読み取れます。これは、RyzenのAES-NIのスループットがPCLMULと比較して高いため、ボトルネックがPCLMULに代表されるGCMハッシュ計算の側に移動したものと考えられます。fusionは、想定されるAEADブロック全体にわたって事前計算を行うことで、GCMハッシュ演算のうちreductionの回数を削減しているので、ブロックサイズ16KBの場合にも差がついたと考えることができます。

■考察

カーネル・ネットワークカードのUDP処理が最適化された場合、暗号処理のコスト差が問題となって、TLSよりもQUICのほうがCPU負荷が高くなる、という問題がありました。この問題について、QUICを始めとする暗号化トランスポート向けに最適化したAES-GCM実装を準備することで、大幅な改善が可能であることを示しました。fusionをQUICの暗号ライブラリとして使った場合の詳細は本稿では紹介しませんが、TCPとUDPでGSOハードウェアオフロードがある環境において、パケットサイズ9KBならQUICが優位、パケットサイズ1.5KBでもQUICのオーバーヘッドはTLS+5%程度だという測定結果を得ています（参照: h2o/quicly PR #359)。

あわせて、

パケットヘッダ暗号化のコストは（少なくとも送信側においては）特に問題視するレベルではないこと
アセンブリを用いる場合と比較して、C言語を用いることで、最善ケースのスループットを保ったまま、より高度な設計による暗号ライブラリが開発可能であること

を示しました。

今回開発したAES-GCM実装「fusion」は、昨日、我々が管理するTLSスタックであるpicotlsにマージされ、使用可能になっています。fusion、あるいはそれに類する実装手法を用いることで、インターネット上の通信が、より低コストに、より安全になっていくことを期待します。

末筆ですが、fusionを開発するにあたり、光成(@herumi)さんにアドバイスを、吉田(@syohex)さんにベンチマークでご協力をいただきました。この場を借りて御礼申し上げます。

Thursday, April 16, 2020

C言語で配列の要素数を安全に数える話

C言語で配列の要素数を数えるイディオムってのがあって、

sizeof(array) / sizeof(array)

なんだけど、配列名が長くなって、たとえば

sizeof(var.that_has_an_array.as_a.member) /
    sizeof(var.that_has_an_array.as_a.member[0])

とかになるとカオス。

なので、ベンダーによっては、

#define _countof(array) (sizeof(array) / sizeof(array[0]))

みたいなマクロを提供していたりするんだけど、こうやって、何も考えずに使えるようにしていくと、配列ではなくポインタを引数に渡しちゃって、サイズ計算ミスって変な動作する懸念が増してくる。

なので、Twitterで

C言語で、ある値がポインタなのか配列なのかを知る方法ってあるのかなぁ（gcc/clang拡張でも可）。意図としては countof(array) みたいなマクロで、引数arrayに渡されるものがポインタではなく配列であることをビルド時に保証したい

cf. https://t.co/izurmOdiTl
— Kazuho Oku (@kazuho) April 14, 2020

と聞いたところ、mattnさんから

gcc だとこのマクロがうまく動きそうです。

# define IS_ARRAY(arg) __builtin_choose_expr(__builtin_types_compatible_p(typeof(arg[0]) [], typeof(arg)), 1, 0)
— mattn (@mattn_jp) April 14, 2020

と教えてもらったので、

#define PTLS_BUILD_ASSERT(cond) \
    ((void)sizeof(char[2 * !!(!__builtin_constant_p(cond) || (cond)) - 1]))

#define PTLS_ASSERT_IS_ARRAY(a) \
    PTLS_BUILD_ASSERT(
        __builtin_types_compatible_p(__typeof__(a[0])[], __typeof__(a)))

 #define PTLS_ELEMENTSOF(x) \
    (PTLS_ASSERT_IS_ARRAY(x), sizeof(x) / sizeof((x)[0]))

こんな感じにして取り込みました。これで、先の冗長な例も

PTLS_ELEMENTSOF(var.that_has_an_array.as_a.member)

と書くことができる。便利。

完全な変更は add PTLS_ELEMENTSOF for counting the number of elements in an array by kazuho · Pull Request #301 · h2o/picotls をご覧ください。

PS. 配列かポインタかの確認方法については、yuguiさんから下のようなコンパイラ非依存の解も教えてもらったのですが、残念ながらコンパイル時に動かすことが難しいようでした。実行時判定が必要なケースなら、この方法のほうがいいかも。

&してからintptr_tにキャストして、元の値と比較したらイケませんか？
もっと移植性のある方法もあった気はするんですが
— Yuki Yugui Sonoda (@yugui) April 14, 2020

Monday, July 22, 2019

pthread_once が嫌いすぎて再実装した話

pthread_once が嫌いです。なぜ嫌いかって言うと、こんな感じで、ファイルレベルのグローバル変数やグローバル関数が出現し、また、値を使う場所と初期化コードの位置が離れがちで可読性が下がるから。

static volatile BIO_METHODS *biom = NULL;

static void init_biom(void)
{
    biom = BIO_meth_new(BIO_TYPE_FD, "h2o_socket");
    BIO_meth_set_write(biom, write_bio);
    BIO_meth_set_read(biom, read_bio);
    BIO_meth_set_puts(biom, puts_bio);
    BIO_meth_set_ctrl(biom, ctrl_bio);
}

static void setup_connection(...)
{
    (いろいろ省略）

    // BIOを初期化
    static pthread_once_t init_biom_once = PTHREAD_ONCE_INIT;
    pthread_once(&init_biom_once, init_biom);
    BIO *bio = BIO_new(biom);
    ...
}

一方、pthread_onceを使う煩雑さを避けようとすると、自前でダブルチェックロックを書くことになるのですが、ダブルチェックロックをちゃんと書くのは難しい（参考：LCK10-J. ダブルチェックロック手法を誤用しない）し、実際間違えるし、毎回、間違えないように書こうとするのはストレスなんです。

というわけで、一念発起して、マクロを使って自分が本当にほしかった「once」を実装しました。

こんな感じで使います。

static void setup_connection(...)
{
    (いろいろ省略）

    // BIOを初期化
    static volatile BIO_METHODS *biom = NULL;
    H2O_MULTITHREAD_ONCE({
        biom = BIO_meth_new(BIO_TYPE_FD, "h2o_socket");
        BIO_meth_set_write(biom, write_bio);
        BIO_meth_set_read(biom, read_bio);
        BIO_meth_set_puts(biom, puts_bio);
        BIO_meth_set_ctrl(biom, ctrl_bio);
    });
    BIO *bio = BIO_new(biom);
    ...
}

グローバル変数やコールバック関数はなくなったし、初期化コードと利用コードが隣同士になって可読性が上がりました。

実際のコードは https://github.com/h2o/h2o/pull/2086 にありますので、ご参照ください。これで正しくダブルチェックロック実装できてるはず。

最後に一言。Cのマクロにブロック渡すのは超便利。

Sunday, October 30, 2016

mmapを使ってファイルベースの巨大なバッファを確保する話

小さなバッファはインメモリでもつが、メモリに収まらないような大きなバッファはテンポラリファイルを作り、file I/Oでアクセスする、というのが昔からの汎用的なバッファ実装のアプローチ。

だが、バッファに格納するデータ量によってアクセス手段を変えるというのはめんどくさいし、そこを抽象化すると無駄なオーバーヘッドが発生する。

幸いなことに最近は、メモリ空間が広い 64bit CPU だけ考えればいい。なので、ファイルの「読み込み」については、めんどくさいから全部mmapするというのが一般的なアプローチになってきている（例: LLVMのリンカであるlld）。

同様のことが、テンポラリファイルを使う可変長のバッファについても可能であり、h2o では実際に実装している。詳しくは h2o_buffer_reserve 関数の実装を見てもらえばいいと思いますが、ざっくりとした手順は以下のとおり：

▪️バッファ作成もしくはリサイズ処理:

テンポラリファイルを作り、すぐ消す（作成の場合のみ。mkstemp, unlink してから file descriptor を使い続ける）
サイズを設定する（posix_fallocate もしくは ftruncate）
旧来のマップされた領域を munmap する（リサイズの場合のみ）
全体を mmap する

▪️解放処理:

munmap して close する

Windowsなら、ファイルを消すのを解放時にする必要があるかも。

Wednesday, November 11, 2015

mruby で同期呼出を非同期化する話（もしくは H2O の mruby ハンドラでネットワークアクセスする話）

■背景

H2Oではバージョン1.5より、mrubyを用い、Rackのインターフェイスに則った形でハンドラを書けるようになっています。

この機能を提供している目的は、正規表現による書き換え等を用いる複雑な設定ファイルではなくプログラミング言語を用いることで、ウェブサーバの設定をより簡潔に拡張しやすくするためです（Apacheのmod_rubyやmod_perlのようにウェブアプリケーションをウェブサーバ内で実行可能にすることではありません）。

とは言っても、現実のウェブサーバの設定においては、外部のデータベース等に問い合わせた結果に基づいたルーティングが必要になることがあります。

H2Oのようなイベントドリブンなウェブサーバ上で動作する、同期モデルを採用するRackインターフェイスを用いて記述されるハンドラ内において、データベースへの問い合わせをどのように実現すれば良いか。問い合わせが同期的だと、その間ウェブサーバの処理が止まってしまうので、Rubyで問い合わせ関数が呼ばれたタイミングで、ウェブサーバ側に処理を戻したいわけです。

そんなこんなで

そのとおりです。mrubyの構造上、Cライブラリ呼出のタイミングで非同期化するの難しそう>「mrubyのコード中でredisへのアクセスが発生した際にh2oがそのスレッドを開放できず、イベントループの恩恵が得られてない」 https://t.co/suTAf8PatM
— Kazuho Oku (@kazuho) November 10, 2015

とツイートしたところ、

@kazuho むむむ、どういう改善をすれば「恩恵を得られる」ようになりますかね。改善する意志はあります。
— Yukihiro Matsumoto (@yukihiro_matz) November 10, 2015

@yukihiro_matz @kazuho CからFiberいじれるようにするとかそういう話ですかね。
— MATSUMOTO, Ryosuke (@matsumotory) November 10, 2015

@matsumotory @kazuho yieldとresumeはできますが。後はなにが必要ですか？作るのはC関数に再入できない問題がありますが、resume同様returnでしか呼べない制限をつければ可能ですね
— Yukihiro Matsumoto (@yukihiro_matz) November 10, 2015

という流れになりました。

その後、考えたところ、

@matsumotory @yukihiro_matz rack handlerをfiber内から呼ぶようにすれば、ひょっとしてmrubyに手を入れずに対応できるんじゃないかと考えています
— Kazuho Oku (@kazuho) November 10, 2015

という気がしてきたので、まずはPoCを書いてみることにしました。

■Fiberを使って、同期コールを非同期化するPoC

ざっと、以下のような感じになります。Rack ハンドラ自体を Fiber 内に置き、その入出力と、非同期化したい関数（ここでは DB#query）が呼ばれたタイミングで Fiber.yield を呼ぶことで、メインループ（これは実際には C で書くことになる）へ制御を戻しています。

# DB class that calls yield
class DB
  def query
    return Fiber.yield ["db#query"]
  end
end

# the application, written as an ordinary Rack handler
app = lambda {|env|
  p "received request to #{env["PATH_INFO"]}"
  [200, {}, ["hello " + DB.new.query]]
}

# fiber that runs the app
runner = Fiber.new {
  req = Fiber.yield
  while 1
    resp = app.call(req)
    req = Fiber.yield ["response", resp]
  end
}
runner.resume

# the app to be written in C
msg = {"PATH_INFO"=> "/abc"} # set request obj
while 1    
  status = runner.resume(msg)
  if status[0] == "response"
    resp = status[1]
    break
  elsif status[0] == "db#query"
    # is a database query, return the result
    msg = ""
  else
    raise "unexpected status:#{status[0]}"
  end
end
p "response:" + resp[2].join("")

やろうと思えばできることはわかりました。しかし、この手法には制限が２点あります。

fiber 内からしか呼べない - それでいいのか?
fiber 内で、Cコードを経由して呼ばれた ruby コードから Fiber.yield できない

いずれも大した問題ではないですが、ここに付記しておきます（後者は mruby の場合、大きな問題にならないと認識されているようです。参照: twitter.com/yukihiro_matz/status/664276538574049280）。

■プロトコルバインディングの実装手法

さて、これで行けそうだということは分かったのですが、可能であることと、それが良いアプローチであることが等価であるとは限りません。そもそも、プロトコルバインディングはどのように書かれるべきなのでしょうか。２種類に大別してプロコンを書きたいと思います。

Cライブラリのラッパーを書く
- Cライブラリが、非同期モデルをサポートしている必要がある
- イベントループ (libuv, libev, ...) 毎に対応が必要
- プロトコルを実装しなくて良い
rubyでバインディングを書く
- プロトコルを実装する必要がある
- rubyで書ける！
- 各バックエンド (libuv, libev, ngx_mruby, h2o, ...) が同じ ruby API (TCPSocketのサブセットで良いと思う) を提供すれば、イベントループ毎の対応が不要
- Cより遅いかも…

個人的には、rubyでバインディングを書くアプローチが好みです。速度が遅いかも…という点については、Perl IO を用いた HTTP 実装を推進してきた立場から言うと、スクリプト言語のI/Oレイヤの負荷はネットワーク通信を行うプログラムにおいては多くの場合問題にならないと考えます。問題になるとすれば、通信データのパーサですが、ここのみをネイティブコード化するという手法で十分に対応できることは、Plack や Furl に慣れた Perl プログラマであれば納得できる話かと思いますし、(m)ruby においても同等かと思います。

■まとめ

長くなりましたが、H2O （あるいはイベントドリブンなプログラム一般）から、同期的に書かれたネットワーククライアントを呼び出す mruby スクリプトを起動する方法については、

同期的に記述されたアプリケーションを Fiber を使ったラッパーで非同期化する
ホストプログラムは、Fiber を通じて、TCPSocket と互換性のある同期ソケット API を提供する
プロトコルバインディングは、Rubyで（もしくは、Ruby の TCPSocket と C で書かれたプロトコルパーサを組み合わせて）提供する

という形で行うのが最善ではないかと思いました。

Friday, November 6, 2015

ソート済の整数列を圧縮する件

圧縮されたソート済の整数列ってのは汎用的なデータ構造で、たとえば検索エンジンの転置インデックスとか、いろんなところで使うわけです。で、検索エンジンの場合は速度重要なので、PForDeltaとか様々なデータ構造が研究されてる。

一方、H2O には、ブラウザキャッシュに載ってない js や css をサーバプッシュする仕組み「cache-aware server push」があって、何がキャッシュされているか判定するためにブルームフィルタを全ての HTTP リクエストに含める必要がある。

で、ブルームフィルタを圧縮しようと思うと、ブルームフィルタってのはソート済の整数列として表現できるので、これを圧縮しようって話になる。

検索エンジン等で使う場合は速度重要だけど、HTTPリクエストに載せる場合は空間効率のほうが重要になる。ってことで、空間効率が理論限界に近いゴロム符号（の特殊系であるライス符号）を使うことになる。

ってことで、作ったのがgithub.com/kazuho/golombset。

今週ちょっとcodecをいじって、あと気軽に試せるようにコマンドラインインターフェイスを追加した。

なので、こいつを git clone して make して、以下のような感じで使うことができる。

(100, 155, 931) というソート済の整数列をエンコード

% (echo 100; echo 155; echo 931) | ./golombset --encode | od -t x1
0000000    41  90  6d  c0  ff
0000005

同じ整数列をエンコードしてデコード

% (echo 100; echo 155; echo 931) | ./golombset --encode | ./golombset --decode
100
155
931

(100,155,931)という３つの数値を含むソート済の整数列を５バイトにエンコードできていることがわかる。

もうちょっと実際的な例として、偽陽性が1/100の確率で発生するブルームフィルタに、100個の要素を突っ込むんでエンコードすることを考える。適当にランダムな値を用いてそのようなフィルタを作成しエンコードしてみると、結果が102バイトであることがわかる。

% perl -MList::MoreUtils=uniq -e 'my @a = (); while (@a < 100) { @a = uniq sort { $a <=> $b } (@a, int rand(10000)); } print "$_\n" for @a' | ./golombset --encode | wc -c
     102

つまり、CSSやJavaScriptのような、ブラウザのレンダリングにクリティカルな影響を与えるファイルが100個あるとして、それらがウェブブラウザのキャッシュ内に存在するかを判定するためのブルームフィルタをHTTPリクエストに添付するためのオーバーヘッドは100バイト程度である、ということになる。さらに、リクエストを２回以上繰り返す場合は、HPACKによる圧縮が効く。

以上が、これなら現実的だよねってんで H2O の cache-aware server push は実装されたのでした、という経緯と、それにあわせて作ったライブラリの紹介でした。

それでは、また。

参考:
ImperialViolet - Smaller than Bloom filters
Golomb-coded sets: smaller than Bloom filters (giovanni.bajo.it)

Thursday, October 8, 2015

雑なツイートをしてしまったばかりにrubyを高速化するはめになった俺たちは！

逆に言うと、Rubyの文字列型の内部実装がropeになれば、freezeしてもしなくても変わらない速度が出るようになって、結局freezeする必要なんてなかったんやーで丸く収まるんじゃないの？と思いました #雑な感想
— Kazuho Oku (@kazuho) October 6, 2015

とツイートしたところ、処理系の中の人から

@kazuho 文字列を弄る話じゃなくて、文字列の identity の話なので、ちょっと関係ないかなぁ、と
— _ko1 (@_ko1) October 6, 2015

みたいなツッコミをもらって、うっすみません…ってなってRuby VMのコードを読むことになったわけです。

で、まあ、いくつか気になる点があったので手をつけてしまいました。

1. オブジェクト生成のホットパスの最適化

ホットスポットだとされていたところのコードを読んでると、オブジェクト生成の際に走る関数が割と深いのが問題っぽかった。通常実行されるパスは短いから、それにあわせて最適なコードがはかれるようにコードを調整すれば速くなるはず！！！

とコンセプトコードを書いて投げたら取り込まれた。やったね！！！

* gc.c (newobj_of): divide fast path and slow path
と思ったら、ほとんど速くなってないっぽい。これは悲しい…ということで、細かな修正を依頼。

optimize performance of `rb_str_resurrect` by kazuho · Pull Request #1050 · ruby/ruby
このPR適用すると、問題のマイクロベンチマークが3%〜5%くらい速くなるっぽい。

2. ヒープページのソートをやめる

通常実行される側をいじっても期待ほど速度が上がらなかったので、これは遅い側に原因があるかも…って見ていて気になったのが、ヒープページをソートする処理。現状のRubyは、オブジェクトを格納する「ヒープページ」を16KB単位で確保するんだけど、これを逆参照できるように、アドレスでソートした一覧を持ってる。この構築コストがでかい。

で、これをヒープに書き直してみたところ、rdocを使ったベンチマークで2〜3%の高速化が確認できたので報告。

.@_ko1 気になったので、heap tableを雑にハッシュ化してみましたが、rdocで2〜3%実行時間が減ります（高速化します）ね https://t.co/CJhierZZN3
— Kazuho Oku (@kazuho) October 7, 2015

ただ、今日のRubyは、（昔読んだ記事とは異なり）ヒープページが16KB単位でアラインされているということなので、ヒープを使うよりもビットマップを使うべき案件。

3. スイープの最適化

ヒープページのソートを書き直したあとでプロファイラの出力を眺めていたら、GCのスイープ処理が重たいことに気づいた。コードを読んだところ、分岐回数と呼出深度の両面で改善が望めそうだったので、ざっくりやったところ、やはり2〜5%程度実行時間の短縮ができた。ので、これはPRとして報告。

optimize gc sweep by kazuho · Pull Request #1049 · ruby/ruby

この３つを組み合わせると、rdocみたいな実アプリケーションの実行時間が、手元で5%以上縮みそう！^注1　ってことで満足したのがここ二日間の進捗です！！！！！！！　なんかいろいろ滞っているような気がしますがすみああおえtぬさおえうh

これからは雑なツイートを慎みたいと思います。

注1: バグがなければ！！

Wednesday, September 30, 2015

H2O version 1.5.0 released

Today, I am happy to announce the release of H2O version 1.5.0.

Notable improvements from 1.4 series are as follows:

On-the-fly gzip support

This was a feature requested by many people, and I would like to thank Justin Zhu for doing the hard work!

mruby-based scripting

Server-side scripting using mruby is now considered production level.
And now that the our API is base on Rack, it would be easy for Ruby programmers to use / learn, thanks to its excellent design and documentation.

For this part, my thank you goes to Ryosuke Matsumoto, Masayoshi Takahashi, Masaki TAGAWA.

cache-aware server push

Server-push is an important aspect of HTTP/2, however it has generally believed to be hard to use, since web application do not have the knowledge of what has already been cached on the client-side.

With the help of Ilya Grigorik and the Japanese HTTP/2 community, we have essentially solved the issue by introducing cache-aware server push; the server is now capable of tracking the what the web browser has in its cache, and determine whether or not a resource should be pushed!

We plan to improve the feature in the upcoming releases so that the Web can be even faster!

isolation of private keys

H2O now implements privilege isolation for handling RSA private key operations so that SSL private keys would not leak in case of vulnerabilities such as Heartbleed.

In the upcoming days I will post several blogposts explaining the notable changes. Stay tuned.

Thursday, September 24, 2015

Neverbleed - RSAの秘密鍵演算を別プロセスに分離する話

機能毎にプロセスを分割し、それらを別個の権限のもとで実行することで、脆弱性があった場合の影響を抑え込むというのは、一定以上の規模をもつプログラムでは、しばしば見られるデザインパターンです。

qmailは、そのような設計がなされたメール配送デーモンとして名高いですし、OpenSSHもまた、認証プロセスと通信プロセスを分離することで、外部との通信を担当するコードにバグがあったとしても、ルート権限が奪われないように設計されています（参照: Privilege Separated OpenSSH）。

一方で、OpenSSLにはそのような権限分離は実装されていません。Heartbleedの際にサーバの秘密鍵が漏洩したのも、秘密鍵の取り扱いと、その他の通信の取り扱いを同一のメモリ空間の中で行っていたからだと考えることができます。

ないのなら、自分で作ればいいじゃない…ということで作りました。それが、Neverbleedです。

Neverbleedは、OpenSSLの拡張インターフェイスであるEngineを利用して、RSA秘密鍵を用いる処理を専用プロセスに分離します。OpenSSLの初期化時に専用プロセスを起動し、秘密鍵の読み込みと関連演算は全て専用プロセスで行われるため、OpenSSLを利用するサーバプロセスに脆弱性があったとしても秘密鍵が漏洩することはありません。

OpenSSLの拡張インターフェイスを利用しているため、OpenSSLへの変更は不要ですし、サーバプログラムへの変更もごく少量ですみます。また、専用プロセスとの通信のオーバーヘッドはRSAの秘密鍵演算と比べると非常に小さいため、そのオーバーヘッドは問題になりません。

そんな感じでうまく動いているので、Neverbleedは、今月リリース予定のH2Oバージョン1.5に組み込まれる予定です。

参照: http://www.citi.umich.edu/u/provos/ssh/privsep.html

Wednesday, May 27, 2015

C言語で「1時間以内に解けなければプログラマ失格となってしまう5つの問題が話題に」の5問目を解いてみた

「Java8で「ソフトウェアエンジニアならば1時間以内に解けなければいけない5つの問題」の5問目を解いてみた」と「Perl6で「ソフトウェアエンジニアならば1時間以内に解けなければいけない5つの問題」の5問目を解いてみた」経由。

以下のような問題ですね。

1,2,…,9の数をこの順序で、”+”、”-“、またはななにもせず結果が100となるあらゆる組合せを出力するプログラムを記述せよ。例えば、1 + 2 + 34 – 5 + 67 – 8 + 9 = 100となる

とてもいい問題だと思うし、一方で上の回答例がeval的な手法を使っていたので、そういうズルをせずに解いたらどうなるだろう、ということでCで書いてみた。

正解が出るようになるまでの所要時間、約30分。なんとかプログラマ合格のようです。

#include <stdio.h>

#define MAX_POS 9
#define EXPECTED 100

static char buf[32];

static void doit(int pos, int sum, char *p, int sign)
{
    int i, n, s;

    *p++ = sign == 1 ? '+' : '-';
    for (i = pos, n = 0; i <= MAX_POS; ++i, n *= 10) {
        *p++ = '0' + i;
        n += i;
        s = sum + sign * n;
        if (i == MAX_POS) {
            if (s == EXPECTED) {
                *p = '\0';
                printf("%s = %d\n", buf + 1, s);
            }
        } else {
            doit(i + 1, s, p, 1);
            doit(i + 1, s, p, -1);
        }
    }
}   

int main(void)
{
    doit(1, 0, buf, 1);
    return 0;
}

Thursday, May 21, 2015

How to properly spawn an external command in C (or not use posix_spawn)

When spawning an external command, as a programmer, you would definitely want to determine if you have succeeded in doing so.

Unfortunately, posix_spawn (and posix_spawnp) does not provide such a feature. To be accurate, there is no guaranteed way to synchronously determine if the function has succeeded in spawning the command synchronously.

In case of Linux, the function returns zero (i.e. success) even if the external command does not exist.

The document suggests that if the function succeeded in spawning the command should be determined asynchronously by checking the exit status of waitpid. But such approach (that waits for the termination of the sub-process) cannot be used if your intension is to spawn a external command that is going to run continuously.

Recently I have faced the issue while working on H2O, and have come up with a solution; a function that spawns an external command that synchronously returns an error if it failed to do so.

What follows is the core logic I implemented. It is fairly simple; it uses the traditional approach of spawning an external command: fork and execvp. And at the same time uses a pipe with FD_CLOEXEC flag set to detect the success of execvp (the pipe gets closed), which is also used for returning errno in case the syscall fails.

pid_t safe_spawnp(const char *cmd, char **argv)
{
    int pipefds[2] = {-1, -1}, errnum;
    pid_t pid;
    ssize_t rret;

    /* create pipe, used for sending error codes */
    if (pipe2(pipefds, O_CLOEXEC) != 0)
        goto Error;

    /* fork */
    if ((pid = fork()) == -1)
        goto Error;

    if (pid == 0) {
        /* in child process */
        execvp(cmd, argv);
        errnum = errno;
        write(pipefds[1], &errnum, sizeof(errnum));
        _exit(127);
    }

    /* parent process */
    close(pipefds[1]);
    pipefds[1] = -1;
    errnum = 0;
    while ((rret = read(pipefds[0], &errnum, sizeof(errnum))) == -1
           && errno == EINTR)
        ;
    if (rret != 0) {
        /* spawn failed */
        while (waitpid(pid, NULL, 0) != pid)
            ;
        pid = -1;
        errno = errnum;
        goto Error;
    }

    /* spawn succeeded */
    close(pipefds[0]);
    return pid;

Error:
    errnum = errno;
    if (pipefds[0] != -1)
        close(pipefds[0]);
    if (pipefds[1] != -1)
        close(pipefds[1]);
    errno = errnum;
    return -1;
}

The actual implementation used in H2O does more; it has a feature to remap the file descriptors so that the caller can communicate with the spawned command via pipes. You can find the implementation here.

I am not sure if this kind of workaround is also needed for other languages, but I am afraid it might be the case.

Anyways I wrote this blogpost as a memo for myself and hopefully others. Happy hacking!

Monday, May 11, 2015

Clangに対応し、より高速になったqrintf version 0.9.2をリリースしました

ひさしぶりにqrintf関連の作業を行い、バージョン0.9.2をリリースしました。

ご存知のように、qrintfは、ccacheやdistccと同様の仕組みでCコンパイラのラッパーとして動作する、sprintf(とsnprintf)の最適化フィルターです。

qrintfを利用することで、整数や文字列をフォーマットするsprintfやsnprintfは最大10倍高速化され、また、H2Oのようなhttpdが20%程度高速化することが知られています。

今回のリリースは、0.9.1以降に行われた以下の改善を含んでいます。

数値の変換速度の改善（@imasahiro氏による）#9 #10
Clangへの暫定対応と、それにともなうコマンドラインインターフェイスの変更 #13 #16
-DQRINTF_NO_AUTO_INCLUDEオプションによる、自動#includeの抑止 #14
%.*sへの対応 #7

以下の実行例からも、GCCでもClangでもIPv4アドレスの文字列化ベンチマークにおいて、qrintfを利用することで10倍以上の高速化が実現できていることが分かります。

$ qrintf --version
v0.9.2
$ gcc -O2 examples/ipv4addr.c && time ./a.out 1234567890
result: 73.150.2.210

real 0m2.512s
user 0m2.506s
sys 0m0.003s
$ qrintf gcc -O2 examples/ipv4addr.c && time ./a.out 1234567890
result: 73.150.2.210

real 0m0.173s
user 0m0.170s
sys 0m0.002s
$ clang -O2 examples/ipv4addr.c && time ./a.out 1234567890
result: 73.150.2.210

real 0m2.487s
user 0m2.479s
sys 0m0.004s
$ qrintf clang -O2 examples/ipv4addr.c && time ./a.out 1234567890
result: 73.150.2.210

real 0m0.220s
user 0m0.214s
sys 0m0.002s

今後は、qrintfをH2Oにバンドルすることで、qrintfの恩恵をより多くの利用者に届けて行きたいと考えています。同様のことは、他のソフトウェアプロジェクトでも可能なのではないでしょうか。

それでは、have fun！

Friday, March 6, 2015

H2O version 1.1.0 released with bug fixes and enhancements in proxy etc.

This is a release announcement of H2O version 1.1.0, the optimized HTTP server with support for HTTP/1 and HTTP/2.

In 1.1.0 we have gone through a major refactor of the proxy implementation, and added three notable features to the H2O standalone server.

Support for x-reproxy-url header #197

With help from @lestrrat, H2O now recognizes the x-reproxy-url header sent by upstream servers, and if found, substitutes the response with the response obtained from the URL specified by the header.

When the feature is activated (by adding the line reproxy: on to the configuration file), any HTTP URL can be served by using the x-reproxy-url header.

There is no support for x-reproxy-file header. Instead, if the authority of the URL matches one of the host entries of the configuration file, then the reproxied request will be handled internally by the handlers of H2O.

Load distribution among upstream servers #208

To resolve the address of the upstream server, H2O calls getaddrinfo each time it needs to connect to the servers. As of version 1.1.0 the call is executed asynchronously (by using dedicated threads for the task), and if multiple entries are returned by the name resolver (e.g. DNS, /etc/hosts, et. al.) then H2O connects to one of them selected at random.

The change neatly constitute as a basis for load balancing. By adjusting the name resolver, you can at any time add or remove an upstream server, or change the selection weight between the servers by using sophisticated DNS servers like
MyDNS.

This is only a first step. We plan to polish up the feature for better load balancing.

Directives for tweaking the response headers #204

Following directives have been introduced for mangling the response headers, modeled after those provided by the Apache HTTP server: header.add, header.append, header.merge, header.set, header.setifempty, header.unset.

For more information, a full list of changes with references can be found in the version 1.1.0 changelog; list of configuration directives can be found by running h2o --help.

Thursday, February 19, 2015

H2O, the new HTTP server goes version 1.0.0 as HTTP/2 gets finalized

I am happy to announce the release of H2O version 1.0.0 on the same day HTTP/2 gets finalized. The momentum for HTTP/2 is building up fast.

According to mnot’s blog: HTTP/2 is Done posted today,

The IESG has formally approved the HTTP/2 and HPACK specifications, and they’re on their way to the RFC Editor, where they’ll soon be assigned RFC numbers, go through some editorial processes, and be published.

Web browser developers have already implemented the protocol. Mozilla Firefox is already providing support for the HTTP/2 draft. Google has announced that they would turn on support for HTTP/2 on Chrome within weeks. Internet Explorer 11 on Windows 10 Technical Preview also speaks HTTP/2.

Considering the facts, it seemed that we'd better freeze the configuration directives of H2O now, so that people could rely on the software for serving HTTP/2 requests (note: the library API should still be considered unstable).

Features provided by H2O version 1.0.0 include the following; please refer to the README and `--help` for more information.

support for HTTP/1.x and HTTP/2
static file serving and reverse proxy
HTTP/2 server-push
excellent performance outperforming Nginx
graceful restart and self-upgrade via Server::Starter

Started last summer, H2O is still a very young project. We would never have advanced this fast without so much help from the community (the fact is clear especially regarding the support for HTTP/2 if we look at H2O issue #133 as an example). I would like to express my gratitude for their advises and suggestions.

We plan to continue improving H2O rapidly. The primary focus is on performance, ease-of-use, and flexible (even autonomous) reconfiguration that suites the cloud era.

Today, HTTP is facing challengers. With the rise of smartphone apps, it is no longer the only protocol that can be used. But wouldn't it be better if we could all continue using a single, well-known protocol a.k.a. HTTP?

Our goal is by providing an excellent implementation, to keep the protocol as the primary choice of the developers, and furthermore, to expand the adoption of HTTP even more than before.

Stay tuned!

Tuesday, February 10, 2015

[Ann] H2O version 0.9.2 released incl. support for HTTP2 server-push, state-of-art prioritization of streams

I am glad to announce the release of H2O version 0.9.2.

This is the third release of H2O, including a number of changes that can be found in the Changes file. And here, I am happy to mention that some of the changes were brought by other people than me, in fact five people have committed into improving the H2O since the last release, whose names can also be found at the top of the README.

Among the changes introduced in version 0.9.3 are improvements to the HTTP/2 protocol implementation.

HTTP/2 Server Push

As of version 0.9.2, H2O automatically pushes content using HTTP/2 server push when suggested by the upstream server using the link:rel=preload header. By using the feature, web applications can push resources mandatory for rendering the web pages on the client side, which results in faster perceived response time from the end-users viewpoint. In other words, web application developers are encouraged to list the files that block the rendering using the link: rel=preload header for optimum rendering speed.

Below is an example of a response sent by a web application running behind H2O. H2O recognizes the link header and starts to push the contents of /assets/main.css even before the client recognizes that the CSS is a blocker for rendering the webpage.

HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Link: </assets/main.css> rel=preload

It is also worth noting that H2O is not alone in providing support for HTTP/2 server push. Following the discussion on an H2O issue discussing the topic, nghttp2 has added support for server-push using the same header as well. What's good here is that the developers are working together on HTTP/2 to provide a logical and a vendor-neutral way of providing access to the new technology; I am so happy to be part of the effort.

Improved Scheduler for HTTP/2 Streams

The HTTP/2 specification defines a somewhat complex logic for prioritizing the streams. In H2O version 0.9.3 we have polished up the scheduler implementation used for the purpose. H2O now is not only highly compliant to the specification implementing all the aspects of the specification, but also excels in performance as the internal code-paths are guaranteed to be O(1).

Implementation of the prioritization logic fully conforming to the specification is essential for a HTTP/2 server, since web browsers would be tuned against the specification. We may see unnecessary delays in page rendering speed if any error (or missing parts) exist within the server program. I am glad that H2O is unlikely to fall into such problem now that we have a complete implementation.

I am pleased that all the changes have been done within three weeks since last release. And my thank you goes to the contributors and the people who have gave us advises on improving the product.

Please stay tuned for the next release!

Friday, December 26, 2014

[Ann] Initial release of H2O, and why HTTPD performance will matter in 2015

Happy Holidays!

Today I am delighted to announce the first release of H2O, version 0.9.0; this is a christmas gift from me.

github.com/h2o/h2o

H2O is an optimized HTTP server with support for HTTP/1.x and the upcoming HTTP/2; it can be used either as a standalone server or a library.

Built around PicoHTTPParser (a very efficient HTTP/1 parser), H2O outperforms Nginx by a considerable margin. It also excels in HTTP/2 performance.

Why do we need a new HTTP server? The answer is because its performance does matter in the coming years.

It is expected that the number of files being served by the HTTP server will dramatically increase as we transit from HTTP/1 to HTTP/2.

This is because current techniques used to decrease the number of asset files (e.g. CSS sprites and CSS concatenation) becomes a drag in page rendering speed in HTTP/2. Such techniques were beneficial in HTTP/1 since the protocol had difficulty in utilizing all the available bandwidth. But in HTTP/2 the issue is fixed, and the overhead of transmitting all the images / CSS styles used by the website at once while only some of them is needed to render a specific page, becomes a bad idea. Instead, switching back to sending small asset files for every required element consisting the webpage being request becomes an ideal approach.

Having an efficient HTTP/1 server is also a good thing, as we large-scale adopt the idea of Microservices; it increases the number of HTTP requests transmitted within the datacenter.

As shown in the benchmark charts, H2O is designed with these facts in mind, making it (as we believe) an ideal choice of HTTP server of the future.

With this first release, H2O is concentrates on serving static files / working as a reverse proxy at high performance.

Together with the contributors I will continue to optimize / add more features to the server, and hopefully reach a stable release (version 1.0.0) when HTTP/2 becomes standardized in the coming months.

Stay tuned.

PS. It is also great that the tools developed by H2O is causing other effects; not only have we raised the bar on HTTP/2 server performance (nghttp2 (a de-facto reference implementation of HTTP/2) has become much faster in recent months), the performance race of HTTP/1 parser has once again become (Performance improvement and benchmark by indutny · Pull Request #200 · joyent/http-parser, Improving PicoHTTPParser further with AVX2), @imasahiro is working on merging qrintf (a preprocessor that speeds up the sprinf(3) family by a magnitude developed as a subproduct of H2O) to Clang. Using H2O as a footstep, I am looking forward to bringing in new approaches for running / maintaining websites next year.

Friday, December 19, 2014

Memory Management in H2O

This blogpost (as part of the H2O Advent Calendar 2014) provides a high-level overview of the memory management functions in H2O that can be categorized into four groups.

h2o_mem_alloc, h2o_mem_realloc

They are wrappers of malloc(3) / realloc(3), that calls abort(3) if memory allocation fails. The returned chunks should be freed by calling free(3).

h2o_mem_init_pool, h2o_mem_clear_pool, h2o_mem_alloc_pool

The functions create, clear, and allocate from a memory pool. The term memory pool has several meanings, but in case of H2O the term has been borrowed from Apache; it refers to a memory allocator that frees all associated chunks at once when the destructor (h2o_mem_clear_pool) is being called.

The primary use-case of the functions is to allocate memory that relates to a HTTP request. The request object h2o_req_t has a memory pool associated to it; small chunks of memory that need to be allocated while handling a request should be obtained by calling h2o_mem_alloc_pool instead of h2o_mem_alloc, since the former is generally faster than the latter.

h2o_mem_alloc_shared, h2o_mem_link_shared, h2o_mem_addref_shared, h2o_mem_release_shared

They are the functions to handle ref-counted chunks of memory. Eeach shared chunk has its own dispose callback that gets called when the reference counter reaches zero. A chunk can be optionally associated to a memory pool, so that the reference counter gets decremented when the pool gets flushed.

The functions are used for handling things like headers transferred via HTTP/2, or to for associating a resource that needs a custom dispose callback to a HTTP request through the use of the memory pool.

h2o_buffer_init, h2o_buffer_dispose, h2o_buffer_reserve, h2o_buffer_consume, h2o_buffer_link_to_pool

The functions provide access to buffer, that can hold any length of octets. They internally use malloc(3) / realloc(3) for handling short buffers, and switch to using temporary-file-backed mmap(2) when the length of the buffer reaches a predefined threshold (default: 32MB). A buffer can also be associated to memory pool by calling the h2o_buffer_link_to_pool function.

The primary use-case of the buffer is to store incoming HTTP requests and POST contents (as it can be used to hold huge chunks on 64-bit systems since it switches to temporary-file-backed memory as described).

h2o_vector_reserve

The function reserves given number of slots for H2O_VECTOR which is a variable length array of an arbitrary type of data. Either h2o_mem_realloc or the memory pool can be used as the underlying memory allocator (in the former case, the allocated memory should be manually freed by the caller). The structure is initialized by zero-filling it.

The vector is used everywhere, from storing a list of HTTP headers to a list of configuration directives.

For details, please refer to their doc-comment and the definitions in include/h2o/memory.h and lib/memory.c.

Tuesday, December 16, 2014

GitHub で submodule ではなく subtree を使うべき理由

GitHub には、タグを打つとソースパッケージを自動的にリリースするという機能があります。スクリプト言語においては、それぞれの言語について一般的なパッケージ管理システム^注1があるため、この機能を使うことが少ないかと思いますが、デファクトのパッケージ管理システムが存在しないC等の言語で書かれたプログラムや、単独で動作する管理用のスクリプトを GitHub で開発・配布する際には、本機能はとても便利なものです。

しかし、この機能は git-archive コマンドのラッパーとして実装されているため、サブモジュールのファイルが含まれないという問題を抱えています。この点は GitHub の人たちも認識しているものの、今のところ GitHub で独自に対応するということは考えていないようです^注2。

私がこの問題を知ることになったのは、picojson の issue で指摘を受けたからです。picojson については問題が「テストが動かない」という程度なので後回しにしても良かったのですが、H2O についても同様の問題が発生することが目に見えていました。

そこでどうするか、irc で相談、実験した結果、サブモジュールのかわりにサブツリーを使えば、参照先のファイルについても git-archive の結果に含めることが可能であることがわかり、picojson についてはサブツリーへの移行を完了しました。

ツールの仕様に引っ張られてやり方を変えるという、ある意味しょうもない話なのですが、H2O についても今後リリースまでにサブツリーへの切り替えを行おうと考えています。

※本記事も H2O Advent Calendar 2014 の一部です。

注1: たとえば Perl については CPAN、JavaScript については NPM が存在する
注2: 参照: » Github zip doesn’t include Submodules Academic Technology Group Developers Blog のコメント

Monday, December 15, 2014

PicoHTTPParser now has a chunked-encoding decoder

Today I have added phr_decode_chunked - a function for decoding chunked-encoded input - to picohttpparser.

As suggested in the doc-comment of the function (shown below), the function is designed to decode the data in-place. In other words, it is not copy-less.

/* the function rewrites the buffer given as (buf, bufsz) removing the chunked-
 * encoding headers. When the function returns without an error, bufsz is
 * updated to the length of the decoded data available. Applications should
 * repeatedly call the function while it returns -2 (incomplete) every time
 * supplying newly arrived data. If the end of the chunked-encoded data is
 * found, the function returns a non-negative number indicating the number of
 * octets left undecoded at the tail of the supplied buffer. Returns -1 on
 * error.
 */
ssize_t phr_decode_chunked(struct phr_chunked_decoder *decoder, char *buf,
                           size_t *bufsz);

It is intentionally designed as such.

Consider a input like the following. The example is more than 2MB long even though it contains only 2 bytes of data. The input is conformant to the HTTP/1.1 specification since it does not define the maximum length of the chunked extensions, requires every conforming implementation to ignore unknown extensions.

1 very-very-veery long extension that lasts ...(snip) 1MB
a
1 very-very-veery long extension that lasts ...(snip) 1MB
a

To handle such input without getting the memory exhausted, a decoder should either a) only preserve the decoded data (requires a copy), or b) limit the size of the chunked-encoded data.

B might have been easier to implement, but such a feature might be difficult to administer. So I decided to take the route a, and for simplicity implemented the decoder to always adjust the position of the data in-place.

Always calling memmove for adjusting the position might induce some overhead, but I assume it to be negligible for two reasons: both the source and destination would exist in the CPU cache / the overhead of unaligned memory access is small on recent Intel CPU.

For ease-of-use, I have added examples to the README.

Saturday, December 13, 2014

C言語で可変長引数をとる関数を、型安全に書く方法

C言語の可変長引数は、型安全でない（まちがった型の引数を渡してもコンパイルエラーにならない）とされています。これは言語仕様の理解としては正しいのですが、特定の型の引数を任意の個数とる関数に限っては、マクロを使うことで型安全性を確保することができます。

任意の個数のdoubleを引数にとり、その和を返す関数「sumf」を例にあげて説明します。

C言語の可変長引数機構を使ってsumfを定義すると、以下のようになります。

#include <math.h>
#include <stdarg.h>
#include <stdio.h>

static double sumf(double nfirst, ...)
{
  double r = 0, n;
  va_list args;

  va_start(args, nfirst);
  for (n = nfirst; ! isnan(n); n = va_arg(args, double))
    r += n;
  va_end(args);

  return r;
}

int main(int argc, char **argv)
{
  printf("%f\n", sumf(NAN)); /* => 0 */
  printf("%f\n", sumf(1., NAN)); /* => 1 */
  printf("%f\n", sumf(1., 2.5, 3., NAN)); /* => 6.5 */
  return 0;
}

が、この定義には「NANを終端に使っているがために、NANを引数として渡すことができない（＝終端を表す値が必要になる）」「型安全でない」という２点の問題があります。後者については、たとえば、sumf(1, 1, NAN)のように、うっかりdouble型以外の引数を渡してしまってもコンパイルエラーにならず、ただ結果がおかしくなったりコアダンプしたりすることになります^注1。

では、どのようにsumfを定義すれば良いのでしょう。答えを書いてしまうと、こんな感じです。

#include <stdio.h>

#define sumf(...)                                       \
  _sumf(                                                \
    (double[]){ __VA_ARGS__ },                          \
    sizeof((double[]){ __VA_ARGS__ }) / sizeof(double)  \
  )

static double _sumf(double *list, size_t count)
{
  double r = 0;
  size_t i;

  for (i = 0; i != count; ++i)
    r += list[i];

  return r;
}

int main(int argc, char **argv)
{
  printf("%f\n", sumf()); /* => 0 */^注2
  printf("%f\n", sumf(1.)); /* => 1 */
  printf("%f\n", sumf(1., 2.5, 3)); /* => 6.5 */
  return 0;
}

この定義では、可変長の引数群をマクロを用いてインラインで配列として初期化し、かつ、その要素数をsizeof演算子を用いて計算しています。そのため、C言語標準の可変長引数機構を使った場合の問題はいずれも発生しません。要素数が_sumf関数に引数countとして渡されるため、終端を表す特殊な値は必要になりませんし、また、実引数はdouble型の配列として呼出側で構築されるため、誤った型の引数を渡してしまうとコンパイルエラーになります。あるいは、たとえばint型の値を渡してしまった場合は、コンパイラによってdouble型に昇格することになるからです。

私たちが開発しているHTTPサーバ「H2O」では、この手法を用いて、型安全な文字列結合関数h2o_concatを定義、使用しています。

以上、H2Oで使っているC言語の小ネタ紹介でした。

※この記事はH2O Advent Calendar 2014の一部です。

注1: 手元の環境だと、sumf(1, 1, NAN)の結果は1となります
注2: 可変長マクロに対して０個の引数を渡すのはC99の規格には違反しますが、GCCやClangは問題なく処理します