Kazuho's Weblog: October 2015

Thursday, October 22, 2015

Performance improvements with HTTP/2 push and server-driven prioritization

tl;dr

HTTP/2 push only marginally improves web-site performance (even when it does). But it might provide better user experience over mobile networks with TCP middleboxes.

Introduction

Push is an interesting feature of HTTP/2.

By using push, HTTP servers can start sending certain asset files that block rendering (e.g. CSS and script files) before the web browser issues requests for such assets. I have heard (or spoken myself of) anticipations that by doing so we might be able to cut down the response time of the Web.

CASPER (cache-aware server pusher)

The biggest barrier in using HTTP/2 push has so far been considered cache validation.

For the server to start pushing asset files, it needs to be sure that the client does not already have the asset cached. You would never want to push a asset file that is already been cached by the client - doing so not only will waste the bandwidth but also cause negative effect on response time. But how can a server determine the cache state of the client without spending an RTT asking to the client?

That's were CASPER comes in.

CASPER (abbreviation for cache-aware server pusher) is a function introduced in H2O version 1.5, that tracks the cache state of the web browser using a single Cookie. The cookie contains a fingerprint of all the high-priority asset files being cached by the browser compressed using Golomb-compressed sets.

The cookie is associated with every request sent by the client¹. So when the server receives a request to HTML, it can immediately determine whether or not the browser is in possession of the blocking assets (CSS and script files) required to render the requested HTML. And it can push the only files that are known not to be cached.

With CASPER, it has now become practical to use HTTP/2 push for serving asset files.

Using HTTP/2 push in H2O

This week, I have started using CASPER on h2o.examp1e.net - the tiny official site of H2O.

The configuration looks like below. The mruby handler initiates push of JavaScript and CSS files if the request is against the top page or one of the HTML documents, and then (by using 399 status code) delegates the request to the next handler (defined by file.dir directive) that actually returns a static file. http2-casper directive is used to turn CASPER on, so that the server will discard push attempts initiated by the mruby handler for assets that are likely cached. http2-reprioritize-blocking-assets is a performance tuning option that raises the priority of blocking assets to highest for web browsers that do not.

"h2o.examp1e.net:443":
    paths:
      "/":
        mruby.handler: |
          lambda do |env|
            push_paths = []
            if /(\/|\.html)$/.match(env["PATH_INFO"])
              push_paths << "/search/jquery-1.9.1.min.js"
              push_paths << "/search/oktavia-jquery-ui.js"
              push_paths << "/search/oktavia-english-search.js"
              push_paths << "/assets/style.css"
              push_paths << "/assets/searchstyle.css"
            end
            return [
              399,
              push_paths.empty? ?
                {} :
                {"link" => push_paths.map{|p| "<#{p}>; rel=preload"}.join("\n")},
              []
            ]
          end
        file.dir: /path/to/doc-root
    http2-casper: ON
    http2-reprioritize-blocking-assets: ON

The benchmark

With the setting above (with http2-reprioritize-blocking-assets both OFF and ON), I have measured unload, first-paint, and load timings² using Google Chrome 46, and the numbers are as follows. The RTT between the server and the client was ~25 milliseconds. The results were mostly reproducible between multiple attempts.

First, let's look at the first two rows that have push turned off. It is evident that reprio:on³ starts rendering the response 50 milliseconds earlier (2 RTT). This is because unless the option is turned on, the priority tree created by Chrome instructs the web browser to interleave responses containing CSS / JavaScript with those containing image files.

Next, let's compare the first two rows (push:off) with the latter two (push:off). It is interesting that unload timings have moved towards right. This is because when push is turned on, the contents of the asset files are sent before the contents of the HMTL. Since web browsers unload the previous page when it receives the first octets of the HTML file, using HTTP/2 push actually increases the time spent until the previous page is unloaded.

The fact will have both positive and negative effects to user experience; the positive side is that time user sees a blank screen decreases substantially (the red section - time spent after unload before first-paint). The negative side is that users would need to wait longer until he/she knows that the server has responded (by the browser starting to render the next page).

It is not surprising that turning on push only somewhat improves the first-paint timing compared to both being turned off; the server is capable of sending more CSS and JavaScript before it receives request for image files, and start interleaving the responses with them.

On the other hand, it might be surprising that using push together with reprioritization did not cause any differences. The reason is simple; in this scenario, transferring the necessary assets and the <head> section of the HTML (in total about 320KB) required about 10 round trips (including overhead required by TCP, TLS, HTTP/2). With this much roundtrips, the merit of push can hardly be observed; considering the fact that push is technique to eliminate one round trip necessary for the browser to issue requests for the blocking assets⁴.

Conclusion

The benchmark reinforces the claims made by some that HTTP2 push will only have marginal effect on web performance⁵. The results have been consistent with expectations that using push will only optimize the web performance by 1 RTT at maximum, and it would be hard to observe the difference considering the effect of TCP slow start and how many roundtrips are usually required to render a web page.

This means to the users of H2O (with reprioritization turned on by default) that they can expect near-best performance without using push.

On the other hand, we may still need to look at networks having TCP proxies. As discussed in Why TCP optimisation has become more important than content optimization (devcentral.f5.com) some mobile carriers do seem to have such middlebox installed.

Existence of such device is generally a good thing since it not only reduces packet retransmits but also improves TCP bandwidth during the slow-start phase. But the downside is that their existence usually increase application-level RTT, since they expand the amount of data in-flight, which has a negative impact on the responsiveness of HTTP/2. HTTP/2 push will be a good optimization under such network conditions.

Notes:
1. the fingerprint contained in the Cookie header is efficiently compressed by HPACK
2. wpbench was used to collect the numbers; first-paint was calculated as max(head-parsed, css-loaded); in this benchmark, DOMContentLoaded timing was indifferent to first-paint
3. starting from H2O version 1.5, http2-reprioritize-blocking-assets option is turned on by default
4. at 10 RTT it is unlikely that we have hit the maximum network bandwidth, and that means that packets will be received by the browser in batch every RTT
5. there are use cases for HTTP/2 push other than pushing asset files

Thursday, October 8, 2015

雑なツイートをしてしまったばかりにrubyを高速化するはめになった俺たちは！

逆に言うと、Rubyの文字列型の内部実装がropeになれば、freezeしてもしなくても変わらない速度が出るようになって、結局freezeする必要なんてなかったんやーで丸く収まるんじゃないの？と思いました #雑な感想
— Kazuho Oku (@kazuho) October 6, 2015

とツイートしたところ、処理系の中の人から

@kazuho 文字列を弄る話じゃなくて、文字列の identity の話なので、ちょっと関係ないかなぁ、と
— _ko1 (@_ko1) October 6, 2015

みたいなツッコミをもらって、うっすみません…ってなってRuby VMのコードを読むことになったわけです。

で、まあ、いくつか気になる点があったので手をつけてしまいました。

1. オブジェクト生成のホットパスの最適化

ホットスポットだとされていたところのコードを読んでると、オブジェクト生成の際に走る関数が割と深いのが問題っぽかった。通常実行されるパスは短いから、それにあわせて最適なコードがはかれるようにコードを調整すれば速くなるはず！！！

とコンセプトコードを書いて投げたら取り込まれた。やったね！！！

* gc.c (newobj_of): divide fast path and slow path
と思ったら、ほとんど速くなってないっぽい。これは悲しい…ということで、細かな修正を依頼。

optimize performance of `rb_str_resurrect` by kazuho · Pull Request #1050 · ruby/ruby
このPR適用すると、問題のマイクロベンチマークが3%〜5%くらい速くなるっぽい。

2. ヒープページのソートをやめる

通常実行される側をいじっても期待ほど速度が上がらなかったので、これは遅い側に原因があるかも…って見ていて気になったのが、ヒープページをソートする処理。現状のRubyは、オブジェクトを格納する「ヒープページ」を16KB単位で確保するんだけど、これを逆参照できるように、アドレスでソートした一覧を持ってる。この構築コストがでかい。

で、これをヒープに書き直してみたところ、rdocを使ったベンチマークで2〜3%の高速化が確認できたので報告。

.@_ko1 気になったので、heap tableを雑にハッシュ化してみましたが、rdocで2〜3%実行時間が減ります（高速化します）ね https://t.co/CJhierZZN3
— Kazuho Oku (@kazuho) October 7, 2015

ただ、今日のRubyは、（昔読んだ記事とは異なり）ヒープページが16KB単位でアラインされているということなので、ヒープを使うよりもビットマップを使うべき案件。

3. スイープの最適化

ヒープページのソートを書き直したあとでプロファイラの出力を眺めていたら、GCのスイープ処理が重たいことに気づいた。コードを読んだところ、分岐回数と呼出深度の両面で改善が望めそうだったので、ざっくりやったところ、やはり2〜5%程度実行時間の短縮ができた。ので、これはPRとして報告。

optimize gc sweep by kazuho · Pull Request #1049 · ruby/ruby

この３つを組み合わせると、rdocみたいな実アプリケーションの実行時間が、手元で5%以上縮みそう！^注1　ってことで満足したのがここ二日間の進捗です！！！！！！！　なんかいろいろ滞っているような気がしますがすみああおえtぬさおえうh

これからは雑なツイートを慎みたいと思います。

注1: バグがなければ！！

Tuesday, October 6, 2015

[メモ] OS XのホストからVMにnfsでファイル共有

普段OS X上で作業しつつ、開発ディレクトリをOS X上のVMで動いているLinuxやFreeBSDからもアクセスできるようにしてあると、互換性検証がはかどる。

どう設定するか、備忘録をかねてメモ。

1. ゲスト側で通常使用するアカウントのuser-id,group-idを、OS Xのそれに揃える

2. OS Xの/etc/exportsと/etc/nfs.confを以下のように、TCP経由でVMの仮想ネットワークにだけファイルを公開するよう設定

/etc/exports:
/shared-dir -mapall=user:group -network netaddr -mask netmask

/etc/nfs.conf:
nfs.server.udp=0
nfs.server.tcp=1

3. ゲスト側の/etc/fstabにマウント情報を設定

192.168.140.1:/shared-dir /shared-dir nfs rw,noatime 1 0

これで、ホストでも、どのVMでも、/shared-dirの中身が同じになる。

注. VMware Fusion 10の場合は、環境設定から、新しいネットワークを作成し、「NATを使用する」を外し、そのネットワークを使う必要がある。そうしない限り、ホストに見えるTCPのソースアドレスがローカルネットワークのものにならない。

できるだけ単純に構成しようとすると、こんな感じになるかな、と。設定に問題があるようだったら教えてください。

Thursday, October 1, 2015

ウェブページの描画 (first-paint) までの時間を測定するツールを作った件、もしくはHTTP2時代のパフォーマンスチューニングの話

ウェブページの表示までにかかる時間をいかに短くするかってのは、儲かるウェブサイトを構築する上で避けて通れない、とても重要な要素です。

少し古いデータとしては、たとえば、ウェブページの表示が500ミリ秒遅くなると広告売上が1.2%低下するというBingの例なんかも知られているわけです。

「ウェブページの表示までにかかる時間」と言った場合、実際には以下のようないくつかのメトリックがあります。

イベント	意味
unload	現在のページからの離脱。離脱後、first-paintまでは真っ白な表示になります
first-paint	ウェブページの初回描画（HTMLの後半や画像は存在しない可能性があります）
DOMContentLoaded	ウェブページのレイアウト完了
load (onload)	画像等を含む全データの表示完了

ウェブサイトチューニングにおいては、

ユーザができるだけ早くウェブページを閲覧し始めることができるよう、first-paintの値を小さくすることを第一の目標^注1
全データができるだけ早く揃うよう、loadの値を小さくすることを第二の目標

とすることが一般的かと思います。

ですが、残念なことに、first-paintまでの時間をAPIを用いて取得できるウェブブラウザは一部に限られています（参照:「Webページ遷移時間のパフォーマンス「First Paint」を計測する方法」）。また、測定にあたって、運用中のウェブページに変更を加えたくない、ということもあったりします。

wpbenchは、これら２点の問題を解消するベンチマークツールです。

github.com/kazuho/wpbench

wpbenchは、たった１枚のHTMLファイルです。このHTMLファイルを自分のウェブサイトに追加するだけで、そのサイト上の任意のページの表示までにかかる時間が測定可能になります。また、body.clientHeightをポリングするという力技により、FirefoxやChromeのようなfirst-paint取得のためのAPIを提供しないウェブブラウザにおいても、その値を取得、表示することができます。

HTTP/1.1を使っている場合、これらパフォーマンスメトリックは、専らネットワーク状況とウェブブラウザのみに依存して決まる値でした。しかし、HTTP/2では、ウェブブラウザとウェブサーバが恊働して、アセットの転送順序を制御することができるようになった結果、使用するウェブサーバによってパフォーマンスが大幅に変わる、という状況が産まれはじめています。

たとえば、HTTP2サーバ「H2O」のベンチマークページを見ていただくと、ウェブサーバによってfirst-paintまでの時間が倍近く異なるケースがあることがお分かりいただけるかと思います。

既にパフォーマンスチューニング済のウェブサイトを管理している方々も、HTTPS、あるいはHTTP/2への対応にあたり、いま一度、ウェブサイトの表示速度を確認されることをおすすめいたします。

注1: ブラウザによってはfirst-paintのタイミングを取得できないため、DOMContentLoadedを使うこともありますが、DOMContentLoadedには<body>末尾に配置する統計系のスクリプトの読み込みにかかる時間等が加算される点等、注意が必要になります。また、画像なしに閲覧が不可能なサイトについては、Above the foldの値をチューニングする必要があります