- the need to write assembly for each architecture
- the use of inline assembly hinders certain compiler optimizations (such as register allocation)
But thanks to LLVM it is now possible to write processor-independent assembly functions using its bitcode, and use its link-time optimizer to get the function expanded in-line into C/C++ code.
I have created an example that shows how to implement fast integer arithmetics (with overflow detection) in C++, which is available at github.com/kazuho/add_with_overflow.
It uses a bitcode-level intrinsic called "llvm.sadd.with.overflow.i32" (that gets inlined) to implement integer addition with overflow check.
With the example, the source code
if (! add_with_overflow(&ret, x, y))gets compiled into
addl 8(%rsp), %esi jno LBB1_4
As can be seen, the generated code is highly optimized. Not only does it use the JNO instruction, the source operand of ADDL is placed on stack (which would be faster than on register since the value is never again being referred to). Such kind of an optimization has been impossible with inline assembly of GCC (that requires the arguments to be loaded on registers).
Since the output (after inline expansion) is a .s file (processor dependent assembly), it is possible to link the optimized code using other linkers as well.
Note: the work is based on Fast integer overflow detection - Xi Wang, and I would like to thank the author for his excellent work.
Note 2: Since the bitcode instructions might change in the future, it might be a good idea to limit the length of the functions written in bitcode as short as possible.