Today my plan was to implement use of ternary flint operations, i.e. addmul and submul. However, I must confess I severely underestimated the complexity of this. To get an idea of what is going on, suppose we want to evaluate the expression a + b*c into a temporary. The idea is that, if a is itself an expression template, and assuming for simplicity that both b and c are ordinary fmpzs, we first evaluate a into the temporary (henceforth tmp) and then call addmul(tmp, b, c).
The devil hides in "for simplicity". Namely, if b and/or c are allowed to be expression templates as well, things get difficult. The optimal order of evaluation depends on the number of temporaries required for evaluating each of a, b, and c (henceforth t1, t2 and t3). Moreover, the total number of temporaries required depends not only on the maximum of t1, t2 and t3, but also on the number of coincidences among these. Additionally, if b or c is an immediate, then evaluation still has to proceed differently, because in this case we cannot determine where the evaluation ends up (because we don't evaluate anything at all). A final complication arises because even for purposes of evaluating a, b, and c, they are not on equal footing. Namely, we need a to end up in tmp, whereas b and c need to end up in newly allocated temporaries.
I now have a first implementation of addmul on github. It takes up about 550 lines of code. The code takes care of all the above difficulties; adding some missing bits (handling a*b + c, and submul) should not significantly increase the code size. Neither should making this a bit more generic (since it is currently tied to fmpz). I am making one simplifying assumption, though: that the expression we are evaluating is homogeneous. By this I mean that it only involves one data type (fmpz in this case). This way instead of merging temporaries, we just need to count them.
Tomorrow I will finish up this code. Missing features I have indicated above; beyond that we need a lot more testing. I am already testing all the different scenarios alluded to in the second paragraph. Additionally, I will need to test the new functionality, and some more complex expressions. But most importantly, we need more codegen tests. This expression template evaluation code is highly complex, and involves a lot of "nominally runtime" code which has to be optimized away; I have to verify that.
One interesting observation is that this code also seems to stress test the compiler quite a bit. Compiling t-mpz.cpp using gcc requires about 2GB of memory, and takes on the order of 2 minutes on my machine (intel Core 2 Duo P8700 @ 2.53GHz, 4 GB main memory -- most of the time is probably spent allocating memory). In comparison, clang requires about 400MB and 10 seconds.
Wowsers, we should have thought about ternary expressions when we thought about temporary allocation. Of course they are very useful to have, especially if you want the greatest speed possible. But they are quite tricky for an expressions template.
ReplyDeleteI'm amazed that gcc requires 2 minutes to compile this, and so much memory. I'd almost file that under "bug".
Well I did think about it in my prototype, which is why I was confident it could be done. I just hadn't thought through how difficult it would actually be to do it in this generality.
ReplyDeleteActually I think my comparison of compilers has been a bit faulty, since I ran g++ with optimizations enabled and clang++ without. Not asking gcc to optimize reduces the resident set to about 1.2GB and compile time to 40 seconds. Enabling optimizations in clang actually decreases its resident set, but increases compile time to 45 seconds. So gcc is still quite a bit worse, but not by quite as much as I initially claimed.
ReplyDeleteClang certainly make a lot of noise about how much faster their compiler is, so this isn't unexpected I guess.
ReplyDelete