Today my plan was to implement use of ternary flint operations, i.e. addmul and submul. However, I must confess I severely underestimated the complexity of this. To get an idea of what is going on, suppose we want to evaluate the expression
a + b*c into a temporary. The idea is that, if
a is itself an expression template, and assuming for simplicity that both
b and
c are ordinary
fmpzs, we first evaluate
a into the temporary (henceforth
tmp) and then call
addmul(tmp, b, c).
The devil hides in "for simplicity". Namely, if
b and/or
c are allowed to be expression templates as well, things get difficult. The optimal order of evaluation depends on the number of temporaries required for evaluating each of a, b, and c (henceforth
t1,
t2 and
t3). Moreover, the total number of temporaries required depends not only on the maximum of t1, t2 and t3, but also on the number of coincidences among these. Additionally, if b or c
is an immediate, then evaluation still has to proceed differently, because in this case we cannot determine where the evaluation ends up (because we don't evaluate anything at all). A final complication arises because even for purposes of evaluating
a,
b, and
c, they are not on equal footing. Namely, we need a to end up in
tmp, whereas
b and
c need to end up in newly allocated temporaries.
I now have a first implementation of addmul on github. It takes up about 550 lines of code. The code takes care of all the above difficulties; adding some missing bits (handling
a*b + c, and submul) should not significantly increase the code size. Neither should making this a bit more generic (since it is currently tied to fmpz). I am making one simplifying assumption, though: that the expression we are evaluating is
homogeneous. By this I mean that it only involves one data type (fmpz in this case). This way instead of merging temporaries, we just need to count them.
Tomorrow I will finish up this code. Missing features I have indicated above; beyond that we need a lot more testing. I am already testing all the different scenarios alluded to in the second paragraph. Additionally, I will need to test the new functionality, and some more complex expressions. But most importantly, we need more codegen tests. This expression template evaluation code is highly complex, and involves a lot of "nominally runtime" code which has to be optimized away; I have to verify that.
One interesting observation is that this code also seems to stress test the compiler quite a bit. Compiling
t-mpz.cpp using gcc requires about 2GB of memory, and takes on the order of 2 minutes on my machine (intel Core 2 Duo P8700 @ 2.53GHz, 4 GB main memory -- most of the time is probably spent allocating memory). In comparison, clang requires about 400MB and 10 seconds.