If you recall the last blog post about asm, we had a test expression as follows
out = ((a + b) + (c + d)) + ((a + c) + (b + d));
which "compiled into the following C-equivalent"
fmpz_t tmp1, tmp2, tmp3, tmp4, tmp5, tmp6;
fmpz_init(tmp1);
fmpz_init(tmp2);
fmpz_init(tmp3);
fmpz_init(tmp4);
fmpz_init(tmp5);
fmpz_init(tmp6);
fmpz_add(tmp1, a._data(), b._data());
fmpz_add(tmp2, c._data(), d._data());
fmpz_add(tmp3, tmp1, tmp2);
fmpz_add(tmp4, a._data(), c._data());
fmpz_add(tmp5, b._data(), d._data());
fmpz_add(tmp6, tmp1, tmp2);
fmpz_add(out._data(), tmp3, tmp6);
fmpz_clear(tmp1);
fmpz_clear(tmp2);
fmpz_clear(tmp3);
fmpz_clear(tmp4);
fmpz_clear(tmp5);
fmpz_clear(tmp6);
fmpz_init(tmp1);
fmpz_init(tmp2);
fmpz_init(tmp3);
fmpz_init(tmp4);
fmpz_init(tmp5);
fmpz_init(tmp6);
fmpz_add(tmp1, a._data(), b._data());
fmpz_add(tmp2, c._data(), d._data());
fmpz_add(tmp3, tmp1, tmp2);
fmpz_add(tmp4, a._data(), c._data());
fmpz_add(tmp5, b._data(), d._data());
fmpz_add(tmp6, tmp1, tmp2);
fmpz_add(out._data(), tmp3, tmp6);
fmpz_clear(tmp1);
fmpz_clear(tmp2);
fmpz_clear(tmp3);
fmpz_clear(tmp4);
fmpz_clear(tmp5);
fmpz_clear(tmp6);
Obviously this uses more temporaries than necessary. But at the time of the last post, I had not yet implemented tuple merging, so this would have been hard to improve. But of course new we do have tuple merging, and so since today I want to start implement mpz "in some completeness", it seemed reasonable to start by employing the tuple merging code to get efficient generic binary operations. And so I did. The resulting C-equivalent is now the following:
fmpz_t tmp1, tmp2, tmp3;
fmpz_init(tmp1);
fmpz_init(tmp2);
fmpz_init(tmp3);
fmpz_add(tmp1, a._data(), b._data());
fmpz_add(tmp2, c._data(), d._data());
fmpz_add(tmp3, tmp1, tmp2);
fmpz_add(tmp1, a._data(), c._data());
fmpz_add(tmp2, b._data(), d._data());
fmpz_add(tmp1, tmp1, tmp2);
fmpz_add(out._data(), tmp1, tmp3);
fmpz_clear(tmp1);
fmpz_clear(tmp2);
fmpz_clear(tmp3);
fmpz_init(tmp1);
fmpz_init(tmp2);
fmpz_init(tmp3);
fmpz_add(tmp1, a._data(), b._data());
fmpz_add(tmp2, c._data(), d._data());
fmpz_add(tmp3, tmp1, tmp2);
fmpz_add(tmp1, a._data(), c._data());
fmpz_add(tmp2, b._data(), d._data());
fmpz_add(tmp1, tmp1, tmp2);
fmpz_add(out._data(), tmp1, tmp3);
fmpz_clear(tmp1);
fmpz_clear(tmp2);
fmpz_clear(tmp3);
As far as I can tell this is optimal (except for moving brackets using associativity). If not please let me know.
Of course there is a more efficient way of evaluating the above expression. In C++, it would be (for example)
out = (a + (((b + (c + (a + b))) + c) + d));
and the C-equivalent would then be
fmpz_t tmp;
fmpz_init(tmp);
fmpz_add(tmp, a._data(), b._data());
fmpz_add(tmp, c._data(), tmp);
fmpz_add(tmp, b._data(), tmp);
fmpz_add(tmp, tmp, c._data());
fmpz_add(tmp, tmp, d._data());
fmpz_add(out._data(), a._data(), tmp);
fmpz_clear(tmp);
fmpz_init(tmp);
fmpz_add(tmp, a._data(), b._data());
fmpz_add(tmp, c._data(), tmp);
fmpz_add(tmp, b._data(), tmp);
fmpz_add(tmp, tmp, c._data());
fmpz_add(tmp, tmp, d._data());
fmpz_add(out._data(), a._data(), tmp);
fmpz_clear(tmp);
(The funny way of bracketing is just to test my code; the point is that you always add one new term to a temporary result, so never need to allocate more temporaries. In fact out = a + b + c + a + b + c + d achieves a similar effect.)
In fact here something really nice happens. Recall that the asm code generated for the expression out = ((a + b) + (c + d)) + ((a + c) + (b + d)); and the one for the hand-coded C-equivalent are not actually the same -- they use the same number of call statements, but somewhat different stack frames. The automatically generated code is also about 10% longer.
On the other hand, for the expression out = (a + (((b + (c + (a + b))) + c) + d)), the asm generated in both cases is in fact exactly equivalent (just using different register allocations and jump target names).
An interesting question to investigate once the main wrapper is done is the following: should we try to automatically use associativity to convert the first expression into the third (i.e. remove brackets)? My inkling is no, since it seems like if the user goes through the effort of bracketing, she probably wants us to evaluate things in the order of brackets; presumably because she knows something the library cannot know.