Tuesday 25 June 2013

Another look at asm

This morning I implemented temporary-efficient generic binary operations.

If you recall the last blog post about asm, we had a test expression as follows

    out = ((a + b) + (c + d)) + ((a + c) + (b + d));

which "compiled into the following C-equivalent"

    fmpz_t tmp1, tmp2, tmp3, tmp4, tmp5, tmp6;
    fmpz_init(tmp1);
    fmpz_init(tmp2);
    fmpz_init(tmp3);
    fmpz_init(tmp4);
    fmpz_init(tmp5);
    fmpz_init(tmp6);

    fmpz_add(tmp1, a._data(), b._data());
    fmpz_add(tmp2, c._data(), d._data());
    fmpz_add(tmp3, tmp1, tmp2);
    fmpz_add(tmp4, a._data(), c._data());
    fmpz_add(tmp5, b._data(), d._data());
    fmpz_add(tmp6, tmp1, tmp2);
    fmpz_add(out._data(), tmp3, tmp6);

    fmpz_clear(tmp1);
    fmpz_clear(tmp2);
    fmpz_clear(tmp3);
    fmpz_clear(tmp4);
    fmpz_clear(tmp5);
    fmpz_clear(tmp6);

Obviously this uses more temporaries than necessary. But at the time of the last post, I had not yet implemented tuple merging, so this would have been hard to improve. But of course new we do have tuple merging, and so since today I want to start implement mpz "in some completeness", it seemed reasonable to start by employing the tuple merging code to get efficient generic binary operations. And so I did. The resulting C-equivalent is now the following:

    fmpz_t tmp1, tmp2, tmp3;
    fmpz_init(tmp1);
    fmpz_init(tmp2);
    fmpz_init(tmp3);

    fmpz_add(tmp1, a._data(), b._data());
    fmpz_add(tmp2, c._data(), d._data());
    fmpz_add(tmp3, tmp1, tmp2);
    fmpz_add(tmp1, a._data(), c._data());
    fmpz_add(tmp2, b._data(), d._data());
    fmpz_add(tmp1, tmp1, tmp2);
    fmpz_add(out._data(), tmp1, tmp3);

    fmpz_clear(tmp1);
    fmpz_clear(tmp2);
    fmpz_clear(tmp3);

As far as I can tell this is optimal (except for moving brackets using associativity). If not please let me know.

Of course there is a more efficient way of evaluating the above expression. In C++, it would be (for example)

    out = (a + (((b + (c + (a + b))) + c) + d));

and the C-equivalent would then be

    fmpz_t tmp;
    fmpz_init(tmp);

    fmpz_add(tmp, a._data(), b._data());
    fmpz_add(tmp, c._data(), tmp);
    fmpz_add(tmp, b._data(), tmp);
    fmpz_add(tmp, tmp, c._data());
    fmpz_add(tmp, tmp, d._data());
    fmpz_add(out._data(), a._data(), tmp);

    fmpz_clear(tmp);

(The funny way of bracketing is just to test my code; the point is that you always add one new term to a temporary result, so never need to allocate more temporaries. In fact out = a + b + c + a + b + c + d achieves a similar effect.)

In fact here something really nice happens. Recall that the asm code generated for the expression out = ((a + b) + (c + d)) + ((a + c) + (b + d)); and the one for the hand-coded C-equivalent are not actually the same -- they use the same number of call statements, but somewhat different stack frames. The automatically generated code is also about 10% longer.

On the other hand, for the expression out = (a + (((b + (c + (a + b))) + c) + d)), the asm generated in both cases is in fact exactly equivalent (just using different register allocations and jump target names).


An interesting question to investigate once the main wrapper is done is the following: should we try to automatically use associativity to convert the first expression into the third (i.e. remove brackets)? My inkling is no, since it seems like if the user goes through the effort of bracketing, she probably wants us to evaluate things in the order of brackets; presumably because she knows something the library cannot know.

5 comments:

  1. In your first parenthesised expression, d appears twice. But in the other expressions you give, d appears only once. So there might be a typo there somewhere, unless I am missing something blindingly obvious.

    I agree that if the user went to the trouble of doing bracketing that the order of operations should be respected.

    Not doing this could lead to unexpected behaviour for the user, for example if they introduce side effects into their functions for debugging purposes, etc.

    ReplyDelete
  2. You are quite right, the expressions are not equivalent. They way my tests are written, there is no need for them to even be similar.

    This would have improved the narrative of my post, though.

    ReplyDelete
    Replies
    1. No problem.

      It looks like it is coming along well.

      Tomorrow I will start diving into your code a bit deeper since you now have some non-trivial stuff up on github.

      Am I right that you've now removed the two simple examples you implemented yesterday and replaced them with the mpz class implementation in prototype?

      We probably should be discussing in flint-devel, but it is probably better if I wait until your weekly update on Thursday (?) to write in more detail.

      Everything looks good for now anyway. Excellent in fact.

      Delete
  3. Yes the weekly update will come on Thursday.

    The simple examples have not gone, they are still in t-expression.cpp. The mpz code is currently still being developed in prototype.h (to be moved somewhere more appropriate when I introduce contexts), and tested in t-mpz.cpp (and t-codegen.cpp).

    I'll explain my doings and plans in more detail on Thursday.

    ReplyDelete