Tuesday, 2 July 2013

Ternary expressions, take 2

I spent the morning finishing the ternary operation support for fmpz. Well, there is some testing still to implement, but things look good. As anticipated, the required additional code is not much - the whole ternary expression code is now 600 LOC. (On the other hand the whole wrapper library is 1800 LOC so far, so this is a quite significant amount.)

I also extended the code in t-mpz.cpp to take care of all the new cases. By then the memory usage and compile time in gcc had become ridiculous (resident set > 4GB). This wasn't particularly unexpected, since the relevant test code size tripled, and apparently gcc does not seem to discard certain data structures, yielding to linear growth. I improved the situation a bit by wrapping the test lines into inner structs, apparently that tricks gcc into giving up a bit of its data, and reduced the resident set size to about 1.5GB again (and compile time to about 1 minute).

Just now I did what I was most wary about - I tested the code generation. Consider the following function:

void test1(mpz& out, const mpz& a, const mpz& b, const mpz& c, const mpz& d)
{
    out = a + ((a+b) + ((c+a) + (a+d))*d);
}

This is a typical case where a ternary operation can be used (it is of the form out = a + (x + y*z), and the term on the right goes into a temporary, so we can use addmul). It is not the most complicated possible case (note that z in the above expression corresponds to d, an immediate), but still reasonably representative (note that evaluating x requires one temporary, y two, and z none).

The most efficient hand-coded version of this I could come up with is the following:

void test2(mpz& out, const mpz& a, const mpz& b, const mpz& c, const mpz& d)
{
    fmpz_t tmp1, tmp2;
    fmpz_init(tmp1);fmpz_init(tmp2);

    fmpz_add(tmp1, c._data(), a._data());
    fmpz_add(tmp2, a._data(), d._data());
    fmpz_add(tmp1, tmp1, tmp2);
    fmpz_add(tmp2, a._data(), b._data());
    fmpz_addmul(tmp2, tmp1, d._data());
    fmpz_add(out._data(), a._data(), tmp2);

    fmpz_clear(tmp1);fmpz_clear(tmp2);
}

The first function compiles to this, whereas the second one compiles to this. The results are quite similar to my earlier investigations. On the negative side, the first function is compiled to asm which
  • is 10% longer
  • uses 10% more storage instructions
  • uses 1 more register
On the positive side, both functions use the same number of call statements, conditional and unconditional jumps.

3 comments:

  1. Well done!

    Are you automatically trying to replace x + y*z with an addmul if the result goes into a temporary? That's really nice, but certainly a difficult optimisation which I hadn't expected.

    When you said ternary operators, I thought you were referring to things like x += y*z where the user had effectively specified an addmul using the combined assignment and addition operator +=.

    ReplyDelete
  2. Yes, that is what I am doing.

    The += and -= things are also implemented (but that's a bit easier, since there are only two things to evaluate).

    By the way, I'm not quite sure why, but gcc actually produces *better* code for += than when I write it by hand. Namely the code generated for expression templates is 10% shorter and uses one register less than the hand-written one. There must be some curious heuristics going on.

    ReplyDelete
  3. I learned last night that it is standard practice in C++ to write the += operator for a class (e.g. fmpz) first, then effectively do:

    const fmpz fmpz::operator+(const zz_t & b)
    {
    fmpz r = *this;
    r += b;

    return r;
    }

    So perhaps this may explain the heuristic, if people do this as standard practice.

    ReplyDelete