Tuesday 25 June 2013

Another look at asm

This morning I implemented temporary-efficient generic binary operations.

If you recall the last blog post about asm, we had a test expression as follows

    out = ((a + b) + (c + d)) + ((a + c) + (b + d));

which "compiled into the following C-equivalent"

    fmpz_t tmp1, tmp2, tmp3, tmp4, tmp5, tmp6;
    fmpz_init(tmp1);
    fmpz_init(tmp2);
    fmpz_init(tmp3);
    fmpz_init(tmp4);
    fmpz_init(tmp5);
    fmpz_init(tmp6);

    fmpz_add(tmp1, a._data(), b._data());
    fmpz_add(tmp2, c._data(), d._data());
    fmpz_add(tmp3, tmp1, tmp2);
    fmpz_add(tmp4, a._data(), c._data());
    fmpz_add(tmp5, b._data(), d._data());
    fmpz_add(tmp6, tmp1, tmp2);
    fmpz_add(out._data(), tmp3, tmp6);

    fmpz_clear(tmp1);
    fmpz_clear(tmp2);
    fmpz_clear(tmp3);
    fmpz_clear(tmp4);
    fmpz_clear(tmp5);
    fmpz_clear(tmp6);

Obviously this uses more temporaries than necessary. But at the time of the last post, I had not yet implemented tuple merging, so this would have been hard to improve. But of course new we do have tuple merging, and so since today I want to start implement mpz "in some completeness", it seemed reasonable to start by employing the tuple merging code to get efficient generic binary operations. And so I did. The resulting C-equivalent is now the following:

    fmpz_t tmp1, tmp2, tmp3;
    fmpz_init(tmp1);
    fmpz_init(tmp2);
    fmpz_init(tmp3);

    fmpz_add(tmp1, a._data(), b._data());
    fmpz_add(tmp2, c._data(), d._data());
    fmpz_add(tmp3, tmp1, tmp2);
    fmpz_add(tmp1, a._data(), c._data());
    fmpz_add(tmp2, b._data(), d._data());
    fmpz_add(tmp1, tmp1, tmp2);
    fmpz_add(out._data(), tmp1, tmp3);

    fmpz_clear(tmp1);
    fmpz_clear(tmp2);
    fmpz_clear(tmp3);

As far as I can tell this is optimal (except for moving brackets using associativity). If not please let me know.

Of course there is a more efficient way of evaluating the above expression. In C++, it would be (for example)

    out = (a + (((b + (c + (a + b))) + c) + d));

and the C-equivalent would then be

    fmpz_t tmp;
    fmpz_init(tmp);

    fmpz_add(tmp, a._data(), b._data());
    fmpz_add(tmp, c._data(), tmp);
    fmpz_add(tmp, b._data(), tmp);
    fmpz_add(tmp, tmp, c._data());
    fmpz_add(tmp, tmp, d._data());
    fmpz_add(out._data(), a._data(), tmp);

    fmpz_clear(tmp);

(The funny way of bracketing is just to test my code; the point is that you always add one new term to a temporary result, so never need to allocate more temporaries. In fact out = a + b + c + a + b + c + d achieves a similar effect.)

In fact here something really nice happens. Recall that the asm code generated for the expression out = ((a + b) + (c + d)) + ((a + c) + (b + d)); and the one for the hand-coded C-equivalent are not actually the same -- they use the same number of call statements, but somewhat different stack frames. The automatically generated code is also about 10% longer.

On the other hand, for the expression out = (a + (((b + (c + (a + b))) + c) + d)), the asm generated in both cases is in fact exactly equivalent (just using different register allocations and jump target names).


An interesting question to investigate once the main wrapper is done is the following: should we try to automatically use associativity to convert the first expression into the third (i.e. remove brackets)? My inkling is no, since it seems like if the user goes through the effort of bracketing, she probably wants us to evaluate things in the order of brackets; presumably because she knows something the library cannot know.

Monday 24 June 2013

Into the deep

I had another productive day; the code size is increasing at a rapid pace. Today I implemented arbitrary addition of myint, and also a new test class mylong. This very similar to myint, the main point being to test that we can do mixed addition (i.e. adding a myint and a mylong and getting a mylong).

I am now reasonably convinced that the basics are working, and tomorrow I will start implement the mpz class is some completeness (excluding contexts for now, but with addmul etc).

With the increased complexity I am also hitting an expected problem: the C++ template system really was not designed to be (ab)used in this way. The main ramification of this are atrociously long and hard to understand error messages. I have been remedying this problem to some extent by using clang for debugging compiler errors -- clang's diagnostics are far superior to gcc's.

Sunday 23 June 2013

I just happen to enjoy working on Sundays

... said nobody ever. But I will graduate on Friday, and most likely won't have any time to get significant work done that day. So I figured I could just start my work week one day early, and take off Friday.

After I managed to make myself crawl out of bed on a Sunday morning at 8, I actually had a very productive day. I moved the expression class (together with its surrounding helpers) from prototype.h into its own file. In case you are not intimately aware of the code structure of the wrapper, this is the class doing the "heavy lifting". It is the main expression template class, from which all concrete classes derive.

Naturally enough, I also started a test file for the expression class. This implements a "dummy" concrete derived class myint, just wrapping an ordinary int. It seemed useful to isolate the entire work needed to define a derived class in the test file, instead of e.g. using the mpz wrapper for basic testing.

At some point during the day I thought I needed priorities for the evaluation traits, so I implemented these as well. Turns out I don't need them immediately, but it probably can't hurt to have them around anyway.

The test class myint just needs a few more lines to implement addition, and then it will be on par with the mpz class of prototype.h. It does have printing, assignment, initialization and equality comparison; and of course all of these features are tested extensively.

Tomorrow I will continue working in parallel on the three strands I already worked on today: (1) extend expression, and (2) test it by (3) extending myint.

Thursday 20 June 2013

A detour into assembly

So as I mentioned last time, today I decided to take a look at the assembly code my prototype is compiled into, to verify that the compiler does the necessary optimizations. The example I looked at is

    out = ((a + b) + (c + d)) + ((a + c) + (b + d))

Currently, this should be turned into the C equivalent of

    fmpz_t tmp1, tmp2, tmp3, tmp4, tmp5, tmp6;
    fmpz_init(tmp1);
    fmpz_init(tmp2);
    fmpz_init(tmp3);
    fmpz_init(tmp4);
    fmpz_init(tmp5);
    fmpz_init(tmp6);

    fmpz_add(tmp1, a._data(), b._data());
    fmpz_add(tmp2, c._data(), d._data());
    fmpz_add(tmp3, tmp1, tmp2);
    fmpz_add(tmp4, a._data(), c._data());
    fmpz_add(tmp5, b._data(), d._data());
    fmpz_add(tmp6, tmp1, tmp2);
    fmpz_add(out._data(), tmp3, tmp6);

    fmpz_clear(tmp1);
    fmpz_clear(tmp2);
    fmpz_clear(tmp3);
    fmpz_clear(tmp4);
    fmpz_clear(tmp5);
    fmpz_clear(tmp6);

(This is of course a rather inefficient way of doing it, but that's besides the point. The question is whether the compiler really manages to turn the first line into the second snippet.)

When compiling without exception support (which clutters things up with stack unwinding code and makes the comparison harder), on my machine the first snippet is turned into this. For comparison, the handwritten code turns into this. The key differences are as follows:

  • the hand-written code is 10% shorter
  • the hand-written code uses one register less
On the other hand, both use exactly the same number of call statements. This is good, because it indicates that everything is inlined as expected.

I'm a bit wary about the increase in register usage and code length, but I think I will press on for now.

Wednesday 19 June 2013

Metaprogramming day

So it's day three, and today I kept extending the prototype. By now it is possible to add fmpz's, in good cases. Things are not completely optimized yet, and both code formatting and layout (and comments) still have a long way to go, but this feels like progress. The metaprogramming style I employ very heavily relies on the compiler being able to optimize it, and I feel like I should verify that. Thus tomorrow I'll extend the code to the extent that we can add arbitrary fmpz's, and then I'll start analysing (and testing) the generated code. This should be interesting.

Tuesday 18 June 2013

Some code - YAY.

So, this is day two. I decided to forgo further detailed design work and actually start writing some code. My plan now is to write -- you guessed it -- another prototype. But this time for real; I'm only calling it a prototype because that allows me to stuff all code into one big messy file. I'll split it up soon.

So today I first upgraded the build system a bit to be able to build and execute C++ tests. After that I began writing the expression template prototype. It lives (unsurprisingly) in cxx/prototype.h; right now there is not much functionality at all (instantiation, destruction, and printing of fmpz). This will change tomorrow.

One question regarding the build system: is there a good reason why we have no real dependency tracking? In particular, currently every C file depends on *every* header, causing everything to be recompiled if only one header is changed. This is a bit of a pain for a header-only library ;). I have currently changed the build system so as to track headers in the module directory separately (whence only the cxx tests will be recompiled after I change cxx headers, not the entirety of FLINT), but it does not seem much harder to track the headers properly in the first place.

Monday 17 June 2013

And so it all begins

Today is the first day of GSOC, and dutifully I got up at 8am to start working at 9. I have spent all of the day outlining my design in a bit more detail. Since I will write a message to the list about this, I shall not go into more detail here.

Instead, let me explain how I intend to document my progress. Namely, I will write a blog post every day that I work, explaining very shortly what it is that I did. This is not intended to be a complete summary at all, it might just be a fun thing I noticed, or a "milestone" in my implementation, or whatever. I want to keep the amount of time I invest in these posts very short, say less than 10min (which is why I am already typing like a madman ...).

Additionally, every Friday I will send an email to the list, documenting in more detail my actual progress throughout the week.