Summer of code at FLINT: 2013

Tuesday 3 September 2013

qadic

Well, I said I would, and so I did: I implemented, documented and tested quadicxx today. Took some small changes to the padicxx_ctx related interfaces, but nothing major.

This only leaves printing, which I will start tackling tomorrow.

Thursday 29 August 2013

padicxx again

I spent most of the morning thinking about the changes to padicxx related to my discussion with Sebastian and Bill. I think that I have now come up with a reasonable way to implement the design we agreed upon. I changed padicxx accordingly, and also wrote enough of padic_polyxx to see that this class will also work out.

Hence not a particularly productive day code-wise, but I expect to be able to use all of todays thinking to finish all the ?adic* classes tomorrow.

Wednesday 28 August 2013

Another productive day

Today I did the following:

implement hensel lifting code for fmpzxx
fix a const correctness problem in fmpz_poly_mat
implement prod for fmpz_poly_matxx
implement solve_fflu_precomp for fmpz_poly_matxx and nmod_poly_matxx
implement various four-argument functions for nmod_polyxx
implement bit_pack and bit_unpack for fmpzxx

This means that the only coding related tasks of priority (H) or (M) left on my TODO list are ustream and ?adic*. Yay!

Tuesday 27 August 2013

Still more work to do

Today I finished the construction and randtest interfaces, added row reduction and lu decomposition interfaces to all matrix classes, and added radix conversion functionality for fmpz_mod_poly.

Monday 26 August 2013

We're getting there

I'm in the process of working through my todo list of things which need to be done before the project can be finished. This is going to take a while, but seeing the list shrink is encouraging. Today I added conversion and construction interfaces to most classes, added static versions of randtest to classes which didn't have that, yet, and started adding set_zero, set_one functions. None of this is hard, but it all takes time ...

Thursday 22 August 2013

work work work

Now that most classes are finished to some extent, it is hard to keep track of the work I'm doing, since it is all scattered about. Today I

fixed make install
implemented the ustream prototype (c/f mailing list)
removed the n % ctx syntax to construct an nmodxx
added nmodxx::reduce static function replacing the member functions of fmpz_polyxx and fmpq_polyxx.
implemented fourary and fiveary expression templates
implemented the CRT function for fmpzxx

Wednesday 21 August 2013

Factorisation code

I didn't write a blog post in a while since I have been active on the list. For this reason there has been a lot of progress since the last one, which I am not going to mention in detail. Instead, let me just focus on today's work: I implemented wrappers for the factorization functionality in flint (recall that this is one of the (H)igh priority tasks that need finishing). Hence there are new classes fmpz_factorxx, nmod_poly_factorxx, fmpz_poly_factorxx, fmpz_mod_poly_factorxx, together with associated methods/functions. As far as I can tell this is all the factoring functionality. This was quite a marathon, but I implemented, tested, and documented all of these today.

Wednesday 14 August 2013

More progress, with an end almost visible.

Today was a good day. I finished nmod_matxx, documented it, added a codegen test. Then I started and finished nmod_poly_matxx and documented it. Finally I started fmpz_mod_poly. After this is done tomorrow (hopefully), we will nominally have initial versions of wrapper classes for every flint type. (Nominally, because quadic is missing (but it is not yet documented, either), and because the nmod_vecxx/fmpz_vecxx classes are very spartanic - but then again in flint they only have underscore methods, so...)

Tuesday 13 August 2013

nmod_matxx and some

This morning I started out by writing codegen tests for nmod_polyxx. The results are in line with my expectations: the C++ code is generally a bit longer than handwritten C, but not much. Specifically

void test_nmod_polyxx_1(nmod_polyxx& to, const nmod_polyxx& p1,

            const nmod_polyxx& p2)
{
    to = (p1*p1) + (p2*p2);
}

void test_nmod_polyxx_3(nmod_polyxx& to, const nmod_polyxx& p1,

const nmod_polyxx& p2)
{
to = ((p1*p1) + (p2*p2)) + ((p1*p2) + (p2*p1));
}

Both come out about 20% longer than C equivalents, and using a few more stack spill frames (and precisely the same FLINT function calls). Recall that for fmpzxx and similar the code usually gets 10% longer; I attribute this to the fact that nmod_polyxx instantiation requires an additional argument (the modulus) which has to be kept track of.

After that, I wrote a new helper file flintxx/matrix.h encapsulating some code common to (probably) all matrix classes, updated fmpz_matxx to use it, and then I started nmod_matxx (using this new helper). I'm about 75% through with the nmod_matxx functions, will finish and document this tomorrow.

Monday 12 August 2013

nmod_polyxx

Today I finished nmod_polyxx (with the usual reservations). Along the way I implemented nmod_vecxx (with very limited functionality) and reference types for nmodxx.

I also extended create_doc.c to allow for opening subsections using lines of pluses instead of stars; with the new version of the program using a leg grammar this was done in about five minutes. I then documented nmod_polyxx, using the same subsection structure as the nmod_poly_t documentation.

The only thing left to do with nmod_polyxx for now is writing codegen tests; this I will do first thing tomorrow.

Thursday 8 August 2013

fmpz_matxx again

First of all, yesterday was an unproductive day. I got up feeling slightly sick, and spent about two hours writing documentation. Then I gave up for the day and went back to bed.

Fortunately, all seems to be fine again today. I finished yesterday's documentation task; now finally all documentation is up to speed again with respect to the recent library changes. After that, I finished the fmpz_matxx class, at least to the extent that other classes are finished, and documented it.

Tuesday 6 August 2013

Lots and lots of documenting

As announced yesterday, this morning I finished adding member functions to all the existing wrapper classes. After that, I began the task of bringing my documentation up to date with respect to the changes of the last week(s). This is again a fairly laborious task; I estimate I am about halfway through.

Monday 5 August 2013

Refactoring

Today I began implementing some more improvements to the existing wrapper classes suggested by Bill / John Cremona. First I added "complex component access" functionality to all classes where it makes sense. The easiest example is fmpqxx. The following all work:

fmpqxx a(3, 4u);

a.num() = 5;

std::cout << (a+a).num() << '\n';

In particular, depending on the object type, num() will return a reference to the denominator, or constant reference, or an expression template.

After that, I began adding member functions for all global functions. So one can e.g. do a.divrem(b) instead of divrem(a, b). As is to be expected, this process is quite laborious (simply because there are a lot of functions in flint), and I will have to continue it tomorrow. Then I will also update the documentation.

Thursday 1 August 2013

some more honest work

This morning I implemented getters for ltuples. You can now do things like

fmpzxx a;

int b = 0;

ltupleref(a, b).get<0>() == a;

Depending on the situation, the get() call may evaluate a lazy tuple (if the return type cannot be made lazy), or yield a lazy expression, or yield a reference without needing to evaluate (if the tuple itself is immediate).

After that I began work on allowing lazy functions with more than two arguments. This requires writing generic code to evaluate n possibly lazy expressions. Things can get as complicated as one wants, but for the sake of (a) simplicity, (b) compilation performance and (c) because of my distrust in the optimizer, I have now decided to do the following: If there are precisely two non-immediate subexpressions, do the clever logic we already have. Otherwise, we evaluate things strictly in-order.

This way no tests are broken, and the minimal number of temporaries is used if there are at most two non-immediate subexpressions. If there are more than two, then the code might be (slightly) non-optimal. I assume this should not be a particularly common case.

Wednesday 31 July 2013

Slow progress

This morning I implemented the new matrix expression evaluation code which operates without temporary merging. This was done without particular difficulty, and works as expected. I did not implement the full class fmpz_matxx for now, though. Instead, I will brush up the classes we have so far with respect to the things which I have learned since I started:

lazy functions with more than two arguments
lazy functions with more than one return value
more member functions, and in general the changes in naming conventions discussed on the list
generic coefficient access
static assert

The one thing I have worked on today is lazy functions with more than one return value, implemented using a lazy tuple class. This is a bit tricky (as expected), but I think I have a working design and implementation now. It allows writing things such as

fmpzxx a;

int b;

ltupleref(a, b) = some_lazy_function(17) 

tassert(some_lazy_function(17) == ltuple(a, b))

Monday 29 July 2013

A bit of a detour

This morning Bill suggested that create_doc.c should use a parser generator like peg/leg. Since I had criticized the code somewhat harshly before, I felt I should undertake the trivial exercise of switching it over to an automatically generated parser.

Long story short, while this really wasn't hard, it was also not completely straightforward (just a bunch of "soft" difficulties related to writing a working grammar and stuff like that). I spent much more time on this than I probably should have, but there is a working version in my gsoc-doc branch now.

After that I thought about matrix temporary allocation jumbles. I think Bill's solution (allocating temporaries of sufficient size to fit every intermediate result, and using cheap resizing before each assignment) will work. I started the fmpz_matxx class, but haven't really gotten anything done yet. But everything is ready to go tomorrow.

Thursday 25 July 2013

More polys

I "finished" fmpz_polyxx (except for newton bases, hensel lifting, input, modular reduction, and addmul), and then started fmpq_polyxx. This is simply a laborious process, not much to tell.

Wednesday 24 July 2013

polys polys polys

I spent all day wrapping fmpz_poly. And I'm not even close to finished. I have wrapped the functions from all sections up to and including 13.25 (evaluation), and also 13.28 (composition).

The most interesting thing I implemented today (on other days this would hardly be worth mentioning) is to make the following work:

    fmpz_polyxx p, q;

    p = "3  1 0 1";

    q = "3 0 0 1";

    tassert(p(fmpzxx(2)) == 5);
    tassert(p(q).to_string() == "5 1 0 0 0 1");

You see, the operator() is overloaded to do either composition or evaluation, just like in real life. Isn't this exciting? ...

Tuesday 23 July 2013

More easy stuff

Today I kept working on writing wrappers. I begun by extending/fixing create_doc.c (the program we use to parse documentation files) to allow functions with empty arguments, and with a const qualifier on the right side. While at it I fixed all the parsing exceptions in the flint sources. This revealed that I had forgotten a few functions for fmpzxx; I have now added all of them which are not pertaining to bitpacking or chinese remaindering.

After that I started the fmpz_polyxx class; nothing much there yet.

Monday 22 July 2013

Things are taking shape.

I "finished" fmpqxx today, and also padicxx except for documentation.

Don't expect more exciting updates any time soon. There are quite a number of data types in FLINT...

Friday 19 July 2013

Some comments on error codes and their improvements

I spent the morning (well, quite a bit more than that by now) looking through the error message reports I generated yesterday, and seeing how to improve them. I think I have made some progress without being too intrusive. For now the work is in my branch gsoc-errors. One interesting thing to note is that this actually made many error messages longer (I'll comment on this below).

Here is a rundown of what typical error messages look like. I'll concentrate on gcc, since I assume this is the primary target. One typical problem is trying to construct an object from the wrong arguments. Here is an excerpt from the error message report:

TEST_FMPZXX_REF_INIT_WRONG_1
{
    const fmpzxx a;
    fmpzxx_ref ar(a);
}
Compiler error output is 37 line(s), 8535 characters
------START COMPILER ERROR OUTPUT-------
t-compiler-errors.cc: In function ‘int main()’:
t-compiler-errors.cc:71:24: error: no matching function for call to ‘flint::fmpzxx_expression<flint::operations::immediate, flint::flint_classes::ref_data<flint::fmpzxx_expression<flint::operations::immediate, flint::detail::fmpz_data>, long int> >::fmpzxx_expression(const fmpzxx&)’
t-compiler-errors.cc:71:24: note: candidates are:
In file included from t-compiler-errors.cc:31:0:
/home/ness/src/flint/fmpzxx.h:51:5: note: template<class T, class U> flint::fmpzxx_expression::fmpzxx_expression(const T&, const U&, typename flint::mp::enable_if<flint::fmpzxx_expression<Operation, Data>::enablec2<T, U> >::type*)
.....
/home/ness/src/flint/fmpzxx.h:43:7: note:   no known conversion for argument 1 from ‘const fmpzxx {aka const flint::fmpzxx_expression<flint::operations::immediate, flint::detail::fmpz_data>}’ to ‘const flint::fmpzxx_expression<flint::operations::immediate, flint::flint_classes::ref_data<flint::fmpzxx_expression<flint::operations::immediate, flint::detail::fmpz_data>, long int> >&’
------END COMPILER ERROR OUTPUT---------

The "...." denotes a lot more similar messages I have skipped. This is one of the error messages which got longer, but much more descriptive. Basically what is happening is that none of the constructors is applicable (as it states in the second line!), and then the compiler lists all five constructors (which is not particularly helpful in our case, but I suppose usually a good strategy). In the old error message one of the constructors was invoked, which then failed way down in the implementation details.

One thing to observe is that printing out all the type names in full is really distracting. Maybe I should write a sed script to shorten them.

An example of similar nature is trying to assign to something for which there simply is no rule to do the assignment (e.g. to a constant reference). Here is an excerpt:

TEST_FMPZXX_SRCREF_ASSIGN
{
    fmpzxx a;
    fmpzxx_srcref b(a);
    b = a;
}
Compiler error output is 7 line(s), 3014 characters
------START COMPILER ERROR OUTPUT-------
In file included from /home/ness/src/flint/fmpzxx.h:32:0,
                 from t-compiler-errors.cc:31:
/home/ness/src/flint/flintxx/expression.h: In instantiation .......
/home/ness/src/flint/fmpzxx.h:50:5:   required from ..........
t-compiler-errors.cc:85:13:   required from here
/home/ness/src/flint/flintxx/expression.h:255:1: error: invalid application of ‘sizeof’ to incomplete type ‘flint::detail::STATIC_ASSERT_FAILED<false>’
/home/ness/src/flint/flintxx/expression.h:256:9: error: ‘doit’ is not a member of ‘rule_t {aka flint::rules::assignment<flint::fmpzxx_expression<flint::operations::immediate, flint::flint_classes::srcref_data<flint::fmpzxx_expression<flint::operations::immediate, flint::detail::fmpz_data>, flint::fmpzxx_expression<flint::operations::immediate, flint::flint_classes::ref_data<flint::fmpzxx_expression<flint::operations::immediate, flint::detail::fmpz_data>, long int> >, long int> >, flint::fmpzxx_expression<flint::operations::immediate, flint::detail::fmpz_data>, void>}’
------END COMPILER ERROR OUTPUT---------

This time the "........." denote really really really long lines I chopped off. We see a simulated STATIC_ASSERT here (and this line is the only thing new compared with yesterday's results). It's not as great as one might like, but at least the STATIC_ASSERT assures us that the following errors are most likely garbage. There is also a message associated with the STATIC_ASSERT, which can be found by looking in the source. Alternatively, when compiling in C++11 mode, the error message is displayed directly:

/home/ness/src/flint/flintxx/expression.h:254:9: error: static assertion failed: no rule to assign expression T to derived_t (after evaluating)

As a final example, consider trying to do unimplemented arithmetic operations. Here is an excerpt:

TEST_FMPZXX_ARITH_WRONG_DEEP
{
    fmpzxx a;
    newtype n;
    a + (a*a + (a / n) + a)*a;
}
Compiler error output is 10 line(s), 1387 characters
------START COMPILER ERROR OUTPUT-------
t-compiler-errors.cc: In function ‘int main()’:
t-compiler-errors.cc:99:25: error: no match for ‘operator/’ in ‘a / n’
t-compiler-errors.cc:99:25: note: candidate is:
In file included from /home/ness/src/flint/fmpzxx.h:32:0,
                 from t-compiler-errors.cc:31:
/home/ness/src/flint/flintxx/expression.h:490:1: note: template<class Expr1, class Expr2> typename flint::detail::binary_op_helper<Expr1, flint::operations::divided_by, Expr2>::enable::type flint::operator/(const Expr1&, const Expr2&)
/home/ness/src/flint/flintxx/expression.h:490:1: note:   template argument deduction/substitution failed:
/home/ness/src/flint/flintxx/expression.h: In substitution of ‘template<class Expr1, class Expr2> typename flint::detail::binary_op_helper<Expr1, flint::operations::divided_by, Expr2>::enable::type flint::operator/(const Expr1&, const Expr2&) [with Expr1 = flint::fmpzxx_expression<flint::operations::immediate, flint::detail::fmpz_data>; Expr2 = newtype]’:
t-compiler-errors.cc:99:25:   required from here
/home/ness/src/flint/flintxx/expression.h:490:1: error: no type named ‘type’ in ‘flint::detail::binary_op_helper<flint::fmpzxx_expression<flint::operations::immediate, flint::detail::fmpz_data>, flint::operations::divided_by, newtype>::enable {aka struct flint::mp::enable_if<flint::traits::is_implemented<flint::rules::UNIMPLEMENTED>, int>}’
------END COMPILER ERROR OUTPUT---------

This is unchanged compared to yesterday, because it is actually pretty good.

Thursday 18 July 2013

Documentation and similar non-code

This morning I went on documenting, this time how to extend flintxx. I felt that the easiest way to get the basics across (to someone who already knows how to use filntxx, which admittedly the small amount of documentation I wrote yesterday is probably not going to achieve) would be by a heavily-commented toy example. So I wrote an example file fooxx.cpp , which is an amalgamation of a fake C interface, part of a C++ wrapper for it, and an example program. Then I added some explaining remarks and a general overview to the appendix of flint-manual.tex. This effort is far from complete, but it should give anyone who is eager to try out flintxx an idea how to proceed.

Then I gave some further thought to static_assert. I didn't really reach any conclusions, but I decided that at the very least we need a method to run the compiler against a set of test cases (which are known to fail) and record the output. I wrote a small (probably not terribly representative) set of test cases, and a script to do the execution. Here are the results for g++-4.7 and clang++-3.4. I'll not comment on them for now.

Wednesday 17 July 2013

Completing fmpzxx

During the first half of today I completed the fmpzxx.h wrapper header. This means I added wrapper functions for all (documented) fmpz functions. There are still a few shortcomings (marked TODO in the code), but fmpzxx is now basically feature complete.

The rest of the day I documented the whole thing.

ps: There was no blog post yesterday because I could not log into blogger.

Monday 15 July 2013

Some boring refactoring

Today I spent refactoring fmpzxx.h and fmpqxx.h. In particular, any code in fmpzxx.h which was not directly pertaining to the fmpz wrapper class was abstracted and moved to a new helper file flint_classes.h. (This is not entirely true, I still have to complete this process for ternary expression evaluation.) This made fmpzxx.h quite a bit cleaner (imho) and shorter (but at the expense of creating flint_classes.h, obviously). Then I applied the same changes to fmpqxx.h. Of course the point being that this can now reuse flint_classes.h.

I will continue the same process tomorrow.

Thursday 11 July 2013

Some honest progress

Today was a fun day, no grave problems. I wrote a wrapper for fmpq, and extended the wrapper for padic I started yesterday. A few minor things came up which I will talk about in my email tomorrow, and a few bugs where uncovered, but all in all things went smoothly. The two wrappers are in a state similar to fmpzxx: all arithmetic and some functions implemented, some details still to do (recall that this is still part of my "case study").

Wednesday 10 July 2013

Finally, reference types

This morning I coded up another implementation of reference types ("view classes"). And all tests (including codegen) still pass. Yay.

So what has been going on the last two days? Let's first recap the problem statement. In addition to the class fmpzxx, we also want classes fmpzxx_ref and fmpzxx_cref, which we would like to behave essentially like fmpzxx& and const fmpzxx&, except that it should be possible to instantiate them from the underlying C type fmpz*. (NB: you might think that one can use const fmpzxx_ref in place of fmpzxx_cref. But to do so, one has to disable copying of the reference types (i.e. always pass them by reference), and that is not desirable in the applications we have in mind.)

The main problem we have with this is that we do not want to have to write rules for all combinations of argument types (reference, const-reference, non-reference). The natural idea to deal with this is to replace all use of (say) const fmzxx& by fmpzxx_cref, and somehow have the core library deal with this (e.g. using implicit conversion constructors, and translation types, as outlined in my friday email). This takes some work, but is doable, and the result is fairly nice (imho). I have saved the work in my gsoc-views branch.

Unfortunately, this messes with the optimizer. Apparently, the increased number of temporary objects and indirections push the compiler beyond its limits. The result is code like this - which is hilariously bad, and also completely unacceptable. This sort of thing - misestimated register pressure causing the introduction of unnecessary spill slot and then blocking scalar to aggregate optimization - seems to be observable across all versions of gcc, and also clang.

This means two things. Secondly, at some later point in time, I should revisit some of the design decisions of the core library which rely on the optimizer. (I am fairly confident that this will be doable without having to change the concrete classes.) But firstly, I had to abandon Monday's work and start afresh. We need a system in which reference-objects and non-references can be mixed, without converting references to non-reference objects to reference-objects. This can be done (using a kind of duck typing approach), and in fact is what I did this morning. I don't think it's quite as pretty, but still good enough.

Tuesday 9 July 2013

Still no luck

This morning I finished the transition which enables view classes for fmpzxx. Unfortunately, as it turns out, this pushes the compiler too far. The optimizer produces correct but incredibly crappy code. (Interestingly enough, the optimizers in various versions of gcc all seem to optimize some functions well, and fail hilariously at others. No version can do all my test, but every test can be done by some version. What sort of rubbish is that??) After some quick discussion on the gcc-help mailing list, it seems like I may have placed too much confidence in the capability of compiler optimizers.

I have now come up with a third strategy, which should allow the compiler to keep producing good code as long as no "views" are involved, while still retaining all the features when views are involved.

Hopefully I can explain details tomorrow, after implementing the new idea. Fingers crossed...

Monday 8 July 2013

Quick update

Since I didn't send out any emails today, here is a quick update. I have been working on implementing "view classes", aka references to fmpz without underlying instance of fmpzxx. I said a few things about this in my email on Friday; it turns out a few more things should be said. Anyway, I didn't manage to complete what I wanted to do today; I will post a fuller update tomorrow.

Tuesday 2 July 2013

Ternary expressions, take 2

I spent the morning finishing the ternary operation support for fmpz. Well, there is some testing still to implement, but things look good. As anticipated, the required additional code is not much - the whole ternary expression code is now 600 LOC. (On the other hand the whole wrapper library is 1800 LOC so far, so this is a quite significant amount.)

I also extended the code in t-mpz.cpp to take care of all the new cases. By then the memory usage and compile time in gcc had become ridiculous (resident set > 4GB). This wasn't particularly unexpected, since the relevant test code size tripled, and apparently gcc does not seem to discard certain data structures, yielding to linear growth. I improved the situation a bit by wrapping the test lines into inner structs, apparently that tricks gcc into giving up a bit of its data, and reduced the resident set size to about 1.5GB again (and compile time to about 1 minute).

Just now I did what I was most wary about - I tested the code generation. Consider the following function:

void test1(mpz& out, const mpz& a, const mpz& b, const mpz& c, const mpz& d)

{

    out = a + ((a+b) + ((c+a) + (a+d))*d);

}

This is a typical case where a ternary operation can be used (it is of the form out = a + (x + y*z), and the term on the right goes into a temporary, so we can use addmul). It is not the most complicated possible case (note that z in the above expression corresponds to d, an immediate), but still reasonably representative (note that evaluating x requires one temporary, y two, and z none).

The most efficient hand-coded version of this I could come up with is the following:

void test2(mpz& out, const mpz& a, const mpz& b, const mpz& c, const mpz& d)

{
    fmpz_t tmp1, tmp2;
    fmpz_init(tmp1);fmpz_init(tmp2);

    fmpz_add(tmp1, c._data(), a._data());
    fmpz_add(tmp2, a._data(), d._data());
    fmpz_add(tmp1, tmp1, tmp2);
    fmpz_add(tmp2, a._data(), b._data());
    fmpz_addmul(tmp2, tmp1, d._data());
    fmpz_add(out._data(), a._data(), tmp2);

    fmpz_clear(tmp1);fmpz_clear(tmp2);
}

The first function compiles to this, whereas the second one compiles to this. The results are quite similar to my earlier investigations. On the negative side, the first function is compiled to asm which

is 10% longer
uses 10% more storage instructions
uses 1 more register

On the positive side, both functions use the same number of call statements, conditional and unconditional jumps.

Monday 1 July 2013

ternary operations, take1

Today my plan was to implement use of ternary flint operations, i.e. addmul and submul. However, I must confess I severely underestimated the complexity of this. To get an idea of what is going on, suppose we want to evaluate the expression a + b*c into a temporary. The idea is that, if a is itself an expression template, and assuming for simplicity that both b and c are ordinary fmpzs, we first evaluate a into the temporary (henceforth tmp) and then call addmul(tmp, b, c).

The devil hides in "for simplicity". Namely, if b and/or c are allowed to be expression templates as well, things get difficult. The optimal order of evaluation depends on the number of temporaries required for evaluating each of a, b, and c (henceforth t1, t2 and t3). Moreover, the total number of temporaries required depends not only on the maximum of t1, t2 and t3, but also on the number of coincidences among these. Additionally, if b or c is an immediate, then evaluation still has to proceed differently, because in this case we cannot determine where the evaluation ends up (because we don't evaluate anything at all). A final complication arises because even for purposes of evaluating a, b, and c, they are not on equal footing. Namely, we need a to end up in tmp, whereas b and c need to end up in newly allocated temporaries.

I now have a first implementation of addmul on github. It takes up about 550 lines of code. The code takes care of all the above difficulties; adding some missing bits (handling a*b + c, and submul) should not significantly increase the code size. Neither should making this a bit more generic (since it is currently tied to fmpz). I am making one simplifying assumption, though: that the expression we are evaluating is homogeneous. By this I mean that it only involves one data type (fmpz in this case). This way instead of merging temporaries, we just need to count them.

Tomorrow I will finish up this code. Missing features I have indicated above; beyond that we need a lot more testing. I am already testing all the different scenarios alluded to in the second paragraph. Additionally, I will need to test the new functionality, and some more complex expressions. But most importantly, we need more codegen tests. This expression template evaluation code is highly complex, and involves a lot of "nominally runtime" code which has to be optimized away; I have to verify that.

One interesting observation is that this code also seems to stress test the compiler quite a bit. Compiling t-mpz.cpp using gcc requires about 2GB of memory, and takes on the order of 2 minutes on my machine (intel Core 2 Duo P8700 @ 2.53GHz, 4 GB main memory -- most of the time is probably spent allocating memory). In comparison, clang requires about 400MB and 10 seconds.

Tuesday 25 June 2013

Another look at asm

This morning I implemented temporary-efficient generic binary operations.

If you recall the last blog post about asm, we had a test expression as follows

    out = ((a + b) + (c + d)) + ((a + c) + (b + d));

which "compiled into the following C-equivalent"

    fmpz_t tmp1, tmp2, tmp3, tmp4, tmp5, tmp6;

    fmpz_init(tmp1);

    fmpz_init(tmp2);

    fmpz_init(tmp3);

    fmpz_init(tmp4);

    fmpz_init(tmp5);

    fmpz_init(tmp6);

    fmpz_add(tmp1, a._data(), b._data());

    fmpz_add(tmp2, c._data(), d._data());

    fmpz_add(tmp3, tmp1, tmp2);

    fmpz_add(tmp4, a._data(), c._data());

    fmpz_add(tmp5, b._data(), d._data());

    fmpz_add(tmp6, tmp1, tmp2);

    fmpz_add(out._data(), tmp3, tmp6);

    fmpz_clear(tmp1);

    fmpz_clear(tmp2);

    fmpz_clear(tmp3);

    fmpz_clear(tmp4);

    fmpz_clear(tmp5);

    fmpz_clear(tmp6);

Obviously this uses more temporaries than necessary. But at the time of the last post, I had not yet implemented tuple merging, so this would have been hard to improve. But of course new we do have tuple merging, and so since today I want to start implement mpz "in some completeness", it seemed reasonable to start by employing the tuple merging code to get efficient generic binary operations. And so I did. The resulting C-equivalent is now the following:

    fmpz_t tmp1, tmp2, tmp3;

    fmpz_init(tmp1);

    fmpz_init(tmp2);

    fmpz_init(tmp3);

    fmpz_add(tmp1, a._data(), b._data());

    fmpz_add(tmp2, c._data(), d._data());

    fmpz_add(tmp3, tmp1, tmp2);

    fmpz_add(tmp1, a._data(), c._data());

    fmpz_add(tmp2, b._data(), d._data());

    fmpz_add(tmp1, tmp1, tmp2);

    fmpz_add(out._data(), tmp1, tmp3);

    fmpz_clear(tmp1);

    fmpz_clear(tmp2);

    fmpz_clear(tmp3);

As far as I can tell this is optimal (except for moving brackets using associativity). If not please let me know.

Of course there is a more efficient way of evaluating the above expression. In C++, it would be (for example)

    out = (a + (((b + (c + (a + b))) + c) + d));

and the C-equivalent would then be

    fmpz_t tmp;

    fmpz_init(tmp);

    fmpz_add(tmp, a._data(), b._data());

    fmpz_add(tmp, c._data(), tmp);

    fmpz_add(tmp, b._data(), tmp);

    fmpz_add(tmp, tmp, c._data());

    fmpz_add(tmp, tmp, d._data());

    fmpz_add(out._data(), a._data(), tmp);

    fmpz_clear(tmp);

(The funny way of bracketing is just to test my code; the point is that you always add one new term to a temporary result, so never need to allocate more temporaries. In fact out = a + b + c + a + b + c + d achieves a similar effect.)

In fact here something really nice happens. Recall that the asm code generated for the expression out = ((a + b) + (c + d)) + ((a + c) + (b + d)); and the one for the hand-coded C-equivalent are not actually the same -- they use the same number of call statements, but somewhat different stack frames. The automatically generated code is also about 10% longer.

On the other hand, for the expression out = (a + (((b + (c + (a + b))) + c) + d)), the asm generated in both cases is in fact exactly equivalent (just using different register allocations and jump target names).

An interesting question to investigate once the main wrapper is done is the following: should we try to automatically use associativity to convert the first expression into the third (i.e. remove brackets)? My inkling is no, since it seems like if the user goes through the effort of bracketing, she probably wants us to evaluate things in the order of brackets; presumably because she knows something the library cannot know.

Monday 24 June 2013

Into the deep

I had another productive day; the code size is increasing at a rapid pace. Today I implemented arbitrary addition of myint, and also a new test class mylong. This very similar to myint, the main point being to test that we can do mixed addition (i.e. adding a myint and a mylong and getting a mylong).

I am now reasonably convinced that the basics are working, and tomorrow I will start implement the mpz class is some completeness (excluding contexts for now, but with addmul etc).

With the increased complexity I am also hitting an expected problem: the C++ template system really was not designed to be (ab)used in this way. The main ramification of this are atrociously long and hard to understand error messages. I have been remedying this problem to some extent by using clang for debugging compiler errors -- clang's diagnostics are far superior to gcc's.

Sunday 23 June 2013

I just happen to enjoy working on Sundays

... said nobody ever. But I will graduate on Friday, and most likely won't have any time to get significant work done that day. So I figured I could just start my work week one day early, and take off Friday.

After I managed to make myself crawl out of bed on a Sunday morning at 8, I actually had a very productive day. I moved the expression class (together with its surrounding helpers) from prototype.h into its own file. In case you are not intimately aware of the code structure of the wrapper, this is the class doing the "heavy lifting". It is the main expression template class, from which all concrete classes derive.

Naturally enough, I also started a test file for the expression class. This implements a "dummy" concrete derived class myint, just wrapping an ordinary int. It seemed useful to isolate the entire work needed to define a derived class in the test file, instead of e.g. using the mpz wrapper for basic testing.

At some point during the day I thought I needed priorities for the evaluation traits, so I implemented these as well. Turns out I don't need them immediately, but it probably can't hurt to have them around anyway.

The test class myint just needs a few more lines to implement addition, and then it will be on par with the mpz class of prototype.h. It does have printing, assignment, initialization and equality comparison; and of course all of these features are tested extensively.

Tomorrow I will continue working in parallel on the three strands I already worked on today: (1) extend expression, and (2) test it by (3) extending myint.

Thursday 20 June 2013

A detour into assembly

So as I mentioned last time, today I decided to take a look at the assembly code my prototype is compiled into, to verify that the compiler does the necessary optimizations. The example I looked at is

    out = ((a + b) + (c + d)) + ((a + c) + (b + d))

Currently, this should be turned into the C equivalent of

    fmpz_t tmp1, tmp2, tmp3, tmp4, tmp5, tmp6;
    fmpz_init(tmp1);
    fmpz_init(tmp2);
    fmpz_init(tmp3);
    fmpz_init(tmp4);
    fmpz_init(tmp5);
    fmpz_init(tmp6);

    fmpz_add(tmp1, a._data(), b._data());
    fmpz_add(tmp2, c._data(), d._data());
    fmpz_add(tmp3, tmp1, tmp2);
    fmpz_add(tmp4, a._data(), c._data());
    fmpz_add(tmp5, b._data(), d._data());
    fmpz_add(tmp6, tmp1, tmp2);
    fmpz_add(out._data(), tmp3, tmp6);

    fmpz_clear(tmp1);
    fmpz_clear(tmp2);
    fmpz_clear(tmp3);
    fmpz_clear(tmp4);
    fmpz_clear(tmp5);
    fmpz_clear(tmp6);

(This is of course a rather inefficient way of doing it, but that's besides the point. The question is whether the compiler really manages to turn the first line into the second snippet.)

When compiling without exception support (which clutters things up with stack unwinding code and makes the comparison harder), on my machine the first snippet is turned into this. For comparison, the handwritten code turns into this. The key differences are as follows:

the hand-written code is 10% shorter
the hand-written code uses one register less

On the other hand, both use exactly the same number of call statements. This is good, because it indicates that everything is inlined as expected.

I'm a bit wary about the increase in register usage and code length, but I think I will press on for now.

Wednesday 19 June 2013

Metaprogramming day

So it's day three, and today I kept extending the prototype. By now it is possible to add fmpz's, in good cases. Things are not completely optimized yet, and both code formatting and layout (and comments) still have a long way to go, but this feels like progress. The metaprogramming style I employ very heavily relies on the compiler being able to optimize it, and I feel like I should verify that. Thus tomorrow I'll extend the code to the extent that we can add arbitrary fmpz's, and then I'll start analysing (and testing) the generated code. This should be interesting.

Tuesday 18 June 2013

Some code - YAY.

So, this is day two. I decided to forgo further detailed design work and actually start writing some code. My plan now is to write -- you guessed it -- another prototype. But this time for real; I'm only calling it a prototype because that allows me to stuff all code into one big messy file. I'll split it up soon.

So today I first upgraded the build system a bit to be able to build and execute C++ tests. After that I began writing the expression template prototype. It lives (unsurprisingly) in cxx/prototype.h; right now there is not much functionality at all (instantiation, destruction, and printing of fmpz). This will change tomorrow.

One question regarding the build system: is there a good reason why we have no real dependency tracking? In particular, currently every C file depends on *every* header, causing everything to be recompiled if only one header is changed. This is a bit of a pain for a header-only library ;). I have currently changed the build system so as to track headers in the module directory separately (whence only the cxx tests will be recompiled after I change cxx headers, not the entirety of FLINT), but it does not seem much harder to track the headers properly in the first place.

Monday 17 June 2013

And so it all begins

Today is the first day of GSOC, and dutifully I got up at 8am to start working at 9. I have spent all of the day outlining my design in a bit more detail. Since I will write a message to the list about this, I shall not go into more detail here.

Instead, let me explain how I intend to document my progress. Namely, I will write a blog post every day that I work, explaining very shortly what it is that I did. This is not intended to be a complete summary at all, it might just be a fun thing I noticed, or a "milestone" in my implementation, or whatever. I want to keep the amount of time I invest in these posts very short, say less than 10min (which is why I am already typing like a madman ...).

Additionally, every Friday I will send an email to the list, documenting in more detail my actual progress throughout the week.

Wednesday 8 May 2013

Introduction

Hi everyone,

my name is Tom Bachmann and I'm applying for the Google Summer of Code programme to the FLINT project. In case anyone isn't aware, FLINT is a fast library for number theoretic computations, written in C. That is to say, it implements arbitrary-precision data types, and arithmetic and higher operations on them. For example, it provides arbitrary precision integers (so no matter how many times you multiply by two, you never get an overflow; although the required memory will increase linearly). Of course you can add and multiply integers, etc. More interestingly, you can also factorize them. Basically, anything you might want to ask about particular elements of the ring of integers, FLINT should be able to do. Similar support exists for the rational numbers, finite fields, and polynomial and matrix rings over these.

Also a few words about summer of code: the google summer of code (henceforth GSOC for short) is an annual programme sponsored by google, where students obtain three months of funding to write code for a free software (aka "open source") project. This is pretty awesome both for students (who can spend the summer doing something interesting and worthwile, without having to worry about money) and for the free software projects, which tend to see a huge amount of activity over the summer in this way.

And this is precisely what I want to do. In order to be accepted to GSOC, students apply to a mentoring organization. In my case, I'm applying to the lmonade platform, which is an umbrella organization coordinating and packaging various open source projects related to scientific computations. One such project is FLINT, whence the circle closes.

So what is it that I want to do in particular? Well, as stated at the very beginning of this post, FLINT is written in C. That's a very reasonable choice. It is meant to be as fast as at all possible, so a compiled language is necessary, and then (in my experience) to only reasonable choices are C and C++ (well, let's not overlook fortran, but then let's try to forget about it as fast as possible). The advantage of C++ over C is greater expressiveness, and in particular object-oriented syntax and semantics. The great advantage of C is "being the lowest common denominator". Essentially any programming language that exists is able to call C functions. Since an arithmetic library is meant to be highly reusable, this is a pretty strong argument in favour of C. Others are that not everyone likes C++ as much as I do (as far as I can tell, this is mainly because C++ is much more complex, and hence takes a lot more effort to undarstand and apply properly), and that certainly more people know C than C++.

Nonetheless, now that FLINT is an established library, it seems like a good idea to implement various language wrappers, to allow it to be used natively in other programming languages. The most obvious candidates for first language wrappers are python (because it is used a lot by the scientific community) and C++ (for all the reasons mentioned above). The project I am applying for, then, is the construction of a C++ wrapper library for FLINT.

This is an interesting challenge. Indeed, the wrapper library should achieve performance as close to the C library as possible, and also should achieve performance as close to any hypothetical native C++ implementation as possible. This means, in particular, that arithmetic operations should compile as efficiently as possible into C calls. For example, an expression like a + b + c + d could naively be evaluated as tmp1 = a + b; tmp2 = c + d; res = tmp1 + tmp2. However, it would be more efficient to write this as res = a + b; res += c; res += d; yielding fewer temporaries. In general, unless one is very careful, innocious-looking expression will call operator overloads, which create many inefficient temporaries. This has been known for a long time, and there is a reasonably straight-forward solution: expression templates. Essentially, an expression of the form a + b + c + d does not actually immediately cause any computation at all. Instead, only upon assignment res = (the above) is the computation executed. Until then, the arithmetic operations are encoded in the type of the expression. So a + b + c + d may return an object of a type such as Add<mpz, Add<mpz, Add<mpz, mpz> > > where the template class Add<A, B> stores simply a pair of objects. (Ab)Using the type system like this may seem like an abomination at first, but is actually very powerful, and has become common practice in the C++ community. (In fact, the template system can be used to implement a turing machine at compile time. One might argue this is taking it a step too far...) The point is that by delaying execution like this, we can determine the most efficient way of evaluating an expression once it is known completely. We can then use an arbitrarily complicated strategy to decide in what order to execute operations, how many temporaries to allocate, etc. This construction has two important properties:

All the extra logic is carried out at compile time.
This requires nothing beyond a C++ compiler.

We can see here the power of C++. It allows use to essentially write domain-specific languages without leaving C++, which are transformed at compile-time into any C++ (or C) expression we want.

All this (designing an expression template domain-specific language to carry out arithmetic operations most efficiently) in and by itself is quite fun already. But there is another interesting dimension to the problem. Namely, FLINT developers want to write highly specialised algorithms in C, not deal with a language wrapper (be it for C++ or anything else). As such, the wrapper must be very easy to extend, and doing so should require as little knowledge of C++ as possible. For example, FLINT supports many rings (integers, rationals, ...), and will support more in future. Adding a new data type to the wrapper should be as painless as possible.

This is a highlevel summary of how I understand my application. I have written out all the glory details (or at least, more details than here) in my proposal in melange. The application phase is already closed; by now it just remains to sit back and wait.