Delta cancellation and amalgamate spectral and coffee modes #142

miklos1 · 2017-08-02T15:17:35Z

Main news:

Delta cancellation arrives in spectral mode, both for vector/tensor elements, and for GL/GLL type elements with the lucky quadrature rule. Resolves Optimise away Kronecker deltas for arguments #89.
Spectral mode now incorporates @tj-sun's the simplified COFFEE algorithm (Coffee migration #98), and becomes the default mode. (Resolves Incorporate sum-factorisation into coffee mode optimisation pass #116?)
Coffee mode remains as it is for @tj-sun to finish porting the complete COFFEE algorithm with graph rewriting (#112).

Some details:

A minor change to the COFFEE algorithm (see commit c67721a), which otherwise moves to GEM. It is now shared between the coffee mode and the spectral mode.
To avoid a performance hit, gem.optimise.delta_elimination now injects ComponentTensor nodes whose removal is now the caller's responsibility.
Spectral mode is mostly rewritten.
Added test cases for underintegration.

Mass matrix and mass action in O(p^d) operations. Laplace operator in O(p^(d+2)) operations, c.f. O(p^(2d+1)) required by mere sum factorisation.

Two modes share this: 'coffee' and 'spectral'.

miklos1 · 2017-08-02T15:32:19Z

gem/coffee.py

+    from firedrake import Citations
+    Citations().register("Luporini2016")
+except ImportError:
+    pass


So this module is only imported from tsfc.coffee and tsfc.spectral, which are in turn only imported on demand, when the particular mode is used.

wence-

Some very unscientific testing with gusto suggests that the first order discretisation gets about a 10% speed boost with "spectral" as the default mode.

wence- · 2017-08-02T16:03:26Z

gem/coffee.py

+    active_monomials = [m for m in monomials if m.atomics]
+    optimal_atomics = find_optimal_atomics(active_monomials, linear_indices)
+    result += factorise_atomics(active_monomials, optimal_atomics, linear_indices)
+    return result


This is just code-movement.

This very end has actually changed (c67721a), but otherwise yet.

wence- · 2017-08-02T16:05:12Z

gem/optimise.py

+            return expression
+        elif isinstance(expression, Delta):
+            mapper = MemoizerArg(filtered_replace_indices)
+            return mapper(expression, ((from_, to_),))


OK, so the delta must have from_ as a free index, so replace with to_, and at the same time, do a little simplification.

Yes, because if we allow to wrap Delta nodes with Indexed(ComponentTensor(..., ...), ...) then isinstance(f, Delta) will fail in the next round of the fixed point iteration looking for delta cancellation opportunities. See the test case.

wence- · 2017-08-02T16:21:14Z

tsfc/spectral.py

@@ -44,14 +32,86 @@ def Integrals(expressions, quadrature_multiindex, argument_multiindices, paramet

    :returns: list of integral representations
    """
+    # Rewrite: a / b => a * (1 / b)
+    expressions = replace_division(expressions)


This continues to feel like cargo-culting to me. Maybe @FabioLuporini can comment, but is it really the case that compilers do a better job when given:

A = B * (1/C);

?

As far as I remember, yes when 'b' is a constant

the b I'm referring to above is the one in the code, in a / b

OK, I see. A little bit of experimentation, using GCC 7.1:

#define VEC 4 void kernel(double * __restrict__ A, double * C, double * B, double D) { for (int j = 0; j < 128; j++) { for (int k = 0; k < VEC; k++) { A[j*VEC + k] += C[j*VEC + k]*B[j*VEC + k] / D; } } }

If I compile with -O3 -march=broadwell I get:

kernel(double*, double*, double*, double): vbroadcastsd ymm0, xmm0 xor eax, eax .L2: vmovupd ymm1, YMMWORD PTR [rsi+rax] vmulpd ymm1, ymm1, YMMWORD PTR [rdx+rax] vdivpd ymm1, ymm1, ymm0 vaddpd ymm1, ymm1, YMMWORD PTR [rdi+rax] vmovupd YMMWORD PTR [rdi+rax], ymm1 add rax, 32 cmp rax, 4096 jne .L2 vzeroupper ret

So B is broadcast across the vector lanes out of the loop, but then the division happens in the loop.

If I add -ffast-math (meaning that the optimisations don't have to maintain IEEE compliance, which, amongst other things allows the compiler to make the substitution a/b => a*(1/b) then I get:

kernel(double*, double*, double*, double): vmovsd xmm1, QWORD PTR .LC0[rip] xor eax, eax vdivsd xmm1, xmm1, xmm0 vbroadcastsd ymm1, xmm1 .L2: vmovupd ymm0, YMMWORD PTR [rsi+rax] vmulpd ymm0, ymm0, YMMWORD PTR [rdx+rax] vfmadd213pd ymm0, ymm1, YMMWORD PTR [rdi+rax] vmovupd YMMWORD PTR [rdi+rax], ymm0 add rax, 32 cmp rax, 4096 jne .L2 vzeroupper ret .LC0: .long 0 .long 1072693248

So we perform the division out of the loop, and inside it's turned into a multiplication.

If I manually replace / D => * (1/D) then I get

kernel(double*, double*, double*, double): vmovsd xmm1, QWORD PTR .LC0[rip] xor eax, eax vdivsd xmm1, xmm1, xmm0 vbroadcastsd ymm1, xmm1 .L2: vmovupd ymm0, YMMWORD PTR [rsi+rax] vmulpd ymm0, ymm0, YMMWORD PTR [rdx+rax] vfmadd213pd ymm0, ymm1, YMMWORD PTR [rdi+rax] vmovupd YMMWORD PTR [rdi+rax], ymm0 add rax, 32 cmp rax, 4096 jne .L2 vzeroupper ret .LC0: .long 0 .long 1072693248

Which is as if I'd said -ffast-math.

We currently don't put -ffast-math in the compiler flags, which I think is a reasonable position.

As far as I remember @tj-sun porting the old COFFEE code, the logic was something like this: apply this rewriting if b is a number or a Symbol. A Symbol could either mean a constant or a variable reference. Generally the denominator was assigned to a temporary if it was used more than once. That is, instead of dividing n times with the same number, the substitution makes 1 division and n multiplication. Most common use case is the calculation of the inverse Jacobian, where the determinant is the denominator.

FWIW, clang won't generate FMAs unless you say either -ffp-contract=fast or -ffast-math.

wence- · 2017-08-02T16:22:00Z

tsfc/spectral.py

+
+    expressions = [index_sum(e, quadrature_multiindex) for e in expressions]
+    argument_indices = tuple(chain(*argument_multiindices))
+    return [Integral(e, quadrature_multiindex, argument_indices) for e in expressions]


OK, so now the representation is no longer just gem.

wence- · 2017-08-02T16:29:35Z

tsfc/spectral.py

+            for monomial in monomial_sum:
+                var, s, a, r = delta_elimination(variable, *monomial)
+                narrow_variables.setdefault(var)
+                delta_simplified[var].add(s, a, r)


Because there is no OrderedDefaultDict.

wence- · 2017-08-02T16:31:48Z

tsfc/spectral.py

+    def prune(factors):
+        result = remove_componenttensors(factors[:-1])
+        result.append(factors[-1])
+        return result


Is it obvious why you don't want to remove component tensors from the last factor?

During argument factorisation, the "atomics" are subexpressions of FInAT-provided tabulations, which means they are small so this index substitution is cheap. Moreover, MonomialSum collapses monomials with the same set of sum indices and atomics. Layers of Indexed(ComponentTensor(..., ...), ...) would atomics that are actually the same to look different, and thus would hinder the shortening of the MonomialSum.

The rest part, which is the last factor if you look at the ordering above, is the opposite of what is described above. It can be arbitrarily big, making remove_componenttensors costly, and there are not benefits to early removal of those ComponentTensors.

Perhaps I should write some if this into a comment?

Sounds like a good idea.

wence- · 2017-08-02T16:46:21Z

tsfc/spectral.py

+            tail_atomics = tuple(a for a in atomics
+                                 if set(tail_indices) & set(a.free_indices))
+            head_indices = tuple(i for i in sum_indices if i not in tail_ordering)
+            head_atomics = tuple(a for a in atomics if a not in tail_atomics)


Again because OrderedSet is not a thing.

wence- · 2017-08-02T16:47:37Z

tsfc/spectral.py

+        monomial_sum = MonomialSum()
+        for (sum_indices, atomics), monosum in sub_monosums:
+            new_rest = sum_factorise(variable, tail_ordering[1:], monosum)
+            monomial_sum.add(sum_indices, atomics, new_rest)


OK, so this is more "loop invariant code motion".

wence- · 2017-08-03T10:27:16Z

Spectral mode now incorporates @tj-sun's the simplified COFFEE algorithm (#98), and becomes the default mode. (Resolves #116?)

I think so, given the what is current "COFFEE" in tsfc.

* underintegration: add a short comment fix previous bug expose delta_elimination bug in test case change default mode: coffee -> spectral add some more comments rename argument indices to linear indices move COFFEE algorithm from TSFC to GEM test underintegration tricks fix Python 2 flake8 fix failing test case rewrite spectral mode tolerate and skip monomials with no atomics delay index substitution in delta_elimination disconnect coffee_mode.Integrals from spectral.Integrals optimise away Delta in UFL arguments

miklos1 added 14 commits July 31, 2017 16:33

optimise away Delta in UFL arguments

5b6383c

disconnect coffee_mode.Integrals from spectral.Integrals

bb90812

delay index substitution in delta_elimination

9ed7f22

tolerate and skip monomials with no atomics

c67721a

rewrite spectral mode

8335c42

fix failing test case

8949396

fix Python 2 flake8

52bcd30

test underintegration tricks

38e99ec

Mass matrix and mass action in O(p^d) operations. Laplace operator in O(p^(d+2)) operations, c.f. O(p^(2d+1)) required by mere sum factorisation.

move COFFEE algorithm from TSFC to GEM

437cf92

Two modes share this: 'coffee' and 'spectral'.

rename argument indices to linear indices

2691148

add some more comments

d4fb9e5

change default mode: coffee -> spectral

00d3db2

expose delta_elimination bug in test case

51a5331

fix previous bug

c736b52

miklos1 commented Aug 2, 2017

View reviewed changes

wence- approved these changes Aug 2, 2017

View reviewed changes

add a short comment

a783f99

miklos1 merged commit a783f99 into master Aug 3, 2017

miklos1 deleted the underintegration branch August 3, 2017 10:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delta cancellation and amalgamate spectral and coffee modes #142

Delta cancellation and amalgamate spectral and coffee modes #142

miklos1 commented Aug 2, 2017 •

edited

Loading

miklos1 Aug 2, 2017

wence- left a comment

wence- Aug 2, 2017

miklos1 Aug 3, 2017

wence- Aug 2, 2017

miklos1 Aug 3, 2017

wence- Aug 2, 2017

FabioLuporini Aug 2, 2017

FabioLuporini Aug 3, 2017

wence- Aug 3, 2017

wence- Aug 3, 2017

miklos1 Aug 3, 2017

wence- Aug 3, 2017

wence- Aug 2, 2017

wence- Aug 2, 2017

wence- Aug 2, 2017

miklos1 Aug 3, 2017

wence- Aug 3, 2017

wence- Aug 2, 2017

wence- Aug 2, 2017

wence- commented Aug 3, 2017

Delta cancellation and amalgamate spectral and coffee modes #142

Delta cancellation and amalgamate spectral and coffee modes #142

Conversation

miklos1 commented Aug 2, 2017 • edited Loading

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- commented Aug 3, 2017

miklos1 commented Aug 2, 2017 •

edited

Loading