Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta cancellation and amalgamate spectral and coffee modes #142

Merged
merged 15 commits into from
Aug 3, 2017

Conversation

miklos1
Copy link
Member

@miklos1 miklos1 commented Aug 2, 2017

Main news:

Some details:

  • A minor change to the COFFEE algorithm (see commit c67721a), which otherwise moves to GEM. It is now shared between the coffee mode and the spectral mode.
  • To avoid a performance hit, gem.optimise.delta_elimination now injects ComponentTensor nodes whose removal is now the caller's responsibility.
  • Spectral mode is mostly rewritten.
  • Added test cases for underintegration.

from firedrake import Citations
Citations().register("Luporini2016")
except ImportError:
pass
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this module is only imported from tsfc.coffee and tsfc.spectral, which are in turn only imported on demand, when the particular mode is used.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some very unscientific testing with gusto suggests that the first order discretisation gets about a 10% speed boost with "spectral" as the default mode.

active_monomials = [m for m in monomials if m.atomics]
optimal_atomics = find_optimal_atomics(active_monomials, linear_indices)
result += factorise_atomics(active_monomials, optimal_atomics, linear_indices)
return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just code-movement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This very end has actually changed (c67721a), but otherwise yet.

return expression
elif isinstance(expression, Delta):
mapper = MemoizerArg(filtered_replace_indices)
return mapper(expression, ((from_, to_),))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so the delta must have from_ as a free index, so replace with to_, and at the same time, do a little simplification.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because if we allow to wrap Delta nodes with Indexed(ComponentTensor(..., ...), ...) then isinstance(f, Delta) will fail in the next round of the fixed point iteration looking for delta cancellation opportunities. See the test case.

@@ -44,14 +32,86 @@ def Integrals(expressions, quadrature_multiindex, argument_multiindices, paramet

:returns: list of integral representations
"""
# Rewrite: a / b => a * (1 / b)
expressions = replace_division(expressions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This continues to feel like cargo-culting to me. Maybe @FabioLuporini can comment, but is it really the case that compilers do a better job when given:

A = B * (1/C);

?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I remember, yes when 'b' is a constant

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the b I'm referring to above is the one in the code, in a / b

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see. A little bit of experimentation, using GCC 7.1:

#define VEC 4

void kernel(double * __restrict__ A, double *  C, double *  B, double D)
{
    for (int j = 0; j < 128; j++) {
        for (int k = 0; k < VEC; k++) {
            A[j*VEC + k] += C[j*VEC + k]*B[j*VEC + k] / D;
        }
    }
}

If I compile with -O3 -march=broadwell I get:

kernel(double*, double*, double*, double):
  vbroadcastsd ymm0, xmm0
  xor eax, eax
.L2:
  vmovupd ymm1, YMMWORD PTR [rsi+rax]
  vmulpd ymm1, ymm1, YMMWORD PTR [rdx+rax]
  vdivpd ymm1, ymm1, ymm0
  vaddpd ymm1, ymm1, YMMWORD PTR [rdi+rax]
  vmovupd YMMWORD PTR [rdi+rax], ymm1
  add rax, 32
  cmp rax, 4096
  jne .L2
  vzeroupper
  ret

So B is broadcast across the vector lanes out of the loop, but then the division happens in the loop.

If I add -ffast-math (meaning that the optimisations don't have to maintain IEEE compliance, which, amongst other things allows the compiler to make the substitution a/b => a*(1/b) then I get:

kernel(double*, double*, double*, double):
  vmovsd xmm1, QWORD PTR .LC0[rip]
  xor eax, eax
  vdivsd xmm1, xmm1, xmm0
  vbroadcastsd ymm1, xmm1
.L2:
  vmovupd ymm0, YMMWORD PTR [rsi+rax]
  vmulpd ymm0, ymm0, YMMWORD PTR [rdx+rax]
  vfmadd213pd ymm0, ymm1, YMMWORD PTR [rdi+rax]
  vmovupd YMMWORD PTR [rdi+rax], ymm0
  add rax, 32
  cmp rax, 4096
  jne .L2
  vzeroupper
  ret
.LC0:
  .long 0
  .long 1072693248

So we perform the division out of the loop, and inside it's turned into a multiplication.

If I manually replace / D => * (1/D) then I get

kernel(double*, double*, double*, double):
  vmovsd xmm1, QWORD PTR .LC0[rip]
  xor eax, eax
  vdivsd xmm1, xmm1, xmm0
  vbroadcastsd ymm1, xmm1
.L2:
  vmovupd ymm0, YMMWORD PTR [rsi+rax]
  vmulpd ymm0, ymm0, YMMWORD PTR [rdx+rax]
  vfmadd213pd ymm0, ymm1, YMMWORD PTR [rdi+rax]
  vmovupd YMMWORD PTR [rdi+rax], ymm0
  add rax, 32
  cmp rax, 4096
  jne .L2
  vzeroupper
  ret
.LC0:
  .long 0
  .long 1072693248

Which is as if I'd said -ffast-math.

We currently don't put -ffast-math in the compiler flags, which I think is a reasonable position.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I remember @tj-sun porting the old COFFEE code, the logic was something like this: apply this rewriting if b is a number or a Symbol. A Symbol could either mean a constant or a variable reference. Generally the denominator was assigned to a temporary if it was used more than once. That is, instead of dividing n times with the same number, the substitution makes 1 division and n multiplication. Most common use case is the calculation of the inverse Jacobian, where the determinant is the denominator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, clang won't generate FMAs unless you say either -ffp-contract=fast or -ffast-math.


expressions = [index_sum(e, quadrature_multiindex) for e in expressions]
argument_indices = tuple(chain(*argument_multiindices))
return [Integral(e, quadrature_multiindex, argument_indices) for e in expressions]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so now the representation is no longer just gem.

for monomial in monomial_sum:
var, s, a, r = delta_elimination(variable, *monomial)
narrow_variables.setdefault(var)
delta_simplified[var].add(s, a, r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there is no OrderedDefaultDict.

tsfc/spectral.py Outdated
def prune(factors):
result = remove_componenttensors(factors[:-1])
result.append(factors[-1])
return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it obvious why you don't want to remove component tensors from the last factor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During argument factorisation, the "atomics" are subexpressions of FInAT-provided tabulations, which means they are small so this index substitution is cheap. Moreover, MonomialSum collapses monomials with the same set of sum indices and atomics. Layers of Indexed(ComponentTensor(..., ...), ...) would atomics that are actually the same to look different, and thus would hinder the shortening of the MonomialSum.

The rest part, which is the last factor if you look at the ordering above, is the opposite of what is described above. It can be arbitrarily big, making remove_componenttensors costly, and there are not benefits to early removal of those ComponentTensors.

Perhaps I should write some if this into a comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good idea.

tail_atomics = tuple(a for a in atomics
if set(tail_indices) & set(a.free_indices))
head_indices = tuple(i for i in sum_indices if i not in tail_ordering)
head_atomics = tuple(a for a in atomics if a not in tail_atomics)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again because OrderedSet is not a thing.

monomial_sum = MonomialSum()
for (sum_indices, atomics), monosum in sub_monosums:
new_rest = sum_factorise(variable, tail_ordering[1:], monosum)
monomial_sum.add(sum_indices, atomics, new_rest)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so this is more "loop invariant code motion".

@wence-
Copy link
Contributor

wence- commented Aug 3, 2017

Spectral mode now incorporates @tj-sun's the simplified COFFEE algorithm (#98), and becomes the default mode. (Resolves #116?)

I think so, given the what is current "COFFEE" in tsfc.

@miklos1 miklos1 merged commit a783f99 into master Aug 3, 2017
miklos1 added a commit that referenced this pull request Aug 3, 2017
* underintegration:
  add a short comment
  fix previous bug
  expose delta_elimination bug in test case
  change default mode: coffee -> spectral
  add some more comments
  rename argument indices to linear indices
  move COFFEE algorithm from TSFC to GEM
  test underintegration tricks
  fix Python 2 flake8
  fix failing test case
  rewrite spectral mode
  tolerate and skip monomials with no atomics
  delay index substitution in delta_elimination
  disconnect coffee_mode.Integrals from spectral.Integrals
  optimise away Delta in UFL arguments
@miklos1 miklos1 deleted the underintegration branch August 3, 2017 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants