-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delta cancellation and amalgamate spectral and coffee modes #142
Conversation
Mass matrix and mass action in O(p^d) operations. Laplace operator in O(p^(d+2)) operations, c.f. O(p^(2d+1)) required by mere sum factorisation.
Two modes share this: 'coffee' and 'spectral'.
from firedrake import Citations | ||
Citations().register("Luporini2016") | ||
except ImportError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this module is only imported from tsfc.coffee
and tsfc.spectral
, which are in turn only imported on demand, when the particular mode is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some very unscientific testing with gusto suggests that the first order discretisation gets about a 10% speed boost with "spectral" as the default mode.
active_monomials = [m for m in monomials if m.atomics] | ||
optimal_atomics = find_optimal_atomics(active_monomials, linear_indices) | ||
result += factorise_atomics(active_monomials, optimal_atomics, linear_indices) | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just code-movement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This very end has actually changed (c67721a), but otherwise yet.
return expression | ||
elif isinstance(expression, Delta): | ||
mapper = MemoizerArg(filtered_replace_indices) | ||
return mapper(expression, ((from_, to_),)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so the delta must have from_
as a free index, so replace with to_
, and at the same time, do a little simplification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because if we allow to wrap Delta
nodes with Indexed(ComponentTensor(..., ...), ...)
then isinstance(f, Delta)
will fail in the next round of the fixed point iteration looking for delta cancellation opportunities. See the test case.
@@ -44,14 +32,86 @@ def Integrals(expressions, quadrature_multiindex, argument_multiindices, paramet | |||
|
|||
:returns: list of integral representations | |||
""" | |||
# Rewrite: a / b => a * (1 / b) | |||
expressions = replace_division(expressions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This continues to feel like cargo-culting to me. Maybe @FabioLuporini can comment, but is it really the case that compilers do a better job when given:
A = B * (1/C);
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I remember, yes when 'b' is a constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the b
I'm referring to above is the one in the code, in a / b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see. A little bit of experimentation, using GCC 7.1:
#define VEC 4
void kernel(double * __restrict__ A, double * C, double * B, double D)
{
for (int j = 0; j < 128; j++) {
for (int k = 0; k < VEC; k++) {
A[j*VEC + k] += C[j*VEC + k]*B[j*VEC + k] / D;
}
}
}
If I compile with -O3 -march=broadwell
I get:
kernel(double*, double*, double*, double):
vbroadcastsd ymm0, xmm0
xor eax, eax
.L2:
vmovupd ymm1, YMMWORD PTR [rsi+rax]
vmulpd ymm1, ymm1, YMMWORD PTR [rdx+rax]
vdivpd ymm1, ymm1, ymm0
vaddpd ymm1, ymm1, YMMWORD PTR [rdi+rax]
vmovupd YMMWORD PTR [rdi+rax], ymm1
add rax, 32
cmp rax, 4096
jne .L2
vzeroupper
ret
So B is broadcast across the vector lanes out of the loop, but then the division happens in the loop.
If I add -ffast-math
(meaning that the optimisations don't have to maintain IEEE compliance, which, amongst other things allows the compiler to make the substitution a/b => a*(1/b)
then I get:
kernel(double*, double*, double*, double):
vmovsd xmm1, QWORD PTR .LC0[rip]
xor eax, eax
vdivsd xmm1, xmm1, xmm0
vbroadcastsd ymm1, xmm1
.L2:
vmovupd ymm0, YMMWORD PTR [rsi+rax]
vmulpd ymm0, ymm0, YMMWORD PTR [rdx+rax]
vfmadd213pd ymm0, ymm1, YMMWORD PTR [rdi+rax]
vmovupd YMMWORD PTR [rdi+rax], ymm0
add rax, 32
cmp rax, 4096
jne .L2
vzeroupper
ret
.LC0:
.long 0
.long 1072693248
So we perform the division out of the loop, and inside it's turned into a multiplication.
If I manually replace / D => * (1/D)
then I get
kernel(double*, double*, double*, double):
vmovsd xmm1, QWORD PTR .LC0[rip]
xor eax, eax
vdivsd xmm1, xmm1, xmm0
vbroadcastsd ymm1, xmm1
.L2:
vmovupd ymm0, YMMWORD PTR [rsi+rax]
vmulpd ymm0, ymm0, YMMWORD PTR [rdx+rax]
vfmadd213pd ymm0, ymm1, YMMWORD PTR [rdi+rax]
vmovupd YMMWORD PTR [rdi+rax], ymm0
add rax, 32
cmp rax, 4096
jne .L2
vzeroupper
ret
.LC0:
.long 0
.long 1072693248
Which is as if I'd said -ffast-math
.
We currently don't put -ffast-math
in the compiler flags, which I think is a reasonable position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I remember @tj-sun porting the old COFFEE code, the logic was something like this: apply this rewriting if b
is a number or a Symbol
. A Symbol
could either mean a constant or a variable reference. Generally the denominator was assigned to a temporary if it was used more than once. That is, instead of dividing n
times with the same number, the substitution makes 1 division and n
multiplication. Most common use case is the calculation of the inverse Jacobian, where the determinant is the denominator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, clang won't generate FMAs unless you say either -ffp-contract=fast
or -ffast-math
.
|
||
expressions = [index_sum(e, quadrature_multiindex) for e in expressions] | ||
argument_indices = tuple(chain(*argument_multiindices)) | ||
return [Integral(e, quadrature_multiindex, argument_indices) for e in expressions] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so now the representation is no longer just gem
.
for monomial in monomial_sum: | ||
var, s, a, r = delta_elimination(variable, *monomial) | ||
narrow_variables.setdefault(var) | ||
delta_simplified[var].add(s, a, r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because there is no OrderedDefaultDict
.
tsfc/spectral.py
Outdated
def prune(factors): | ||
result = remove_componenttensors(factors[:-1]) | ||
result.append(factors[-1]) | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it obvious why you don't want to remove component tensors from the last factor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During argument factorisation, the "atomics" are subexpressions of FInAT-provided tabulations, which means they are small so this index substitution is cheap. Moreover, MonomialSum
collapses monomials with the same set of sum indices and atomics. Layers of Indexed(ComponentTensor(..., ...), ...)
would atomics that are actually the same to look different, and thus would hinder the shortening of the MonomialSum
.
The rest
part, which is the last factor if you look at the ordering above, is the opposite of what is described above. It can be arbitrarily big, making remove_componenttensors
costly, and there are not benefits to early removal of those ComponentTensor
s.
Perhaps I should write some if this into a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good idea.
tail_atomics = tuple(a for a in atomics | ||
if set(tail_indices) & set(a.free_indices)) | ||
head_indices = tuple(i for i in sum_indices if i not in tail_ordering) | ||
head_atomics = tuple(a for a in atomics if a not in tail_atomics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again because OrderedSet
is not a thing.
monomial_sum = MonomialSum() | ||
for (sum_indices, atomics), monosum in sub_monosums: | ||
new_rest = sum_factorise(variable, tail_ordering[1:], monosum) | ||
monomial_sum.add(sum_indices, atomics, new_rest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so this is more "loop invariant code motion".
* underintegration: add a short comment fix previous bug expose delta_elimination bug in test case change default mode: coffee -> spectral add some more comments rename argument indices to linear indices move COFFEE algorithm from TSFC to GEM test underintegration tricks fix Python 2 flake8 fix failing test case rewrite spectral mode tolerate and skip monomials with no atomics delay index substitution in delta_elimination disconnect coffee_mode.Integrals from spectral.Integrals optimise away Delta in UFL arguments
Main news:
Some details:
gem.optimise.delta_elimination
now injectsComponentTensor
nodes whose removal is now the caller's responsibility.