Implement attention operator using basic PyTorch functions to match PyTorch MultiAttention behavior.
Implement the attention operator in CUDA.
Implement the flash attention operator using basic PyTorch functions (emulation for understanding).
Implement flash attention operator in CUDA.