Matrix Multiplication Optimizations

Description

This project implements a minimalistic matrix multiplication focusing on various CPU optimization techniques such as:

On a Apple M3 Max I get the following timings —

Naive GEMM: ΔT=937,902µs (1.00x)
Loop Flipping: ΔT=70,094µs (13.38x)
Tiling: ΔT=71,804µs (13.06x)
OpenMP (Parallelized): ΔT=10,752µs (87.23x)
BLAS (Apple Accelerate Framework): ΔT=1,087µs (862.84x)

Note that the accelerate framework has access to special instructions, I take it as speed-of-light.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Makefile		Makefile
README.md		README.md
matmul.cpp		matmul.cpp
matmul.h		matmul.h
matrix.cpp		matrix.cpp
matrix.h		matrix.h
test.cpp		test.cpp