-
-
Notifications
You must be signed in to change notification settings - Fork 128
GSoC 2021 Ideas Page
PyData/Sparse is a software project that provides sparse arrays for the PyData ecosystem, conforming to the NumPy API. That's a lot to digest, so let's break it down:
A sparse array is one that has a lot of zeros in it. Except in this package, we can also treat other arrays as sparse: Ones that have a lot of the same non-zero values in them.
Because we don't have infinite memory or computational power, so it's important to make the best use of it possible. If we "skip over" the zeros when doing computations, it will be a lot faster. In practice, this also means keeping track of where the zeros are, so that also has some extra overhead.
It means you can use it mostly as you would use NumPy. In fact, if you do try using it, some of the familiar functions, like np.max
, np.exp
etc. work on arrays provided by this project.
A lot of people, actually. Sparse arrays are important in physics and simulations, as well as electron microscopy. If you look at the public dependents, you'll even find some COVID-19 research done with this package.
Look at our contributing page! There are a lot of great instructions there. Our source code is hosted here.
Currently, we use mainly Numba, a package that makes Python go faster than it normally does. However, we are considering using other approaches, such as leveraging research by the TACO team to make things faster. For the curious reader, here's a PhD thesis from the pioneer of the topic. Most of our ideas are in that direction.
Our Gitter Channel is the best place to get in touch, or to ask if something should go someplace else. We also have an issue tracker for the more experienced among you!
We have a contributing page that we'll link to as the go-to source for how to get started. If you get stuck, just see above on how to contact us!
Usually, your GSoC application has to be a true "game plan" if what you'd like to achieve. It has to be hashed out in enough detail so we are reasonably sure you can make it to the very end. We'd like to remind you that the tile of the sub-org, in this case "PyData/Sparse", must be in the title of your application. We'd also like to point you to Google's own instructions for writing GSoC proposals.
- LLVM Back-end for the Tensor Algebra Compiler (TACO)
-
Description: The TACO project does some JIT compilation in an ad-hoc manner by writing out
*.c
files, compiling them and dynamically linking them into the executable. We would like to have a back-end for TACO that produces LLVM bytecode using the LLVM C++ API, and also compiles it in-memory. - Skills: LLVM bytecode, LLVM C++ API
- Difficulty Level: Hard
-
Related Readings/Links:
- The research paper that moved to the current method of code generation.
- Some partial work on the back-end so far.
- A pull request with an old LLVM version.
- Potential mentors: Guilherme Leobas (@guilhermeleobas), Hameer Abbasi (@hameerabbasi)
-
Description: The TACO project does some JIT compilation in an ad-hoc manner by writing out
- Completion of Python Bindings for the TACO compiler
- Description: The TACO project has partial Python bindings, but these are missing tests and API coverage. We'd like to add some tests and more API coverage to the Python bindings.
-
Skills: C++/
pybind11
knowledge - Difficulty Level: Medium
-
Related Readings/Links:
- The documentation for TACO.
- The documentation for
pybind11
. - The source you'll have to change.
- Potential mentors: Dale Tovar (@daletovar), Hameer Abbasi (@hameerabbasi)
- Creating a
conda-forge
package for TACO-
Description: The TACO project has no
conda-forge
package. We'd like to have one so we can depend on it in PyData/Sparse -
Skills: CMake/
conda
packaging knowledge - Difficulty Level: Medium
-
Related Readings/Links:
- The relevant documentation for
conda-forge
. - The main
CMakeLists.txt
for the project.
- The relevant documentation for
- Potential mentors: John Lee (@leej3), Hameer Abbasi (@hameerabbasi)
-
Description: The TACO project has no
- CSR/CSC format support and performance
-
Description: The
sparse
project has theGCXS
format, which is a generalization ofCSR
/CSC
. Ideally, we'd like to special case it forCSR
/CSC
as well as have better performance for certain operations. - Skills: Python knowledge, data structures and algorithms
- Difficulty Level: Easy
-
Related Readings/Links:
- The documentation for
GCXS
. - The source tree for
GCXS
.
- The documentation for
- Potential mentors: Dale Tovar (@daletovar), Hameer Abbasi (@hameerabbasi)
-
Description: The