add some docs on this package

Open-EO · Jan 3, 2025 · 37dcebd · 37dcebd
1 parent 85649b9
commit 37dcebd
Showing 1 changed file with 84 additions and 0 deletions.
diff --git a/docs/backend_explained.md b/docs/backend_explained.md
@@ -0,0 +1,84 @@
+
+# openEO Geotrellis: how it works
+
+## Data Cubes
+
+The main openEO concept implemented by this library is the `Raster data cube`, which is a multi-dimensional array of raster data.
+
+This library does not support arbitrary multidimensional cubes, but rather focuses on the type of cubes that are most 
+commonly encountered in earth observation. The most complex type that we can represent is the 4-dimensional space-time-bands cube.
+
+This cube is mapped to a Geotrellis [`MultibandTileLayerRDD`](https://geotrellis.github.io/scaladocs/latest/geotrellis/spark/index.html#MultibandTileLayerRDD[K]=org.apache.spark.rdd.RDD[(K,geotrellis.raster.MultibandTile)]withgeotrellis.layer.Metadata[geotrellis.layer.TileLayerMetadata[K]])
+
+For cubes with a temporal dimension, a `SpacetimeKey` is used, while spatial cubes use the `SpatialKey`.
+
+Geotrellis documents this class and its associated metadata [here](https://geotrellis.readthedocs.io/en/latest/guide/core-concepts.html#layouts-and-tile-layers).
+
+To be able to understand how these cubes work and behave, or how to correctly manipulate them, knowledge of the [Spark RDD concept](https://spark.apache.org/docs/latest/rdd-programming-guide.html)
+is essential. These are distributed datasets that (in production) live on multiple machines in a cluster, allowing openEO 
+to effectively scale out computations. 
+
+### Data cube loading: `load_collection` & `load_stac` 
+
+Loading large EO datasets into a Spark RDD is complex, mainly due to the variety in data formats, but also because
+IO is often a performance & stability bottleneck. Depending on the characteristics of the underlying storage system, the 
+optimal loading strategy may also differ, which is why we support multiple code paths.
+
+The main class for loading data from POSIX or Object Storage type of systems is the `FileLayerProvider`.
+
+This `load_collection` implementation supports a number of features:
+
+- Resampling at load time, improving both performance and correctness.
+- Loading of GDAL supported formats
+- Loading of Geotiff using a native implementation
+- Applying masks at load time, reducing the amount of data to read
+- Applying polygonal filtering at load time
+
+
+### Data cube organization: partitioners
+
+As mentioned, Spark will distribute data cubes over a cluster, into groups of data called `partitions`. A partition needs
+to fit into memory for processing, so the size matters, and Spark is not able to know the optimal partitioning for any given problem.
+
+Depending on procesess that are used, the optimal partitioning scheme can also change: for time series analysis, it would 
+be optimal to group data for the same location with multiple observations in the same group. For other processes, like `resample_spatial`,
+it may be needed to have information from neighbouring tiles, so a grouping per observation would be more optimal.
+As a rule of thumb it is up to the process to check the partitioner, and change it if needed.
+
+Whenever a partitioner is changed, the data will be shuffled, which is a costly operation. This is why the code often tries 
+to cleverly avoid this where possible.
+
+#### Sparse partitioners
+
+Data cubes can be sparse because openEO supports operations on polygons, that are not necessarily spatially adjacent. Examples
+include the use of `aggregate_spatial` on a vector cube or the `filter_spatial` process on a vector cube.
+
+When this case occurs, regular Geotrellis spatial partitioners tend to create too many partitions, because they consider the
+full bounding box instead of the more detailed geometry locations. The same can occur for data which is irregular in the temporal
+dimension.
+
+The effect of having too many partitions, is huge numbers of Spark tasks that do not do anything, but still consume resources as
+they are nevertheless scheduled. This becomes especially noticeable when the actual work that needs to be done is small.
+
+Sparse partitioners avoid this problem by determining all of the SpacetimeKeys up front. We also store the list of keys in
+the partitioner itself, allowing certain operations to be implemented more efficiently.
+
+## openEO processes
+
+This library implements a large part of the openEO process specification, mainly by using Geotrellis functionality.
+
+Depending on the type of process, the implementations can be found in 2 main locations: OpenEOProcesses and OpenEOProcessScriptBuilder.
+
+`OpenEOProcesses` implements the Scala part of predefined processes that operate on full data cubes. This Scala code is usually
+invoked by a Python counterpart in [`GeopysparkDataCube`](https://github.com/Open-EO/openeo-geopyspark-driver/blob/master/openeogeotrellis/geopysparkdatacube.py).
+Some of this Scala code may be used by multiple openEO process implementations. For instance, openEO `reduce_dimension` and `apply_dimension` can in some cases
+use the same code.
+
+`OpenEOProcessScriptBuilder` supports openEO processes that operate on arrays of numbers, which are often encountered in
+openEO `child processes`, as explained [here](https://api.openeo.org/#section/Processes/Process-Graphs). This part of the 
+work is not distributed on Spark, so operates on chunks of data that fit in memory. It does use Geotrellis which generally has
+quite well performing implementations for most basic processes.
+
+
+
+