Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud-Native Geospatial Data Definition #18

Open
jedsundwall opened this issue Nov 9, 2023 · 24 comments
Open

Cloud-Native Geospatial Data Definition #18

jedsundwall opened this issue Nov 9, 2023 · 24 comments

Comments

@jedsundwall
Copy link
Member

jedsundwall commented Nov 9, 2023

As the concept of "cloud-native geospatial" gains traction, it's increasingly important to define what it means. In fact, I believe that maintaining a useful definition of the term is one of our primary functions as the Cloud-Native Geospatial Foundation.

It turns out that it's hard to define!

We surveyed our community earlier this year to get their take on it. We got 56 responses. Half of respondents came from public sector orgs (government, nonprofit, and academic) and half came from commercial organizations. 80% of respondents are from North America, Europe, Australia, or New Zealand. 13% of respondents are from Asia, 5% from South America, and 2% from Africa. Less than 10% of respondents consider themselves "beginner" level users of cloud-native geospatial solutions.

These were the most commonly used adjectives in the responses to the question "What does the term “cloud-native geospatial” mean to you?":

  • scalable
  • optimized
  • efficient
  • large
  • remote
  • data-adjacent
  • seamless
  • fast
  • standardized
  • simple
  • parallelized
  • consumable
  • on-demand
  • preprocessed
  • accessible
  • interoperable
  • easy
  • quick

Just listing the adjectives is interesting because it highlights the benefits that people are seeking, but it doesn't do a good job of describing the features of cloud-native formats. The full responses to the question are a bit more illuminating and you can see them in this gist.

I ended up boiling everything down to this very simple definition:

Cloud-native formats allow people to build applications on top of data using simple HTTP APIs.

Implicit in this definition is that cloud-native formats allow people to build good or reasonably performant applications.

Last night, at our first Seattle Zarr Meetup, @jiayuasu gave a presentation in which he provided a more comprehensive set of characteristics:

  • Efficient storage
    • high compression ratios
  • Scalability
    • Multiple data chunks for parallel processing
    • Each computer/process can take a chunk at choice (Random access)
  • Integrity constraints and schema evolution
    • All new data must pass Integrity check (i.e., type check)
    • Adding/removing a column should not rewrite the entire data
    • Natively support geometry / geography / raster type data
  • Metadata integration
    • Geo-statistics: bounding box, CRS, …
    • Store metadata alongside with data for advanced operations such as filter pushdown
  • Open protocol
    • Easy data exchange / sharing
    • Anyone can implement their own reader / writer to r/w data in this format

Note that the bolded points are what would make a merely cloud-optimized format a cloud-optimized geo format.

I don't disagree with this list. It's a good set of best practices for scalable formats that will take advantage of object storage, but they don't apply to some things that we'd consider cloud-native. STAC certainly doesn't have all of these characteristics.

I think Jia's final point and subpoenas drive the point home:

  • Open protocol
    • Easy data exchange / sharing
    • Anyone can implement their own reader / writer to r/w data in this format

This takes me back to the simple definition: "Cloud-native formats allow people to build applications on top of data using simple HTTP APIs."

HTTP might be too prescriptive here. Maybe "open protocol" is enough, but I think HTTP is enough for us to bet on at this point. Or maybe we say using generic RESTful APIs?

One other thing to add here, which is inspired by "Anyone can implement their own reader / writer to r/w data in this format": the Cloud-Native Geospatial Foundation will only be able to support projects that are developed using open source principles. Specs must be available under an open license and must be open to contributions from anyone. We don't pick winners, but we do aim to keep track of implementations of formats, which is one way we can measure adoption and the practical benefit of different standards.

Having said far too much here, I'm open to suggestions on how we should define "cloud-native geospatial" data formats. Discuss!

@PostholerCom
Copy link

PostholerCom commented Nov 9, 2023

Nice! A definition is sorely needed. I think the answer is very simple with our personal bias and use cases getting in the way of the answer. Here's my answer.

If a processor is between the end client and the data source, it's client/server, not cloud native.

That processor can be a server/service. Any definition that has a processor between end points opens pandoras box. A PostgreSQL/PostGIS database with a 'simple API endpoint' is now cloud native. A 'simple API endpoint' that interacts with a dozen AWS services, 2 snowflake databases and duckdb are now cloud native.

Pushing the subject even further, as far as I can tell, only 2 authentic formats exist that can genuinely be called cloud native. That would be COG and FGB. Neither require an intermediate server/service. COG should be obvious. FGB is the only vector format that can return feature level data using range requests. PMTiles can't. I've yet to see GeoParquet return any significant amount of data over a network in the browser in a timely manner, without a processor in the middle.

To reiterate, if there's a processor between the data and the client, then anything can be called cloud native.

There's a reason why cloud native data is so, so special.

@m-mohr
Copy link
Contributor

m-mohr commented Nov 9, 2023

As with ARD we may never find a definition that everyone accepts. I think profiles should be defined, something like:

  • Browser-optimized
  • Cloud Processing-optimized
  • ...

Maybe it's simpler to define those and then cloud-nativeness might be evaluated based on the profiles. You are only fully cloud-native if you fullfill them all, but there are certain levels about it and give you a better indication what it can actually do.

So for example, COG seems to be both browser-optimized and processing-optimitzed, but geoparquet and zarr to me are processing-optimized but not browser-optimized (yet). (The definition of the profiles is tbd, I'm just assuming something here as an example).

@jedsundwall
Copy link
Member Author

Thank you @m-mohr and @PostholerCom. I agree with the comparison to ARD and I'm sympathetic to Postholer's view here. I've often wondered if "web-optimized" would be a better term for this because I think that the pure source→client use case is what we're striving for. But, there's clearly a need for other "big" use cases as indicated by "large", "parallelized", and "scalable" showing up in the survey responses.

@m-mohr I like the idea of using profiles to classify the different use cases that fit under the cloud-native umbrella. Still thinking…

@kylebarron
Copy link
Member

kylebarron commented Nov 9, 2023

Almost every format has some version of a "chunk size" determined when writing, and the choice of the chunk size determines where on the continuum your dataset lies between browser-optimized and processing-optimized.

Take Zarr: most existing applications of Zarr are for server-side compute, and so most Zarr datasets are saved with a large chunk size. But there's nothing in the Zarr format that makes Zarr harder to use from a browser. If anything, Zarr as a format is considerably easier than COG to implement support for from a browser because the metadata is JSON-readable and all chunks in the entire dataset are aligned.

Similarly with COG. Most applications of COG have had a smaller chunk size, which makes the files amenable to being read directly from a browser. But you could easily create a COG with a block size of, say, 8192x8192, which would then be further on the "processing-optimized" end of the spectrum.

On the vector side, FlatGeobuf effectively has a hard-coded chunk size of 1 that can't be changed. This effectively makes FlatGeobuf further on the "browser-optimized" part of the spectrum. GeoParquet has a variable chunk size and thus can be optimized to either end of the spectrum.

To take this to the limit, you could make a Zarr dataset or COG file with a chunk size of 1, where each chunk is a single pixel, or a GeoParquet file with a row group size of 1. The tradeoff is between the relative size of your metadata to the actual data.

The point is that there's a huge onus on the writer to choose how the data will be used, but that's separate from whether the format is a cloud-native data format. All of these are cloud-native formats, even though they can be saved in a way that is hard to use directly from a browser.

@m-mohr
Copy link
Contributor

m-mohr commented Nov 9, 2023

So I guess the profiles should define which chunk sizes are acceptable.

@hobu
Copy link

hobu commented Nov 9, 2023

If a processor is between the end client and the data source, it's client/server, not cloud native.

Very much this.

Data-is-a-service (via HTTP). Data at rest organized in such a way that clients can control their own access of the content over HTTP without downloading the entire thing.

Which then begs the question, A Service for What Purpose? Visualization? A certain summary and analysis? Both (hard)? How clients expect to access the content drives that organization. Too much flexibility means it it is suitable for everything and nothing at once. Too much constraint and it's only good for a few specific things. Therein lies the art...

Pushing the subject even further, as far as I can tell, only 2 authentic formats exist that can genuinely be called cloud native. That would be COG and FGB

COPC is the same as COG except for point clouds. It also has the opt-in backward compatibility nature of COG that was key to its industry bootstrap.

@vincentsarago
Copy link

So I guess the profiles should define which chunk sizes are acceptable.

The chunk size depends on the application and the data type! we've been asked so manytime to define what's the best chunk size for COG and my answer (and the most common) is ANY.

Cloud Native != Server Native != Browser Native

When first released the COG format was mostly used for server side application because the client technologies weren't performant enough to handle so much data. It changed a bit but still 90% of COG usage is going to be through server.

IMO, Cloud Native describes a file format, not how to use it. if people want to access geoparquet or whatever format from a browser directly, if the format enables it... it's cloud native. I fear trying to put more specification on something that's started as just chuck your data is going to derive to a 🕳️

@jhamman
Copy link

jhamman commented Nov 9, 2023

Here's the definition of cloud-optimized we put in our ARCO paper last year:

... When we refer to “cloud-optimized” data it is this third variable, format, which we are most concerned with. Cloud-optimized data formats are unique insofar as they support direct access to data subsets without the computational overhead of opening and navigating through a massive data object simply to retrieve a small subset of bytes within it. Implementations of this functionality vary according to the specific cloud-optimized format: some formats include a metadata header which maps byte-ranges within a single large data object, while others opt to split a large object up into many small blocks stored in an organized hierarchical structure. Regardless of the specific implementation, the end result is an interface whereby algorithms can efficiently access data subsets. Efficient access to data subsets is especially impactful in the context of cloud object storage, where simultaneous read/write of arbitrary numbers of data subsets does not decrease the throughput to any individual subset. As such, parallel I/O dramatically increases cumulative throughput.

@m-mohr
Copy link
Contributor

m-mohr commented Nov 9, 2023

@vincentsarago Yes, and the profiles define the application. So maybe that helps to at least give some guidance... ARD and Cloud-Native are already 🕳️ and we try to fill that a bit, right? ;-)

@PostholerCom
Copy link

In the context of "cloud native data", when I see the words "large, parallelized and scalable" all I hear is network speed. Network speed at the end user will likely be the bottleneck, regardless of how well distributed your cloud data is. This also makes processing millions of features in the browser untenable, as the parquet crowd is finding out.

You could probably even quantify the cloud-nativeness of a particular data format, something like: total bytes requested / seconds to transfer = cloud nativeness index.

@vincentsarago
Copy link

again I feel the focus on browser application is not right, while 90% of the cloud native data application is server side processing!

total bytes requested / seconds to transfer = cloud nativeness index.

Makes senses only for frontend application, and still it will also depends on the data... large scale raster won't be cloud native because if you want to load it (at full resolution) you'll have to download a lot of data ... which your browser won't be able to handle anyway !

@m-mohr
Copy link
Contributor

m-mohr commented Nov 9, 2023

@vincentsarago Where are the "90%" coming from? Any evidence? For me personally it's the opposite. Also, usage can't be high if people don't optimize for the use case. If zarr, geoparquet etc would be cloud-native, I assume there would be much more usage.

Also, the point "we should not include it just because it's not technologically there yet" feels rather backwards. Shouldn't we then work on actually improving the situation? I think that browsers are important for cloud-nativeness.

because if you want to load it (at full resolution) you'll have to download a lot of data

You usually don't look at the full extent in full resolution but rather at a subset so I don't buy this argument.

@vincentsarago
Copy link

Where are the "90%" coming from? Any evidence?

yeah sorry, I do not have number.

I think that browsers are an important for cloud-nativeness.

Browsers are important for visualization or if you want to tell a story but it's not the most important for scientists trying to fight climate change using TB of Zarr data!

@jedsundwall
Copy link
Member Author

🍿

I think it's clear that there are at least two very distinct use cases that can be enabled by a cloud-native approach. One doesn't preclude the other. I like the language @jhamman shared. The magic comes from making subsets of data or metadata efficiently accessible. This can happen via range requests or through thoughtful chunking. It can enable browser-based interfaces in some instances and it can enable large scale analysis by distributed compute resources in others.

@geospatial-jeff
Copy link

If a processor is between the end client and the data source, it's client/server, not cloud native.

Very much this.

An alternate opinion is that restricting "cloud native" to things that don't involve a server is too narrow in scope. If I deploy a transparent HTTP pass-through cache in front of a S3 bucket of COGs to reduce latency for end users does that mean I'm no longer cloud-native? This is a very common pattern in distributed / cloud-based systems, with implementations ranging from CDNs like Cloudfront / Cloudflare, to open source projects like Varnish, to more complex service meshes which are used prominently in k8s.

It also ignores the fact that S3 is a server. There is no such thing as a client-only implementation here. There is always a server that is facilitating the requests from the client. The distinction is who is hosting those servers, which I don't think is relevant at all in determining if something is "cloud native". I could build my own implementation of the S3 API, deploy that to k8s, and serve my data through it and it's equivalent from a system-design perspective to using S3. The difference is I'm hosting it. There's obviously no reason to ever do this but it's a good thought experiment.

I also think that client != browser. Servers can be a client too, it's a matter of perspective. The ability to use HTTP range requests to stream parts of a file from one server to another is equally valuable to the community as the ability for these data formats to support the "serverless" use cases focused on efficiently streaming data into the browser. It would be a shame if the community splits into two, the two uses cases are more similar than different.

@m-mohr
Copy link
Contributor

m-mohr commented Nov 9, 2023

@vincentsarago Yes, and every scientist first needs to discover the data before they start using it to fight climate change and in that process a visualization may help to find the right tool. Otherwise they may not even start to fight...

@PostholerCom
Copy link

PostholerCom commented Nov 9, 2023

It also ignores the fact that S3 is a server. There is no such thing as a client-only implementation here.

Some software layer and processing has to be used to move bits on/off the hardware. This is defined by the cloud storage provider and we have no control over what is used. Should source.coop be a cloud storage provider to accommodate every possible use case?

Servers can be a client too

By definition, servers serve. If it's not a source/endpoint, it's an intermediate point in a potentially long line of intermediate points.

S3 is the source. The end user, browser/app/whatever is the end point. There is an origin and a destination. Clearly, this isn't difficult to grasp.

CDN performance is realized with redundant data requests. A bounding box as defined by a client will rarely have the same bounds. Tiles for example may occur frequently, a browser viewport bounding box not so much. A CDN may actually degrade performance with vector data.

@geospatial-jeff
Copy link

geospatial-jeff commented Nov 9, 2023

I think it's also worth considering the question of "what is cloud-native geospatial data" from the context of "how did this used to work, and why is it different now?". I'm referring to the era of desktop GIS software which started in the 1980s and continues on to this day. Historically GIS software has been packaged in a box and installed on individual computers to run a certain analysis, view some data etc. Larger organizations might have a centralized server (ex. ArcGIS Server) that is used for centralized processing and storage.

Data formats almost always adhere to form-fits-function. The data formats that were commonly used in desktop GIS software (ex. strip-aligned geotiff) work really well in an environment with a single server where data is stored next to your compute. In fact the original TIFF spec (released in 1984) proposed encoding images in strips because that made the most sense at the time given the high popularity of scan-line based devices like printers, faxers, scanners, copiers, and even computer displays.

The cloud provides a very different model of compute than what was previously available to the geospatial industry. Desktop GIS software relies on vertical scaling of the underlying hardware and high data locality for performance. While cloud base environments rely on horizontal scaling and economies of scale (due to pay-as-you-go pricing models) to provide high throughput while minimizing cost. This makes the cloud particularly good at performing embarrassingly horizontal tasks, but the geospatial community was stuck on a bunch of legacy data formats that aren't optimized for this purpose.

I argue that cloud native geospatial data formats were not designed with any specific use-case in mind, but instead to align the industry with this new, scaleable method of computing. In fact the industry went through a similar transition when switching from analog -> digital!

COGs innovation is (1) organizing the image in a standardized way (2) storing the data in blocks instead of strips (3) accessing the data with HTTP ranges; the combination of which turns I/O access patterns into an embarrassingly horizontal problem that is more compatible with the underlying compute model used by the cloud. Again this is independent of use-case. All use cases, regardless of how many servers are involved, are leveraging the embarrassingly parallel nature of these data formats to improve access to data.

@aluhamaa
Copy link

aluhamaa commented Nov 9, 2023

It also ignores the fact that S3 is a server. There is no such thing as a client-only implementation here. There is always a server that is facilitating the requests from the client. The distinction is who is hosting those servers, which I don't think is relevant at all in determining if something is "cloud native". I could build my own implementation of the S3 API, deploy that to k8s, and serve my data through it and it's equivalent from a system-design perspective to using S3. The difference is I'm hosting it. There's obviously no reason to ever do this but it's a good thought experiment.

I've been using a particular cloud environment that has S3-like interface to a particular type of EO data, and the experience is mixed at best. Having the protocol implemented without actual performance and scalability is not enough. So, I'd argue it is "not equivalent".

@geospatial-jeff
Copy link

geospatial-jeff commented Nov 9, 2023

It also ignores the fact that S3 is a server. There is no such thing as a client-only implementation here. There is always a server that is facilitating the requests from the client. The distinction is who is hosting those servers, which I don't think is relevant at all in determining if something is "cloud native". I could build my own implementation of the S3 API, deploy that to k8s, and serve my data through it and it's equivalent from a system-design perspective to using S3. The difference is I'm hosting it. There's obviously no reason to ever do this but it's a good thought experiment.

I've been using a particular cloud environment that has S3-like interface to a particular type of EO data, and the experience is mixed at best. Having the protocol implemented without actual performance and scalability is not enough. So, I'd argue it is "not equivalent".

The purpose of this thought experiment is to highlight that the way a file format is served doesn't change whether or not that file format is cloud-native. Is a COG only cloud-native if it lives in a blob store? If I download it to my local machine is it no longer cloud-native? What if I serve it over a shared filesystem like EBS where the data is stored on disk and mounted as a shared volume onto my server through a network protocol? Is that no longer cloud-native?

In practice I agree with you that performance of the server is important, but thats irrelevant to the point I'm trying to make. The cloud nativeness of a file format depends on how well the internals of that file are structured to be served from the cloud, not the particular implementation or use-case.

@aluhamaa
Copy link

aluhamaa commented Nov 9, 2023

In practice I agree with you that performance of the server is important, but thats irrelevant to the point I'm trying to make. The cloud nativeness of a file format depends on how well the internals of that file are structured to be served from the cloud, not the particular implementation or use case.

I do not disagree, but in practice, the underlying infrastructure can be the decisive factor in whether cloud-optimized format has real benefits or not. Seeing large projects using taxpayers money creating a confusion here.
Being really cloud native IMO would also require looking at the writing part. The case of COG, serving of the data is really cloud optimized and performs well, but the writing part (maybe I miss something)? I can see ZARR being cloud native over NetCDF4, but the latter can also be efficiently read from S3 by chunks (awkward, but can be done). The major difference of ZARR over NetCDF4 is parallel write directly to object store and fail-safe (sort of) expansion.

@geospatial-jeff
Copy link

The case of COG, serving of the data is really cloud optimized and performs well, but the writing part (maybe I miss something)?

Yes this is a really good point! And you are correct that COG does not support writes (updates) very well. Updating a single block in the image often requires rewriting the entire image. But that doesn't exclude it from being a cloud-native data format!

Read vs. write optimized formats is a classic example of how file formats are use-case driven. Some file formats are designed for optimized reads, some are designed for optimized writes, very few are optimized for both. It wouldn't be equitable to take a read-based file format and evaluate it against a workflow that performs a lot of writes. Just like its not equitable to exclude a certain geospatial data format from being "cloud-native" because it doesn't serve a very narrow and specific use-case.

When defining what it means to be a "cloud-native" data format I think its more productive to focus on the data-formats themselves; what they are good at, what they are bad at, what are the commonalities between them (as you just did comparing COG/NetCDF4/Zarr!). Then let end users make decisions on which formats best meet their individual needs based on those reccomendations.

@jedsundwall
Copy link
Member Author

jedsundwall commented Nov 9, 2023

@geospatial-jeff your "why is it different now?" discussion helps make a point I discussed at the Zarr meetup that kicked this off, which is that the cloud has changed how software and file formats interact. As you say, formats used to be inextricably linked to shrink-wrapped software written to run within specific environments. In contrast, when we're publishing data on S3, we don't know anything about where it's going to go, which, again, is why @jiayuasu's final point is so important: "Anyone can implement their own reader / writer to r/w data in this format."

Here's a slide from my presentation:
Screenshot 2023-11-09 at 2 42 53 PM
(thx to @tmcw for introducing me to the Pace Layering concept.)

The point here is that if we can align around open standards, it should enable a very wide array of interfaces/applications on top of it. I'm of the opinion that we have to be agnostic about use case or implementation. The whole point (as I see it) is that we want to enable flexibility for people to do all sorts of weird things.

  • Old way: hardware requirements informed software development which constrained file formats.
  • Cloud-native way: open standards made available on scalable hardware with generic APIs leads to Cambrian explosion of software, interfaces, and implementations.

@alexgleith
Copy link

Maybe you could replace "nature" with "OGC Standards" haha! Jokes jokes 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants