-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA]: Improve pyarrow integration/IO performance using geoarrow-python #1288
Comments
Hi @paleolimbot! Thanks for submitting this issue - our team has been notified and we'll get back to you as soon as we can! |
Thanks for the feature request. @paleolimbot where is the CRS in the example? |
It's a property of the (Arrow) type! from geoarrow.pyarrow import io
tbl = io.read_pyogrio_table("/vsizip/vsicurl/https://github.com/geoarrow/geoarrow-data/releases/download/v0.1.0/ns-water-basin_point.fgb.zip")
tbl["wkb_geometry"].type.crs
#> '{"$schema":"https://proj.org/schemas/v0.7/projjson.schema.json","type":"Projected... The full serialization of the type is described in the 'extension types' section ( https://github.com/geoarrow/geoarrow/blob/main/extension-types.md ), and you can access the it using |
Hey @paleolimbot ! Thanks for the update. I've been following your geoarrow work for a long while and am pretty excited to integrate it. I wrote a simple wrapper a few months ago before |
Yes! |
Is this a new feature, an improvement, or a change to existing functionality?
New Feature
How would you describe the priority of this feature request
Medium
Please provide a clear description of problem you would like to solve.
Now that geoarrow-pyarrow ( https://github.com/geoarrow/geoarrow-python ) is available and the GeoArrow specification has an initial 0.1 release, there are potential synergies we may be able to leverage given the common memory layout! Basically, geoarrow-pyarrow implements a
pyarrow.DataType
subclass for geometry with a type-level place to store the coordinate reference system. It would be very cool ifcudf.Series.from_arrow()
could handle these (or whatever the best interface is from your end).I also think it has the potential to significantly speed up IO from the current
geopandas.read_file()
+cuspatial.GeoSeries.from_geopandas()
(rough estimate from some musings below assembled linestrings from a large ish FlatGeoBuf about 20x faster).Happy to implement anything in geoarrow-c or geoarrow-python that makes this easier! We're slowly working on getting both on conda-forge (they're on pip already).
Describe any alternatives you have considered
The closest thing that currently provides this functionality is
from_geopandas()
, with Shapely's to_ragged_array and from_ragged_array also providing similar buffer building/parsing capability.Additional context
Some musings with a large-ish linestring dataset (with apologies if I'm missing some obvious usage I should be aware of):
There are more example datasets at https://geoarrow.org/data as well (although I'm sure you have many internally as well).
The text was updated successfully, but these errors were encountered: