diff --git a/docs/source/about.md b/docs/source/about.md index 935f9240..156e7546 100644 --- a/docs/source/about.md +++ b/docs/source/about.md @@ -2,34 +2,41 @@ ## How it works -Internally, Pyogrio uses a numpy-oriented approach in Cython to read +The "standard" mode in Pyogrio uses a numpy-oriented approach in Cython to read information about data sources and records from spatial data layers. Geometries are extracted from the data layer as Well-Known Binary (WKB) objects and fields (attributes) are read into numpy arrays of the appropriate data type. These are -then converted to GeoPandas `GeoDataFrame`s. +then converted to a GeoPandas `GeoDataFrame`. -All records are read into memory, which may be problematic for very large data -sources. You can use `skip_features` / `max_features` to read smaller parts of -the file at a time. +When the "Arrow" mode is used, (`use_arrow=True`), Pyogrio uses the Arrow Stream +interface of GDAL, which reads the data to the +[Apache Arrow](https://arrow.apache.org/) memory format. After reading the data, +Pyogrio converts the data to a `GeoDataFrame`. Because this code path is even +more optimized, also in GDAL, using `use_arrow=True` can give a significant +performance boost, especially when reading large files. -The entire `GeoDataFrame` is written at once. Incremental writes or appends to -existing data sources are not supported. +All records are read into memory in bulk. This is very fast, but can give memory +issues when reading very large data sources. To solve this, Pyogrio exposes +several options offered by GDAL to filter the data while being read. +Some examples are a filter on a `bbox`, use `skip_features` / `max_features`, +using a `sql` statement, etc. The performance of the filtering depends on the +file format being read, e.g. the availability of (spatial) indexes, etc. + +When writing, the entire `GeoDataFrame` is written at once, but it is possible +to append data to an existing data source. ## Comparison to Fiona -[Fiona](https://github.com/Toblerity/Fiona) is a full-featured Python library -for working with OGR vector data sources. It is **awesome**, has highly-dedicated -maintainers and contributors, and exposes more functionality than Pyogrio ever will. -This project would not be possible without Fiona having come first. - -Pyogrio uses a bulk-oriented approach for reading and writing -spatial vector file formats, which enables faster I/O operations. It borrows -from the internal mechanics and lessons learned of Fiona. It uses a stateless -approach to reading or writing data; all data are read or written in a single -pass. - -`Fiona` is a general-purpose spatial format I/O library that is used within many -projects in the Python ecosystem. In contrast, Pyogrio specifically targets -GeoPandas in order to reduce the number of data transformations currently -required to read and write data between GeoPandas `GeoDataFrame`s and OGR data -sources using Fiona (the current default in GeoPandas). +[Fiona](https://github.com/Toblerity/Fiona) is a full-featured, general-purpose +Python library for working with OGR vector data sources. It is **awesome**, has +highly-dedicated maintainers and contributors, and exposes more functionality than +Pyogrio ever will. Finally it is used in many projects in the Python ecosystem. + +In contrast, Pyogrio specifically targets the typical needs of GeoPandas. It uses +a stateless approach, so all data are read or written in a single pass. This +bulk-oriented approach enables significantly faster I/O operations, especially +for larger datasets. + +Pyogrio borrows from the internal mechanics and lessons +learned of Fiona and so this project would not have been possible without Fiona +having come first. diff --git a/docs/source/index.md b/docs/source/index.md index bc008d6f..a8e1217c 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -16,9 +16,9 @@ Pyogrio is fast because it uses pre-compiled bindings for GDAL/OGR to read and write the data records in bulk. This approach avoids multiple steps of converting to and from Python data types within Python, so performance becomes primarily limited by the underlying I/O speed of data source drivers in -GDAL/OGR. +GDAL/OGR. -We have seen \>5-10x speedups reading files and \>5-20x speedups writing files +We have seen \>5-100x speedups reading files and \>5-20x speedups writing files compared to using row-per-row approaches (e.g. Fiona). ```{toctree}