Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 30 additions & 23 deletions docs/source/about.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,34 +2,41 @@

## How it works

Internally, Pyogrio uses a numpy-oriented approach in Cython to read
The "standard" mode in Pyogrio uses a numpy-oriented approach in Cython to read
information about data sources and records from spatial data layers. Geometries
are extracted from the data layer as Well-Known Binary (WKB) objects and fields
(attributes) are read into numpy arrays of the appropriate data type. These are
then converted to GeoPandas `GeoDataFrame`s.
then converted to a GeoPandas `GeoDataFrame`.

All records are read into memory, which may be problematic for very large data
sources. You can use `skip_features` / `max_features` to read smaller parts of
the file at a time.
When the "Arrow" mode is used, (`use_arrow=True`), Pyogrio uses the Arrow Stream
interface of GDAL, which reads the data to the
[Apache Arrow](https://arrow.apache.org/) memory format. After reading the data,
Pyogrio converts the data to a `GeoDataFrame`. Because this code path is even
more optimized, also in GDAL, using `use_arrow=True` can give a significant
performance boost, especially when reading large files.

The entire `GeoDataFrame` is written at once. Incremental writes or appends to
existing data sources are not supported.
All records are read into memory in bulk. This is very fast, but can give memory
issues when reading very large data sources. To solve this, Pyogrio exposes
several options offered by GDAL to filter the data while being read.
Some examples are a filter on a `bbox`, use `skip_features` / `max_features`,
using a `sql` statement, etc. The performance of the filtering depends on the
file format being read, e.g. the availability of (spatial) indexes, etc.

When writing, the entire `GeoDataFrame` is written at once, but it is possible
to append data to an existing data source.

## Comparison to Fiona

[Fiona](https://github.com/Toblerity/Fiona) is a full-featured Python library
for working with OGR vector data sources. It is **awesome**, has highly-dedicated
maintainers and contributors, and exposes more functionality than Pyogrio ever will.
This project would not be possible without Fiona having come first.

Pyogrio uses a bulk-oriented approach for reading and writing
spatial vector file formats, which enables faster I/O operations. It borrows
from the internal mechanics and lessons learned of Fiona. It uses a stateless
approach to reading or writing data; all data are read or written in a single
pass.

`Fiona` is a general-purpose spatial format I/O library that is used within many
projects in the Python ecosystem. In contrast, Pyogrio specifically targets
GeoPandas in order to reduce the number of data transformations currently
required to read and write data between GeoPandas `GeoDataFrame`s and OGR data
sources using Fiona (the current default in GeoPandas).
[Fiona](https://github.com/Toblerity/Fiona) is a full-featured, general-purpose
Python library for working with OGR vector data sources. It is **awesome**, has
highly-dedicated maintainers and contributors, and exposes more functionality than
Pyogrio ever will. Finally it is used in many projects in the Python ecosystem.

In contrast, Pyogrio specifically targets the typical needs of GeoPandas. It uses
a stateless approach, so all data are read or written in a single pass. This
bulk-oriented approach enables significantly faster I/O operations, especially
for larger datasets.

Pyogrio borrows from the internal mechanics and lessons
learned of Fiona and so this project would not have been possible without Fiona
having come first.
4 changes: 2 additions & 2 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@ Pyogrio is fast because it uses pre-compiled bindings for GDAL/OGR to read and
write the data records in bulk. This approach avoids multiple steps of
converting to and from Python data types within Python, so performance becomes
primarily limited by the underlying I/O speed of data source drivers in
GDAL/OGR.
GDAL/OGR.

We have seen \>5-10x speedups reading files and \>5-20x speedups writing files
We have seen \>5-100x speedups reading files and \>5-20x speedups writing files
compared to using row-per-row approaches (e.g. Fiona).

```{toctree}
Expand Down
Loading