-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance optimizations #268
base: master
Are you sure you want to change the base?
Conversation
except AttributeError: | ||
pixel_count = dict(zip([np.asscalar(k) for k in keys], | ||
[np.asscalar(c) for c in counts])) | ||
pixel_count = dict(zip(keys.tolist(), counts.tolist())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This builds pixel_count
in C instead of Python, and is consequently significantly faster.
Also, pixel_count
should really be a list of tuples. Floats do not make good keys in hash table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea on the pixel_count calcs.
Tuples would work, though I'm hestitant to change that now due to backwards compatibility with existing code. I'd consider it thought. But In practice, categorical rasters (the only rasters it would make sense to request pixel counts on) are going to be some integer dtype. A categorical float would be exceedingly rare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps adding a note about this stat being intended for categorical rasters in the docstring would be enough.
@@ -205,14 +207,10 @@ def gen_zonal_stats( | |||
if 'count' in stats: # special case, zero makes sense here | |||
feature_stats['count'] = 0 | |||
else: | |||
valid = masked.compressed() | |||
keys, counts = np.unique(valid, return_counts=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unique elements, keys
, has the really useful property of being sorted. We can take advantage of that for several stats and thus eliminate two O(n) passes (min, max) over the raster data. We can also use it to avoid a sort in computing the median.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea. I like this approach. Two comments:
-
We could eliminate the
np.unique
call when the stats do not require it. I guess most stats do now... but we could avoid it entirely if the stats were onlymean
,count
orstd
-
This impacts memory usage. Keeping the entire keys and counts arrays in memory for the remainder of the loop might be a bad tradeoff in some cases. Prior to this PR, the
unique
stat was the only one that callednp.unique
, and users had to opt in. That stat is intended for categorical rasters - using it on continuous values would not make much sense. Now we're callingnp.unique
on every invocation and using that to derive other stats - fine in theory. But for continuous rasters or those with 16+ bits per band, there's not likely to be any duplicates - meaning we're allocating akeys
andcounts
array that are both roughly the length of the masked array itself (possible 200% increase in memory footprint). This causes a performance regression for several use cases; code which never had to callnp.unique
now does, and allocates huge chunks of memory as a result. Multiple O(n) passes have the advantage of not requiring new allocations unlessunique
is specifically requested.
So while this may speed up some use cases, it may slow down others and will certainly cause additional resource usage. I'll have to do some more testing to quantify the impact and evaluate the tradeoffs
A quick benchmark using some datasets from your blog @perrygeo (https://www.perrygeo.com/zonal-stats-with-postgis-rasters-part-2.html) import time
import rasterstats
print(rasterstats)
from rasterstats import zonal_stats
start = time.perf_counter()
stats = zonal_stats(
"ne_50m_admin_0_countries/ne_50m_admin_0_countries.shp",
"wc2.1_2.5m_tavg/wc2.1_2.5m_tavg_07.tif",
stats=["unique", "range", "std", "sum", "mean", "count", "std", "min", "max", "median", "percentile_.1", "percentile_.5", "percentile_.9"]
)
print(time.perf_counter() - start) On my laptop, the latest version of rasterstats runs in 11.847925074999694s. With this PR, that runtime drops to 8.6056675789996s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @groutr - excellent work!
I'm concerned about the additional memory usage here. I need to run some more rigorous benchmarks to quantify the impact on all the various edge cases
Hopefully it's a net win for speed in most cases, and a tolerable increase in memory footprint.
@@ -205,14 +207,10 @@ def gen_zonal_stats( | |||
if 'count' in stats: # special case, zero makes sense here | |||
feature_stats['count'] = 0 | |||
else: | |||
valid = masked.compressed() | |||
keys, counts = np.unique(valid, return_counts=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea. I like this approach. Two comments:
-
We could eliminate the
np.unique
call when the stats do not require it. I guess most stats do now... but we could avoid it entirely if the stats were onlymean
,count
orstd
-
This impacts memory usage. Keeping the entire keys and counts arrays in memory for the remainder of the loop might be a bad tradeoff in some cases. Prior to this PR, the
unique
stat was the only one that callednp.unique
, and users had to opt in. That stat is intended for categorical rasters - using it on continuous values would not make much sense. Now we're callingnp.unique
on every invocation and using that to derive other stats - fine in theory. But for continuous rasters or those with 16+ bits per band, there's not likely to be any duplicates - meaning we're allocating akeys
andcounts
array that are both roughly the length of the masked array itself (possible 200% increase in memory footprint). This causes a performance regression for several use cases; code which never had to callnp.unique
now does, and allocates huge chunks of memory as a result. Multiple O(n) passes have the advantage of not requiring new allocations unlessunique
is specifically requested.
So while this may speed up some use cases, it may slow down others and will certainly cause additional resource usage. I'll have to do some more testing to quantify the impact and evaluate the tradeoffs
except AttributeError: | ||
pixel_count = dict(zip([np.asscalar(k) for k in keys], | ||
[np.asscalar(c) for c in counts])) | ||
pixel_count = dict(zip(keys.tolist(), counts.tolist())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea on the pixel_count calcs.
Tuples would work, though I'm hestitant to change that now due to backwards compatibility with existing code. I'd consider it thought. But In practice, categorical rasters (the only rasters it would make sense to request pixel counts on) are going to be some integer dtype. A categorical float would be exceedingly rare.
Initial testing Setup
Results (basic stats)zonal_stats(polygon, raster, stats=["min", "max", "count"])
master (fastest, lowest memory footprint)
groutr:perf1
So as I suspected, in the case where the user is requesting only basic stats, this PR will increase the memory footprint (in this case +12%) and the run time (+48%) There are a significant number of duplicates in this dataset (since the tif is integer values of elevation, highly spatially autocorrelated) so I'd suspect that datasets with larger integers or floats would have more unique values and the numbers would be worse. But for now, we'll just focus on this dataset... Results (basic stats + unique)Let's add unique to invoke zonal_stats(polygon, raster, stats=["min", "max", "count", "unique"])
master: (lowest memory footprint)
groutr:perf1 (fastest)
It's a draw. This PR is faster at the expense of an increased memory footprint. Results (all stats)Let's try it with every statistic we can calculate: zonal_stats(polygon, raster, stats="*")
master:
groutr:perf1 (fastest, lowest memory footprint)
This PR shines ✨ ConclusionThere's no definitive answer, only tradeoffs :-) This PR's approach is faster with a lower memory footprint, if all the stats are requested. This PR's approach is slower with an increased memory footprint, when using a small subset of stats and fewer duplicate pixel values. For this dataset, the number of unique values was 4 orders of magnitude lower than the total count, ie plenty of duplicates. We can intuit that the memory footprint and performance of this PR would worsen as the number of unique values approached the total count. If I have time in the future, I'd like to test with a floating point grid of random noise to simulate the worst case scenario. We know that this PR improves performance for certain cases but introduces performance regressions in other cases; not a clear win over the current approach. The question is: On average, do the benefits outweigh the negatives across a wide variety of use cases?. I'm going to need to see more benchmark data from real-world datasets to make the decision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking for feedback/testing on real-world datasets.
Does this speed things up for you? Does it increase or decrease memory usage? Please let me know in the comments.
@groutr we could also try to make this |
@groutr Would you like to split the percentile computation into a separate PR? That seems like a clear win to me and I don't want to hold that part up due to the np.unique issues. |
@perrygeo I will need to dig up my real-world test case today. IIRC, the changes in the PR yielded a dramatic performance increase. |
Sure! |
@perrygeo My real-world test case was flawed. The performance difference was largely accounted for by a change in formats that escaped my notice. However, I've been thinking about this further and wanted to hear you thoughts about the following idea. |
A small collection of performance enhancements aimed at reducing the number of times the data is traversed.
Also improved percentile computation by computing them all at once instead of one at a time (and avoid recreating the compressed array each time).