-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New options: Percent coverage selection and weighting #136
base: master
Are you sure you want to change the base?
Conversation
…some test coverage).
src/rasterstats/main.py
Outdated
if percent_cover_selection is not None: | ||
masked = np.ma.MaskedArray( | ||
fsrc.array, | ||
mask=(isnodata | ~rv_array | percent_cover > percent_cover_selection)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be wrong - percent_cover
is a boolean, and shouldn't the inequality go the other way? Right now it excludes values which are higher than the selection criteria. But maybe I am just not reading it correctly...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right. that should be rv_array < percent_cover_selection
src/rasterstats/main.py
Outdated
else: | ||
masked = np.ma.MaskedArray( | ||
fsrc.array, | ||
mask=(isnodata | ~rv_array)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will raise an TypeError
is rv_array
comes from rasterize_pctcover_geom
- it won't be boolean, but instead have dtype float32
, raising TypeError: ufunc 'invert' not supported for the input types
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. think i can switch it to np.logical_not(rv_array)
instead
src/rasterstats/main.py
Outdated
else: | ||
rv_array = rasterize_geom( | ||
geom, shape=fsrc.shape, affine=fsrc.affine, | ||
all_touched=all_touched) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it is asking for trouble - these shouldn't be labelled the same thing, as they have different meanings and dtypes
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, will change that
…nitions Create separate var name for pct cover rasterized array (rv_pct_array, float) to distinguish from basic rv_array (bool). Update masks for these to use np.logical_not to handle non bool data. Fix bug in pct cover selection mask (was using wrong var and wrong logical check. Updated all rv_array references related to pct cover cases to use new rv_pct_array
Pixels aren't always square
Clearly distinguish between raster mask and weight arrays
Will need to consider how using percent cover as a selection method impacts nodata stat. May need to be a separate field? |
Percent cover into geo master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concretely, I'd like to explore a different implementation that:
- removes the latitude correction concept (keep PR focused)
- kept code complexity in the main function to a minimum (less ifs)
- removes the breaking change to the
rasterize_geom
function signature - uses a more flexible function for upsampling (I need to look more closely at this implementation, not sure I understanding it fully)
Things like docs and command line interface can come in a later PR.
warnings.warn('Value for `percent_cover_scale` given ({0}) ' | ||
'was converted to int ({1}) but does not ' | ||
'match original value'.format( | ||
percent_cover_scale, int(percent_cover_scale))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why we need to limit to integers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reshape
performed in the rebin_sum
function requires ints
https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html
# resize-with-averaging-or-rebin-a-numpy-2d-array/8090605#8090605 | ||
def rebin_sum(a, shape, dtype): | ||
sh = shape[0],a.shape[0]//shape[0],shape[1],a.shape[1]//shape[1] | ||
return a.reshape(sh).sum(-1, dtype=dtype).sum(1, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't dug into this code but why choose this implementation over other methods of resampling? Specifically, using Rasterio's resampling techniques would give us more control over the resampling methods versus assuming sum
: https://mapbox.github.io/rasterio/topics/resampling.html
"rebin" implies categorizing pixel values, I think "upsample" or similar would be a more accurate function name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I looked into Rasterio's resampling methods, but I tested out a couple of different implementations (one was a proof of concept you put together for Rasterio, rasterio/rasterio#232, another was a more generalized aggregation scheme I pulled from another project of mine which had way too much overhead for what was needed here) and this method was a fair bit faster with less code. My main concern with any method here is going to be minimizing the additional time/memory required to run when using this feature.
Did you have a use case in mind that would require using a method other than sum?
I am on board with renaming to something more accurate, I had just kept it similar to the original function from SO I used.
src/rasterstats/main.py
Outdated
(masked.shape[1] - np.sum(masked.mask, axis=1)))) | ||
else: | ||
feature_stats['mean'] = float(masked.mean()) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latitude correction and the percent cover add branching complexity to the code and it detracts from readability a bit. Let's implement this with less if
statements to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
Sorry if this is off-topic/spam, but I've been working on a similar project that I thought might be of interest. It's some C++ code, exposed as an R package, that computes percent coverage exactly (it's a vector-based calculation), but also extremely quickly (it's faster than most cell-center-based implementations.) The approach is described here. It seems like it should be possible to wrap it up so that it can be called on numpy arrays. |
Hoping to have a chance to get back to this in the next month or two |
…ic to separate util functions (could expand this to class that covers all stat options later on)
Finally had some time to come back to this
@perrygeo - let me know if you still have concerns regarding the percent_cover_scale or rebin_sum function (or anything else) Also, I would definitely be interested in checking out a wrapper of what @dbaston has put together |
triggering CI rebuild |
Hi all - just checking on the status here. This seems to have stalled more than once and I'd like to see if we can help move it forward, if at all possible. |
I hate to do the annoying whats-the-status-of-this-PR, but here I am :) Is there some known big caveat before considering attempting to use this change? |
@akrherz I haven't tested it extensively but I believe, from a correctness-of-output perspective, you can rely on these changes. You could help by validating the results and posting here. The remaining work concerns mitigating performance impacts, documentation, API changes, etc before creating a release out. Nothing too significant but I simply don't have the time to do the work. If you can use this branch instead of an official release, please do and let us know how it works for you. |
@sgoodm, I'm geo-spatial noob, but before I saw your approach, to solve weighted extract, |
@davidabek1 - vectorizing for exact intersections will not have any flaws in terms of accuracy, but will be more costly in terms of computations. For a few small geometries, like in that Stack question, the additional computation probably won't matter to you, but if you wanted to get exact coverage for many large geometries then the cost of those vector based intersections over many pixels adds up. |
@davidabek1 @sgoodm The approach of https://github.com/isciences/exactextract is equivalent to the Stack Exchange link; the difference is that exactextract does the vector-based intersections in a way that takes advantage of the fact that the raster isn't an arbitrary polygon, it's a grid. Done this way there is no performance penalty for the exact calculation. In many cases it is faster because of all of the avoided point-in-polygon tests. |
Adds code to generate percent coverage estimates for raster cells by rasterizing vector features at a finer resolution than raster data, then aggregating back to raster data resolution. An adjustable scale can be used (
percent_cover_scale
int arg) to adjust how accurate the estimates are. A scale of 10 = 1 order of magnitude finer resolution and is generally good enough to get estimates <10% from actual. A scale of 100 will usually get you well under 1% but is slower and requires quite a bit more memory*.Users can use these coverage estimates to either discard cells that do not meet a minimum coverage (
percent_cover_selection
float arg) or use the weights to generate (potentially) more accurate measures of mean, count, and sum stats (percent_cover_weighting
bool arg)*I have some misc memory improvements that I can put into another pull request
(I can take care of merging this PR with the latitude correction PR i submitted if you want to use both of them)