Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New options: Percent coverage selection and weighting #136

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

sgoodm
Copy link
Contributor

@sgoodm sgoodm commented Nov 21, 2016

Adds code to generate percent coverage estimates for raster cells by rasterizing vector features at a finer resolution than raster data, then aggregating back to raster data resolution. An adjustable scale can be used (percent_cover_scale int arg) to adjust how accurate the estimates are. A scale of 10 = 1 order of magnitude finer resolution and is generally good enough to get estimates <10% from actual. A scale of 100 will usually get you well under 1% but is slower and requires quite a bit more memory*.

Users can use these coverage estimates to either discard cells that do not meet a minimum coverage (percent_cover_selection float arg) or use the weights to generate (potentially) more accurate measures of mean, count, and sum stats (percent_cover_weighting bool arg)

*I have some misc memory improvements that I can put into another pull request

(I can take care of merging this PR with the latitude correction PR i submitted if you want to use both of them)

@coveralls
Copy link

coveralls commented Nov 21, 2016

Coverage Status

Coverage decreased (-6.4%) to 91.328% when pulling ba627d7 on sgoodm:percent_cover into 6bdb524 on perrygeo:master.

@coveralls
Copy link

coveralls commented Jan 11, 2017

Coverage Status

Coverage decreased (-5.9%) to 91.837% when pulling 6725da1 on sgoodm:percent_cover into 6bdb524 on perrygeo:master.

if percent_cover_selection is not None:
masked = np.ma.MaskedArray(
fsrc.array,
mask=(isnodata | ~rv_array | percent_cover > percent_cover_selection))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be wrong - percent_cover is a boolean, and shouldn't the inequality go the other way? Right now it excludes values which are higher than the selection criteria. But maybe I am just not reading it correctly...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. that should be rv_array < percent_cover_selection

else:
masked = np.ma.MaskedArray(
fsrc.array,
mask=(isnodata | ~rv_array))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will raise an TypeError is rv_array comes from rasterize_pctcover_geom - it won't be boolean, but instead have dtype float32, raising TypeError: ufunc 'invert' not supported for the input types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. think i can switch it to np.logical_not(rv_array) instead

else:
rv_array = rasterize_geom(
geom, shape=fsrc.shape, affine=fsrc.affine,
all_touched=all_touched)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it is asking for trouble - these shouldn't be labelled the same thing, as they have different meanings and dtypes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, will change that

cmutel and others added 3 commits March 3, 2017 14:10
…nitions

Create separate var name for pct cover rasterized array (rv_pct_array, float)
to distinguish from basic rv_array (bool). Update masks for these to use
np.logical_not to handle non bool data. Fix bug in pct cover selection mask
(was using wrong var and wrong logical check. Updated all rv_array references
related to pct cover cases to use new rv_pct_array
@coveralls
Copy link

Coverage Status

Coverage decreased (-6.2%) to 91.575% when pulling a03eb04 on sgoodm:percent_cover into 6bdb524 on perrygeo:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) to 97.802% when pulling 7000632 on sgoodm:percent_cover into 6bdb524 on perrygeo:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) to 97.806% when pulling 85f62a1 on sgoodm:percent_cover into 6bdb524 on perrygeo:master.

@sgoodm
Copy link
Contributor Author

sgoodm commented Mar 24, 2017

Will need to consider how using percent cover as a selection method impacts nodata stat. May need to be a separate field?

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 97.786% when pulling 644ddc3 on sgoodm:percent_cover into 6bdb524 on perrygeo:master.

@perrygeo perrygeo mentioned this pull request Mar 26, 2017
Copy link
Owner

@perrygeo perrygeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concretely, I'd like to explore a different implementation that:

  • removes the latitude correction concept (keep PR focused)
  • kept code complexity in the main function to a minimum (less ifs)
  • removes the breaking change to the rasterize_geom function signature
  • uses a more flexible function for upsampling (I need to look more closely at this implementation, not sure I understanding it fully)

Things like docs and command line interface can come in a later PR.

src/rasterstats/utils.py Outdated Show resolved Hide resolved
src/rasterstats/utils.py Outdated Show resolved Hide resolved
src/rasterstats/utils.py Outdated Show resolved Hide resolved
warnings.warn('Value for `percent_cover_scale` given ({0}) '
'was converted to int ({1}) but does not '
'match original value'.format(
percent_cover_scale, int(percent_cover_scale)))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why we need to limit to integers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reshape performed in the rebin_sum function requires ints
https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html

# resize-with-averaging-or-rebin-a-numpy-2d-array/8090605#8090605
def rebin_sum(a, shape, dtype):
sh = shape[0],a.shape[0]//shape[0],shape[1],a.shape[1]//shape[1]
return a.reshape(sh).sum(-1, dtype=dtype).sum(1, dtype=dtype)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't dug into this code but why choose this implementation over other methods of resampling? Specifically, using Rasterio's resampling techniques would give us more control over the resampling methods versus assuming sum: https://mapbox.github.io/rasterio/topics/resampling.html

"rebin" implies categorizing pixel values, I think "upsample" or similar would be a more accurate function name.

Copy link
Contributor Author

@sgoodm sgoodm Feb 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I looked into Rasterio's resampling methods, but I tested out a couple of different implementations (one was a proof of concept you put together for Rasterio, rasterio/rasterio#232, another was a more generalized aggregation scheme I pulled from another project of mine which had way too much overhead for what was needed here) and this method was a fair bit faster with less code. My main concern with any method here is going to be minimizing the additional time/memory required to run when using this feature.

Did you have a use case in mind that would require using a method other than sum?

I am on board with renaming to something more accurate, I had just kept it similar to the original function from SO I used.

(masked.shape[1] - np.sum(masked.mask, axis=1))))
else:
feature_stats['mean'] = float(masked.mean())

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latitude correction and the percent cover add branching complexity to the code and it detracts from readability a bit. Let's implement this with less if statements to test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

@dbaston
Copy link

dbaston commented Jun 20, 2018

Sorry if this is off-topic/spam, but I've been working on a similar project that I thought might be of interest. It's some C++ code, exposed as an R package, that computes percent coverage exactly (it's a vector-based calculation), but also extremely quickly (it's faster than most cell-center-based implementations.) The approach is described here. It seems like it should be possible to wrap it up so that it can be called on numpy arrays.

@sgoodm
Copy link
Contributor Author

sgoodm commented Aug 10, 2018

Hoping to have a chance to get back to this in the next month or two

sgoodm added 3 commits August 31, 2018 13:24
…ic to separate util functions (could expand this to class that covers all stat options later on)
@sgoodm
Copy link
Contributor Author

sgoodm commented Aug 31, 2018

Finally had some time to come back to this

  • stripped out the "latitude correction" stuff I had accidentally merged into PR branch
  • kept the like arg in the new and existing rasterization functions
  • minimized main function complexity by moving stat calculations related to percent cover to functions in utils

@perrygeo - let me know if you still have concerns regarding the percent_cover_scale or rebin_sum function (or anything else)

Also, I would definitely be interested in checking out a wrapper of what @dbaston has put together

@sgoodm
Copy link
Contributor Author

sgoodm commented Sep 4, 2018

triggering CI rebuild

@sgoodm sgoodm closed this Sep 4, 2018
@sgoodm sgoodm reopened this Sep 4, 2018
@perrygeo perrygeo self-assigned this Sep 28, 2018
@jhamman
Copy link

jhamman commented Apr 24, 2019

Hi all - just checking on the status here. This seems to have stalled more than once and I'd like to see if we can help move it forward, if at all possible.

@akrherz
Copy link

akrherz commented Sep 23, 2019

I hate to do the annoying whats-the-status-of-this-PR, but here I am :) Is there some known big caveat before considering attempting to use this change?

@perrygeo
Copy link
Owner

@akrherz I haven't tested it extensively but I believe, from a correctness-of-output perspective, you can rely on these changes. You could help by validating the results and posting here.

The remaining work concerns mitigating performance impacts, documentation, API changes, etc before creating a release out. Nothing too significant but I simply don't have the time to do the work. If you can use this branch instead of an official release, please do and let us know how it works for you.

@davidabek1
Copy link

@sgoodm, I'm geo-spatial noob, but before I saw your approach, to solve weighted extract,
I saw @dbaston exact approach, that is different to yours that produce estimate weight.
I took an idea from the second answer to this question How does QGIS Zonal Statistics handle partially overlapping pixels?
which is a simple approach of vecotrizing each pixel and intersect with the overlay polygon, dividing areas will result for the fraction coverage of that pixel dividing that with sum of fractions will result with the weight of that pixel/poly part.
I was wondering if this approach has flaws?

@sgoodm
Copy link
Contributor Author

sgoodm commented Dec 23, 2019

@davidabek1 - vectorizing for exact intersections will not have any flaws in terms of accuracy, but will be more costly in terms of computations. For a few small geometries, like in that Stack question, the additional computation probably won't matter to you, but if you wanted to get exact coverage for many large geometries then the cost of those vector based intersections over many pixels adds up.

@dbaston
Copy link

dbaston commented Jan 22, 2020

@davidabek1 @sgoodm The approach of https://github.com/isciences/exactextract is equivalent to the Stack Exchange link; the difference is that exactextract does the vector-based intersections in a way that takes advantage of the fact that the raster isn't an arbitrary polygon, it's a grid. Done this way there is no performance penalty for the exact calculation. In many cases it is faster because of all of the avoided point-in-polygon tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants