-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For candidate datasets, determine appropriate steps for cleaning and preparing data. We don't want too many shortcuts here (i.e. blindly aggregating with no justification for this), it would be better to think through the data generating process for each dataset #5
Comments
Hi Folks. |
Hi @AdamCSmithCWS, this is a good question that deserves some thought. On the one hand I'd prefer to add zeros for all missing combinations as we would expect that species to be detectable in another region if it occurred, so the fact that it wasn't recorded by survey teams should count as a "true" zero. I haven't tested this explicitly, but my gut instinct is that these zeros are very useful for estimating variation in species' spatial fields. But on the other hand the additional complexity is enormous, as you have recognised. In the latest model I've tried, where I ended up with 83 species over the 37 BCRs, I have just over 89,000 observations. This model takes ~50 minutes to complete on my brand new i9 processor. So it isn't difficult to imagine that we'll run into situations where they can't be done on personal machines. Perhaps it would be better to use some sort of threshold for which species to include? But I'm happy to try different things |
Yes, I think you're right: the zeros (and yes, I agree, they are true zeros given the BBS field methods), would help estimate the species' spatial fields. If helpful, I can draft up some data-preparation scripts to support these alternatives. |
There are two potentially important components of variation in the BBS data that we may want to consider including:
Both of these are known to affect BBS counts. The observer-variation is clearly nuisance variation, while the route-variation could be considered nuisance (sampling noise reflecting random variation among relatively few routes within a bird conservation regions) or biological (variation in abundance within relatively large strata that captures variation in landcover, elevation, and latitude). Both sources of variation will also vary in time with turn-over in the observer pool and variation in which routes get surveyed in a given year. So, if route and observer variation need to be ignored for practical reasons (in favour of aggregated counts across all routes and observers), I think we could argue it's important to demonstrate the phylogenetic and trait-based patterns. But I think it will limit the immediate applicability of these models, at least for many of the species status assessment uses of the BBS. |
Thanks @AdamCSmithCWS. Yes I agree completely and in fact my early attempts were using route-level counts so I could incorporate both of these sources. But as you alluded to, the size of the dataset and complexity of the model became outright impossible for me to handle. At present my process has been to aggregate at the polygon level but to use the number of routes that provided information for that polygon in that year to form an offset. This of course treats all routes equally and ignores observer effects entirely, which is a shame. We can certainly argue our way out of it, though we might not want to. But one option that deserves some thought is to chunk the data into well-defined spatial units and fit separate models within each unit, allowing us to use models that do go into point-level detail. This would allow us to use geostatistical spatial models and might make more sense anyway given that these are the scales where we would expect phylogenetic (and especially) functional relationships to be particularly important. The question is whether we would lose some important information, i.e. if species 1 has a broad range and is declining in areas outside of our defined unit, we'd lose that context when estimating that species' trend within the unit. But maybe that doesn't matter if the units are big enough and we have enough data? There are always endless compromises! |
Interesting discussions here! In the full CSV file of BBS that contains the 50-stop count, they included (what I also agree can be referred to as) "true zeros." One thing I have done in the past was to include the route variation (at least), as I believe that would have ecological implications. So, to reduce the size of this dataset (hopefully), I have some potential suggestions too (in addition to the stratification plans- which is very good, btw).
If we do all that, we will have each row for each species, route ID, year and count, which should be < 10 million rows! On a last note (regarding computing facility), I wonder how fast/effective our analysis would be if we decide to run in the cloud! We could set up an instance in Amazon EC2 using their Tensor core GPU (e.g., Amazon EC2 P3), as I believe that would be much faster or even on OCI. Of course, we'll still use R and only have to transfer everything to the Cloud for efficiency. |
It's probably not the first option for developing the models, but it might be something to consider for later stages. Another potential dataset, but private, is the Swiss breeding bird survey. The survey data is collected within the MHB (the Swiss common breeding bird survey), which covers 267 sampling sites (1 km2 cell) every year. See here an example where we developed dynamic occupancy models to evaluate range dynamics |
Thanks @guifandos. Yes I've seen a lot of the work that Marc Kéry has done with the Swiss breeding bird survey. It certainly seems like a suitable dataset to explore. Would you know what procedures we'd need to go through to gain access to the data? |
I wonder if the zeros dilemma links back to this issue focussed on the theoretical predictions. I can't think of any real examples of this, but I could imagine a given species may only exist in a location if another competing species is absent. For instance, in the UK, the only place we see native Red squirrels are in locations with very very low densities of grey squirrels. So the zeros could be very informative across space/phylo. 91 million observations sound nasty though... |
No description provided.
The text was updated successfully, but these errors were encountered: