index_netcdf could warn for common fillvalue errors

`Index_netcdf` uses `var_range()` [in nchelpers](https://github.com/pacificclimate/nchelpers/blob/master/nchelpers/__init__.py#L1035) to determine the range of a variable.

Sometimes processes output a netCDF file where the `_FillValue` attribute of a variable is not the Official Fill Attribute. This error is unfortunately common, but surprisingly hard to detect. For example, you can look at an affected file with `ncdump`:

```
(venv) [lzeman@lynx lzeman]$ ncdump -h /storage/data/projects/comp_support/climate_explorer_data_prep/climatological_means/return_periods/all-canada/pr_RP5_annual_maximum_BCCAQv2+ANUSPLIN300_CanESM2_historical+rcp85_r1i1p1_1961-1990.nc 
netcdf pr_RP5_annual_maximum_BCCAQv2+ANUSPLIN300_CanESM2_historical+rcp85_r1i1p1_1961-1990 {
dimensions:
	lon = 1068 ;
	lat = 510 ;
	time = UNLIMITED ; // (1 currently)
	bnds = 2 ;
variables:
	double lon(lon) ;
	double lat(lat) ;
	double time(time) ;
	double time_bnds(time, bnds) ;
	float rp5pr(time, lat, lon) ;
		rp5pr:_FillValue = 1.e+20f ;
		rp5pr:long_name = "5-year annual maximum one day precipitation amount" ;
		rp5pr:standard_name = "rp5pr" ;
		rp5pr:cell_methods = "time: maximum" ;
		rp5pr:units = "mm day-1" ;
		rp5pr:missing_value = 1.e+20f ;

// global attributes:
}
```
Or use ncview to look at the file:
![ncview](https://user-images.githubusercontent.com/4512605/92043654-4bda4e80-ed31-11ea-807b-df6051dc7955.png)

You can even look at this file in python:
```python
>>> from netCDF4 import Dataset
>>> data = Dataset("pr_RP5_annual_maximum_BCCAQv2+ANUSPLIN300_CanESM2_historical+rcp85_r1i1p1_1961-1990.nc")
>>> data.variables["rp5pr"]
<class 'netCDF4._netCDF4.Variable'>
float32 rp5pr(time, lat, lon)
    _FillValue: 1e+20
    long_name: 5-year annual maximum one day precipitation amount
    standard_name: rp5pr
    cell_methods: time: maximum
    units: mm day-1
    missing_value: 1e+20
unlimited dimensions: time
current shape = (1, 510, 1068)
filling on
>>> data.variables["rp5pr"][:]
masked_array(
  data=[[[--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         ...,
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --]]],
  mask=[[[ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True],
         ...,
         [ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True]]],
  fill_value=1e+20,
  dtype=float32)
```
and all looks reasonable.

However, if you get the variable range using `var_range`, the value of the _FillValue attribute will be included in the range:

```python
>>> from nchelpers import CFDataset
>>> data = CFDataset("pr_RP5_annual_maximum_BCCAQv2+ANUSPLIN300_CanESM2_historical+rcp85_r1i1p1_1961-1990.nc")
>>> data.var_range("rp5pr")
(7.149994, 1e+20)
```

So when this unfortunately-reasonable-looking file is indexed, the maximum variable value will be 1e+20, which was likely *intended* to be a fill value, judging from its presence in the `_FillValue` attribute.

This type of file error is quite hard to detect in advance, since it does not show up on any of the common netcdf-checking tools. It would be wonderful if `index_netcdf` would print a warning when the following happens:
* a variable has a `_FillValue` attribute, and 
* the range of the variable, as returned by var_range, include the `_FillValue` attribute as either a minimum or a maximum. 

That is a Bad Data Smell and whoever is indexing probably wants to know! Certainly would save me some headaches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index_netcdf could warn for common fillvalue errors #103

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

index_netcdf could warn for common fillvalue errors #103

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions