-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Index_netcdf uses var_range() in nchelpers to determine the range of a variable.
Sometimes processes output a netCDF file where the _FillValue attribute of a variable is not the Official Fill Attribute. This error is unfortunately common, but surprisingly hard to detect. For example, you can look at an affected file with ncdump:
(venv) [lzeman@lynx lzeman]$ ncdump -h /storage/data/projects/comp_support/climate_explorer_data_prep/climatological_means/return_periods/all-canada/pr_RP5_annual_maximum_BCCAQv2+ANUSPLIN300_CanESM2_historical+rcp85_r1i1p1_1961-1990.nc
netcdf pr_RP5_annual_maximum_BCCAQv2+ANUSPLIN300_CanESM2_historical+rcp85_r1i1p1_1961-1990 {
dimensions:
lon = 1068 ;
lat = 510 ;
time = UNLIMITED ; // (1 currently)
bnds = 2 ;
variables:
double lon(lon) ;
double lat(lat) ;
double time(time) ;
double time_bnds(time, bnds) ;
float rp5pr(time, lat, lon) ;
rp5pr:_FillValue = 1.e+20f ;
rp5pr:long_name = "5-year annual maximum one day precipitation amount" ;
rp5pr:standard_name = "rp5pr" ;
rp5pr:cell_methods = "time: maximum" ;
rp5pr:units = "mm day-1" ;
rp5pr:missing_value = 1.e+20f ;
// global attributes:
}
Or use ncview to look at the file:

You can even look at this file in python:
>>> from netCDF4 import Dataset
>>> data = Dataset("pr_RP5_annual_maximum_BCCAQv2+ANUSPLIN300_CanESM2_historical+rcp85_r1i1p1_1961-1990.nc")
>>> data.variables["rp5pr"]
<class 'netCDF4._netCDF4.Variable'>
float32 rp5pr(time, lat, lon)
_FillValue: 1e+20
long_name: 5-year annual maximum one day precipitation amount
standard_name: rp5pr
cell_methods: time: maximum
units: mm day-1
missing_value: 1e+20
unlimited dimensions: time
current shape = (1, 510, 1068)
filling on
>>> data.variables["rp5pr"][:]
masked_array(
data=[[[--, --, --, ..., --, --, --],
[--, --, --, ..., --, --, --],
[--, --, --, ..., --, --, --],
...,
[--, --, --, ..., --, --, --],
[--, --, --, ..., --, --, --],
[--, --, --, ..., --, --, --]]],
mask=[[[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
...,
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True]]],
fill_value=1e+20,
dtype=float32)
and all looks reasonable.
However, if you get the variable range using var_range, the value of the _FillValue attribute will be included in the range:
>>> from nchelpers import CFDataset
>>> data = CFDataset("pr_RP5_annual_maximum_BCCAQv2+ANUSPLIN300_CanESM2_historical+rcp85_r1i1p1_1961-1990.nc")
>>> data.var_range("rp5pr")
(7.149994, 1e+20)
So when this unfortunately-reasonable-looking file is indexed, the maximum variable value will be 1e+20, which was likely intended to be a fill value, judging from its presence in the _FillValue attribute.
This type of file error is quite hard to detect in advance, since it does not show up on any of the common netcdf-checking tools. It would be wonderful if index_netcdf would print a warning when the following happens:
- a variable has a
_FillValueattribute, and - the range of the variable, as returned by var_range, include the
_FillValueattribute as either a minimum or a maximum.
That is a Bad Data Smell and whoever is indexing probably wants to know! Certainly would save me some headaches.