Skip to content

index_netcdf could warn for common fillvalue errors #103

@corviday

Description

@corviday

Index_netcdf uses var_range() in nchelpers to determine the range of a variable.

Sometimes processes output a netCDF file where the _FillValue attribute of a variable is not the Official Fill Attribute. This error is unfortunately common, but surprisingly hard to detect. For example, you can look at an affected file with ncdump:

(venv) [lzeman@lynx lzeman]$ ncdump -h /storage/data/projects/comp_support/climate_explorer_data_prep/climatological_means/return_periods/all-canada/pr_RP5_annual_maximum_BCCAQv2+ANUSPLIN300_CanESM2_historical+rcp85_r1i1p1_1961-1990.nc 
netcdf pr_RP5_annual_maximum_BCCAQv2+ANUSPLIN300_CanESM2_historical+rcp85_r1i1p1_1961-1990 {
dimensions:
	lon = 1068 ;
	lat = 510 ;
	time = UNLIMITED ; // (1 currently)
	bnds = 2 ;
variables:
	double lon(lon) ;
	double lat(lat) ;
	double time(time) ;
	double time_bnds(time, bnds) ;
	float rp5pr(time, lat, lon) ;
		rp5pr:_FillValue = 1.e+20f ;
		rp5pr:long_name = "5-year annual maximum one day precipitation amount" ;
		rp5pr:standard_name = "rp5pr" ;
		rp5pr:cell_methods = "time: maximum" ;
		rp5pr:units = "mm day-1" ;
		rp5pr:missing_value = 1.e+20f ;

// global attributes:
}

Or use ncview to look at the file:
ncview

You can even look at this file in python:

>>> from netCDF4 import Dataset
>>> data = Dataset("pr_RP5_annual_maximum_BCCAQv2+ANUSPLIN300_CanESM2_historical+rcp85_r1i1p1_1961-1990.nc")
>>> data.variables["rp5pr"]
<class 'netCDF4._netCDF4.Variable'>
float32 rp5pr(time, lat, lon)
    _FillValue: 1e+20
    long_name: 5-year annual maximum one day precipitation amount
    standard_name: rp5pr
    cell_methods: time: maximum
    units: mm day-1
    missing_value: 1e+20
unlimited dimensions: time
current shape = (1, 510, 1068)
filling on
>>> data.variables["rp5pr"][:]
masked_array(
  data=[[[--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         ...,
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --]]],
  mask=[[[ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True],
         ...,
         [ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True]]],
  fill_value=1e+20,
  dtype=float32)

and all looks reasonable.

However, if you get the variable range using var_range, the value of the _FillValue attribute will be included in the range:

>>> from nchelpers import CFDataset
>>> data = CFDataset("pr_RP5_annual_maximum_BCCAQv2+ANUSPLIN300_CanESM2_historical+rcp85_r1i1p1_1961-1990.nc")
>>> data.var_range("rp5pr")
(7.149994, 1e+20)

So when this unfortunately-reasonable-looking file is indexed, the maximum variable value will be 1e+20, which was likely intended to be a fill value, judging from its presence in the _FillValue attribute.

This type of file error is quite hard to detect in advance, since it does not show up on any of the common netcdf-checking tools. It would be wonderful if index_netcdf would print a warning when the following happens:

  • a variable has a _FillValue attribute, and
  • the range of the variable, as returned by var_range, include the _FillValue attribute as either a minimum or a maximum.

That is a Bad Data Smell and whoever is indexing probably wants to know! Certainly would save me some headaches.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions