|
| 1 | +# JSON Schema Profile |
| 2 | + |
| 3 | +The goal of JSON Schema Profile is to augment the vocabulary of [JSON Schema](http://json-schema.org/) to represent properties of the data as opposed to focusing only on the structure. |
| 4 | + |
| 5 | +## Definitions |
| 6 | +### Bloom filter |
| 7 | +This is a string which represents a serialized Bloom filter. Currently this is a Base64 encoded serialized value of the specific Bloom filter class used by [JSONoid](https://github.com/dataunitylab/jsonoid-discovery), but we plan to make this a more reusable format. |
| 8 | + |
| 9 | +Bloom filters are useful to check if specific values were observed for a particular property without the need to store all the values. |
| 10 | + |
| 11 | +### Histogram |
| 12 | +property | description |
| 13 | +:-- | :-- |
| 14 | +`bins` | An array of two-element arrays where the first element is the mean of the bin and the second is the number of elements in the bin |
| 15 | +`hasExtremeValues` | A Boolean indicating whether the histogram contains values which cannot be represented in the given bounds. This usually only occurs for extremely large absolute values and is rarely observed in practice |
| 16 | + |
| 17 | +### Statistics |
| 18 | +property | description |
| 19 | +:-- | :-- |
| 20 | +`variance` | The variance of all values of this property |
| 21 | +`stdev` | The standard deviation of all values of this property |
| 22 | +`skewness` | The skewness of all values of this property |
| 23 | +`kurtosis` | The kurtosis of all values of this property |
| 24 | + |
| 25 | +## Arrays |
| 26 | +property | description |
| 27 | +:-- |:-- |
| 28 | +`lengthHistogram` | A [histogram](#Histogram) of array lengths |
| 29 | + |
| 30 | +## Booleans |
| 31 | + |
| 32 | +property | description |
| 33 | +:-- |:-- |
| 34 | +`pctTrue` | Percentage of the Boolean values which are `true` |
| 35 | + |
| 36 | +## Integers |
| 37 | + |
| 38 | +property | description |
| 39 | +:-- | :-- |
| 40 | +`bloomFilter` | A [Bloom filter](#bloom-filter) of integer values |
| 41 | +`distinctValues` | An estimate of the number of distinct values of this property |
| 42 | +`histogram` | A [histogram](#histogram) of integer values |
| 43 | +`statistics` | A set of [statistics](#statistics) of integer values |
| 44 | + |
| 45 | +## Numbers |
| 46 | + |
| 47 | +property | description |
| 48 | +:-- | :-- |
| 49 | +`bloomFilter` | A [Bloom filter](#bloom-filter) of number values |
| 50 | +`distinctValues` | An estimate of the number of distinct values of this property |
| 51 | +`histogram` | A [histogram](#histogram) of number values |
| 52 | +`statistics` | A set of [statistics](#statistics) of number values |
| 53 | + |
| 54 | +## Objects |
| 55 | + |
| 56 | +property | description |
| 57 | +:-- | :-- |
| 58 | +`fieldPresence` | An object where the value represents the percentage of the time the corresponding key appears |
| 59 | + |
| 60 | +## Strings |
| 61 | +property | description |
| 62 | +:-- |:-- |
| 63 | +`bloomFilter` | A [Bloom filter](#bloom-filter) of string values |
| 64 | +`distinctValues` | An estimate of the number of distinct values of this property |
| 65 | +`lengthHistogram` | A [histogram](#Histogram) of string lengths |
0 commit comments