Long format (lvl 1) - datum groupings #60
Replies: 4 comments 1 reply
-
|
It might make sense to move the current read files from returning a genetic 'long' format to the specific 'stacked', 'joined', or 'threeTable' formats described above. |
Beta Was this translation helpful? Give feedback.
-
|
This three table format is very centered on geolocation applications (like digital soil mapping) rather then observation centered applications (like pedotransfer function development). Maybe call it 'geolocation' instead? |
Beta Was this translation helpful? Give feedback.
-
|
ok, assuming I am understanding (big IF)...the issue(s) are:
Which means there are no globally unique IDs; tabular data are primarily not stand alone, they require values from other tables to maintain identity/coherence/context. To further complicate matters, the semantics of Are there any other concepts, qualities, or observations with overloaded semantics? It seems that I'll see if I can work up a suitable example. More on this to come. |
Beta Was this translation helpful? Give feedback.
-
|
Also, the intractability of (2) for local machines could be an opportunity to test CyVerse, or HiPerGator, or similar. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This thread is intended to record discussion and decision points when designing the datum grouping. In the long data format (lvl 1) that we use as an intermediate between the original data format (lvl 0) and the curated data product (lvl 2). The long data format currently has two main components: the id locating the datum grouping and the datum description. We will focus on the grouping in this thread.
Data group together and often have a hierarchical relationship with other datum. In the relational databases this grouping is often represented by row and table relationships. One exception here is that multiple rows within a table can be assigned to groupings as well (ie treatment/control). This leads to our first decision point.
Decision point lvl0:1 I propose treating multi-rows groupings as datum rather then id's.
Not included here are the datum that we will discuss further in other threads.
ISCN3
ISCN3 consists of four tables
dataset,citation,profile, andlayer. Note that the id's are generally not unique id's and the set of id's and foreign keys must be referred to together to generate a unique row identifier. Also note thatdataset_nameis cross referenced withdataset_name_suband may be crossed withdataset_name_soc.dataset_name_socmay also refer to ISCN soil organic carbon stock gap filling. Decision point ISCN:1 We considerdataset_name_socto be a soil organic carbon method and part of a unique row identifier for theprofile.erDiagram dataset ||--|{ citation : has dataset ||--o{ profile : has dataset ||--o{ layer : has profile ||--|{ layer : has dataset { id dataset_name } citation { dataset_id dataset_name } profile { dataset_id dataset_name_sub id dataset_name_soc id site_name id profile_name } layer { dataset_id dataset_name_sub profile_id dataset_name_soc profile_id site_name profile_id profile_name id layer_name }Intermediate data Lvl 1
The intermediate data model for this project has oscillated between three structures:
1. Stacked
This unjoined stack option is, in some ways, a non-option. It preserves the original table structures most closely. This moves a lot of the data manipulation work into the curation phase and maintains any data normalization (ie avoidance of repeated data) done in the original study.
However it removes some of the advantage of a more unified data model and makes the lvl1 data much more difficult to work with when creating subsequent data products.
2. Joined
The joined stack option is probably the most idealized option. Each datum has a complete association with the ids, for example, a citation associated with multiple layers would be repeated for each of those layers. This makes it fairly easy to work with using
dplyr::filteranduniqueas a data table but results in a very high memory demand data object. This rapidly becomes intractable for almost all larger survey data on most desktop computers.3. Three table
The three table claims that we have some underlaying understand of the datum; it comes from somewhere and has
provenanceand observation are associated with some geolocation that may have a time element on thesurfaceof the Earth, some of which are also associated with a specificlayerwith a depth interval. This places the highest burden on the level 1 coding since all datums need to be associated with these three tables.erDiagram provenance ||--o{ site : has provenance ||--o{ layer : has site ||--|{ layer : hasBeta Was this translation helpful? Give feedback.
All reactions