Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group Discussion Questions (to answer) #8

Open
sheilasaia opened this issue Jun 11, 2019 · 4 comments
Open

Group Discussion Questions (to answer) #8

sheilasaia opened this issue Jun 11, 2019 · 4 comments

Comments

@sheilasaia
Copy link
Collaborator

sheilasaia commented Jun 11, 2019

  1. What would be part of a report that you would like to see?
  2. How should it be implemented (i.e., boxplot, table, etc.)?
  3. Are their any constraints on these visualizations? - What might break your report/visualization?
  4. What are some if-then decisions that need to be considered when producing data reports?
@sheilasaia
Copy link
Collaborator Author

Group 3 Responses

  1. What would be part of a report that you would like to see?
  • list of variable names and variable type (i.e., character, numerical)
  • total number of records (rows) and variables (columns) (assuming data is "tidy")
  • indication of NA values
  • spatial and temporal plots (distributions, etc.)
  • map spatial and temporal NA values
  • spatial boundary box
  • range (max - min) of variable values
  • distribution of variable values
  • average temporal frequency stats (i.e., number of entries per day on average)
  • frequency of different variable categories (unique values)
  1. How should it be implemented (i.e., boxplot, table, etc.)?
  • spatial NA values - use a heatmap to show presence/absence of data (stronger color for more values)
  • frequency of variable categories - use inspectdf package function inspect_cat()?
  • histograms for variable value distributions
  • temporal NA values - select variable(s) and look at point distributions over time, time would have to be equal distance between axes ticks, could compare multiple variables by stacking points on top of each other (to see NA values)
  • spatial variables - show observations as points so can see number of observations at a particular location, stronger color means more observations
  • correlation matrix between continuous variables using pairs() function
  • table for summary stats (number of records, number of variables, list of variables, range of each variable, number of NA values, etc.)
  1. Are their any constraints on these visualizations?
  • NA values might be defined differently between datasets AND columns (e.g., 999, missing, NaN, NA)
  • multiple missing value codes
  • is there a flagging column and how to deal with it?
  • date format inconsistencies (i.e, what timezone?)
  • spatial data inconsistencies or unknowns (i.e., where is the long column?, what coordinate system is being used?)
  • inconsistencies between categories within a column (i.e., female vs Female)
  • how to know if something is a factor or not (or let the user choose how to override?)

@clnsmth
Copy link
Member

clnsmth commented Jun 11, 2019

Group 1 Responses

@clnsmth
Copy link
Member

clnsmth commented Jun 11, 2019

@jhp7e
Copy link
Collaborator

jhp7e commented Jun 11, 2019

General Group 2 outline
what
General title of dataset, DOI etc. as header, links etc.
units
in metadata
variable types
range - domain - levels
min max median mode - R summary
number of valid observations
correlation matrix for numerical data
matrix
combined matrix or plot
limited by number of pairs
correlogram
histogram of numerical variables
frequency histogram of categorical variables
bivariate frequency crosstabulations for categorical
boxes by categories for numerical
missing - NA
how many
complete cases
graph of where NAs are
scatter or line
X is date variables
check for strings that are really dates
bivariate numerical
need to be many on each page
lattice plots of each numerical variables by each categorical variable
location maps - if have lat lon
hard - issue semantics
appendix of R code used for plots
how
NA plot
visually -dense plots
correlogram - correlation plot
need table of contents
try to keep basic information for a variable on a place
if
target around 10-20 pages
brief summary of overall
2 pages per variable
do not do box plots if levels > n
correlation matrix limits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants