Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is my TidyDensity package a good fit here? #657

Open
2 of 21 tasks
spsanderson opened this issue Sep 11, 2024 · 18 comments
Open
2 of 21 tasks

Is my TidyDensity package a good fit here? #657

spsanderson opened this issue Sep 11, 2024 · 18 comments

Comments

@spsanderson
Copy link

spsanderson commented Sep 11, 2024

Submitting Author Name: Steven P. Sanderson II, MPH
Submitting Author Github Handle: @spsanderson
Other Package Authors Github handles: None
Repository: https://github.com/spsanderson/TidyDensity
Submission type: Pre-submission
Language: en


Package: TidyDensity
Title: Functions for Tidy Analysis and Generation of Random Data
Version: 1.5.0.9000
Authors@R: c(
    person("Steven","Sanderson", email = "[email protected]", role = c("aut","cre","cph"),
      comment = c(ORCID = "0009-0006-7661-8247")
      )
    )
Description: 
    To make it easy to generate random numbers based upon the underlying stats 
    distribution functions. All data is returned in a tidy and structured
    format making working with the data simple and straight forward. Given that the
    data is returned in a tidy 'tibble' it lends itself to working with the rest of the
    'tidyverse'.
License: MIT + file LICENSE
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.1
URL: https://github.com/spsanderson/TidyDensity
BugReports: https://github.com/spsanderson/TidyDensity/issues
Depends:
    R (>= 4.1.0)
Imports: 
    magrittr,
    rlang (>= 0.4.11),
    dplyr,
    ggplot2,
    plotly,
    tidyr,
    purrr,
    actuar,
    methods,
    stats,
    patchwork,
    survival,
    nloptr,
    broom,
    tidyselect,
    data.table,
    stringr
Suggests: 
    rmarkdown,
    knitr,
    EnvStats
VignetteBuilder: knitr

Scope

  • Please indicate which category or categories from our package fit policies or statistical package categories this package falls under. (Please check one or more appropriate boxes below):

    Data Lifecycle Packages

    • data retrieval
    • data extraction
    • data munging
    • data deposition
    • data validation and testing
    • workflow automation
    • version control
    • citation management and bibliometrics
    • scientific software wrappers
    • field and lab reproducibility tools
    • database software bindings
    • geospatial data
    • text analysis

    Statistical Packages

    • Bayesian and Monte Carlo Routines
    • Dimensionality Reduction, Clustering, and Unsupervised Learning
    • Machine Learning
    • Regression and Supervised Learning
    • Exploratory Data Analysis (EDA) and Summary Statistics
    • Spatial Analyses
    • Time Series Analyses
    • Probability Distributions
  • Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of: I think this package fits here because it is designed to help people generate distributions and to see how distributions change in shape when their parameters change.

  • If submitting a statistical package, have you already incorporated documentation of standards into your code via the srr package? No I have not used srr package.

  • Who is the target audience and what are scientific applications of this package? Anyone that wants to see how distributions change with parameters (students) or anyone that is looking for parameter estimation etc.

  • Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category? There are packages like Distributional, actuar and EnvStats, but to my knowledge nothing out there that has as many parameter estimation tools and nothing that brings back all generators as tidyverse compliant tibbles with r, p, d, and q all set in one function.

  • (If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research? I don't collect any telemetry so I would say yes.

  • Any other questions or issues we should be aware of?: I don't really know.

@mpadge
Copy link
Member

mpadge commented Sep 11, 2024

@spsanderson The stats team will discuss your query and get back to you asap. In regard to Probability Distributions standards, our stats dev-guide states:

Unlike most other categories of standards, packages which fit in this category will also generally be expected to fit into at least one other category of statistical software.

This would be first package that goes against that general principle. That said, it is clearly well-developed, and looks like a significant contribution to probability distributions in general. We just have to decide whether we'll relax that condition for your submission.

@spsanderson
Copy link
Author

spsanderson commented Sep 11, 2024

Thank you. And if a relaxing the standards is not going to happen then maybe some direction on how to make it fit the standard? As in maybe it also fits EDA?

@mpadge
Copy link
Member

mpadge commented Sep 11, 2024

Good point - our criteria for fit with stats submission is very easy: If you think you could comply with half or more of all EDA standards, then it "fits" in that category too. Could you have a quick scan through those standards and report back? Seems to me it should comply. Thanks!

@spsanderson
Copy link
Author

@mpadge I will take a look now. I think it should fit EDA at face value as there is plotting, parameter estimation, summary table functions and AIC calculation functions.

@spsanderson
Copy link
Author

@mpadge

Here is an example of why I think it may fit with EDA as well:

Plotting:

library(TidyDensity)
library(dplyr)

# Create a data frame with 100 random numbers
set.seed(123)
df_tbl <- tidy_normal(.n = 100, .num_sims = 5)

# Plot different aspects of the data
# The default plot is a density plot
tidy_autoplot(df_tbl)

image

# A qq plot
tidy_autoplot(df_tbl, .plot_type = "qq")

image

# A quantile plot
tidy_autoplot(df_tbl, .plot_type = "quantile")

image

# A probability plot
tidy_autoplot(df_tbl, .plot_type = "probability")

image

Summary Stats:

# A tibble: 1 × 12
  mean_val median_val std_val min_val max_val skewness kurtosis range   iqr variance ci_low ci_high
     <dbl>      <dbl>   <dbl>   <dbl>   <dbl>    <dbl>    <dbl> <dbl> <dbl>    <dbl>  <dbl>   <dbl>
1   0.0346     0.0207   0.973   -2.66    3.24   0.0861     2.95  5.90  1.26    0.946  -1.77    2.00
> tidy_distribution_summary_tbl(df_tbl, sim_number)
# A tibble: 5 × 13
  sim_number mean_val median_val std_val min_val max_val skewness kurtosis range   iqr variance
  <fct>         <dbl>      <dbl>   <dbl>   <dbl>   <dbl>    <dbl>    <dbl> <dbl> <dbl>    <dbl>
1 1            0.0904    0.0618    0.913   -2.31    2.19   0.0605     2.84  4.50  1.19    0.833
2 2           -0.108    -0.226     0.967   -2.05    3.24   0.639      3.66  5.29  1.27    0.935
3 3            0.120     0.0359    0.950   -1.76    2.29   0.326      2.46  4.05  1.29    0.902
4 4           -0.0362   -0.00351   1.04    -2.47    2.57  -0.0312     2.74  5.04  1.42    1.08 
5 5            0.106     0.165     0.989   -2.66    2.40  -0.431      3.31  5.06  1.12    0.979
# ℹ 2 more variables: ci_low <dbl>, ci_high <dbl>

Bootstraping the data:

> # Bootstrap plotting a single simulation of df_tbl
> tidy_bootstrap(filter(df_tbl, sim_number == 1)[["y"]])
# A tibble: 2,000 × 2
   sim_number bootstrap_samples
   <fct>      <list>           
 1 1          <dbl [80]>       
 2 2          <dbl [80]>       
 3 3          <dbl [80]>       
 4 4          <dbl [80]>       
 5 5          <dbl [80]>       
 6 6          <dbl [80]>       
 7 7          <dbl [80]>       
 8 8          <dbl [80]>       
 9 9          <dbl [80]>       
10 10         <dbl [80]>       
# ℹ 1,990 more rows
# ℹ Use `print(n = ...)` to see more rows

Bootstrap Plots

tidy_bootstrap(filter(df_tbl, sim_number == 1)[["y"]]) |>
  bootstrap_stat_plot(y)

image

tidy_bootstrap(filter(df_tbl, sim_number == 1)[["y"]], .num_sims = 100) |>
  bootstrap_stat_plot(y, .show_groups = TRUE)
Warning message:
Setting '.num_sims' to less than 2000 means that results can be potentially unstable. Consider
setting to 2000 or more. 

image

> tidy_bootstrap(filter(df_tbl, sim_number == 1)[["y"]], .num_sims = 100) |>
+   bootstrap_stat_plot(y, .show_groups = TRUE, .stat = "cmax")
Warning message:
Setting '.num_sims' to less than 2000 means that results can be potentially unstable. Consider
setting to 2000 or more. 

image

Stats about the distribution

> # Stats about the distribution
> # All distributions
> util_normal_stats_tbl(df_tbl) |>
+   glimpse()
Rows: 1
Columns: 17
$ tidy_function     <chr> "tidy_gaussian"
$ function_call     <chr> "Gaussian c(0, 1)"
$ distribution      <chr> "Gaussian"
$ distribution_type <chr> "continuous"
$ points            <dbl> 100
$ simulations       <dbl> 5
$ mean              <dbl> 0
$ median            <dbl> 0.02071715
$ mode              <dbl> 0
$ std_dv            <dbl> 1
$ coeff_var         <dbl> Inf
$ skewness          <dbl> 0
$ kurtosis          <dbl> 3
$ computed_std_skew <dbl> 0.0861201
$ computed_std_kurt <dbl> 2.953595
$ ci_lo             <dbl> -1.774248
$ ci_hi             <dbl> 1.99998
> # A single distribution
> util_normal_stats_tbl(df_tbl |> filter(sim_number == 2)) |>
+   glimpse()
Rows: 1
Columns: 17
$ tidy_function     <chr> "tidy_gaussian"
$ function_call     <chr> "Gaussian c(0, 1)"
$ distribution      <chr> "Gaussian"
$ distribution_type <chr> "continuous"
$ points            <dbl> 100
$ simulations       <dbl> 5
$ mean              <dbl> 0
$ median            <dbl> -0.22583
$ mode              <dbl> 0
$ std_dv            <dbl> 1
$ coeff_var         <dbl> Inf
$ skewness          <dbl> 0
$ kurtosis          <dbl> 3
$ computed_std_skew <dbl> 0.6387938
$ computed_std_kurt <dbl> 3.656954
$ ci_lo             <dbl> -1.610118
$ ci_hi             <dbl> 2.051234

Random walks of data and Viz

# Create and plot a random walk of the data
tidy_random_walk(df_tbl |> filter(sim_number == 1), .initial_value = 100, 
                 .value_type = "cum_prod") |>
  tidy_random_walk_autoplot()

image

tidy_random_walk(df_tbl, .initial_value = 100, .value_type = "cum_sum") |>
  tidy_random_walk_autoplot()

image

Getting AIC

> # Getting AIC of distribution
> util_normal_aic(df_tbl |> filter(sim_number == 1) |> pull(y))
[1] 268.5385

@spsanderson
Copy link
Author

@mpadge I just wanted to check back in and get thoughts of if I should also check off the EDA.

Thank you :)

@mpadge
Copy link
Member

mpadge commented Oct 23, 2024

@spsanderson Yes, I definitely think so. That category is "Exploratory Data Analysis and Summary Statistics", and your package certainly has a strong focus on that latter component. Seems like you'd easily tick off at least half of those standards, so please go ahead and document compliance with those. When you're done you can call @ropensci-review-bot check srr yourself to get a statement of compliance and we'll proceed from there. Thanks!

@spsanderson
Copy link
Author

@mpadge thank you! I have checked it off and will now call the bot

@spsanderson
Copy link
Author

@ropensci-review-bot

@ropensci-review-bot
Copy link
Collaborator

I'm sorry human, I don't understand that. You can see what commands I support by typing:

@ropensci-review-bot help

@spsanderson
Copy link
Author

@ropensci-review-bot check srr

@mpadge
Copy link
Member

mpadge commented Oct 23, 2024

Sorry @spsanderson, I'll find time tomorrow to debug why the bot hasn't responded there

@ropensci-review-bot
Copy link
Collaborator

This is not an 'srr' package

@spsanderson
Copy link
Author

Hi @mpadge the bot says this is not an srr package, should I do something else? Thanks

@mpadge
Copy link
Member

mpadge commented Oct 24, 2024

Yes, you should indeed. You'll need to document compliance with the statistical standards. The general procedure is described in https://stats-devguide.ropensci.org, and the specific tools for documenting compliance within your code as part of our srr package. (The bot response above just indicates that you've not yet done that.) Feel free to ask any questions you might have during the process.

@spsanderson
Copy link
Author

Hi @mpadge I'm finally getting down to it! Would you be able to tell me if the below is what your looking for, I'm pretty sure it is.

#' Bootstrap Density Tibble
#'
#' @family Bootstrap
#' @family Augment Function
#'
#' @author Steven P. Sanderson II, MPH
#'
#' @details This function takes as input the output of the `tidy_bootstrap()` or
#' `bootstrap_unnest_tbl()` and returns an augmented tibble that has the following
#' columns added to it: _`x`_, _`y`_, _`dx`_, and _`dy`_.
#'
#' It looks for an attribute that comes from using `tidy_bootstrap()` or
#' `bootstrap_unnest_tbl()` so it will not work unless the data comes from one of
#' those functions.
#'
#' @description Add density information to the output of `tidy_bootstrap()`, and
#' `bootstrap_unnest_tbl()`.
#'
#' @param .data The data that is passed from the `tidy_bootstrap()` or
#' `bootstrap_unnest_tbl()` functions.
#'
#' @examples
#' x <- mtcars$mpg
#'
#' tidy_bootstrap(x) |>
#'   bootstrap_density_augment()
#'
#' tidy_bootstrap(x) |>
#'   bootstrap_unnest_tbl() |>
#'   bootstrap_density_augment()
#'
#' @return
#' A tibble
#'
#' @srrstats {PD1.0}.
#' @export
#'

bootstrap_density_augment <- function(.data) {
  atb <- attributes(.data)

  # Checks
  if (!is.data.frame(.data)) {
    rlang::abort(
      message = "'.data' is expecting a data.frame/tibble. Please supply.",
      use_cli_format = TRUE
    )
  }

  if (!atb$tibble_type %in% c("tidy_bootstrap", "tidy_bootstrap_nested")) {
    rlang::abort(
      message = "Must pass data to this function from either tidy_bootstrap() or
      bootstrap_unnest_tbl().",
      use_cli_format = TRUE
    )
  }

  # Add density data
  if (atb$tibble_type == "tidy_bootstrap_nested") {
    df_tbl <- dplyr::as_tibble(.data) |>
      TidyDensity::bootstrap_unnest_tbl()
  }

  if (atb$tibble_type == "tidy_bootstrap") {
    df_tbl <- dplyr::as_tibble(.data)
  }

  df_tbl <- df_tbl |>
    dplyr::nest_by(sim_number) |>
    dplyr::mutate(dens_tbl = list(
      stats::density(unlist(data),
        n = nrow(data)
      )[c("x", "y")] |>
        purrr::set_names("dx", "dy") |>
        dplyr::as_tibble()
    )) |>
    tidyr::unnest(cols = c(data, dens_tbl)) |>
    dplyr::mutate(x = dplyr::row_number()) |>
    dplyr::ungroup() |>
    dplyr::select(sim_number, x, y, dx, dy, dplyr::everything())

  # Return
  attr(df_tbl, "tibble_type") <- "bootstrap_density"
  attr(df_tbl, "incoming_tibble_type") <- atb$tibble_type
  attr(df_tbl, ".num_sims") <- atb$.num_sims
  attr(df_tbl, "dist_with_params") <- atb$dist_with_params
  attr(df_tbl, "distribution_family_type") <- atb$distribution_family_type

  return(df_tbl)
}

@mpadge
Copy link
Member

mpadge commented Nov 21, 2024

@spsanderson That's getting there, but the idea is to write a statement of how you comply with each standard. A single sentence generally suffices. The compliance statements are intended to help reviews instantly see how the code they're about to look at complies with the standard. Have a look at some of our other stats packages for examples.

@spsanderson
Copy link
Author

@mpadge thank you! I'll work on that and then ask you to look again before I proceed with the other ~150+ functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants