Is my TidyDensity package a good fit here? #657

spsanderson · 2024-09-11T15:33:30Z

Submitting Author Name: Steven P. Sanderson II, MPH
Submitting Author Github Handle: @spsanderson
Other Package Authors Github handles: None
Repository: https://github.com/spsanderson/TidyDensity
Submission type: Pre-submission
Language: en

Package: TidyDensity
Title: Functions for Tidy Analysis and Generation of Random Data
Version: 1.5.0.9000
Authors@R: c(
    person("Steven","Sanderson", email = "[email protected]", role = c("aut","cre","cph"),
      comment = c(ORCID = "0009-0006-7661-8247")
      )
    )
Description: 
    To make it easy to generate random numbers based upon the underlying stats 
    distribution functions. All data is returned in a tidy and structured
    format making working with the data simple and straight forward. Given that the
    data is returned in a tidy 'tibble' it lends itself to working with the rest of the
    'tidyverse'.
License: MIT + file LICENSE
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.1
URL: https://github.com/spsanderson/TidyDensity
BugReports: https://github.com/spsanderson/TidyDensity/issues
Depends:
    R (>= 4.1.0)
Imports: 
    magrittr,
    rlang (>= 0.4.11),
    dplyr,
    ggplot2,
    plotly,
    tidyr,
    purrr,
    actuar,
    methods,
    stats,
    patchwork,
    survival,
    nloptr,
    broom,
    tidyselect,
    data.table,
    stringr
Suggests: 
    rmarkdown,
    knitr,
    EnvStats
VignetteBuilder: knitr

Scope

Please indicate which category or categories from our package fit policies or statistical package categories this package falls under. (Please check one or more appropriate boxes below):

Data Lifecycle Packages
- data retrieval
- data extraction
- data munging
- data deposition
- data validation and testing
- workflow automation
- version control
- citation management and bibliometrics
- scientific software wrappers
- field and lab reproducibility tools
- database software bindings
- geospatial data
- text analysis
Statistical Packages
- Bayesian and Monte Carlo Routines
- Dimensionality Reduction, Clustering, and Unsupervised Learning
- Machine Learning
- Regression and Supervised Learning
- Exploratory Data Analysis (EDA) and Summary Statistics
- Spatial Analyses
- Time Series Analyses
- Probability Distributions
Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of: I think this package fits here because it is designed to help people generate distributions and to see how distributions change in shape when their parameters change.
If submitting a statistical package, have you already incorporated documentation of standards into your code via the srr package? No I have not used srr package.
Who is the target audience and what are scientific applications of this package? Anyone that wants to see how distributions change with parameters (students) or anyone that is looking for parameter estimation etc.
Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category? There are packages like Distributional, actuar and EnvStats, but to my knowledge nothing out there that has as many parameter estimation tools and nothing that brings back all generators as tidyverse compliant tibbles with r, p, d, and q all set in one function.
(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research? I don't collect any telemetry so I would say yes.
Any other questions or issues we should be aware of?: I don't really know.

The text was updated successfully, but these errors were encountered:

mpadge · 2024-09-11T16:47:23Z

@spsanderson The stats team will discuss your query and get back to you asap. In regard to Probability Distributions standards, our stats dev-guide states:

Unlike most other categories of standards, packages which fit in this category will also generally be expected to fit into at least one other category of statistical software.

This would be first package that goes against that general principle. That said, it is clearly well-developed, and looks like a significant contribution to probability distributions in general. We just have to decide whether we'll relax that condition for your submission.

spsanderson · 2024-09-11T16:52:56Z

Thank you. And if a relaxing the standards is not going to happen then maybe some direction on how to make it fit the standard? As in maybe it also fits EDA?

mpadge · 2024-09-11T17:11:17Z

Good point - our criteria for fit with stats submission is very easy: If you think you could comply with half or more of all EDA standards, then it "fits" in that category too. Could you have a quick scan through those standards and report back? Seems to me it should comply. Thanks!

spsanderson · 2024-09-11T17:41:57Z

@mpadge I will take a look now. I think it should fit EDA at face value as there is plotting, parameter estimation, summary table functions and AIC calculation functions.

spsanderson · 2024-09-11T18:05:22Z

@mpadge

Here is an example of why I think it may fit with EDA as well:

Plotting:

library(TidyDensity)
library(dplyr)

# Create a data frame with 100 random numbers
set.seed(123)
df_tbl <- tidy_normal(.n = 100, .num_sims = 5)

# Plot different aspects of the data
# The default plot is a density plot
tidy_autoplot(df_tbl)

# A qq plot
tidy_autoplot(df_tbl, .plot_type = "qq")

# A quantile plot
tidy_autoplot(df_tbl, .plot_type = "quantile")

# A probability plot
tidy_autoplot(df_tbl, .plot_type = "probability")

Summary Stats:

# A tibble: 1 × 12
  mean_val median_val std_val min_val max_val skewness kurtosis range   iqr variance ci_low ci_high
     <dbl>      <dbl>   <dbl>   <dbl>   <dbl>    <dbl>    <dbl> <dbl> <dbl>    <dbl>  <dbl>   <dbl>
1   0.0346     0.0207   0.973   -2.66    3.24   0.0861     2.95  5.90  1.26    0.946  -1.77    2.00
> tidy_distribution_summary_tbl(df_tbl, sim_number)
# A tibble: 5 × 13
  sim_number mean_val median_val std_val min_val max_val skewness kurtosis range   iqr variance
  <fct>         <dbl>      <dbl>   <dbl>   <dbl>   <dbl>    <dbl>    <dbl> <dbl> <dbl>    <dbl>
1 1            0.0904    0.0618    0.913   -2.31    2.19   0.0605     2.84  4.50  1.19    0.833
2 2           -0.108    -0.226     0.967   -2.05    3.24   0.639      3.66  5.29  1.27    0.935
3 3            0.120     0.0359    0.950   -1.76    2.29   0.326      2.46  4.05  1.29    0.902
4 4           -0.0362   -0.00351   1.04    -2.47    2.57  -0.0312     2.74  5.04  1.42    1.08 
5 5            0.106     0.165     0.989   -2.66    2.40  -0.431      3.31  5.06  1.12    0.979
# ℹ 2 more variables: ci_low <dbl>, ci_high <dbl>

Bootstraping the data:

> # Bootstrap plotting a single simulation of df_tbl
> tidy_bootstrap(filter(df_tbl, sim_number == 1)[["y"]])
# A tibble: 2,000 × 2
   sim_number bootstrap_samples
   <fct>      <list>           
 1 1          <dbl [80]>       
 2 2          <dbl [80]>       
 3 3          <dbl [80]>       
 4 4          <dbl [80]>       
 5 5          <dbl [80]>       
 6 6          <dbl [80]>       
 7 7          <dbl [80]>       
 8 8          <dbl [80]>       
 9 9          <dbl [80]>       
10 10         <dbl [80]>       
# ℹ 1,990 more rows
# ℹ Use `print(n = ...)` to see more rows

Bootstrap Plots

tidy_bootstrap(filter(df_tbl, sim_number == 1)[["y"]]) |>
  bootstrap_stat_plot(y)

tidy_bootstrap(filter(df_tbl, sim_number == 1)[["y"]], .num_sims = 100) |>
  bootstrap_stat_plot(y, .show_groups = TRUE)
Warning message:
Setting '.num_sims' to less than 2000 means that results can be potentially unstable. Consider
setting to 2000 or more.

> tidy_bootstrap(filter(df_tbl, sim_number == 1)[["y"]], .num_sims = 100) |>
+   bootstrap_stat_plot(y, .show_groups = TRUE, .stat = "cmax")
Warning message:
Setting '.num_sims' to less than 2000 means that results can be potentially unstable. Consider
setting to 2000 or more.

Stats about the distribution

> # Stats about the distribution
> # All distributions
> util_normal_stats_tbl(df_tbl) |>
+   glimpse()
Rows: 1
Columns: 17
$ tidy_function     <chr> "tidy_gaussian"
$ function_call     <chr> "Gaussian c(0, 1)"
$ distribution      <chr> "Gaussian"
$ distribution_type <chr> "continuous"
$ points            <dbl> 100
$ simulations       <dbl> 5
$ mean              <dbl> 0
$ median            <dbl> 0.02071715
$ mode              <dbl> 0
$ std_dv            <dbl> 1
$ coeff_var         <dbl> Inf
$ skewness          <dbl> 0
$ kurtosis          <dbl> 3
$ computed_std_skew <dbl> 0.0861201
$ computed_std_kurt <dbl> 2.953595
$ ci_lo             <dbl> -1.774248
$ ci_hi             <dbl> 1.99998
> # A single distribution
> util_normal_stats_tbl(df_tbl |> filter(sim_number == 2)) |>
+   glimpse()
Rows: 1
Columns: 17
$ tidy_function     <chr> "tidy_gaussian"
$ function_call     <chr> "Gaussian c(0, 1)"
$ distribution      <chr> "Gaussian"
$ distribution_type <chr> "continuous"
$ points            <dbl> 100
$ simulations       <dbl> 5
$ mean              <dbl> 0
$ median            <dbl> -0.22583
$ mode              <dbl> 0
$ std_dv            <dbl> 1
$ coeff_var         <dbl> Inf
$ skewness          <dbl> 0
$ kurtosis          <dbl> 3
$ computed_std_skew <dbl> 0.6387938
$ computed_std_kurt <dbl> 3.656954
$ ci_lo             <dbl> -1.610118
$ ci_hi             <dbl> 2.051234

Random walks of data and Viz

# Create and plot a random walk of the data
tidy_random_walk(df_tbl |> filter(sim_number == 1), .initial_value = 100, 
                 .value_type = "cum_prod") |>
  tidy_random_walk_autoplot()

tidy_random_walk(df_tbl, .initial_value = 100, .value_type = "cum_sum") |>
  tidy_random_walk_autoplot()

Getting AIC

> # Getting AIC of distribution
> util_normal_aic(df_tbl |> filter(sim_number == 1) |> pull(y))
[1] 268.5385

spsanderson · 2024-10-15T13:28:58Z

@mpadge I just wanted to check back in and get thoughts of if I should also check off the EDA.

Thank you :)

mpadge · 2024-10-23T08:23:48Z

@spsanderson Yes, I definitely think so. That category is "Exploratory Data Analysis and Summary Statistics", and your package certainly has a strong focus on that latter component. Seems like you'd easily tick off at least half of those standards, so please go ahead and document compliance with those. When you're done you can call @ropensci-review-bot check srr yourself to get a statement of compliance and we'll proceed from there. Thanks!

spsanderson · 2024-10-23T12:26:03Z

@mpadge thank you! I have checked it off and will now call the bot

spsanderson · 2024-10-23T12:26:30Z

@ropensci-review-bot

ropensci-review-bot · 2024-10-23T12:26:32Z

I'm sorry human, I don't understand that. You can see what commands I support by typing:

@ropensci-review-bot help

spsanderson · 2024-10-23T12:27:48Z

@ropensci-review-bot check srr

mpadge · 2024-10-23T16:28:24Z

Sorry @spsanderson, I'll find time tomorrow to debug why the bot hasn't responded there

ropensci-review-bot · 2024-10-24T10:20:29Z

This is not an 'srr' package

spsanderson · 2024-10-24T16:02:27Z

Hi @mpadge the bot says this is not an srr package, should I do something else? Thanks

mpadge · 2024-10-24T20:00:20Z

Yes, you should indeed. You'll need to document compliance with the statistical standards. The general procedure is described in https://stats-devguide.ropensci.org, and the specific tools for documenting compliance within your code as part of our srr package. (The bot response above just indicates that you've not yet done that.) Feel free to ask any questions you might have during the process.

spsanderson · 2024-11-21T01:56:25Z

Hi @mpadge I'm finally getting down to it! Would you be able to tell me if the below is what your looking for, I'm pretty sure it is.

#' Bootstrap Density Tibble
#'
#' @family Bootstrap
#' @family Augment Function
#'
#' @author Steven P. Sanderson II, MPH
#'
#' @details This function takes as input the output of the `tidy_bootstrap()` or
#' `bootstrap_unnest_tbl()` and returns an augmented tibble that has the following
#' columns added to it: _`x`_, _`y`_, _`dx`_, and _`dy`_.
#'
#' It looks for an attribute that comes from using `tidy_bootstrap()` or
#' `bootstrap_unnest_tbl()` so it will not work unless the data comes from one of
#' those functions.
#'
#' @description Add density information to the output of `tidy_bootstrap()`, and
#' `bootstrap_unnest_tbl()`.
#'
#' @param .data The data that is passed from the `tidy_bootstrap()` or
#' `bootstrap_unnest_tbl()` functions.
#'
#' @examples
#' x <- mtcars$mpg
#'
#' tidy_bootstrap(x) |>
#'   bootstrap_density_augment()
#'
#' tidy_bootstrap(x) |>
#'   bootstrap_unnest_tbl() |>
#'   bootstrap_density_augment()
#'
#' @return
#' A tibble
#'
#' @srrstats {PD1.0}.
#' @export
#'

bootstrap_density_augment <- function(.data) {
  atb <- attributes(.data)

  # Checks
  if (!is.data.frame(.data)) {
    rlang::abort(
      message = "'.data' is expecting a data.frame/tibble. Please supply.",
      use_cli_format = TRUE
    )
  }

  if (!atb$tibble_type %in% c("tidy_bootstrap", "tidy_bootstrap_nested")) {
    rlang::abort(
      message = "Must pass data to this function from either tidy_bootstrap() or
      bootstrap_unnest_tbl().",
      use_cli_format = TRUE
    )
  }

  # Add density data
  if (atb$tibble_type == "tidy_bootstrap_nested") {
    df_tbl <- dplyr::as_tibble(.data) |>
      TidyDensity::bootstrap_unnest_tbl()
  }

  if (atb$tibble_type == "tidy_bootstrap") {
    df_tbl <- dplyr::as_tibble(.data)
  }

  df_tbl <- df_tbl |>
    dplyr::nest_by(sim_number) |>
    dplyr::mutate(dens_tbl = list(
      stats::density(unlist(data),
        n = nrow(data)
      )[c("x", "y")] |>
        purrr::set_names("dx", "dy") |>
        dplyr::as_tibble()
    )) |>
    tidyr::unnest(cols = c(data, dens_tbl)) |>
    dplyr::mutate(x = dplyr::row_number()) |>
    dplyr::ungroup() |>
    dplyr::select(sim_number, x, y, dx, dy, dplyr::everything())

  # Return
  attr(df_tbl, "tibble_type") <- "bootstrap_density"
  attr(df_tbl, "incoming_tibble_type") <- atb$tibble_type
  attr(df_tbl, ".num_sims") <- atb$.num_sims
  attr(df_tbl, "dist_with_params") <- atb$dist_with_params
  attr(df_tbl, "distribution_family_type") <- atb$distribution_family_type

  return(df_tbl)
}

mpadge · 2024-11-21T08:31:19Z

@spsanderson That's getting there, but the idea is to write a statement of how you comply with each standard. A single sentence generally suffices. The compliance statements are intended to help reviews instantly see how the code they're about to look at complies with the standard. Have a look at some of our other stats packages for examples.

spsanderson · 2024-11-21T12:31:16Z

@mpadge thank you! I'll work on that and then ask you to look again before I proceed with the other ~150+ functions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is my TidyDensity package a good fit here? #657

Is my TidyDensity package a good fit here? #657

spsanderson commented Sep 11, 2024 •

edited

Loading

mpadge commented Sep 11, 2024

spsanderson commented Sep 11, 2024 •

edited

Loading

mpadge commented Sep 11, 2024

spsanderson commented Sep 11, 2024

spsanderson commented Sep 11, 2024

spsanderson commented Oct 15, 2024

mpadge commented Oct 23, 2024

spsanderson commented Oct 23, 2024

spsanderson commented Oct 23, 2024

ropensci-review-bot commented Oct 23, 2024

spsanderson commented Oct 23, 2024

mpadge commented Oct 23, 2024

ropensci-review-bot commented Oct 24, 2024

spsanderson commented Oct 24, 2024

mpadge commented Oct 24, 2024

spsanderson commented Nov 21, 2024

mpadge commented Nov 21, 2024

spsanderson commented Nov 21, 2024

Is my TidyDensity package a good fit here? #657

Is my TidyDensity package a good fit here? #657

Comments

spsanderson commented Sep 11, 2024 • edited Loading

Scope

mpadge commented Sep 11, 2024

spsanderson commented Sep 11, 2024 • edited Loading

mpadge commented Sep 11, 2024

spsanderson commented Sep 11, 2024

spsanderson commented Sep 11, 2024

Plotting:

Summary Stats:

Bootstraping the data:

Bootstrap Plots

Stats about the distribution

Random walks of data and Viz

Getting AIC

spsanderson commented Oct 15, 2024

mpadge commented Oct 23, 2024

spsanderson commented Oct 23, 2024

spsanderson commented Oct 23, 2024

ropensci-review-bot commented Oct 23, 2024

spsanderson commented Oct 23, 2024

mpadge commented Oct 23, 2024

ropensci-review-bot commented Oct 24, 2024

spsanderson commented Oct 24, 2024

mpadge commented Oct 24, 2024

spsanderson commented Nov 21, 2024

mpadge commented Nov 21, 2024

spsanderson commented Nov 21, 2024

spsanderson commented Sep 11, 2024 •

edited

Loading

spsanderson commented Sep 11, 2024 •

edited

Loading