-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is my TidyDensity package a good fit here? #657
Comments
@spsanderson The stats team will discuss your query and get back to you asap. In regard to Probability Distributions standards, our stats dev-guide states:
This would be first package that goes against that general principle. That said, it is clearly well-developed, and looks like a significant contribution to probability distributions in general. We just have to decide whether we'll relax that condition for your submission. |
Thank you. And if a relaxing the standards is not going to happen then maybe some direction on how to make it fit the standard? As in maybe it also fits EDA? |
Good point - our criteria for fit with stats submission is very easy: If you think you could comply with half or more of all EDA standards, then it "fits" in that category too. Could you have a quick scan through those standards and report back? Seems to me it should comply. Thanks! |
@mpadge I will take a look now. I think it should fit EDA at face value as there is plotting, parameter estimation, summary table functions and AIC calculation functions. |
Here is an example of why I think it may fit with EDA as well: Plotting:library(TidyDensity)
library(dplyr)
# Create a data frame with 100 random numbers
set.seed(123)
df_tbl <- tidy_normal(.n = 100, .num_sims = 5)
# Plot different aspects of the data
# The default plot is a density plot
tidy_autoplot(df_tbl) # A qq plot
tidy_autoplot(df_tbl, .plot_type = "qq") # A quantile plot
tidy_autoplot(df_tbl, .plot_type = "quantile") # A probability plot
tidy_autoplot(df_tbl, .plot_type = "probability") Summary Stats:# A tibble: 1 × 12
mean_val median_val std_val min_val max_val skewness kurtosis range iqr variance ci_low ci_high
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.0346 0.0207 0.973 -2.66 3.24 0.0861 2.95 5.90 1.26 0.946 -1.77 2.00
> tidy_distribution_summary_tbl(df_tbl, sim_number)
# A tibble: 5 × 13
sim_number mean_val median_val std_val min_val max_val skewness kurtosis range iqr variance
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.0904 0.0618 0.913 -2.31 2.19 0.0605 2.84 4.50 1.19 0.833
2 2 -0.108 -0.226 0.967 -2.05 3.24 0.639 3.66 5.29 1.27 0.935
3 3 0.120 0.0359 0.950 -1.76 2.29 0.326 2.46 4.05 1.29 0.902
4 4 -0.0362 -0.00351 1.04 -2.47 2.57 -0.0312 2.74 5.04 1.42 1.08
5 5 0.106 0.165 0.989 -2.66 2.40 -0.431 3.31 5.06 1.12 0.979
# ℹ 2 more variables: ci_low <dbl>, ci_high <dbl> Bootstraping the data:> # Bootstrap plotting a single simulation of df_tbl
> tidy_bootstrap(filter(df_tbl, sim_number == 1)[["y"]])
# A tibble: 2,000 × 2
sim_number bootstrap_samples
<fct> <list>
1 1 <dbl [80]>
2 2 <dbl [80]>
3 3 <dbl [80]>
4 4 <dbl [80]>
5 5 <dbl [80]>
6 6 <dbl [80]>
7 7 <dbl [80]>
8 8 <dbl [80]>
9 9 <dbl [80]>
10 10 <dbl [80]>
# ℹ 1,990 more rows
# ℹ Use `print(n = ...)` to see more rows Bootstrap Plotstidy_bootstrap(filter(df_tbl, sim_number == 1)[["y"]]) |>
bootstrap_stat_plot(y) tidy_bootstrap(filter(df_tbl, sim_number == 1)[["y"]], .num_sims = 100) |>
bootstrap_stat_plot(y, .show_groups = TRUE)
Warning message:
Setting '.num_sims' to less than 2000 means that results can be potentially unstable. Consider
setting to 2000 or more. > tidy_bootstrap(filter(df_tbl, sim_number == 1)[["y"]], .num_sims = 100) |>
+ bootstrap_stat_plot(y, .show_groups = TRUE, .stat = "cmax")
Warning message:
Setting '.num_sims' to less than 2000 means that results can be potentially unstable. Consider
setting to 2000 or more. Stats about the distribution> # Stats about the distribution
> # All distributions
> util_normal_stats_tbl(df_tbl) |>
+ glimpse()
Rows: 1
Columns: 17
$ tidy_function <chr> "tidy_gaussian"
$ function_call <chr> "Gaussian c(0, 1)"
$ distribution <chr> "Gaussian"
$ distribution_type <chr> "continuous"
$ points <dbl> 100
$ simulations <dbl> 5
$ mean <dbl> 0
$ median <dbl> 0.02071715
$ mode <dbl> 0
$ std_dv <dbl> 1
$ coeff_var <dbl> Inf
$ skewness <dbl> 0
$ kurtosis <dbl> 3
$ computed_std_skew <dbl> 0.0861201
$ computed_std_kurt <dbl> 2.953595
$ ci_lo <dbl> -1.774248
$ ci_hi <dbl> 1.99998
> # A single distribution
> util_normal_stats_tbl(df_tbl |> filter(sim_number == 2)) |>
+ glimpse()
Rows: 1
Columns: 17
$ tidy_function <chr> "tidy_gaussian"
$ function_call <chr> "Gaussian c(0, 1)"
$ distribution <chr> "Gaussian"
$ distribution_type <chr> "continuous"
$ points <dbl> 100
$ simulations <dbl> 5
$ mean <dbl> 0
$ median <dbl> -0.22583
$ mode <dbl> 0
$ std_dv <dbl> 1
$ coeff_var <dbl> Inf
$ skewness <dbl> 0
$ kurtosis <dbl> 3
$ computed_std_skew <dbl> 0.6387938
$ computed_std_kurt <dbl> 3.656954
$ ci_lo <dbl> -1.610118
$ ci_hi <dbl> 2.051234 Random walks of data and Viz# Create and plot a random walk of the data
tidy_random_walk(df_tbl |> filter(sim_number == 1), .initial_value = 100,
.value_type = "cum_prod") |>
tidy_random_walk_autoplot() tidy_random_walk(df_tbl, .initial_value = 100, .value_type = "cum_sum") |>
tidy_random_walk_autoplot() Getting AIC> # Getting AIC of distribution
> util_normal_aic(df_tbl |> filter(sim_number == 1) |> pull(y))
[1] 268.5385 |
@mpadge I just wanted to check back in and get thoughts of if I should also check off the EDA. Thank you :) |
@spsanderson Yes, I definitely think so. That category is "Exploratory Data Analysis and Summary Statistics", and your package certainly has a strong focus on that latter component. Seems like you'd easily tick off at least half of those standards, so please go ahead and document compliance with those. When you're done you can call |
@mpadge thank you! I have checked it off and will now call the bot |
I'm sorry human, I don't understand that. You can see what commands I support by typing:
|
@ropensci-review-bot check srr |
Sorry @spsanderson, I'll find time tomorrow to debug why the bot hasn't responded there |
This is not an 'srr' package |
Hi @mpadge the bot says this is not an srr package, should I do something else? Thanks |
Yes, you should indeed. You'll need to document compliance with the statistical standards. The general procedure is described in https://stats-devguide.ropensci.org, and the specific tools for documenting compliance within your code as part of our srr package. (The bot response above just indicates that you've not yet done that.) Feel free to ask any questions you might have during the process. |
Hi @mpadge I'm finally getting down to it! Would you be able to tell me if the below is what your looking for, I'm pretty sure it is. #' Bootstrap Density Tibble
#'
#' @family Bootstrap
#' @family Augment Function
#'
#' @author Steven P. Sanderson II, MPH
#'
#' @details This function takes as input the output of the `tidy_bootstrap()` or
#' `bootstrap_unnest_tbl()` and returns an augmented tibble that has the following
#' columns added to it: _`x`_, _`y`_, _`dx`_, and _`dy`_.
#'
#' It looks for an attribute that comes from using `tidy_bootstrap()` or
#' `bootstrap_unnest_tbl()` so it will not work unless the data comes from one of
#' those functions.
#'
#' @description Add density information to the output of `tidy_bootstrap()`, and
#' `bootstrap_unnest_tbl()`.
#'
#' @param .data The data that is passed from the `tidy_bootstrap()` or
#' `bootstrap_unnest_tbl()` functions.
#'
#' @examples
#' x <- mtcars$mpg
#'
#' tidy_bootstrap(x) |>
#' bootstrap_density_augment()
#'
#' tidy_bootstrap(x) |>
#' bootstrap_unnest_tbl() |>
#' bootstrap_density_augment()
#'
#' @return
#' A tibble
#'
#' @srrstats {PD1.0}.
#' @export
#'
bootstrap_density_augment <- function(.data) {
atb <- attributes(.data)
# Checks
if (!is.data.frame(.data)) {
rlang::abort(
message = "'.data' is expecting a data.frame/tibble. Please supply.",
use_cli_format = TRUE
)
}
if (!atb$tibble_type %in% c("tidy_bootstrap", "tidy_bootstrap_nested")) {
rlang::abort(
message = "Must pass data to this function from either tidy_bootstrap() or
bootstrap_unnest_tbl().",
use_cli_format = TRUE
)
}
# Add density data
if (atb$tibble_type == "tidy_bootstrap_nested") {
df_tbl <- dplyr::as_tibble(.data) |>
TidyDensity::bootstrap_unnest_tbl()
}
if (atb$tibble_type == "tidy_bootstrap") {
df_tbl <- dplyr::as_tibble(.data)
}
df_tbl <- df_tbl |>
dplyr::nest_by(sim_number) |>
dplyr::mutate(dens_tbl = list(
stats::density(unlist(data),
n = nrow(data)
)[c("x", "y")] |>
purrr::set_names("dx", "dy") |>
dplyr::as_tibble()
)) |>
tidyr::unnest(cols = c(data, dens_tbl)) |>
dplyr::mutate(x = dplyr::row_number()) |>
dplyr::ungroup() |>
dplyr::select(sim_number, x, y, dx, dy, dplyr::everything())
# Return
attr(df_tbl, "tibble_type") <- "bootstrap_density"
attr(df_tbl, "incoming_tibble_type") <- atb$tibble_type
attr(df_tbl, ".num_sims") <- atb$.num_sims
attr(df_tbl, "dist_with_params") <- atb$dist_with_params
attr(df_tbl, "distribution_family_type") <- atb$distribution_family_type
return(df_tbl)
} |
@spsanderson That's getting there, but the idea is to write a statement of how you comply with each standard. A single sentence generally suffices. The compliance statements are intended to help reviews instantly see how the code they're about to look at complies with the standard. Have a look at some of our other stats packages for examples. |
@mpadge thank you! I'll work on that and then ask you to look again before I proceed with the other ~150+ functions. |
Submitting Author Name: Steven P. Sanderson II, MPH
Submitting Author Github Handle: @spsanderson
Other Package Authors Github handles: None
Repository: https://github.com/spsanderson/TidyDensity
Submission type: Pre-submission
Language: en
Scope
Please indicate which category or categories from our package fit policies or statistical package categories this package falls under. (Please check one or more appropriate boxes below):
Data Lifecycle Packages
Statistical Packages
Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of: I think this package fits here because it is designed to help people generate distributions and to see how distributions change in shape when their parameters change.
If submitting a statistical package, have you already incorporated documentation of standards into your code via the srr package? No I have not used srr package.
Who is the target audience and what are scientific applications of this package? Anyone that wants to see how distributions change with parameters (students) or anyone that is looking for parameter estimation etc.
Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category? There are packages like Distributional, actuar and EnvStats, but to my knowledge nothing out there that has as many parameter estimation tools and nothing that brings back all generators as tidyverse compliant tibbles with r, p, d, and q all set in one function.
(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research? I don't collect any telemetry so I would say yes.
Any other questions or issues we should be aware of?: I don't really know.
The text was updated successfully, but these errors were encountered: