Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QuadratiK: A Collection of Methods Using Kernel-Based Quadratic Distances for Statistical Inference and Clustering #632

Open
1 of 20 tasks
giovsaraceno opened this issue Mar 13, 2024 · 68 comments

Comments

@giovsaraceno
Copy link

giovsaraceno commented Mar 13, 2024

Submitting Author Giovanni Saraceno
Submitting Author Github Handle: @giovsaraceno
Other Package Authors Github handles: @rmj3197
Repository: https://github.com/giovsaraceno/QuadratiK-package§
Version submitted:1.1.1
Submission type: Stats
Badge grade: gold
Editor: @emitanaka
Reviewers: @kasselhingee, @emitanaka

Due date for @kasselhingee: 2024-07-16

Due date for @emitanaka: 2025-02-05
Archive: TBD
Version accepted: TBD


  • DESCRIPTION file:
Type: Package
Package: QuadratiK
Title: A Collection of Methods Using Kernel-Based Quadratic Distances for 
       Statistical Inference and Clustering
Version: 1.0.0
Authors@R: c(
person("Giovanni", "Saraceno", , "[email protected]", role = c("aut", "cre"),
comment = "ORCID 000-0002-1753-2367"),
person("Marianthi", "Markatou", role = "aut"),
person("Raktim", "Mukhopadhyay", role = "aut"),
person("Mojgan", "Golzy", role = "ctb")
)
Maintainer: Giovanni Saraceno <[email protected]>
Description: The package includes test for multivariate normality, test for
uniformity on the Sphere, non-parametric two- and k-sample tests,
random generation of points from the Poisson kernel-based density and a
clustering algorithm for spherical data. For more information see
Saraceno, G., Markatou, M., Mukhopadhyay, R., Golzy, M. (2024)
<arXiv:2402.02290>, Ding, Y., Markatou, M., Saraceno, G. (2023)
<doi:10.5705/ss.202022.0347>, and Golzy, M., Markatou, M. (2020)
<doi:10.1080/10618600.2020.1740713>.
License: GPL (>= 3)
URL: https://cran.r-project.org/web/packages/QuadratiK/index.html, 
     https://github.com/giovsaraceno/QuadratiK-package
BugReports: https://github.com/giovsaraceno/QuadratiK-package/issues
Depends: 
R (>= 3.5.0)
Imports: 
cluster,
clusterRepro,
doParallel,
foreach,
ggplot2,
ggpp,
ggpubr,
MASS,
mclust,
methods,
moments,
movMF,
mvtnorm,
Rcpp,
RcppEigen,
rgl,
rlecuyer,
rrcov,
sn,
stats,
Tinflex
Suggests: 
knitr,
rmarkdown,
roxygen2,
testthat (>= 3.0.0)
LinkingTo: 
Rcpp,
RcppEigen
VignetteBuilder: 
knitr
Config/testthat/edition: 3
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown=TRUE, roclets=c("namespace", "rd", "srr::srr_stats_roclet"))
RoxygenNote: 7.2.3

Scope

Data Lifecycle Packages

  • data retrieval
  • data extraction
  • data munging
  • data deposition
  • data validation and testing
  • workflow automation
  • version control
  • citation management and bibliometrics
  • scientific software wrappers
  • field and lab reproducibility tools
  • database software bindings
  • geospatial data
  • text analysis

Statistical Packages

  • Bayesian and Monte Carlo Routines

  • Dimensionality Reduction, Clustering, and Unsupervised Learning

  • Machine Learning

  • Regression and Supervised Learning

  • Exploratory Data Analysis (EDA) and Summary Statistics

  • Spatial Analyses

  • Time Series Analyses

  • Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:

This category is the most suitable due to QuadratiK's clustering technique, specifically designed for spherical data. The package's clustering algorithm falls within the realm of unsupervised learning, where the focus is on identifying groupings in the data without pre-labeled categories. The two- and k-sample tests serve as additional tools for testing the differences between the identified groups.
Following the link https://stats-devguide.ropensci.org/standards.html we noticed in the "Table of contents" that category 6.9 refers to Probability Distribution. We are unsure how we fit and if we fit this category. Can you please advise?

Yes, we have incorporated documentation of standards into our QuadratiK package by utilizing the srr package, considering the categories "General" and "Dimensionality Reduction, Clustering, and Unsupervised Learning", in line with the recommendations provided in the rOpenSci Statistical Software Peer Review Guide.

  • Who is the target audience and what are scientific applications of this package?

The QuadratiK package offers robust tools for goodness-of-fit testing, a fundamental aspect in statistical analysis, where accurately assessing the fit of probability distributions is essential. This is especially critical in research domains where model accuracy has direct implications on conclusions and further research directions. Spherical data structures are common in fields such as biology, geosciences and astronomy, where data points are naturally mapped to a sphere. QuadratiK provides a tailored approach to effectively handle and interpret these data. Furthermore, this package is also of particular interest to professionals in health and biological sciences, where understanding and interpreting spherical data can be crucial in studies ranging from molecular biology to epidemiology. Moreover, its implementation in both R and Python broadens its accessibility, catering to a wide audience accustomed to these popular programming languages.

Yes, there are other R packages that address goodness-of-fit (GoF) testing and multivariate analysis. Notable among these are the energy package for energy statistics-based tests. The function kmmd in the kernlab package offers a kernel-based test which has similar mathematical formulation. The package sphunif provides all the tests for uniformity on the sphere available in literature. The list of implemented tests includes the test for uniformity based on the Poisson kernel. However, there are fundamental differences between the methods encoded in the aforementioned packages and those offered in the QuadratiK package.

QuadratiK uniquely focuses on kernel-based quadratic distances methods for GoF testing, offering a comprehensive set of tools for one-sample, two-sample, and k-sample tests. This specialization provides more nuanced and robust methodologies for statistical analysis, especially in complex multivariate contexts. QuadratiK is optimized for high-dimensional datasets, employing efficient C++ implementations. This makes it particularly suitable for contemporary large-scale data analysis challenges. The package introduces advanced methods for kernel centering and critical value computation, as well as optimal tuning parameter selection based on midpower analysis. QuadratiK includes a unique clustering algorithm for spherical data. These innovations are not covered in other available packages. With implementations in both R and Python, QuadratiK appeals to a wider audience across different programming communities. We also provide a user-friendly dashboard application which further enhances accessibility, catering to users with varying levels of statistical and programming expertise.

In summary there are fundamental differences between QuadratiK and all existing R packages:

  1. The goodness-of-fit tests are U-statistics based on centered kernels. The concept and methodology of centering is novel and unique to our methods and is not part of the methods of other existing packages.
  2. An algorithm for connecting the tuning parameter with the statistical properties of the test, namely power and degrees of freedom of the kernel (DOF) is provided. This feature differentiates our novel methods from all encoded methods in the aforementioned R packages.
  3. A new clustering algorithm for data that reside on the sphere is offered. This aspect is not a feature of existing packages.
  4. We also offer algorithms for generating random samples from Poisson kernel-based densities. This capability is also unique to our package.

Yes, our package, QuadratiK, is compliant with the rOpenSci guidelines on Ethics, Data Privacy, and Human Subjects Research. We have carefully considered and adhered to ethical standards and data privacy laws relevant to our work.

  • Any other questions or issues we should be aware of?:

Please see the question posed in the first bullet.

@ldecicco-USGS
Copy link

@ropensci-review-bot check srr

1 similar comment
@maelle
Copy link
Member

maelle commented Mar 18, 2024

@ropensci-review-bot check srr

@ropensci-review-bot
Copy link
Collaborator

'srr' standards compliance:

  • Complied with: 57 / 101 = 56.4% (general: 37 / 68; unsupervised: 20 / 33)
  • Not complied with: 44 / 101 = 43.6% (general: 31 / 68; unsupervised: 13 / 33)

✔️ This package complies with > 50% of all standads and may be submitted.

@ldecicco-USGS
Copy link

Thanks for the submission @giovsaraceno ! I'm getting some advice from the other editors about your question. One thing that would be really helpful - could you push up your documentation to a GitHub page?

From the usethis package, there's a function that helps setting it up:
https://usethis.r-lib.org/reference/use_github_pages.html

@mpadge
Copy link
Member

mpadge commented Mar 20, 2024

Hi @giovsaraceno, Mark here from the rOpenSci stats team to answer your question. We've done our best to clarify the role of Probability Distributions Standards:

Unlike most other categories of standards, packages which fit in this category will also generally be expected to fit into at least one other category of statistical software. Reflecting that expectation, standards for probability distributions will be expected to only pertain to some (potentially small) portion of code in any package.

So packages should generally fit within some main category, with Probability Distributions being an additional category. In your case, Dimensionality Reduction seems like the appropriate main category, but it seems like your package would also fit within Probability Distributions. Given that, the next step would be for you to estimate what proportion of those standards you think might apply to your package? Our general rule-of-thumb is that at least 50% should apply, but for Probability Distributions as an additional category, that figure may be lower.

We are particularly keen to document compliance with this category, because it is where our standards have a large overlap with many core routines of the R language itself. As always, we encourage feedback on our standards, so please also feel very welcome to open issues in the Stats Software repository, or add comments or questions in the discussion pages. Thanks for you submission!

@giovsaraceno
Copy link
Author

Thanks for the submission @giovsaraceno ! I'm getting some advice from the other editors about your question. One thing that would be really helpful - could you push up your documentation to a GitHub page?

From the usethis package, there's a function that helps setting it up: https://usethis.r-lib.org/reference/use_github_pages.html

Thanks @ldecicco-USGS for your guidance during this process. Following your suggestion, I've now pushed the documentation for the QuadratiK package to a GitHub page. You can find it displayed on the main page of the GitHub repository. Here's the direct link for easy access: QuadratiK package GitHub page.

@giovsaraceno
Copy link
Author

Hi @giovsaraceno, Mark here from the rOpenSci stats team to answer your question. We've done our best to clarify the role of Probability Distributions Standards:

Unlike most other categories of standards, packages which fit in this category will also generally be expected to fit into at least one other category of statistical software. Reflecting that expectation, standards for probability distributions will be expected to only pertain to some (potentially small) portion of code in any package.

So packages should generally fit within some main category, with Probability Distributions being an additional category. In your case, Dimensionality Reduction seems like the appropriate main category, but it seems like your package would also fit within Probability Distributions. Given that, the next step would be for you to estimate what proportion of those standards you think might apply to your package? Our general rule-of-thumb is that at least 50% should apply, but for Probability Distributions as an additional category, that figure may be lower.

We are particularly keen to document compliance with this category, because it is where our standards have a large overlap with many core routines of the R language itself. As always, we encourage feedback on our standards, so please also feel very welcome to open issues in the Stats Software repository, or add comments or questions in the discussion pages. Thanks for you submission!

Hi Mark,

Thank you for the additional clarification regarding the standards for Probability Distributions and their integration with other statistical software categories. Following your guidance, we have conducted a thorough review of the standards applicable to the Probability Distributions category in relation to our package.

Based on our assessment, we found that the current version of our package satisfies 14% of the standards directly. Furthermore, we identified that an additional 36% of the standards could potentially apply to our package, but this would require us to make some enhancements, including the addition of checks and test codes. We feel the remaining 50% of the standards are not applicable to our package.

We are committed to improve our package and aim to fulfill the applicable standards. To this end, we plan to work on a separate branch dedicated to implementing these enhancements, with the goal of meeting the 50% of the standards for the Probability Distributions category. Before proceeding, we would greatly appreciate your opinion on this plan.    

Thank you for your time and support. Giovanni

@giovsaraceno
Copy link
Author

Hi @giovsaraceno, Mark here from the rOpenSci stats team to answer your question. We've done our best to clarify the role of Probability Distributions Standards:

Unlike most other categories of standards, packages which fit in this category will also generally be expected to fit into at least one other category of statistical software. Reflecting that expectation, standards for probability distributions will be expected to only pertain to some (potentially small) portion of code in any package.

So packages should generally fit within some main category, with Probability Distributions being an additional category. In your case, Dimensionality Reduction seems like the appropriate main category, but it seems like your package would also fit within Probability Distributions. Given that, the next step would be for you to estimate what proportion of those standards you think might apply to your package? Our general rule-of-thumb is that at least 50% should apply, but for Probability Distributions as an additional category, that figure may be lower.
We are particularly keen to document compliance with this category, because it is where our standards have a large overlap with many core routines of the R language itself. As always, we encourage feedback on our standards, so please also feel very welcome to open issues in the Stats Software repository, or add comments or questions in the discussion pages. Thanks for you submission!

Hi Mark,

Thank you for the additional clarification regarding the standards for Probability Distributions and their integration with other statistical software categories. Following your guidance, we have conducted a thorough review of the standards applicable to the Probability Distributions category in relation to our package.

Based on our assessment, we found that the current version of our package satisfies 14% of the standards directly. Furthermore, we identified that an additional 36% of the standards could potentially apply to our package, but this would require us to make some enhancements, including the addition of checks and test codes. We feel the remaining 50% of the standards are not applicable to our package.

We are committed to improve our package and aim to fulfill the applicable standards. To this end, we plan to work on a separate branch dedicated to implementing these enhancements, with the goal of meeting the 50% of the standards for the Probability Distributions category. Before proceeding, we would greatly appreciate your opinion on this plan.    

Thank you for your time and support. Giovanni

Hi Mark,

We addressed the enhancements we discussed, and our package now meets 50% of the standards for the Probability Distributions category. These updates are in the probability-distributions-standards branch of our repository.
We would like your opinion on merging this branch with the submitted version of the package.

Thank you, Giovanni

@mpadge
Copy link
Member

mpadge commented Mar 27, 2024

Hi Giovanni, your srrstats tags for probability distribution standards definitely look good enough to proceed. That said, one aspect which could be improved, and which I would request if I were reviewing the package, is the compliance statements in the tests. In both test-dpkb.R and test-rkpb.R you claim compliance in single statements at the start, yet I can't really see where or how a few of these are really complied with. In particular, there do not appear to be explicit tests for output values, as these are commonly tested using test_equal with an explicit tolerance parameter, which you don't have. It is also not clear to me where and how you compare results of different distributions, because you have no annotations in the tests about what the return values of the functions are.

Those are very minor points which you may ignore for the moment if you'd like to get the review process started, or you could quickly address them straight away if you prefer. Either way, feel free to ask the bot to check srr when you think you're ready to proceed. Thanks!

@giovsaraceno
Copy link
Author

Hi, thank you for your suggestions on our compliance statements and testing practices.
Regarding the explicit testing for output values and the use of test_equal with a tolerance parameter, we aimed to ensure that our functions return the expected outputs. However, we recognize that our current tests may not explicitly demonstrate compliance with this standard in the way you've described. We're uncertain about the best approach to incorporate test_equal with a tolerance parameter effectively, for testing the numeric equality of outputs from the provided random generation and density functions. Can you provide some tips?

As for comparing results from different distributions, the rpkb function in our package provides options to generate random observations using three distinct algorithms based on different probability distributions. We've conducted tests to confirm that each method functions as intended. We added also a new vignette in which the methods are compared by graphically displaying the generated points. Is this what you are looking for?

We're inclined to address them promptly. We would appreciate if we can get an answer to the questions posed above so that we can start the review process.
Thanks, Giovanni

@noamross
Copy link
Contributor

Sorry we didn't reply faster, @giovsaraceno. In, say, a single-variable distribution tests might include:

  • A correctness that the density function with given parameters has means, modes, or variances as theoretically expected.
  • A parameter recovery that the mean of a sufficiently large number of randomly generated values is within a window of expectations.
    In your case my understanding is that you are generating multivariate outputs. Ultimately we aim to see tests that those outputs are as expected, so for both density and random values. I think the thing to do is test that summary properties of those outputs, deterministic for density and within bounds for random, match those expected based on the input parameters

@giovsaraceno
Copy link
Author

Thanks @noamross for your explanation. We have taken your suggestions into consideration and have implemented them accordingly.
We are now ready to request the automatic bot check for our package. We look forward to any further instructions or feedback that might come from this next step.

@ldecicco-USGS
Copy link

@ropensci-review-bot check package

@ropensci-review-bot
Copy link
Collaborator

Thanks, about to send the query.

@ropensci-review-bot
Copy link
Collaborator

🚀

The following problems were found in your submission template:

  • HTML variable [editor] is missing
  • HTML variable [reviewers-list] is missing
  • HTML variable [due-dates-list] is missing
    Editors: Please ensure these problems with the submission template are rectified. Package checks have been started regardless.

👋

@ropensci-review-bot
Copy link
Collaborator

Checks for QuadratiK (v1.0.0)

git hash: 21541a40

  • ✔️ Package is already on CRAN.
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 78.2%.
  • ✖️ Package contains unexpected files.
  • ✔️ R CMD check found no errors.
  • ✖️ R CMD check found 1 warning.
  • 👀 Function names are duplicated in other packages

Important: All failing checks above must be addressed prior to proceeding

(Checks marked with 👀 may be optionally addressed.)

Package License: GPL (>= 3)


1. rOpenSci Statistical Standards (srr package)

This package is in the following category:

  • Dimensionality Reduction, Clustering and Unsupervised Learning

✔️ All applicable standards [v0.2.0] have been documented in this package (204 complied with; 49 N/A standards)

Click to see the report of author-reported standards compliance of the package with links to associated lines of code, which can be re-generated locally by running the srr_report() function from within a local clone of the repository.


2. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type package ncalls
internal base 382
internal QuadratiK 50
internal utils 10
internal grDevices 1
imports stats 29
imports methods 26
imports sn 14
imports ggpp 2
imports cluster 1
imports mclust 1
imports moments 1
imports rrcov 1
imports clusterRepro NA
imports doParallel NA
imports foreach NA
imports ggplot2 NA
imports ggpubr NA
imports MASS NA
imports movMF NA
imports mvtnorm NA
imports Rcpp NA
imports RcppEigen NA
imports rgl NA
imports rlecuyer NA
imports Tinflex NA
suggests knitr NA
suggests rmarkdown NA
suggests roxygen2 NA
suggests testthat NA
linking_to Rcpp NA
linking_to RcppEigen NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

list (46), data.frame (26), matrix (24), nrow (23), t (20), log (19), rep (19), ncol (18), c (14), numeric (12), for (11), sqrt (10), length (8), mean (8), as.numeric (6), return (6), sample (6), T (6), vapply (6), apply (5), as.factor (5), table (5), unique (5), as.vector (4), cumsum (4), exp (4), rbind (4), sum (4), as.matrix (3), kappa (3), lapply (3), lgamma (3), pi (3), q (3), replace (3), unlist (3), as.integer (2), diag (2), max (2), readline (2), rownames (2), rowSums (2), which (2), which.max (2), with (2), beta (1), colMeans (1), expand.grid (1), F (1), factor (1), if (1), levels (1), norm (1), rep.int (1), round (1), seq_len (1), subset (1)

QuadratiK

DOF (3), kbNormTest (3), normal_CV (3), C_d_lambda (2), compute_CV (2), cv_ksample (2), d2lpdf (2), dlpdf (2), lpdf (2), norm_vec (2), objective_norm (2), poisson_CV (2), rejvmf (2), sample_hypersphere (2), statPoissonUnif (2), compare_qq (1), compute_stats (1), computeKernelMatrix (1), computePoissonMatrix (1), dpkb (1), elbowMethod (1), generate_SN (1), NonparamCentering (1), objective_2 (1), objective_k (1), ParamCentering (1), pkbc_validation (1), rejacg (1), rejpsaw (1), select_h (1), stat_ksample_cpp (1), stat2sample (1)

stats

df (12), quantile (4), dist (2), rnorm (2), runif (2), aggregate (1), cov (1), D (1), qchisq (1), sd (1), sigma (1), uniroot (1)

methods

setMethod (12), setGeneric (8), new (3), setClass (3)

sn

rmsn (14)

utils

data (8), prompt (2)

ggpp

annotate (2)

cluster

silhouette (1)

grDevices

colorRampPalette (1)

mclust

adjustedRandIndex (1)

moments

skewness (1)

rrcov

PcaLocantore (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.


3. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

  • code in C++ (17% in 2 files) and R (83% in 12 files)
  • 4 authors
  • 5 vignettes
  • 1 internal data file
  • 21 imported packages
  • 24 exported functions (median 14 lines of code)
  • 56 non-exported functions in R (median 16 lines of code)
  • 16 R functions (median 13 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

  • loc = "Lines of Code"
  • fn = "function"
  • exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure value percentile noteworthy
files_R 12 65.5
files_src 2 79.1
files_vignettes 5 96.9
files_tests 10 90.7
loc_R 1408 76.6
loc_src 281 34.1
loc_vignettes 235 55.3
loc_tests 394 70.0
num_vignettes 5 97.9 TRUE
data_size_total 11842 71.9
data_size_median 11842 80.1
n_fns_r 80 70.4
n_fns_r_exported 24 72.5
n_fns_r_not_exported 56 70.6
n_fns_src 16 40.4
n_fns_per_file_r 5 67.1
n_fns_per_file_src 8 69.1
num_params_per_fn 5 69.6
loc_per_fn_r 15 46.1
loc_per_fn_r_exp 14 35.1
loc_per_fn_r_not_exp 16 54.8
loc_per_fn_src 13 41.6
rel_whitespace_R 24 82.7
rel_whitespace_src 18 36.2
rel_whitespace_vignettes 16 29.2
rel_whitespace_tests 34 78.1
doclines_per_fn_exp 50 62.8
doclines_per_fn_not_exp 0 0.0 TRUE
fn_call_network_size 50 66.3

3a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


4. goodpractice and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

(There do not appear to be any)

GitHub Workflow Results

id name conclusion sha run_number date
8851531581 pages build and deployment success 21541a 25 2024-04-26
8851531648 pkgcheck failure 21541a 60 2024-04-26
8851531643 pkgdown success 21541a 25 2024-04-26
8851531649 R-CMD-check success 21541a 83 2024-04-26
8851531642 test-coverage success 21541a 83 2024-04-26

3b. goodpractice results

R CMD check with rcmdcheck

R CMD check generated the following warning:

  1. checking whether package ‘QuadratiK’ can be installed ... WARNING
    Found the following significant warnings:
    Warning: 'rgl.init' failed, running with 'rgl.useNULL = TRUE'.
    See ‘/tmp/RtmpQrtXuf/file133861d90686/QuadratiK.Rcheck/00install.out’ for details.

R CMD check generated the following note:

  1. checking installed package size ... NOTE
    installed size is 16.6Mb
    sub-directories of 1Mb or more:
    libs 15.0Mb

R CMD check generated the following check_fails:

  1. no_import_package_as_a_whole
  2. rcmdcheck_examples_run_without_warnings
  3. rcmdcheck_significant_compilation_warnings
  4. rcmdcheck_reasonable_installed_size

Test coverage with covr

Package coverage: 78.21

Cyclocomplexity with cyclocomp

The following function have cyclocomplexity >= 15:

function cyclocomplexity
select_h 46

Static code analyses with lintr

lintr found the following 20 potential issues:

message number of times
Avoid library() and require() calls in packages 9
Lines should not be more than 80 characters. 9
Use <-, not =, for assignment. 2


5. Other Checks

Details of other checks (click to open)

✖️ Package contains the following unexpected files:

  • src/RcppExports.o
  • src/kernel_function.o

✖️ The following function name is duplicated in other packages:

    • extract_stats from ggstatsplot


Package Versions

package version
pkgstats 0.1.3.13
pkgcheck 0.1.2.21
srr 0.1.2.9


Editor-in-Chief Instructions:

Processing may not proceed until the items marked with ✖️ have been resolved.

@giovsaraceno
Copy link
Author

giovsaraceno commented May 13, 2024

We have solved all the marked items and we are now ready to request the automatic bot check.
Thanks

@jooolia
Copy link
Member

jooolia commented May 29, 2024

@ropensci-review-bot check package

@jooolia jooolia self-assigned this May 29, 2024
@ropensci-review-bot
Copy link
Collaborator

Thanks, about to send the query.

@ropensci-review-bot
Copy link
Collaborator

🚀

The following problems were found in your submission template:

  • HTML variable [editor] is missing
  • HTML variable [reviewers-list] is missing
  • HTML variable [due-dates-list] is missing
    Editors: Please ensure these problems with the submission template are rectified. Package checks have been started regardless.

👋

@giovsaraceno
Copy link
Author

Hi @jooolia,
thanks for checking the package. Can you give us indications on how we should address the listed problems?
At the moment, we do not know which information to insert in the mentioned fields (editor, reviewers and due-dates list).
Thanks in advance

@mpadge
Copy link
Member

mpadge commented May 31, 2024

@jooolia The automated checks failed because of issue linked to above. @giovsaraceno When you've fixed this issue and confirmed that pkgcheck workflows once again succeed in your repo, please call @ropensci-review-bot check package here to run checks again. Thanks

@giovsaraceno
Copy link
Author

@ropensci-review-bot check package

@ropensci-review-bot
Copy link
Collaborator

Thanks, about to send the query.

@ropensci-review-bot
Copy link
Collaborator

🚀

The following problems were found in your submission template:

  • HTML variable [editor] is missing
  • HTML variable [reviewers-list] is missing
  • HTML variable [due-dates-list] is missing
    Editors: Please ensure these problems with the submission template are rectified. Package checks have been started regardless.

👋

@giovsaraceno
Copy link
Author

@mpadge I have modified the initial issue text by additing the version submitted and choosing the badge grade. Please let us know if anything else is needed.
Thanks!

@giovsaraceno
Copy link
Author

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/632_status.svg)](https://github.com/ropensci/software-review/issues/632)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

We have added the provided badge into the README file and added the NEWS.md file into the package.

@emitanaka
Copy link

@ropensci-review-bot add @kasselhingee as reviewer

@ropensci-review-bot
Copy link
Collaborator

@kasselhingee added to the reviewers list. Review due date is 2024-07-16. Thanks @kasselhingee for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

@ropensci-review-bot
Copy link
Collaborator

@kasselhingee: If you haven't done so, please fill this form for us to update our reviewers records.

@emitanaka
Copy link

It is working now. Thank you @mpadge !

@kasselhingee
Copy link

Package Review

  • I have no relationship past or present with the package authors.
  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need: clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s): demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions
  • Examples: (that run successfully locally) for all exported functions
  • Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software have been confirmed.
  • Performance: Any performance claims of the software have been confirmed.
  • Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing:

  • Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

I'm only partially familiar with the area of spherical data. Both your package's tests of uniformity and clustering sound useful. The kb.test() tests sound complicated but I'm sure with more explanation their use will become clear.

I found it really hard to understand your package initially. That was because most of the documentation ignores G1.3 on explaining statistical terms. For example, I jumped into the kb.test() part of the package which was really confusing until I found an arXiv document on QuadratiK that had more information.
Although the other parts were less confusing to me, they still didn't explain themselves well, often stating that the function performs "the [a bespoke method by authors]" without explanation of the method.
Because of this, currently your package feels only usable by expert statisticians who have read your papers and want to try out your methods.

I would love it if your readme described the major benefits with more detail. For example, that pk.test() performs much better than competitors when the alternative is multimodal.
Your answers to rOpenSci here state that novel and unique kernel-based methods are used, but you don't say what is good about them. Also, please clarify if the kernel in the 'Poisson kernel-based densities' different from the kernel in your 'kernel-based quadratic distances'.

More on following G1.3. There are many unexplained terms I've never heard of before, a few used differently to expectation, and others used vaguely. At times it felt like these were all the terms!

  • An intro to PKB distributions would be nice.
  • And exact meaning of parameters in rpkb().

Package-Level Comments

Major Functionality Claims

The following appear to be met from the documentation and function results:

  • A test of uniformity on the sphere for multimodal alternatives
  • Clustering algorithms
  • Equal-distribution tests (this is my current guess of what the k-sample tests are)
  • Goodness of fit to an estimated PKB distribution?
  • Some graphics help

However, I haven't checked if they are implemented correctly and the package lacks tests to confirm results in simple situations.

These claims seem unmet to me:

  • I can't see the bridge between statistics and machine learning claimed in the README
  • I haven't got the scatterplots of clustering results to work
  • I can't see the Python version
  • In answering some rOpenSci questions you mentioned a dashboard application. I completely missed it! Where is it?

Statement of need in README

  • Statement of need lacks precision. When should I use your methods vs existing methods?
  • The majority of methods are based on statistical methods published with peer-review by the authors, so I haven't checked the novelty or importance of the methods myself.
    An exception is the 'k-sample goodness of fit tests', which are in a manuscript I couldn't find. Not sure what they are exactly, or how good they are compared to existing methods.
  • Two and k-sample tests often refer to testing equality between groups (or similar).
    • Please be more clear in the README what the hypotheses are in these tests.
    • I'm especially confused by a k-sample test of goodness of fit to probability distribution. Without reading your arXiv document it sounded like it would test the fit of each sample to the model probability distributions.
    • What model does the single-sample goodness of fit test?
    • Specify what class of probability distributions (PKB?)
  • Target audience is unclear in README. Currently I'm thinking expert statisticians and machine learners who have read your papers/manuscripts, but I see your answers to rOpenSci questions hope for more general use.

Installation

  • Doesn't have installation instructions in README for the development version or CRAN version (but should be very easy to add: I did it with devtools::install_github("https://github.com/giovsaraceno/QuadratiK-package/tree/master"))

Dependencies

  • Why is ggplot2, rlecuyer, doParallel, foreach, imported but aren't called according to the automatic check?
  • Package has lots of dependencies (imports) which could make it hard to maintain. rrcov, mvtnorm, movMF, moments, mclust, clusterRepro, cluster, Tinflex, ggpp, ggpubr all seem to be only called once. Are they all crucial? Could they be optional extra functionality?

Parts of the Package

Vignettes

  • Seem to cover the main uses of the package

  • One vignette can't be run locally

  • Overall they appear to be written for people already familiar with the authors' papers or the other vignettes.

  • wireless_clustering vignette

    • Please justify this conversion to the sphere more. "Given that the Wi-Fi signal strength takes values in a limited range, it is appropriate to consider the spherically transformed observations, by L2 normalization, and consequently perform the clustering algorithm on the 7-dimensional sphere." On the face of it a sphere seems in appropriate, why doesn't the total length of a vector in the space matter?
    • Typo: "it returns an object with IGP the InGroup Proportion (IGP), and metrics a table of computed evaluation measures" --> remove 'IGP the'.
    • A chunk is visible but isn't evaluated, which confused me a lot, because the function inside the chunk called validation() doesn't exist. I'm guessing it is now pkbc_validation().
    • What is Average Silhouette Width etc. from pkbc_validation()?
    • The paragraph on plotting is unclear which object the plot method is for. A plot(res_pk) would be nice, with an example plot showing why K=4 looks appropriate.
  • k-sample test vignette

    • "Recall that the computed test statistics correspond to the omnibus tests." Recall from where? And what are the omnibus tests?
    • The commented-out line in the code chunk "#k_test_h <- kb.test(x=x, y=y)" confused me. I think you mean it to be an example of automatic use of select_h()?
    • h_k <- select_h(x=x, y=y, alternative="skewness") takes a really long time on my machine, would be good to mention in the help for select_h() that it takes a long time.
  • I quickly scanned and checked the remaining vignettes could render, but didn't run the code myself.

Function help

  • I suspect help assumes user has read your references e.g.,

    • For pkbc_validation(), how do I interpret all the measures?
    • For kb.test() see my comments about k-sample tests in the README.
  • kb.test()

    • What happens if the reference distribution parameters are NULL?
    • I googled for your 2024 manuscript but couldn't find it, so not sure what the tests are.
    • There are many aspects not seen documented in kb.test(). What are the U-statistics, what is Vn, and a V-statistic etc?
    • The function clearly does a lot of things, but what are those things? All you've said is "the kernel-based quadratic distance tests using the Gaussian kernel with bandwidth parameter h". Please describe what they are.
  • kb.test class

    • Great to have this class documented explicitly.
    • Link to kb.test() and vice versa to explain the objects.
    • The class and the function are so closely linked so consider documenting them in the same .Rd file. But it is okay for them to be separate.
  • pk.test() and pk.test-class

    • What happens if rho is NULL?
    • Would be great to say somewhere when pk.test() is advised - from the referenced paper it is when the alternative is multimodal.
  • pkbc()

    • Written like the user is familiar with the referenced paper
  • pkbc_validation()

    • I'm guessing the columns of pkbc_validation(res)$metrics are the different cluster numbers in res?
  • plot.pkbc()

    • For some reason on my system after installing from CRAN: ?plot.pkbc and help(plot.pkbc) both find nothing. But, for example help(plot.TinflexC) gets me to the appropriate help. Do you know why it isn't working? The manual for plot.pkbc is crucial to understanding the plots so should be accessible from the console.
    • Can you please make it so that it can be used non-interactively?
      • This is crucial for reproducibility for anyone using this function.
    • Please document what the options mean. In the wireless_clustering vignette plot(res_pk) choosing scatter plot then 4 clusters to display didn't make a plot at all! Is this a bug, or what it should be doing? And surely it means displaying the 4-cluster model, rather than say displaying the first 4 clusters of the 6-cluster model?
    • It would be nice if the elbow plot panels made clear which metric they are using, but I know that sort of thing can be hard to program. I'm hoping the order corresponds to the order in the help (left = Euclidean, right = cosine similarity).
  • select_h()

    • Can take a really long time - worth warning the user
  • wine data set

    • Source seems to be missing some spaces
  • wireless data set

    • Source seems to be missing some spaces

Examples

I get warnings from geom_table_npc() when I run the examples. I'm using version 0.5.7 of package ggpp.

Test Coverage

  • Could you test that pk.test() rejects uniformity correctly?
  • You could test rpkb against a known approximation with preexisting simulation methods (e.g. vMF, kent etc?) (currently you only test the mean)
  • Test of pkbc() doesn't seem to check that it gets the clustering correct in a simple situation
  • Likewise 'stats_clusters()andpkbc_validation()` (these all seem to test that the structure/class of the output is correct, but not the actual values of the output)
  • Considering your value of the graphical methods (and that I couldn't get one to work), testing that they work would be good!

Packaging guidelines

  • README should include a brief demonstration
  • No citation instructions yet
  • Guidelines suggest a general vignette to introduce the package
  • Vignettes are missing citations
  • There are places where cross-links would be appropriate to add (e.g. between pb.test help and pb.test-class, link to plot.kpbc help from kbpc and vice-versa).
  • Do you have a pkgdown website yet?

More

  • G1.6 suggests a demo of pk.test() on multimodal data vs another uniform testing method
  • I can see a few spelling errors (e.g. silhouette) according to devtools::spell_check()

@giovsaraceno
Copy link
Author

Hello @kasselhingee
Thank you very much for your review, which we found to be very helpful.

All the points raised in this review will be addressed. I do have some questions associated with some of the points and I would greatly appreciate if you could please provide some guidance.

Here are my questions:

  • Dependencies: some of the listed packages are used for providing additional features to the main functionalities. For example, the packages mclust, clusterRepro and cluster are used for computing the measures of ARI, IGP and ASW in the pkbc_validation() function. This portion could be an optional extra functionality. Could you advise on the best approach to code this?
  • Test coverage for rpkb(): in the second point it is suggested to compare the function against a known approximation method. The proposed function rpkb() is the first in the literature for random generation from a Poisson kernel-based distribution. While using different circular distributions, such as the vMF and Kent, might illustrate the differences between these distributions, it is unclear what specific tests would be most appropriate in this context. Could you provide further clarification or suggestions on this matter?

Thank you again for your time and the helpful review.
Giovanni

@ropensci-review-bot
Copy link
Collaborator

📆 @kasselhingee you have 2 days left before the due date for your review (2024-07-16).

@kasselhingee
Copy link

Hi @giovsaraceno , glad it was helpful.

  • optional extra functionality: I haven't done this much myself, but I would put these packages as 'Suggested' in the DESCRIPTION file and inside pkbc_validation() check for mclust etc (using require(pkgname) say). I've can imagine two situations then (1) ARI, IGP and ASW are core to pkbc_validation(), in which case I've seen other people's packages ask me if I want to install the packages (not sure how they did it) OR (2) they aren't core to pkbc_validation() and you could just print a message() if the packages aren't available. In either case you'll have to make sure the help for pkbc_validation() explains the behaviour.

  • Test coverage for rpkb(): I got the impression that certain parameter settings for the Poisson kernel-based distribution can closely approximate a vMF or Kent etc. I suggest checking that the results of rpkb() are very similar to the results for existing vMF and Kent etc simulators in at least one of these situations. (And if the situation is close to a typical use case for rpkb() that will be an even better test!).

@giovsaraceno
Copy link
Author

Hello @kasselhingee,
Thank you very much for your insightful review. We apologize for the delay in our response. We have addressed the points raised in the review and you can find below point-by-point response.

Review Comments

I'm only partially familiar with the area of spherical data. Both your package's tests of uniformity and clustering sound useful. The kb.test() tests sound complicated but I'm sure with more explanation their use will become clear.

I found it really hard to understand your package initially. That was because most of the documentation ignores G1.3 on explaining statistical terms. For example, I jumped into the kb.test() part of the package which was really confusing until I found an arXiv document on QuadratiK that had more information. Although the other parts were less confusing to me, they still didn't explain themselves well, often stating that the function performs "the [a bespoke method by authors]" without explanation of the method. Because of this, currently your package feels only usable by expert statisticians who have read your papers and want to try out your methods.

I would love it if your readme described the major benefits with more detail. For example, that pk.test() performs much better than competitors when the alternative is multimodal. Your answers to rOpenSci here state that novel and unique kernel-based methods are used, but you don't say what is good about them. Also, please clarify if the kernel in the 'Poisson kernel-based densities' different from the kernel in your 'kernel-based quadratic distances'.

More on following G1.3. There are many unexplained terms I've never heard of before, a few used differently to expectation, and others used vaguely. At times it felt like these were all the terms!

Thank you very much for your detailed and constructive feedback. We acknowledge the complexity of the statistical methods implemented in the package, particularly for users who may not be familiar with spherical data analysis or the specific kernel-based techniques we have employed.
We have taken significant steps to improve the clarity and accessibility of our documentation. Specifically, we have expanded the README to include more detailed explanations of the major benefits of our methods, such as the advantages of the methods; we have added comprehensive descriptions in the function help files, ensuring that users can understand and effectively utilize these methods without needing to consult external research papers.
Moreover, we have reviewed and revised the use of statistical terminology throughout the package to align with the best practices recommended in G1.3, ensuring that terms are used consistently and are well-explained. Our goal is to make the package accessible to a broader audience, not just expert statisticians who are familiar with our previous work.
We hope these improvements address your concerns and make the package more user-friendly and accessible, while still providing the robust statistical tools that are central to its purpose.

  • An intro to PKB distributions would be nice.
  • And exact meaning of parameters in rpkb().

We have added an introduction of the PKB distributions in the README file and the help of the dpkb() and rpkb() functions, including the description of the parameters.

Package-Level Comments

Major Functionality Claims

The following appear to be met from the documentation and function results:

  • A test of uniformity on the sphere for multimodal alternatives
  • Clustering algorithms
  • Equal-distribution tests (this is my current guess of what the k-sample tests are)
  • Goodness of fit to an estimated PKB distribution?
  • Some graphics help

However, I haven't checked if they are implemented correctly and the package lacks tests to confirm results in simple situations.

The package documentation has been significantly enhanced in order to clarify the introduction and usage of the mentioned points. It now provides a more comprehensive overview, featuring a brief introduction to the methods, a clear explanation of the theoretical foundations, and a discussion of the advantages and appropriate use cases. Additionally, we have incorporated further tests in line with the reviewer's suggestions.

These claims seem unmet to me:

  • I can't see the bridge between statistics and machine learning claimed in the README

Please notice that we have removed the mentioned statement from the README file.

  • I haven't got the scatterplots of clustering results to work

Thank you for identifying this bug. We have fixed this and now the scatter-plots are correctly generated.

  • I can't see the Python version

In the README file, we have added an Installation section which includes the link for the Python package. It is implemented separately and it is now under review with pyOpenSci.

  • In answering some rOpenSci questions you mentioned a dashboard application. I completely missed it! Where is it?

The dashboard is generated using the associated Python package. You can find the link with detailed instructions on how to access it in the Installation section of the README file.

Statement of need in README

  • Statement of need lacks precision. When should I use your methods vs existing methods?

The README file and the documentation for the main functions now include a statement highlighting the advantages of the proposed methods and the scenarios in which they are most appropriate.

  • The majority of methods are based on statistical methods published with peer-review by the authors, so I haven't checked the novelty or importance of the methods myself.

Thank you for your feedback. We appreciate your acknowledgment of the peer-reviewed nature of the statistical methods included in the package. While the methods implemented have been extensively validated in the literature and are recognized for their contributions to the field, we believe the practical utility and ease of access these methods offer to users is a significant contribution. We welcome any further insights or suggestions you might have regarding the implementation or documentation of these methods.

An exception is the 'k-sample goodness of fit tests', which are in a manuscript I couldn't find. Not sure what they are exactly, or how good they are compared to existing methods.

We apologize for the inconvenience, it was not available at the time of the first submission. More information about the two and k-sample tests can be found in the newly released arXiv pre-print
Markatou Marianthi & Saraceno Giovanni (2024). “A Unified Framework for Multivariate Two- and k-Sample Kernel-based Quadratic Distance Goodness-of-Fit Tests.” arXiv:2407.16374
including extensive simulation studies comparing the proposed methods with existing methods.

  • Two and k-sample tests often refer to testing equality between groups (or similar).

    • Please be more clear in the README what the hypotheses are in these tests.

Please see our answer above where we state clearly that with the kb.test we offer tests for 1) testing normality of a single sample; 2) testing equality of the distributions of two samples; 3) testing the equality of the distributions of $k$ samples.

  • I'm especially confused by a k-sample test of goodness of fit to probability distribution. Without reading your arXiv document it sounded like it would test the fit of each sample to the model probability distributions.

Apologies, we hope it is clear now.

  • What model does the single-sample goodness of fit test?

It tests if the sample follows a multivariate normal distributions with a given mean vector and covariance matrix. If these parameters are not provided, the standard normal distribution is considered. We hope now it is clearly explained.

  • Specify what class of probability distributions (PKB?)

We refer here to general distributions for the two and $k$-sample tests, that is $H_0:F_1 = F_2 = F$ and $H_0:F_1 = \ldots = F_k = F$, where the distribution $F$ is general and does not need to be specified.

  • Target audience is unclear in README. Currently I'm thinking expert statisticians and machine learners who have read your papers/manuscripts, but I see your answers to rOpenSci questions hope for more general use.

Thank you for your feedback. We have improved the documentation to clearly outline the specific use cases for the package’s functionalities. Our goal is to make the package accessible to a broader audience, including those who may not be familiar with the associated papers. For those interested in a deeper understanding, we have included references to the original research.

Installation

  • Doesn't have installation instructions in README for the development version or CRAN version (but should be very easy to add: I did it with devtools::install_github("https://github.com/giovsaraceno/QuadratiK-package/tree/master"))

The README file now has an Installation section with instructions for installing the CRAN version or the development version of the package.

Dependencies

  • Why is ggplot2, rlecuyer, doParallel, foreach, imported but aren't called according to the automatic check?

The packages 'rlecuyer', 'doParallel', 'foreach' and 'parallel' are fundamental for performing the parallel computing in the select_h function, hence they are kept in the 'Imports' section, as well as the package 'moments'. They are now called, except for 'rlecuyer' which is needed for parallel computing but it has not a direct call.

  • Package has lots of dependencies (imports) which could make it hard to maintain. rrcov, mvtnorm, movMF, moments, mclust, clusterRepro, cluster, Tinflex, ggpp, ggpubr all seem to be only called once. Are they all crucial? Could they be optional extra functionality?

The packages 'mclust', 'clusterRepro' and 'cluster' are used for comupting the ARI, IGP and ASW in the pkbc_validation function. These packages are removed from the 'Imports' and included in the 'Suggests' section of the DESCRIPTION file. If the pkbc_validation function is performed, the user is asked if he/she wants to install them in order to use the function. They are also called during the automatic check. It is clearly explained in the documentation of the pkbc_validation function.

The packages 'Tinflex' and 'movMF' are used for the random generation of points from the Poisson kernel-based distribution through the rpkb() function. Also in this case, the function asks if the user wants to install the packages. They are also called during the automatic check. This is explained in the help documentation of the rpkb() function.

The package 'ggpp' is no longer used.

The packages 'rrcov', 'mvtnorm' and 'ggpubr' are now called according the automatic check. They are not moved to the 'Suggests' section since they are used for the package's functionalities.

There are also the packages 'sphunif' and 'circular' that appear not be called according the automatic check. This happens since they are used in the tests or the vignettes.

Parts of the Package

Vignettes

  • Seem to cover the main uses of the package

Thank you for the comment.

  • One vignette can't be run locally

We checked that all the vignettes can be run locally.

  • Overall they appear to be written for people already familiar with the authors' papers or the other vignettes.

Thank you for the comment. In the revised vignettes we added more details and it is not necessary that readers are familiar with the papers or the other vignettes.

  • wireless_clustering vignette

    • Please justify this conversion to the sphere more. "Given that the Wi-Fi signal strength takes values in a limited range, it is appropriate to consider the spherically transformed observations, by L2 normalization, and consequently perform the clustering algorithm on the 7-dimensional sphere." On the face of it a sphere seems in appropriate, why doesn't the total length of a vector in the space matter?

The proposed test for uniformity and clustering algorithm are tailored for data points on the sphere $\mathcal{S}^{d-1} = {\mathbf{x} \in \mathbb{R}^d : ||\mathbf{x}||=1}$ where $||\mathbf{x}||=\sqrt{x_1^2 + x_2^2 + \ldots + x_d^2}$. The L2 normalization is necessary for performing the methods. The sentence has been rewritten more clearly. It is also specified in the Note section of the help documentation of the pkbc() function.

  • Typo: "it returns an object with IGP the InGroup Proportion (IGP), and metrics a table of computed evaluation measures" --> remove 'IGP the'.

Corrected.

  • A chunk is visible but isn't evaluated, which confused me a lot, because the function inside the chunk called validation() doesn't exist. I'm guessing it is now pkbc_validation().

The chunck has been corrected using the function pkbc_validation() and now it is evaluated correctly.

  • What is Average Silhouette Width etc. from pkbc_validation()?

Each cluster is represented by a so-called silhouette which is based on the comparison of its tightness and separation. The average silhouette width provides an evaluation of clustering validity, and might be used to select an ‘appropriate’ number of clusters.
We have added this information in the documentation of the pkbc_validation() function and specified the corresponding reference.

  • The paragraph on plotting is unclear which object the plot method is for. A plot(res_pk) would be nice, with an example plot showing why K=4 looks appropriate.

We have modified the plot() function such that it directly displays the scatter plot of data points with the elbow plots. An example is added to the wireless_clustering vignette.

  • k-sample test vignette

    • "Recall that the computed test statistics correspond to the omnibus tests." Recall from where? And what are the omnibus tests?

Apologies, now it is clearly stated. We avoided the term 'omnibus tests'.

  • The commented-out line in the code chunk "#k_test_h <- kb.test(x=x, y=y)" confused me. I think you mean it to be an example of automatic use of select_h()?

This chunck shows the usage of kb.test() when $h$ is not specified. In the example, we show the usage of the function select_h() separately.

  • h_k <- select_h(x=x, y=y, alternative="skewness") takes a really long time on my machine, would be good to mention in the help for select_h() that it takes a long time.

Thanks for this comment. We have modified the code such that the mentioned line can be run with the sample size considered in the example. However, this aspect is now mentioned in the help of the select_h() function.

  • I quickly scanned and checked the remaining vignettes could render, but didn't run the code myself.

Thanks, we have checked again the correct compilation of the remaining vignettes.

Function help

  • I suspect help assumes user has read your references e.g.,

We have modified the help such that the corresponding references do not need to be read.

  • For pkbc_validation(), how do I interpret all the measures?

In the details section of the pkbc_validation() help documentation we have added a brief explanation of the computed measures.

  • For kb.test() see my comments about k-sample tests in the README.

We have improved the description of the function kb.test() in the Details section, clearly stating the possible usage.

  • kb.test()

    • What happens if the reference distribution parameters are NULL?

The role of reference distribution parameters has been specified in the Details section.

  • I googled for your 2024 manuscript but couldn't find it, so not sure what the tests are.

Apologies. Our 2024 manuscript is now available on arXiv and we updated the corresponding references.

  • There are many aspects not seen documented in kb.test(). What are the U-statistics, what is Vn, and a V-statistic etc?

In the kb.test() documentation, we now introduce the computed statistics and corresponding results, without the need of knowing the mentioned references. We also added a note with the explanation of the used terms.

  • The function clearly does a lot of things, but what are those things? All you've said is "the kernel-based quadratic distance tests using the Gaussian kernel with bandwidth parameter h". Please describe what they are.

The help of the kb.test() function now includes a detailed description.

  • kb.test class

    • Great to have this class documented explicitly.

Thank you for the comment.

  • Link to kb.test() and vice versa to explain the objects.

We have added the corresponding links.

  • The class and the function are so closely linked so consider documenting them in the same .Rd file. But it is okay for them to be separate.

Thank you for the suggestion. At this step we prefer to organize them separately.

  • pk.test() and pk.test-class

    • What happens if rho is NULL?

rho must be provided for computing the tests. Now this argument has no default value.

  • Would be great to say somewhere when pk.test() is advised - from the referenced paper it is when the alternative is multimodal.

We have added this information in the README file and the help of the pk.test() function.

  • pkbc()

    • Written like the user is familiar with the referenced paper

We added all the relevant information for introducing the clustering algorithm in the help documentation, including information about the initialization and stopping rule.

  • pkbc_validation()

    • I'm guessing the columns of pkbc_validation(res)$metrics are the different cluster numbers in res?

It is correct. It is now clearly explained in the help of the pkbc_validation function.

  • plot.pkbc()

    • For some reason on my system after installing from CRAN: ?plot.pkbc and help(plot.pkbc) both find nothing. But, for example help(plot.TinflexC) gets me to the appropriate help. Do you know why it isn't working? The manual for plot.pkbc is crucial to understanding the plots so should be accessible from the console.

We thank the reviewer for pointing this out. The ?plot.pkb was not accessible for a missing tag for the 'roxygen2' package which is used for building the documentation. Now, the help documentation for the plot.pkbc function is displayed correctly.

  • Can you please make it so that it can be used non-interactively?

    • This is crucial for reproducibility for anyone using this function.

We have modified the plot function as suggested.

  • Please document what the options mean. In the wireless_clustering vignette plot(res_pk) choosing scatter plot then 4 clusters to display didn't make a plot at all! Is this a bug, or what it should be doing? And surely it means displaying the 4-cluster model, rather than say displaying the first 4 clusters of the 6-cluster model?

We thank the reviewer for pointing this out and we fixed the mentioned bug. The scatter plot displays data points in color and the color indicates the cluster membership of each point obtained for the indicated number of clusters. In the current version, if the number of clusters is not provided, the function shows the scatterplot for each possible value of clusters considered, as indicated in the arguments.

  • It would be nice if the elbow plot panels made clear which metric they are using, but I know that sort of thing can be hard to program. I'm hoping the order corresponds to the order in the help (left = Euclidean, right = cosine similarity).

The missing information is now included in the details of the help documentation of the function, as well as it is indicated as header in the created plots.

  • select_h()

    • Can take a really long time - worth warning the user

This information has been added in the Note section of the documentation of the select_h() function.

  • wine data set

    • Source seems to be missing some spaces
  • wireless data set

    • Source seems to be missing some spaces

Corrected.

Examples

I get warnings from geom_table_npc() when I run the examples. I'm using version 0.5.7 of package ggpp.

The function geom_table_npc() is used for displaying the tables of computed statistics together with the qqplots for normality tests, uniformity test and the two-sample test. Since these statistics are reported in the output of the summary() function, we removed the geom_table_npc() function and ggpp package to avoid the warnings.

Test Coverage

  • Could you test that pk.test() rejects uniformity correctly?

It has been added.

  • You could test rpkb against a known approximation with preexisting simulation methods (e.g. vMF, kent etc?) (currently you only test the mean)

Golzy and Markatou (2020) proposed an acceptance-rejection method for simulating data from a PKBD. Furthermore Sablica, Hornik and Leydold (2023) proposed new ways for simulating from the PKBD. We have checked the results in the case $d=2$ and $\rho=0.5$. Generating data from PKBD in this case and data from the Wrapped Chauchy distribution with concentration parameter $\rho=0.5$ and the same location as PKBD is equivalent.

  • Test of pkbc() doesn't seem to check that it gets the clustering correct in a simple situation

We have added a test where three well separed clusters are generated. We test that the clustering algorithm correcly identifies the three clusters.

  • Likewise 'stats_clusters()andpkbc_validation()` (these all seem to test that the structure/class of the output is correct, but not the actual values of the output)

We also added tests for these functions.

  • Considering your value of the graphical methods (and that I couldn't get one to work), testing that they work would be good!

We added a test checking that the plot method does not return any error or warning.

Packaging guidelines

  • README should include a brief demonstration

README file includes installation instructions, describes the main functionalities, and indicates the corresponding citations and references. There is also a link pointing to the introductory vignette of the package.

  • No citation instructions yet

The Citation section has been added.

  • Guidelines suggest a general vignette to introduce the package

An introductory vignette has been added presenting the key features of the packages with simple examples and useful links.

  • Vignettes are missing citations

Citations have been added.

  • There are places where cross-links would be appropriate to add (e.g. between pb.test help and pb.test-class, link to plot.kpbc help from kbpc and vice-versa).

Cross-links have been added

  • Do you have a pkgdown website yet?

Yes, the pkgdown website can be assessed from the GitHub page of the package at the link https://giovsaraceno.github.io/QuadratiK-package/. This link is already present in the DESCRIPTION file and in the Citation section of the README file.

More

  • G1.6 suggests a demo of pk.test() on multimodal data vs another uniform testing method

In the vignette which shows the usage of the test for uniformity on the sphere we added an example generating data from a multimodal distribution and we compared the obtained results with two tests from the literature.

  • I can see a few spelling errors (e.g. silhouette) according to devtools::spell_check()

The suggested command is run and the spelling errors are handled.

@kasselhingee
Copy link

Hi @giovsaraceno great to hear back from you. I've had a busy few weeks, but I hope to get something back to you by the end of next week.

@giovsaraceno
Copy link
Author

Hi @kasselhingee, thanks for the update and for letting me know. I look forward to hearing from you soon.

@kasselhingee
Copy link

@ropensci-review-bot check package

@ropensci-review-bot
Copy link
Collaborator

Thanks, about to send the query.

@ropensci-review-bot
Copy link
Collaborator

🚀

Editor check started

👋

@ropensci-review-bot
Copy link
Collaborator

Checks for QuadratiK (v1.1.2)

git hash: 32b794e7

  • ✔️ Package is already on CRAN.
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 100%.
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.

Package License: GPL (>= 3)


1. rOpenSci Statistical Standards (srr package)

This package is in the following category:

  • Dimensionality Reduction, Clustering and Unsupervised Learning

✔️ All applicable standards [v0.2.0] have been documented in this package (213 complied with; 49 N/A standards)

Click to see the report of author-reported standards compliance of the package with links to associated lines of code, which can be re-generated locally by running the srr_report() function from within a local clone of the repository.


2. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type package ncalls
internal base 414
internal QuadratiK 54
internal utils 13
internal graphics 6
internal grDevices 4
imports stats 31
imports methods 27
imports sn 8
imports ggpubr 3
imports scatterplot3d 2
imports parallel 1
imports doParallel 1
imports foreach 1
imports moments 1
imports mvtnorm 1
imports rrcov 1
imports ggplot2 NA
imports Rcpp NA
imports RcppEigen NA
imports rlecuyer NA
suggests Tinflex 3
suggests cluster 1
suggests mclust 1
suggests movMF 1
suggests knitr NA
suggests rmarkdown NA
suggests roxygen2 NA
suggests testthat NA
suggests rgl NA
suggests sphunif NA
suggests circular NA
suggests clusterRepro NA
linking_to Rcpp NA
linking_to RcppEigen NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

list (55), c (33), matrix (22), nrow (21), t (21), data.frame (20), rep (20), log (19), ncol (19), for (12), sqrt (12), length (10), as.numeric (9), return (9), mean (8), numeric (8), T (7), unique (7), apply (6), det (6), sample (6), vapply (6), as.factor (5), table (5), as.vector (4), cumsum (4), exp (4), rbind (4), sum (4), as.matrix (3), diag (3), kappa (3), lgamma (3), pi (3), q (3), replace (3), unlist (3), max (2), readline (2), rowSums (2), which (2), which.max (2), beta (1), colMeans (1), expand.grid (1), factor (1), if (1), lapply (1), levels (1), norm (1), rep.int (1), round (1), seq_along (1), seq_len (1), subset (1), with (1)

QuadratiK

kbNormTest (4), compute_CV (3), normal_CV (3), C_d_lambda (2), cv_ksample (2), d2lpdf (2), dlpdf (2), DOF_norm (2), lpdf (2), norm_vec (2), objective_norm (2), poisson_CV (2), rejvmf (2), sample_hypersphere (2), statPoissonUnif (2), compare_qq (1), compute_stats (1), computeKernelMatrix (1), computePoissonMatrix (1), DOF (1), dpkb (1), elbowMethod (1), generate_SN (1), NonparamCentering (1), objective_2 (1), objective_k (1), ParamCentering (1), pkbc_validation (1), rejacg (1), rejpsaw (1), root_func (1), select_h (1), stat_ksample_cpp (1), stat2sample (1), var_norm (1)

stats

df (12), quantile (5), dist (2), qchisq (2), rnorm (2), runif (2), aggregate (1), cov (1), D (1), sd (1), sigma (1), uniroot (1)

methods

setMethod (12), setGeneric (8), new (3), setClass (3), is (1)

utils

data (11), prompt (2)

sn

rmsn (8)

graphics

par (6)

grDevices

colors (2), rainbow (2)

ggpubr

ggarrange (3)

Tinflex

Tinflex.sample (2), Tinflex.setup.C (1)

scatterplot3d

scatterplot3d (2)

cluster

silhouette (1)

doParallel

registerDoParallel (1)

foreach

foreach (1)

mclust

adjustedRandIndex (1)

moments

skewness (1)

movMF

rmovMF (1)

mvtnorm

rmvnorm (1)

parallel

makeCluster (1)

rrcov

PcaLocantore (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.


3. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

  • code in C++ (20% in 2 files) and R (80% in 14 files)
  • 4 authors
  • 6 vignettes
  • 3 internal data files
  • 15 imported packages
  • 28 exported functions (median 10 lines of code)
  • 60 non-exported functions in R (median 14 lines of code)
  • 20 R functions (median 13 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

  • loc = "Lines of Code"
  • fn = "function"
  • exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure value percentile noteworthy
files_R 14 68.7
files_src 2 79.5
files_vignettes 12 99.5
files_tests 10 87.4
loc_R 1529 76.0
loc_src 387 44.0
loc_vignettes 762 85.5
loc_tests 638 76.0
num_vignettes 6 97.6 TRUE
data_size_total 77179 81.0
data_size_median 11842 79.6
n_fns_r 88 71.6
n_fns_r_exported 28 75.7
n_fns_r_not_exported 60 70.6
n_fns_src 20 54.9
n_fns_per_file_r 5 72.2
n_fns_per_file_src 10 80.5
num_params_per_fn 4 51.2
loc_per_fn_r 14 43.1
loc_per_fn_r_exp 10 25.3
loc_per_fn_r_not_exp 14 46.7
loc_per_fn_src 13 45.0
rel_whitespace_R 23 81.5
rel_whitespace_src 21 50.9
rel_whitespace_vignettes 31 85.9
rel_whitespace_tests 30 82.1
doclines_per_fn_exp 40 49.5
doclines_per_fn_not_exp 0 0.0 TRUE
doclines_per_fn_src 1 72.6
fn_call_network_size 57 67.7

3a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


4. goodpractice and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

badge License: GPL v3|arXiv CRAN version GitHub version | Project Status: Active – The project has reached a stable, usable state and is being actively developed. Status at rOpenSci Software Peer Review codecov R-CMD-check Lifecycle

GitHub Workflow Results

id name conclusion sha run_number date
11571834514 pages build and deployment success 32b794 145 2024-10-29
11571834630 pkgcheck failure 32b794 182 2024-10-29
11571834646 pkgdown success 32b794 147 2024-10-29
11571834640 R-CMD-check success 32b794 205 2024-10-29
11571834627 test-coverage success 32b794 205 2024-10-29

3b. goodpractice results

R CMD check with rcmdcheck

R CMD check generated the following note:

  1. checking installed package size ... NOTE
    installed size is 20.0Mb
    sub-directories of 1Mb or more:
    libs 18.1Mb

R CMD check generated the following check_fails:

  1. no_import_package_as_a_whole
  2. rcmdcheck_reasonable_installed_size

Test coverage with covr

Package coverage: 100

Cyclocomplexity with cyclocomp

The following functions have cyclocomplexity >= 15:

function cyclocomplexity
select_h 47
pkbc_validation 23
rpkb 15

Static code analyses with lintr

lintr found no issues with this package!


Package Versions

package version
pkgstats 0.2.0.47
pkgcheck 0.1.2.63
srr 0.1.3.26


Editor-in-Chief Instructions:

This package is in top shape and may be passed on to a handling editor

@kasselhingee
Copy link

Hi @giovsaraceno the package is much improved with much more information in the documentation. I noticed a few small things to change and some queries, but I do not need to see it again. Once the small changes have been made I recommend this package for Gold badge.

The below is a report I generated with the help of pkgreviewr::pkgreview_create().

QuadratiK - package review

Reviewer: @kasselhingee

Review Submitted:




This report contains documents associated with the review of rOpenSci submitted package:

QuadratiK: ropensci/software-review issue #632.


Package info

Description:

It includes test for multivariate normality, test for uniformity on the d-dimensional
Sphere, non-parametric two- and k-sample tests, random generation of points from the Poisson
kernel-based density and clustering algorithm for spherical data. For more information see
Saraceno G., Markatou M., Mukhopadhyay R. and Golzy M. (2024)
doi:10.48550/arXiv.2402.02290
Markatou, M. and Saraceno, G. (2024) doi:10.48550/arXiv.2407.16374,
Ding, Y., Markatou, M. and Saraceno, G. (2023) doi:10.5705/ss.202022.0347,
and Golzy, M. and Markatou, M. (2020) doi:10.1080/10618600.2020.1740713.

Author: Giovanni Saraceno [email protected] [aut, cre] (ORCID 000-0002-1753-2367), Marianthi Markatou [aut], Raktim Mukhopadhyay [aut], Mojgan Golzy [aut]

repo url: https://github.com/giovsaraceno/QuadratiK-package

website url: https://giovsaraceno.github.io/QuadratiK-package/

Review info

See reviewer guidelines for further information on the rOpenSci review process.

key review checks:

  • Does the code comply with general principles in the Mozilla reviewing guide?
  • Does the package comply with the ROpenSci packaging guide?
  • Are there improvements that could be made to the code style?
  • Is there code duplication in the package that should be reduced?
  • Are there user interface improvements that could be made?
  • Are there performance improvements that could be made?
  • Is the documentation (installation instructions/vignettes/examples/demos) clear and sufficient?

Please be respectful and kind to the authors in your reviews. The rOpenSci code of conduct is mandatory for everyone involved in our review process.


Outcome

The package is much improved with much more information in the documentation. I noticed a few small things to change and some queries, but I do not need to see it again. Once the small changes have been made I recommend this package for Gold badge.

High Level Check List

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • Briefly describe any working relationship you have (had) with the package authors.
  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need: clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s): demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions
  • Examples: (that run successfully locally) for all exported functions
  • Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software have been confirmed.
  • Performance: Any performance claims of the software have been confirmed.
  • Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 6 hours this time - there are so many things to check!

  • Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

session info

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
#>  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
#>  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Australia/Sydney
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] devtools_2.4.5     usethis_3.0.0.9000 magrittr_2.0.3    
#> 
#> loaded via a namespace (and not attached):
#>  [1] miniUI_0.1.1.1    jsonlite_1.8.9    compiler_4.4.1    promises_1.3.0   
#>  [5] Rcpp_1.0.13-1     later_1.3.2       jquerylib_0.1.4   yaml_2.3.10      
#>  [9] fastmap_1.2.0     mime_0.12         R6_2.5.1          knitr_1.49       
#> [13] htmlwidgets_1.6.4 profvis_0.4.0     shiny_1.9.1       bslib_0.8.0      
#> [17] rlang_1.1.4       cachem_1.1.0      httpuv_1.6.15     xfun_0.49        
#> [21] fs_1.6.5          sass_0.4.9        pkgload_1.4.0     memoise_2.0.1    
#> [25] cli_3.6.3         digest_0.6.37     rstudioapi_0.17.1 xtable_1.8-4     
#> [29] remotes_2.5.0     lifecycle_1.0.4   vctrs_0.6.5       evaluate_1.0.1   
#> [33] glue_1.8.0        urlchecker_1.0.1  sessioninfo_1.2.2 pkgbuild_1.4.5   
#> [37] rmarkdown_2.29    purrr_1.0.2       tools_4.4.1       ellipsis_0.3.2   
#> [41] htmltools_0.5.8.1

Test installation

test local QuadratiK install:

install(pkg_dir, dependencies = T, build_vignettes = T)
remove.packages("QuadratiK")
#> Removing package from '/home/kassel/R/x86_64-pc-linux-gnu-library/4.4'
#> (as 'lib' is unspecified)

comments:

Local install was seamless.


test install of QuadratiK from GitHub with:

devtools::install_github("giovsaraceno/QuadratiK-package", dependencies = T, build_vignettes = T)
#> Downloading GitHub repo giovsaraceno/QuadratiK-package@HEAD
#> 
#> ── R CMD build ─────────────────────────────────────────────────────────────────
#>   
   checking for file/tmp/Rtmpix2IQB/remotesa0043e3633a2/giovsaraceno-QuadratiK-package-32b794e/DESCRIPTION...checking for file/tmp/Rtmpix2IQB/remotesa0043e3633a2/giovsaraceno-QuadratiK-package-32b794e/DESCRIPTION#> preparingQuadratiK:
#>    checking DESCRIPTION meta-information ...checking DESCRIPTION meta-information
#> ─  cleaning src
#> installing the package to build vignettes
#> 
  
   creating vignettes ...creating vignettes (1m 7.6s)
#> cleaning src
#> checking for LF line-endings in source and make files and shell scripts (685ms)
#> checking for empty or unneeded directories
#> buildingQuadratiK_1.1.2.tar.gz#> 
  
   
#> 
#> Installing package into '/home/kassel/R/x86_64-pc-linux-gnu-library/4.4'
#> (as 'lib' is unspecified)

comments:

Installing from GitHub works.


Check package integrity

run checks on QuadratiK source:

devtools::check(pkg_dir)

comments:

Checks pass.


run tests on QuadratiK source:

comments:

Tests pass.


check QuadratiK for goodpractice:

goodpractice::gp(pkg_dir)
#> ℹ Preparing: covr
#> ℹ Preparing: cyclocomp
#> ── R CMD build ─────────────────────────────────────────────────────────────────
#>   
   checking for file/tmp/Rtmpp2nRfX/remotesb5005dc86caa/QuadratiK/DESCRIPTION...checking for file/tmp/Rtmpp2nRfX/remotesb5005dc86caa/QuadratiK/DESCRIPTION#> preparingQuadratiK:
#> 
  
   checking DESCRIPTION meta-information ...checking DESCRIPTION meta-information
#> cleaning src
#> 
  
   checking vignette meta-information ...checking vignette meta-information
#> checking for LF line-endings in source and make files and shell scripts (880ms)
#> checking for empty or unneeded directories
#> buildingQuadratiK_1.1.2.tar.gz#> 
  
   
#> 
#> ℹ Preparing: description
#> ℹ Preparing: lintr
#> ℹ Preparing: namespace
#> ℹ Preparing: rcmdcheck
#> ── GP QuadratiK ────────────────────────────────────────────────────────────────
#> 
#> It is good practice to
#> 
#>   ✖ avoid long code lines, it is bad for readability. Also, many people prefer
#>     editor windows that are about 80 characters wide. Try make your lines
#>     shorter than 80 characters
#> 
#>     'R/clustering_functions.R:1048:81'
#>     'R/QuadratiK-package.R:62:81'
#>     'tests/testthat/test-pkbc.R:199:81'
#>     'tests/testthat/test-select_h.R:93:81'
#> 
#>   ✖ avoid 1:length(...), 1:nrow(...), 1:ncol(...), 1:NROW(...) and 1:NCOL(...)
#>     expressions. They are error prone and result 1:0 if the expression on the
#>     right hand side is zero. Use seq_len() or seq_along() instead.
#> 
#>     'R/h_selection.R:366:28'
#> 
#>   ✖ not import packages as a whole, as this can cause name clashes between the
#>     imported packages, especially over time as packages change. Instead, import
#>     only the specific functions you need.
#>   ✖ fix this R CMD check NOTE: installed size is 17.3Mb sub-directories of 1Mb
#>     or more: doc 1.0Mb libs 15.3Mb
#> ────────────────────────────────────────────────────────────────────────────────

comments:

I agree with the above suggestions, but not sure if they are part of rOpenSci standards. Particularly the suggestion about reducing size I would ignore (though I am surprised libs is so large).


Check package metadata files

inspect

  • is d the dimension of the ambient Euclidean space or the dimension of the sphere? Seems like you're using it for both which I found hard to untangle.

  • Is the Poisson kernal's $x$ on the sphere too? ($d$-variate makes it sound like Euclidean space)

  • x matrix of data points: please specify whether each data point a row in the matrix (like in tidyverse) or a column?

  • Namespace has a duplication. I'm not sure what R would actually be doing in this case.

useDynLib(QuadratiK)
useDynLib(QuadratiK, .registration = TRUE)
  • A few packages are fully imported according to the NAMESPACE. This makes sense when you're using alot of the functionality of that package, or a lot of functions from that package. I'd guess that most can be converted to importFrom() statements.
import(Rcpp)
import(RcppEigen)
import(foreach)
import(ggplot2)
import(rlecuyer)

spell check

devtools::spell_check(pkg_dir)
#>   WORD               FOUND IN
#> Aeberhard          wine.Rd:27,51
#> Ajne               uniformity.Rmd:107,120
#> al                 dpkb.Rd:40,42
#>                    kb.test.Rd:151,162
#>                    plot.pkbc.Rd:36,59
#>                    wireless.Rd:49
#> ARI                pkbc_validation.Rd:31,42,44,66,69
#>                    wireless_clustering.Rmd:42,51
#> arxiv              select_h.Rd:154
#> arXiv              compute_CV.Rd:56
#>                    kb.test.Rd:288
#>                    QuadratiK-package.Rd:75,89
#>                    select_h.Rd:149
#>                    description:5,6
#>                    README.md:5,67,67,88
#>                    Introduction.Rmd:55,236
#> ASW                pkbc_validation.Rd:70
#>                    wireless_clustering.Rmd:42
#> Bhatt              wireless.Rd:21,46
#> Bingham            uniformity.Rmd:107,120
#> biostatistics      pkbc_validation.Rd:95
#> Biostatistics      pkbc_validation.Rd:94
#> cdot               uniformity.Rmd:34
#> cen                uniformity.Rmd:23,34
#> CMD                NEWS.md:13,73
#>                    README.md:5
#> codecov            README.md:5
#> codemeta           NEWS.md:81
#> Coomans            wine.Rd:51
#> cr                 TwoSample_test.Rmd:59
#> dBm                wireless.Rd:35
#>                    wireless_clustering.Rmd:27
#> De                 wine.Rd:51
#> det                kSample_test.Rmd:40
#>                    TwoSample_test.Rmd:45
#> DOF                DOF_norm.Rd:5,15,19
#>                    DOF.Rd:5,15,19
#> doi                breast_cancer.Rd:14
#>                    compute_CV.Rd:56
#>                    kb.test.Rd:288
#>                    pk.test.Rd:119
#>                    pkbc_validation.Rd:95
#>                    plot.pkbc.Rd:61
#>                    poisson_CV.Rd:38
#>                    QuadratiK-package.Rd:79,89
#>                    select_h.Rd:149
#>                    wine.Rd:29
#>                    wireless.Rd:23,51
#>                    description:5,6,7,8
#>                    README.md:84
#>                    Introduction.Rmd:238
#>                    uniformity.Rmd:126
#> DOI                dpkb.Rd:118
#>                    kb.test.Rd:293
#>                    pkbc.Rd:154
#>                    QuadratiK-package.Rd:84
#>                    README.md:86
#>                    generate_rpkb.Rmd:126,130
#>                    Introduction.Rmd:240
#>                    wireless_clustering.Rmd:93
#> DW                 breast_cancer.Rd:14
#> EJS                generate_rpkb.Rmd:130
#> et                 dpkb.Rd:40,42
#>                    kb.test.Rd:151,162
#>                    plot.pkbc.Rd:36,59
#>                    wireless.Rd:49
#> Fi                 wireless.Rd:30,31,35
#>                    wireless_clustering.Rmd:27,29
#> Fk                 var_k.Rd:21
#> FNA                breast_cancer.Rd:22
#> Forina             wine.Rd:27
#> frac               README.md:16
#>                    generate_rpkb.Rmd:29
#>                    kSample_test.Rmd:32,32,40,47,68
#>                    TwoSample_test.Rmd:36,36,36,40,40,45,52
#>                    uniformity.Rmd:23,34
#> ge                 README.md:11
#>                    kSample_test.Rmd:128
#> geospatial         Introduction.Rmd:14
#> ggplot             compare_qq.Rd:5,20
#> gitignore          NEWS.md:84
#> Hornik             dpkb.Rd:80,120
#>                    generate_rpkb.Rmd:48,128
#> https              breast_cancer.Rd:14
#>                    compute_CV.Rd:56
#>                    kb.test.Rd:288
#>                    pkbc_validation.Rd:95
#>                    plot.pkbc.Rd:61
#>                    QuadratiK-package.Rd:89
#>                    select_h.Rd:149,154
#>                    wine.Rd:29
#>                    wireless.Rd:23,51
#> hypersphere        pkbc.Rd:85
#>                    sample_hypersphere.Rd:5,18
#> IGP                pkbc_validation.Rd:35,38,69
#>                    wireless_clustering.Rmd:42
#> ij                 kSample_test.Rmd:32,57
#> InGroup            wireless_clustering.Rmd:42
#> initializers       pkbc.Rd:104
#> interpretability   QuadratiK-package.Rd:55
#>                    README.md:20
#>                    Introduction.Rmd:193
#> json               NEWS.md:81
#> Kaleida            QuadratiK-package.Rd:63
#>                    README.md:92
#> Kapp               pkbc_validation.Rd:30,40,93
#> KBQD               compute_CV.Rd:5
#>                    cv_ksample.Rd:5,44
#>                    kb.test.Rd:7,146
#>                    normal_CV.Rd:5,30,35
#>                    pk.test.Rd:63,72
#>                    poisson_CV.Rd:5
#>                    select_h.Rd:80,86
#>                    TwoSample_test.Rmd:30,59
#>                    uniformity.Rmd:23,34
#> kxj                pkbc_validation.Rd:95
#> ldots              README.md:13
#>                    Introduction.Rmd:122
#>                    kSample_test.Rmd:17,20,47
#>                    TwoSample_test.Rmd:18,59
#>                    uniformity.Rmd:17
#> le                 kSample_test.Rmd:20
#> Leydold            dpkb.Rd:80,120
#>                    generate_rpkb.Rmd:48,128
#> Lifecycle          README.md:5
#> Locantore          plot.pkbc.Rd:36,59
#> logLikelihood      summary.pkbc.Rd:14
#> Mangasarian        breast_cancer.Rd:11,32
#> Marron             plot.pkbc.Rd:59
#> mathbb             README.md:16
#>                    generate_rpkb.Rmd:23,34
#>                    Introduction.Rmd:142
#>                    kSample_test.Rmd:47
#>                    TwoSample_test.Rmd:52
#> mathbf             README.md:16
#>                    generate_rpkb.Rmd:23,29
#>                    kSample_test.Rmd:17,18,30,32,40,47,57
#>                    TwoSample_test.Rmd:35,36,36,40,40,45,52
#>                    uniformity.Rmd:23,25,34,36,48
#> mathcal            README.md:16
#>                    generate_rpkb.Rmd:23
#>                    Introduction.Rmd:93,142
#>                    uniformity.Rmd:17,48
#> mathrm             kSample_test.Rmd:57
#>                    TwoSample_test.Rmd:35
#> mbox               kSample_test.Rmd:32,36,51
#> md                 NEWS.md:6,60,64,77,82
#> Memb               predict.pkbc.Rd:23
#> milliwatts         wireless.Rd:35
#>                    wireless_clustering.Rmd:27
#> Mises              dpkb.Rd:37,79
#>                    generate_rpkb.Rmd:48
#>                    uniformity.Rmd:84
#> Narayanan          wireless.Rd:46
#> Nonparam           kb.test.Rd:78
#>                    NEWS.md:87
#> nonumber           TwoSample_test.Rmd:41
#> Param              kb.test.Rd:78
#> parametrically     kb.test.Rd:190
#> PCALocantore       plot.pkbc.Rd:21
#> Perumal            wireless.Rd:46
#> pkbc               plot.pkbc.Rd:30
#>                    summary.pkbc.Rd:20
#> PKBD               dpkb.Rd:6,52,74,79,81,120
#>                    QuadratiK-package.Rd:31,33
#>                    summary.pkbc.Rd:6
#>                    NEWS.md:16
#>                    README.md:16,16
#>                    generate_rpkb.Rmd:48,128
#>                    Introduction.Rmd:151,153,230
#> PKBDs              dpkb.Rd:71,72
#>                    generate_rpkb.Rmd:34
#> pkgcheck           NEWS.md:73
#> pkgdown            NEWS.md:65
#> poisson            poisson_CV.Rd:26
#> preprint           QuadratiK-package.Rd:75
#>                    README.md:67
#>                    Introduction.Rmd:55
#> Probs              predict.pkbc.Rd:24
#> PyPI               QuadratiK-package.Rd:67
#>                    README.md:39
#>                    Introduction.Rmd:32
#> qq                 summary.kb.test.Rd:20
#>                    summary.pk.test.Rd:20
#>                    TwoSample_test.Rmd:94
#>                    uniformity.Rmd:73,78
#> qquad              kSample_test.Rmd:32,36
#> Rbuildignore       NEWS.md:81,82
#> README             NEWS.md:8,77
#> rejacg             dpkb.Rd:46,48,97
#> Rohra              wireless.Rd:46
#> rOpenSci           NEWS.md:13
#>                    README.md:5
#> Rousseeuw          pkbc_validation.Rd:31,52,97
#> roxygen            NEWS.md:27
#> Sablica            dpkb.Rd:40,42,80,120
#>                    generate_rpkb.Rmd:48,128
#> Satterthwaite      kb.test.Rd:160
#>                    pk.test.Rd:87
#> Sinica             pk.test.Rd:118
#>                    poisson_CV.Rd:38
#>                    QuadratiK-package.Rd:79
#>                    README.md:84
#>                    Introduction.Rmd:238
#>                    uniformity.Rmd:125
#> SPIE               breast_cancer.Rd:34
#> Springer           wireless.Rd:51
#> srr                NEWS.md:61
#> ss                 pk.test.Rd:119
#>                    poisson_CV.Rd:38
#>                    QuadratiK-package.Rd:79
#>                    description:7
#>                    README.md:84
#>                    Introduction.Rmd:238
#>                    uniformity.Rmd:126
#> Statistica         pk.test.Rd:118
#>                    poisson_CV.Rd:38
#>                    QuadratiK-package.Rd:79
#>                    README.md:84
#>                    Introduction.Rmd:238
#>                    uniformity.Rmd:125
#> testthat           NEWS.md:15
#> th                 kb.test.Rd:168
#>                    pk.test.Rd:97
#>                    predict.pkbc.Rd:25
#> Thakur             wireless.Rd:46
#> Tibshirani         pkbc_validation.Rd:30,40,93
#> UCI                breast_cancer.Rd:13
#>                    wine.Rd:28
#>                    wireless.Rd:22
#>                    wireless_clustering.Rmd:20
#> Un                 kb.test-class.Rd:30
#>                    kb.test.Rd:99,100
#>                    pk.test-class.Rd:20,23
#>                    pk.test.Rd:20,35,37,46
#> unbiasedness       kb.test.Rd:259
#>                    pk.test.Rd:106
#> Vel                wine.Rd:51
#> Vn                 kb.test-class.Rd:26,32
#>                    kb.test.Rd:103,104
#>                    pk.test-class.Rd:27,31
#>                    pk.test.Rd:38,39,42
#> von                dpkb.Rd:37,79
#>                    generate_rpkb.Rmd:48
#>                    uniformity.Rmd:84
#> wcss               plot.pkbc.Rd:41,47
#>                    summary.pkbc.Rd:15
#> Wi                 wireless.Rd:30,31,35
#>                    wireless_clustering.Rmd:27,29
#> Wolberg            breast_cancer.Rd:11,32
#> Yuxin              poisson_CV.Rd:36

comments:

Spelling looks okay.


Check documentation

online documentation: https://giovsaraceno.github.io/QuadratiK-package/

  • Is the documentation (installation instructions/vignettes/examples/demos) clear and sufficient?
  • The documentation is generally clear and sufficient.

  • I still have an issue with your reason for L2 normalising: `Given that Wi-Fi signal strength values are inherently bounded within a certain range, it is possible to consider the spherically transformed data points using L2 normalization.' To me this would usually not be a good enough reason in a statistical application because the strength within that range could be important. As this is just a demonstration I suggest:

    • Say that the normalisation is for demonstration purposes
    • OR say why the strength doesn't matter even within the bounded range (is it irrelevant to the question or does it have a too much noise etc)

test QuadratiK function help files:

help(package = "QuadratiK")

comments:

  • x matrix of data points: is each data point a row (like in tidyverse) or a column?

  • it is very nice now!!

  • `so-called silhouette which is based on the comparison of its tightness and separation' tells me what it is based on, but still I don't actually know what it represents. E.g. the ratio of tightness to separation?? Does it have an analogy with a shadow projected onto a surface?

  • A more statistical methodology question: I haven't looked into your bootstrapping method for your test so you likely already doing this: I'm curious if you are resampling from the null hypothesis and if your statistic is pivotal or at least stays on the same scale? The reasons for doing this are nicely summarised by Hall and Wilson (1991) doi:10.2307/2532163.

  • it is unusual that rpkb() returns a list of items, more than just the simulated data. Most samplers in R (eg. runif()) return just a vector or matrix of simulated data. Not necessary, but users would find it easier if you followed this convention. Please document in either case. Also is it rows or columns that are a `data point' in the return?


test QuadratiK vignettes:

comments:

Introduction:

  • In the uniformity test, the $d$-dimensional sphere is usually written $S^d$ and is a manifold of $d$-dimensions. You seem to want to mean the unit sphere in $R^d$ space, which is a $d-1$-dimensional sphere.

  • Why set $\rho = 0.7$ for pk.test()? Would other values work better in other situations? If there is a good broadly useful value of $\rho$ then it would be good to make that the default.

  • Is pk.test(x, rho = 1) meant to work? I get NA values, but maybe I should get the same error as for $\rho = 10$?

  • haven't looked at other vignettes


Test functionality:

  • Are there user interface improvements that could be made?
  • Are there performance improvements that could be made?
library("QuadratiK")
exports <-ls("package:QuadratiK")
exports
#>   [1] "_QuadratiK_computeKernelMatrix"  "_QuadratiK_computePoissonMatrix"
#>   [3] "_QuadratiK_kbNormTest"           "_QuadratiK_NonparamCentering"   
#>   [5] "_QuadratiK_ParamCentering"       "_QuadratiK_stat_ksample_cpp"    
#>   [7] "_QuadratiK_stat2sample"          "_QuadratiK_statPoissonUnif"     
#>   [9] "_QuadratiK_var_k"                "_QuadratiK_var_two"             
#>  [11] "%:%"                             "%+%"                            
#>  [13] "%+replace%"                      "%do%"                           
#>  [15] "%dopar%"                         "accumulate"                     
#>  [17] "aes"                             "aes_"                           
#>  [19] "aes_all"                         "aes_auto"                       
#>  [21] "aes_q"                           "aes_string"                     
#>  [23] "after_scale"                     "after_stat"                     
#>  [25] "aggregate"                       "alpha"                          
#>  [27] "annotate"                        "annotation_custom"              
#>  [29] "annotation_logticks"             "annotation_map"                 
#>  [31] "annotation_raster"               "arrow"                          
#>  [33] "as_label"                        "as_labeller"                    
#>  [35] "autolayer"                       "autoplot"                       
#>  [37] "AxisSecondary"                   "benchplot"                      
#>  [39] "binned_scale"                    "borders"                        
#>  [41] "breast_cancer"                   "calc_element"                   
#>  [43] "check_device"                    "clusterExport"                  
#>  [45] "combine_vars"                    "compare_qq"                     
#>  [47] "compileAttributes"               "compute_CV"                     
#>  [49] "compute_stats"                   "computeKernelMatrix"            
#>  [51] "computePoissonMatrix"            "continuous_scale"               
#>  [53] "Coord"                           "coord_cartesian"                
#>  [55] "coord_equal"                     "coord_fixed"                    
#>  [57] "coord_flip"                      "coord_map"                      
#>  [59] "coord_munch"                     "coord_polar"                    
#>  [61] "coord_quickmap"                  "coord_radial"                   
#>  [63] "coord_sf"                        "coord_trans"                    
#>  [65] "CoordCartesian"                  "CoordFixed"                     
#>  [67] "CoordFlip"                       "CoordMap"                       
#>  [69] "CoordPolar"                      "CoordQuickmap"                  
#>  [71] "CoordRadial"                     "CoordSf"                        
#>  [73] "CoordTrans"                      "cov"                            
#>  [75] "cpp_object_dummy"                "cpp_object_initializer"         
#>  [77] "cppFunction"                     "cut_interval"                   
#>  [79] "cut_number"                      "cut_width"                      
#>  [81] "cv_ksample"                      "datetime_scale"                 
#>  [83] "demangle"                        "derive"                         
#>  [85] "detectCores"                     "discrete_scale"                 
#>  [87] "dist"                            "DOF"                            
#>  [89] "DOF_norm"                        "dpkb"                           
#>  [91] "draw_key_abline"                 "draw_key_blank"                 
#>  [93] "draw_key_boxplot"                "draw_key_crossbar"              
#>  [95] "draw_key_dotplot"                "draw_key_label"                 
#>  [97] "draw_key_linerange"              "draw_key_path"                  
#>  [99] "draw_key_point"                  "draw_key_pointrange"            
#> [101] "draw_key_polygon"                "draw_key_rect"                  
#> [103] "draw_key_smooth"                 "draw_key_text"                  
#> [105] "draw_key_timeseries"             "draw_key_vline"                 
#> [107] "draw_key_vpath"                  "dup_axis"                       
#> [109] "el_def"                          "elbowMethod"                    
#> [111] "element_blank"                   "element_grob"                   
#> [113] "element_line"                    "element_rect"                   
#> [115] "element_render"                  "element_text"                   
#> [117] "enexpr"                          "enexprs"                        
#> [119] "enquo"                           "enquos"                         
#> [121] "ensym"                           "ensyms"                         
#> [123] "evalCpp"                         "expand_limits"                  
#> [125] "expand_scale"                    "expansion"                      
#> [127] "exposeClass"                     "expr"                           
#> [129] "Facet"                           "facet_grid"                     
#> [131] "facet_null"                      "facet_wrap"                     
#> [133] "FacetGrid"                       "FacetNull"                      
#> [135] "FacetWrap"                       "fastLm"                         
#> [137] "fastLmPure"                      "fill_alpha"                     
#> [139] "find_panel"                      "flip_data"                      
#> [141] "flipped_names"                   "foreach"                        
#> [143] "formals<-"                       "fortify"                        
#> [145] "generate_SN"                     "Geom"                           
#> [147] "geom_abline"                     "geom_area"                      
#> [149] "geom_bar"                        "geom_bin_2d"                    
#> [151] "geom_bin2d"                      "geom_blank"                     
#> [153] "geom_boxplot"                    "geom_col"                       
#> [155] "geom_contour"                    "geom_contour_filled"            
#> [157] "geom_count"                      "geom_crossbar"                  
#> [159] "geom_curve"                      "geom_density"                   
#> [161] "geom_density_2d"                 "geom_density_2d_filled"         
#> [163] "geom_density2d"                  "geom_density2d_filled"          
#> [165] "geom_dotplot"                    "geom_errorbar"                  
#> [167] "geom_errorbarh"                  "geom_freqpoly"                  
#> [169] "geom_function"                   "geom_hex"                       
#> [171] "geom_histogram"                  "geom_hline"                     
#> [173] "geom_jitter"                     "geom_label"                     
#> [175] "geom_line"                       "geom_linerange"                 
#> [177] "geom_map"                        "geom_path"                      
#> [179] "geom_point"                      "geom_pointrange"                
#> [181] "geom_polygon"                    "geom_qq"                        
#> [183] "geom_qq_line"                    "geom_quantile"                  
#> [185] "geom_raster"                     "geom_rect"                      
#> [187] "geom_ribbon"                     "geom_rug"                       
#> [189] "geom_segment"                    "geom_sf"                        
#> [191] "geom_sf_label"                   "geom_sf_text"                   
#> [193] "geom_smooth"                     "geom_spoke"                     
#> [195] "geom_step"                       "geom_text"                      
#> [197] "geom_tile"                       "geom_violin"                    
#> [199] "geom_vline"                      "GeomAbline"                     
#> [201] "GeomAnnotationMap"               "GeomArea"                       
#> [203] "GeomBar"                         "GeomBlank"                      
#> [205] "GeomBoxplot"                     "GeomCol"                        
#> [207] "GeomContour"                     "GeomContourFilled"              
#> [209] "GeomCrossbar"                    "GeomCurve"                      
#> [211] "GeomCustomAnn"                   "GeomDensity"                    
#> [213] "GeomDensity2d"                   "GeomDensity2dFilled"            
#> [215] "GeomDotplot"                     "GeomErrorbar"                   
#> [217] "GeomErrorbarh"                   "GeomFunction"                   
#> [219] "GeomHex"                         "GeomHline"                      
#> [221] "GeomLabel"                       "GeomLine"                       
#> [223] "GeomLinerange"                   "GeomLogticks"                   
#> [225] "GeomMap"                         "GeomPath"                       
#> [227] "GeomPoint"                       "GeomPointrange"                 
#> [229] "GeomPolygon"                     "GeomQuantile"                   
#> [231] "GeomRaster"                      "GeomRasterAnn"                  
#> [233] "GeomRect"                        "GeomRibbon"                     
#> [235] "GeomRug"                         "GeomSegment"                    
#> [237] "GeomSf"                          "GeomSmooth"                     
#> [239] "GeomSpoke"                       "GeomStep"                       
#> [241] "GeomText"                        "GeomTile"                       
#> [243] "GeomViolin"                      "GeomVline"                      
#> [245] "get_alt_text"                    "get_element_tree"               
#> [247] "get_guide_data"                  "getDoParName"                   
#> [249] "getDoParRegistered"              "getDoParVersion"                
#> [251] "getDoParWorkers"                 "getDoSeqName"                   
#> [253] "getDoSeqRegistered"              "getDoSeqVersion"                
#> [255] "getDoSeqWorkers"                 "getErrorIndex"                  
#> [257] "getErrorValue"                   "getexports"                     
#> [259] "getRcppVersion"                  "getResult"                      
#> [261] "gg_dep"                          "ggarrange"                      
#> [263] "ggplot"                          "ggplot_add"                     
#> [265] "ggplot_build"                    "ggplot_gtable"                  
#> [267] "ggplotGrob"                      "ggproto"                        
#> [269] "ggproto_parent"                  "ggsave"                         
#> [271] "ggtitle"                         "Guide"                          
#> [273] "guide_axis"                      "guide_axis_logticks"            
#> [275] "guide_axis_stack"                "guide_axis_theta"               
#> [277] "guide_bins"                      "guide_colorbar"                 
#> [279] "guide_colorsteps"                "guide_colourbar"                
#> [281] "guide_coloursteps"               "guide_custom"                   
#> [283] "guide_gengrob"                   "guide_geom"                     
#> [285] "guide_legend"                    "guide_merge"                    
#> [287] "guide_none"                      "guide_train"                    
#> [289] "guide_transform"                 "GuideAxis"                      
#> [291] "GuideAxisLogticks"               "GuideAxisStack"                 
#> [293] "GuideAxisTheta"                  "GuideBins"                      
#> [295] "GuideColourbar"                  "GuideColoursteps"               
#> [297] "GuideCustom"                     "GuideLegend"                    
#> [299] "GuideNone"                       "GuideOld"                       
#> [301] "guides"                          "has_flipped_aes"                
#> [303] "initialize"                      "install.packages"               
#> [305] "IQR"                             "is.Coord"                       
#> [307] "is.facet"                        "is.ggplot"                      
#> [309] "is.ggproto"                      "is.theme"                       
#> [311] "kb.test"                         "kbNormTest"                     
#> [313] "label_both"                      "label_bquote"                   
#> [315] "label_context"                   "label_parsed"                   
#> [317] "label_value"                     "label_wrap_gen"                 
#> [319] "labeller"                        "labs"                           
#> [321] "last_plot"                       "layer"                          
#> [323] "layer_data"                      "layer_grob"                     
#> [325] "layer_scales"                    "layer_sf"                       
#> [327] "Layout"                          "LdFlags"                        
#> [329] "legend"                          "library.dynam"                  
#> [331] "library.dynam.unload"            "lims"                           
#> [333] "loadModule"                      "loadRcppClass"                  
#> [335] "loadRcppModules"                 "makeAccum"                      
#> [337] "makeCluster"                     "map_data"                       
#> [339] "margin"                          "max_height"                     
#> [341] "max_width"                       "mean_cl_boot"                   
#> [343] "mean_cl_normal"                  "mean_sdl"                       
#> [345] "mean_se"                         "median"                         
#> [347] "median_hilow"                    "merge_element"                  
#> [349] "Module"                          "new"                            
#> [351] "new_guide"                       "NonparamCentering"              
#> [353] "normal_CV"                       "old_guide"                      
#> [355] "panel_cols"                      "panel_rows"                     
#> [357] "par"                             "ParamCentering"                 
#> [359] "pattern_alpha"                   "PcaLocantore"                   
#> [361] "pk.test"                         "pkbc"                           
#> [363] "pkbc_validation"                 "plot"                           
#> [365] "poisson_CV"                      "populate"                       
#> [367] "Position"                        "position_dodge"                 
#> [369] "position_dodge2"                 "position_fill"                  
#> [371] "position_identity"               "position_jitter"                
#> [373] "position_jitterdodge"            "position_nudge"                 
#> [375] "position_stack"                  "PositionDodge"                  
#> [377] "PositionDodge2"                  "PositionFill"                   
#> [379] "PositionIdentity"                "PositionJitter"                 
#> [381] "PositionJitterdodge"             "PositionNudge"                  
#> [383] "PositionStack"                   "power"                          
#> [385] "predict"                         "prompt"                         
#> [387] "qchisq"                          "qplot"                          
#> [389] "qqnorm"                          "quantile"                       
#> [391] "quickplot"                       "quo"                            
#> [393] "quo_name"                        "quos"                           
#> [395] "rainbow"                         "Rcpp.package.skeleton"          
#> [397] "Rcpp.plugin.maker"               "RcppEigen.package.skeleton"     
#> [399] "RcppLdFlags"                     "register_theme_elements"        
#> [401] "registerDoParallel"              "registerDoSEQ"                  
#> [403] "registerPlugin"                  "rejacg"                         
#> [405] "rejpsaw"                         "rejvmf"                         
#> [407] "rel"                             "remove_missing"                 
#> [409] "render_axes"                     "render_strips"                  
#> [411] "reset_theme_settings"            "resolution"                     
#> [413] "rmsn"                            "rmvnorm"                        
#> [415] "rnorm"                           "rpkb"                           
#> [417] "runif"                           "sample_hypersphere"             
#> [419] "Scale"                           "scale_alpha"                    
#> [421] "scale_alpha_binned"              "scale_alpha_continuous"         
#> [423] "scale_alpha_date"                "scale_alpha_datetime"           
#> [425] "scale_alpha_discrete"            "scale_alpha_identity"           
#> [427] "scale_alpha_manual"              "scale_alpha_ordinal"            
#> [429] "scale_color_binned"              "scale_color_brewer"             
#> [431] "scale_color_continuous"          "scale_color_date"               
#> [433] "scale_color_datetime"            "scale_color_discrete"           
#> [435] "scale_color_distiller"           "scale_color_fermenter"          
#> [437] "scale_color_gradient"            "scale_color_gradient2"          
#> [439] "scale_color_gradientn"           "scale_color_grey"               
#> [441] "scale_color_hue"                 "scale_color_identity"           
#> [443] "scale_color_manual"              "scale_color_ordinal"            
#> [445] "scale_color_steps"               "scale_color_steps2"             
#> [447] "scale_color_stepsn"              "scale_color_viridis_b"          
#> [449] "scale_color_viridis_c"           "scale_color_viridis_d"          
#> [451] "scale_colour_binned"             "scale_colour_brewer"            
#> [453] "scale_colour_continuous"         "scale_colour_date"              
#> [455] "scale_colour_datetime"           "scale_colour_discrete"          
#> [457] "scale_colour_distiller"          "scale_colour_fermenter"         
#> [459] "scale_colour_gradient"           "scale_colour_gradient2"         
#> [461] "scale_colour_gradientn"          "scale_colour_grey"              
#> [463] "scale_colour_hue"                "scale_colour_identity"          
#> [465] "scale_colour_manual"             "scale_colour_ordinal"           
#> [467] "scale_colour_steps"              "scale_colour_steps2"            
#> [469] "scale_colour_stepsn"             "scale_colour_viridis_b"         
#> [471] "scale_colour_viridis_c"          "scale_colour_viridis_d"         
#> [473] "scale_continuous_identity"       "scale_discrete_identity"        
#> [475] "scale_discrete_manual"           "scale_fill_binned"              
#> [477] "scale_fill_brewer"               "scale_fill_continuous"          
#> [479] "scale_fill_date"                 "scale_fill_datetime"            
#> [481] "scale_fill_discrete"             "scale_fill_distiller"           
#> [483] "scale_fill_fermenter"            "scale_fill_gradient"            
#> [485] "scale_fill_gradient2"            "scale_fill_gradientn"           
#> [487] "scale_fill_grey"                 "scale_fill_hue"                 
#> [489] "scale_fill_identity"             "scale_fill_manual"              
#> [491] "scale_fill_ordinal"              "scale_fill_steps"               
#> [493] "scale_fill_steps2"               "scale_fill_stepsn"              
#> [495] "scale_fill_viridis_b"            "scale_fill_viridis_c"           
#> [497] "scale_fill_viridis_d"            "scale_linetype"                 
#> [499] "scale_linetype_binned"           "scale_linetype_continuous"      
#> [501] "scale_linetype_discrete"         "scale_linetype_identity"        
#> [503] "scale_linetype_manual"           "scale_linewidth"                
#> [505] "scale_linewidth_binned"          "scale_linewidth_continuous"     
#> [507] "scale_linewidth_date"            "scale_linewidth_datetime"       
#> [509] "scale_linewidth_discrete"        "scale_linewidth_identity"       
#> [511] "scale_linewidth_manual"          "scale_linewidth_ordinal"        
#> [513] "scale_radius"                    "scale_shape"                    
#> [515] "scale_shape_binned"              "scale_shape_continuous"         
#> [517] "scale_shape_discrete"            "scale_shape_identity"           
#> [519] "scale_shape_manual"              "scale_shape_ordinal"            
#> [521] "scale_size"                      "scale_size_area"                
#> [523] "scale_size_binned"               "scale_size_binned_area"         
#> [525] "scale_size_continuous"           "scale_size_date"                
#> [527] "scale_size_datetime"             "scale_size_discrete"            
#> [529] "scale_size_identity"             "scale_size_manual"              
#> [531] "scale_size_ordinal"              "scale_type"                     
#> [533] "scale_x_binned"                  "scale_x_continuous"             
#> [535] "scale_x_date"                    "scale_x_datetime"               
#> [537] "scale_x_discrete"                "scale_x_log10"                  
#> [539] "scale_x_reverse"                 "scale_x_sqrt"                   
#> [541] "scale_x_time"                    "scale_y_binned"                 
#> [543] "scale_y_continuous"              "scale_y_date"                   
#> [545] "scale_y_datetime"                "scale_y_discrete"               
#> [547] "scale_y_log10"                   "scale_y_reverse"                
#> [549] "scale_y_sqrt"                    "scale_y_time"                   
#> [551] "ScaleBinned"                     "ScaleBinnedPosition"            
#> [553] "ScaleContinuous"                 "ScaleContinuousDate"            
#> [555] "ScaleContinuousDatetime"         "ScaleContinuousIdentity"        
#> [557] "ScaleContinuousPosition"         "ScaleDiscrete"                  
#> [559] "ScaleDiscreteIdentity"           "ScaleDiscretePosition"          
#> [561] "scatterplot3d"                   "scatterplotMethod"              
#> [563] "sd"                              "sec_axis"                       
#> [565] "select_h"                        "set_last_plot"                  
#> [567] "setDoPar"                        "setDoSeq"                       
#> [569] "setRcppClass"                    "sf_transform_xy"                
#> [571] "should_stop"                     "show"                           
#> [573] "sizeof"                          "skewness"                       
#> [575] "sourceCpp"                       "stage"                          
#> [577] "standardise_aes_names"           "stat"                           
#> [579] "Stat"                            "stat_align"                     
#> [581] "stat_bin"                        "stat_bin_2d"                    
#> [583] "stat_bin_hex"                    "stat_bin2d"                     
#> [585] "stat_binhex"                     "stat_boxplot"                   
#> [587] "stat_contour"                    "stat_contour_filled"            
#> [589] "stat_count"                      "stat_density"                   
#> [591] "stat_density_2d"                 "stat_density_2d_filled"         
#> [593] "stat_density2d"                  "stat_density2d_filled"          
#> [595] "stat_ecdf"                       "stat_ellipse"                   
#> [597] "stat_function"                   "stat_identity"                  
#> [599] "stat_ksample_cpp"                "stat_qq"                        
#> [601] "stat_qq_line"                    "stat_quantile"                  
#> [603] "stat_sf"                         "stat_sf_coordinates"            
#> [605] "stat_smooth"                     "stat_spoke"                     
#> [607] "stat_sum"                        "stat_summary"                   
#> [609] "stat_summary_2d"                 "stat_summary_bin"               
#> [611] "stat_summary_hex"                "stat_summary2d"                 
#> [613] "stat_unique"                     "stat_ydensity"                  
#> [615] "stat2sample"                     "StatAlign"                      
#> [617] "StatBin"                         "StatBin2d"                      
#> [619] "StatBindot"                      "StatBinhex"                     
#> [621] "StatBoxplot"                     "StatContour"                    
#> [623] "StatContourFilled"               "StatCount"                      
#> [625] "StatDensity"                     "StatDensity2d"                  
#> [627] "StatDensity2dFilled"             "StatEcdf"                       
#> [629] "StatEllipse"                     "StatFunction"                   
#> [631] "StatIdentity"                    "statPoissonUnif"                
#> [633] "StatQq"                          "StatQqLine"                     
#> [635] "StatQuantile"                    "stats_clusters"                 
#> [637] "StatSf"                          "StatSfCoordinates"              
#> [639] "StatSmooth"                      "StatSum"                        
#> [641] "StatSummary"                     "StatSummary2d"                  
#> [643] "StatSummaryBin"                  "StatSummaryHex"                 
#> [645] "StatUnique"                      "StatYdensity"                   
#> [647] "stopCluster"                     "summarise_coord"                
#> [649] "summarise_layers"                "summarise_layout"               
#> [651] "summary"                         "sym"                            
#> [653] "syms"                            "system.file"                    
#> [655] "theme"                           "theme_bw"                       
#> [657] "theme_classic"                   "theme_dark"                     
#> [659] "theme_get"                       "theme_gray"                     
#> [661] "theme_grey"                      "theme_light"                    
#> [663] "theme_linedraw"                  "theme_minimal"                  
#> [665] "theme_replace"                   "theme_set"                      
#> [667] "theme_test"                      "theme_update"                   
#> [669] "theme_void"                      "times"                          
#> [671] "transform_position"              "translate_shape_string"         
#> [673] "uniroot"                         "unit"                           
#> [675] "update_geom_defaults"            "update_labels"                  
#> [677] "update_stat_defaults"            "var_k"                          
#> [679] "var_norm"                        "var_two"                        
#> [681] "vars"                            "waiver"                         
#> [683] "when"                            "wine"                           
#> [685] "wireless"                        "wrap_dims"                      
#> [687] "xlab"                            "xlim"                           
#> [689] "ylab"                            "ylim"                           
#> [691] "zeroGrob"
set.seed(3)
x <- rpkb(100, mu = c(1, 0, 0, 0), rho = 0.2)
pk.test(x$x, 0.3)
#> 
#>  Poisson Kernel-based quadratic distance test of 
#>                         Uniformity on the Sphere 
#> Selected consentration parameter rho:  0.3 
#> 
#> U-statistic:
#> 
#> H0 is rejected:  TRUE 
#> Statistic Un:  9.199349 
#> Critical value:  1.853823 
#> 
#> V-statistic:
#> 
#> H0 is rejected:  TRUE 
#> Statistic Vn:  11.43926 
#> Critical value:  4.504034

On some earthquake moment data, try clustering:

sphdf <- readRDS("../../../SphericalRegression_mobius/data/PapuaMoments/sphdf.rds")

cluster_res <- pkbc(sphdf[, c("s1", "s2", "sMrt", "sMrf", "sMtf")], nClust = 6:12)
summary(cluster_res)
#> Poisson Kernel-Based Clustering on the Sphere (pkbc) Results
#> ------------------------------------------------------------
#> 
#> Summary:
#>         LogLik     WCSS
#> [1,] -688.8587 343.2059
#> [2,] -655.2140 347.6755
#> [3,] -625.4191 341.1606
#> [4,] -603.4364 336.9385
#> [5,] -584.0470 333.7499
#> [6,] -560.5431 334.7955
#> [7,] -548.3379 321.4967
#> 
#> Results for 6 clusters:
#> Estimated Mixing Proportions (alpha):
#> [1] 0.05257730 0.32943288 0.25355351 0.21368909 0.11473290 0.03601432
#> 
#> Clustering table:
#> 
#>   1   2   3   4   5   6 
#>  14 104  75  55  30   9 
#> 
#> 
#> Results for 7 clusters:
#> Estimated Mixing Proportions (alpha):
#> [1] 0.29738989 0.12025126 0.05733429 0.21906057 0.05480797 0.08904655 0.16210947
#> 
#> Clustering table:
#> 
#>  1  2  3  4  5  6  7 
#> 93 31 15 60 17 25 46 
#> 
#> 
#> Results for 8 clusters:
#> Estimated Mixing Proportions (alpha):
#> [1] 0.089068159 0.035917344 0.003522938 0.327756205 0.052475138 0.118944538
#> [7] 0.162009565 0.210306112
#> 
#> Clustering table:
#> 
#>   1   2   3   4   5   6   7   8 
#>  25   9   1 104  14  32  46  56 
#> 
#> 
#> Results for 9 clusters:
#> Estimated Mixing Proportions (alpha):
#> [1] 0.115259566 0.035650660 0.248469219 0.059157674 0.162459051 0.228479959
#> [7] 0.055219772 0.089593080 0.005711019
#> 
#> Clustering table:
#> 
#>  1  2  3  4  5  6  7  8  9 
#> 28  8 81 15 46 65 17 25  2 
#> 
#> 
#> Results for 10 clusters:
#> Estimated Mixing Proportions (alpha):
#>  [1] 0.154161434 0.126900123 0.036690467 0.088649827 0.067957772 0.007410158
#>  [7] 0.233193216 0.222564065 0.055204595 0.007268343
#> 
#> Clustering table:
#> 
#>  1  2  3  4  5  6  7  8  9 10 
#> 45 31  9 25 17  2 67 73 16  2 
#> 
#> 
#> Results for 11 clusters:
#> Estimated Mixing Proportions (alpha):
#>  [1] 0.238977834 0.006129128 0.036053913 0.044206237 0.007254788 0.122687998
#>  [7] 0.065938152 0.180849861 0.162140234 0.090438169 0.045323686
#> 
#> Clustering table:
#> 
#>  1  2  3  4  5  6  7  8  9 10 11 
#> 81  2  8  9  2 32 16 52 46 25 14 
#> 
#> 
#> Results for 12 clusters:
#> Estimated Mixing Proportions (alpha):
#>  [1] 0.051682557 0.137302948 0.058818614 0.157742864 0.062945028 0.007379592
#>  [7] 0.146301496 0.071738513 0.163399829 0.046295540 0.092838897 0.003554124
#> 
#> Clustering table:
#> 
#>  1  2  3  4  5  6  7  8  9 10 11 12 
#> 13 40 15 52 19  2 38 19 48 14 26  1
validation <- pkbc_validation(cluster_res, true_label = sphdf$region)
plot(cluster_res)

comments:

  • When does pkbc() need nClust? If it is always then it should not default to NULL.

  • summary(cluster_res) where cluster_res is a pbkc object prints a table of LogLik and WCSS. This is great. A small tweak showing the number of clusters each row corresponds to would be really nice.

  • Why do you use cosine similarity rather than the angular distance. The latter is the natural metric for the sphere and I'm guessing would give elbow plots that have same shape as the Euclidean distance (where smaller WCSS --> better clustering).


Inspect code:

srr::srr_report("../QuadratiK-package/")


comments:

  • Not necessary, but if you found that rejvmf was too slow, you could speed it up by doing many tries at once and using matrix operations, or by using C++ of course. (while and for loops are very slow in R). Similar with rejacg.

  • Why do you specify pk.test() to be a 'generic' function? As far as I can tell there is only one method for it so pk.test() could be a plain function. Ditto kb.test() and pkbc()

  • Optional: (Hadley)[https://adv-r.hadley.nz/s4.html#s4-generics] advises that setGeneric("myGeneric", function(x) standardGeneric("myGeneric")) is faster than the way you've written it using {.

  • Code appears to follow standards that are relevant. I scanned all code in /R and /tests then compared to srr_report(). It all seemed to be as I expected.

  • Difficulties with srr not with QuadratiK (i.e.~I should probably tell the srr team about them):

    • with the srr_report() function misreading the DESCRIPTION file. The DESCRIPTION file has a comma-separated list of urls. The hyperlinks in the report find the correct url, but then include the comma at the end of it - leading to broken links. Something for the maintainers of srr to fix.
    • srr_report() output has the standard (great!) but not the authors reason. For example in QuadratiK there is a line:
#' @srrstatsNA {G2.14b,G2.14c} we just report errors in case of missing data

Would be lovely if the text "we just report errors in case of missing data" was also included in srr_report().

Review test suite:

See guidance on testing for further details.

test coverage

covr::package_coverage(pkg_dir)
#> QuadratiK Coverage: 100.00%
#> R/clustering_functions.R: 100.00%
#> R/critical_value.R: 100.00%
#> R/h_selection.R: 100.00%
#> R/kb.test.R: 100.00%
#> R/pk.test.R: 100.00%
#> R/pkbd_functions.R: 100.00%
#> R/utility.R: 100.00%
#> src/kernel_function.cpp: 100.00%

inspect tests

comments:

  • dont need library(testthat) at the start of test-validation.R
  • testing suite looks good otherwise

@giovsaraceno
Copy link
Author

Hi @kasselhingee,
Thank you for your thorough and thoughtful review of the package. We are pleased to hear that you find the package much improved and recommend it for the Gold badge.
All the changes you mentioned in the review have been made, and the repository on GitHub has been updated accordingly.
We are grateful for your contributions as a reviewer and agree to acknowledge you as a package reviewer in the DESCRIPTION file.
Below, we address the comments and methodological questions you raised in the review for clarity and completeness.

Could you kindly provide information on the next steps in the review process?

Thank you again for your time and effort in reviewing this package.

check QuadratiK for goodpractice:

#> ── GP QuadratiK ───────────────────────────────────────────────────
#>
#> It is good practice to
#>
#> ✖ avoid long code lines, it is bad for readability. Also, many people prefer
#> editor windows that are about 80 characters wide. Try make your lines
#> shorter than 80 characters
#>
#> 'R/clustering_functions.R:1048:81'
#> 'R/QuadratiK-package.R:62:81'
#> 'tests/testthat/test-pkbc.R:199:81'
#> 'tests/testthat/test-select_h.R:93:81'
#>
#> ✖ avoid 1:length(...), 1:nrow(...), 1:ncol(...), 1:NROW(...) and 1:NCOL(...)
#> expressions. They are error prone and result 1:0 if the expression on the
#> right hand side is zero. Use seq_len() or seq_along() instead.
#>
#> 'R/h_selection.R:366:28'
#>
#> ✖ not import packages as a whole, as this can cause name clashes between the
#> imported packages, especially over time as packages change. Instead, import
#> only the specific functions you need.
#> ✖ fix this R CMD check NOTE: installed size is 17.3Mb sub-directories of 1Mb
#> or more: doc 1.0Mb libs 15.3Mb
#> ─────────────────────────────────────────────────────


#### **comments:**
I agree with the above suggestions, but not sure if they are part of rOpenSci standards. Particularly the suggestion about reducing size I would ignore (though I am surprised `libs` is so large).

The indicated lines are now shorter than 80 characters. The function seq_len() is used as suggested at the indicated line.
For the submission we reviewed each single package included in the imports and we moved a portion of them in the ‘Suggests’ section. I think we don’t need to change anything in terms of packages imports.
The NOTE on the sub-directories sizes is something that is displayed depending on the Operating system used during installation and check, also the reported sizes are not always the same. We have looked for a way to avoid this NOTE, but we did not find any clear explanation.

Check package metadata files

inspect

  • README

  • DESCRIPTION

  • NAMESPACE

  • is d the dimension of the ambient Euclidean space or the dimension of the sphere? Seems like you're using it for both which I found hard to untangle.

Now d is used for the dimension in Euclidean space. Hence in the documentation of the methods for spherical data, they are indicated as point on the (d-1)-dimensional sphere defined in $\mathbb{R}^d$.

  • Is the Poisson kernal's x on the sphere too? (d-variate makes it sound like Euclidean space)
  • x matrix of data points: please specify whether each data point a row in the matrix (like in tidyverse) or a column?

The dimension of the data matrix $x$ has been specified when describing the arguments of the functions for spherical data dpkb(), rpkb(), pk.test() and pkbc(), as well as that observations are d-dimensional vectors lying on $\mathcal{S}^{(d-1)}$.

  • Namespace has a duplication. I'm not sure what R would actually be doing in this case.
useDynLib(QuadratiK)
useDynLib(QuadratiK, .registration = TRUE)

One has been removed.

  • A few packages are fully imported according to the NAMESPACE. This makes sense when you're using alot of the functionality of that package, or a lot of functions from that package. I'd guess that most can be converted to importFrom() statements.
import(Rcpp)
import(RcppEigen)
import(foreach)
import(ggplot2)
import(rlecuyer)

The packages foreach and ggplo2 have been modified using importFrom. The package rlecuyer is needed for the paralleling computing and it is needed as full import. The same applies for the packages Rcpp and RcppEigen since they are necessary for running the algorithms as C++ codes.

Check documentation

  • I still have an issue with your reason for L2 normalising: `Given that Wi-Fi signal strength values are inherently bounded within a certain range, it is possible to consider the spherically transformed data points using L2 normalization.' To me this would usually not be a good enough reason in a statistical application because the strength within that range could be important. As this is just a demonstration I suggest:

    • Say that the normalisation is for demonstration purposes
    • OR say why the strength doesn't matter even within the bounded range (is it irrelevant to the question or does it have a too much noise etc)

In general, the L2 transformation creates vectors of length 1 that lie on the unit sphere of dimension $(d-1)$. This transformation is appropriate when: (i) the absolute length of the measurements is irrelevant or too noisy; (ii) if we are more interested in the relative distributions (or angular relationships) between data points.
Transforming the wireless dataset to spherically normalized data points is reasonable for this analysis because it emphasizes the relative signal strengths across routers, which are more relevant to the underlying spatial patterns and device positioning than the absolute magnitudes. This transformation also makes the analysis more robust to possible noise and environmental variability while capturing the consistent patterns in router signal ratios. Given that absolute signal strength is not critical to the research question, the spherical representation provides a meaningful and interpretable framework for studying the dataset.
We agree that the limited range of signal strength is neither a sufficient nor a necessary condition for considering spherically transformed data points, and the corresponding sentence has been removed. The sentence in the vignette ha been changed as follows:

“In many wireless applications, the relative signal strengths across routers are more relevant to the underlying spatial patterns and device positioning than the absolute magnitudes. Additionally, absolute signal strength can be affected by noise, device orientation or environmental factors. In this case, it is reasonable to consider the spherically transformed data points using L2 normalization. This transformation maps the data onto the surface of a 6-dimensional sphere, ensuring that each observation has a uniform length.
Given that absolute signal strength is not critical to the research question, the spherical representation provides a meaningful and interpretable framework for studying the data set.
In general, it is appropriate to consider spherically transformed data points when: (i) the absolute length of the measurements is irrelevant or too noisy; (ii) if we are more interested in the relative distributions (or angular relationships) between data points.”

test QuadratiK function help files:

comments:

  • `so-called silhouette which is based on the comparison of its tightness and separation' tells me what it is based on, but still I don't actually know what it represents. E.g. the ratio of tightness to separation?? Does it have an analogy with a shadow projected onto a surface?

We have changed the description of ASW as follows:
“The average silhouette width quantifies the quality of clustering by measuring how well each object fits within its assigned cluster. It is the mean of silhouette values, which compare the tightness of an object within its cluster to its separation from other clusters. Higher values indicate well-separated, cohesive clusters, making it useful for selecting the appropriate number of clusters (Rousseeuw 1987).”

  • A more statistical methodology question: I haven't looked into your bootstrapping method for your test so you likely already doing this: I'm curious if you are resampling from the null hypothesis and if your statistic is pivotal or at least stays on the same scale? The reasons for doing this are nicely summarised by Hall and Wilson (1991) doi:10.2307/2532163.

The two guidelines in the reported reference highlight two important aspects of using bootstrap for hypothesis testing.
Resampling from the null hypothesis: in our implementation of k-sample tests we correctly resample from the null hypothesis of equality among the related distributions considering the pooled sample.
Pivotal statistics: in our implementation we take care of this aspect, indeed we compute the variance under H0 of the tests statistics which is used for scaling the test statistics.

  • it is unusual that rpkb() returns a list of items, more than just the simulated data. Most samplers in R (eg. runif()) return just a vector or matrix of simulated data. Not necessary, but users would find it easier if you followed this convention. Please document in either case. Also is it rows or columns that are a `data point' in the return?

This is true and the function rpkb() has been changed accordingly. It has also been specified that this function returns a data matrix where each row is a generated data point.

test QuadratiK vignettes:

comments:

Introduction:

  • In the uniformity test, the d-dimensional sphere is usually written $S^d$ and is a manifold of d-dimensions. You seem to want to mean the unit sphere in $\mathbb{R}^d$ space, which is a $(d−1)$-dimensional sphere.

This has been changed along the package.

  • Why set $\rho = 0.7$ for pk.test()? Would other values work better in other situations? If there is a good broadly useful value of $\rho$ then it would be good to make that the default.

The value of $\rho=0.7$ is chosen for demonstration purposes. In general, $\rho$ can be selected via a grid search algorithm discussed in Ding et al. (2025, Statistica Sinica, Section 5.1 – Algorithm 1).

  • Is pk.test(x, rho = 1) meant to work? I get NA values, but maybe I should get the same error as for $\rho = 10$?

$\rho$ should be in (0,1), indeed our function includes a portion of code checking this.

comments:

  • When does pkbc() need nClust? If it is always then it should not default to NULL.

nClust has no default values now.

  • summary(cluster_res) where cluster_res is a pbkc object prints a table of LogLik and WCSS. This is great. A small tweak showing the number of clusters each row corresponds to would be really nice.

We have added an additional column indicating the corresponding number of clusters.

  • Why do you use cosine similarity rather than the angular distance. The latter is the natural metric for the sphere and I'm guessing would give elbow plots that have same shape as the Euclidean distance (where smaller WCSS --> better clustering).

If two data points are identical, their cosine similarity is 1, which results in a contribution to WCSS, unlike Euclidean distance where identical points contribute 0. Thus, the WCSS calculated using cosine similarity increases with more clusters. Cosine similarity is a widely used measure when working with spherically transformed data, because it emphasizes the relative angular relationships between points, rather than their absolute distances. The use of cosine similarity aligns with the common interpretation of WCSS in elbow plots, where it increases with an increasing number of clusters, and the number of clusters showing the biggest difference in slope is suggested as the optimal number of clusters.

Inspect code:

comments:

  • Why do you specify pk.test() to be a 'generic' function? As far as I can tell there is only one method for it so pk.test() could be a plain function. Ditto kb.test() and pkbc()

The ‘pkbc’ class has some methods defined for this type of objects. We think it is better to leave these functions the way they are since the current version allows for easier future developments and extensions.

  • Optional: (Hadley)[https://adv-r.hadley.nz/s4.html#s4-generics] advises that setGeneric("myGeneric", function(x) standardGeneric("myGeneric")) is faster than the way you've written it using {.

We followed this suggestion.

  • dont need library(testthat) at the start of test-validation.R

Removed

@kasselhingee
Copy link

I think the next step is for @emitanaka to look at my review and your response.

@emitanaka
Copy link

I am on travel at the moment so will look at this on my return on the week starting 16th Dec.

@giovsaraceno
Copy link
Author

Hi @emitanaka,
I hope you're doing well! I wanted to kindly follow up on the status of the review and feedback. Please let me know if there’s anything I can assist with to move things forward.
Thanks in advance

@emitanaka
Copy link

Thank you for the prompt @giovsaraceno and sorry I let it slip in my travel/leave

@emitanaka
Copy link

@ropensci-review-bot add @emitanaka as reviewer

@ropensci-review-bot
Copy link
Collaborator

@emitanaka added to the reviewers list. Review due date is 2025-02-05. Thanks @emitanaka for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

@ropensci-review-bot
Copy link
Collaborator

@emitanaka: If you haven't done so, please fill this form for us to update our reviewers records.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants