Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

butterfly: Verification of continually updating timeseries data where we expect new values, but want to ensure previous data remains unchanged. #676

Open
12 of 29 tasks
thomaszwagerman opened this issue Nov 29, 2024 · 26 comments

Comments

@thomaszwagerman
Copy link

thomaszwagerman commented Nov 29, 2024

Submitting Author Name: Thomas Zwagerman
Submitting Author Github Handle: @thomaszwagerman
Repository: https://github.com/antarctica/butterfly
Version submitted: 1.0.0
Submission type: Standard
Editor: @emilyriederer
Reviewers: @qdread, @TheAnalyticalEdge

Due date for @qdread: 2025-01-04

Due date for @TheAnalyticalEdge: 2025-01-06
Archive: TBD
Version accepted: TBD
Language: en


  • Paste the full DESCRIPTION file inside a code block below:
Package: butterfly
Title: Verification For Continually Updating Timeseries Data
Version: 1.0.0
Authors@R: 
    person("Thomas", "Zwagerman", , "[email protected]", role = c("aut", "cre"),
           comment = c(ORCID = "0009-0003-3742-3234"))
Description: Verification of continually updating timeseries data where we expect new values, but want to ensure previous data remains unchanged.
License: MIT + file LICENSE
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.2
Imports: 
    cli,
    dplyr,
    lifecycle,
    rlang,
    waldo
Suggests: 
    knitr,
    rmarkdown,
    testthat (>= 3.0.0)
Config/testthat/edition: 3
Depends: 
    R (>= 2.10)
LazyData: true
VignetteBuilder: knitr
URL: https://github.com/antarctica/butterfly
BugReports: https://github.com/thomaszwagerman/antarctica/issues

Scope

  • Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):

    • data retrieval
    • data extraction
    • data munging
    • data deposition
    • data validation and testing
    • workflow automation
    • version control
    • citation management and bibliometrics
    • scientific software wrappers
    • field and lab reproducibility tools
    • database software bindings
    • geospatial data
    • text analysis
  • Explain how and why the package falls under these categories (briefly, 1-2 sentences):

This package was written to handle the verification of continually updating time-series data, where we expect new values over time, but want to ensure previous data remains unchanged, and timesteps remain continuous. This package provides functionality that can be used as part of a data pipeline, to check and flag changes to previous data and prevent changes going unnoticed.

  • Who is the target audience and what are scientific applications of this package?

Researchers, data engineers, data stewards etc. We have implemented this package in an operational data pipeline, which extracts ERA5 data and performs some calculations to generate a new publishable dataset. ERA5 can be subject to retrospective changes, and so to prevent these changes going unnoticed and affecting our dataset, we use butterfly to verify our data has not unexpectedly changed compared to previously published versions. New since pre-submission: this package also contains functionality to check the continuity of timeseries data (due to instrument failure), and handle time series data with varying measurement frequencies.

(duplicated from pre-submission #665)

To my knowledge, there is no other package that handles the comparing of iterations of the same data, i.e. we want to verify there are no changes in previous data, but at the same time we do not want new data to throw an error and stop our pipeline.

waldo - butterfly uses waldo::compare() in every function to provide a report on difference. There is therefore significant overlap, however butterfly builds on waldo by providing the functionality of comparing objects where we expect changes in some places (entirely new rows of data), but not in others (previously published data). butterfly also provides extra user feedback to provide clarity on what it is and isn’t comparing, due to the nature of comparing only “matched” rows.

diffdf - similar to waldo, but specifically for data frames, diffdf provides the ability to compare data frames directly. diffdf::diffdf() could have been used in our case, but I prefer waldo’s more explicit and clear user feedback. That said, there is significant overlap in functionality: butterfly::loupe() and diffdf::diffdf_has_issues() both provide a TRUE/FALSE difference check, while diffdf::diffdf_issue_rows() and butterfly::catch() both return the rows where changes have occurred. However, it lacks the flexibility of butterfly to compare object where we expect some changes, but not others.

assertr - assertr provides assertion functionality that can be used as part of a pipeline, and test assertions on a particular dataset, but it does not offer tools for comparison. I would use them hand-in-hand but butterfly has a comparison purpose.

daquiri - daquiri provides tools to check data quality and visually inspect timeseries data. It is a quality assurance package for timeseries, but does not have a comparison/verification purpose like butterfly.

N/A

  • If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

#665

Since pre-submission, two new functions have been added to the package timeline() and timeline_group(). This has expanded the package to also deal with time series data from instruments/sensors, specifically those which are prone to error, or those which intentionally measure time stamped data at variable frequencies. Further feedback on controlling tolerance were also incorporated.

Another pre-submission suggestion, on checking relationships between variables was not implemented. I had a few passes at this, but I could not think of a generalised method of doing this effectively. I currently don't have a specific use case which suited, but I would be open to re-visiting and open to suggestions.

  • Explain reasons for any pkgcheck items which your package is unable to pass.

The only warning is "Function names are duplicated in other packages". Currently I've named functions in the way they made sense for this package, and following the "butterfly" theme. However, I recognise that in specific cases this might not be ideal. I would be very open to feedback/guidance from reviewers on best practices for naming these (with the personal preference of sticking with the "butterfly" theme), perhaps pre-pending functions as bf_*()? In the documentation I've tried to specify functions using butterfly::*().

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

  • Do you intend for this package to go on CRAN?

  • Do you intend for this package to go on Bioconductor?

  • Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options
  • The package is novel and will be of interest to the broad readership of the journal.
  • The manuscript describing the package is no longer than 3000 words.
  • You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
  • (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
  • (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
  • (Please do not submit your package separately to Methods in Ecology and Evolution)

Code of conduct

@ropensci-review-bot
Copy link
Collaborator

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

@ropensci-review-bot
Copy link
Collaborator

🚀

Editor check started

👋

@ropensci-review-bot
Copy link
Collaborator

Checks for butterfly (v1.0.0)

git hash: ed8586e4

  • ✔️ Package name is available
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 100%.
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.
  • 👀 Function names are duplicated in other packages

(Checks marked with 👀 may be optionally addressed.)

Package License: MIT + file LICENSE


1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type package ncalls
internal base 11
internal butterfly 6
imports dplyr 4
imports waldo 1
imports cli NA
imports lifecycle NA
imports rlang NA
suggests knitr NA
suggests rmarkdown NA
suggests testthat NA
linking_to NA NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

col (4), suppressMessages (2), by (1), character (1), cumsum (1), data.frame (1), list (1)

butterfly

create_object_list (3), timeline_group (2), catch (1)

dplyr

anti_join (1), bind_rows (1), case_when (1), semi_join (1)

waldo

compare (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.


2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

  • code in R (100% in 8 files) and
  • 1 authors
  • 1 vignette
  • 2 internal data files
  • 5 imported packages
  • 6 exported functions (median 19 lines of code)
  • 6 non-exported functions in R (median 50 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

  • loc = "Lines of Code"
  • fn = "function"
  • exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure value percentile noteworthy
files_R 8 47.7
files_vignettes 1 61.9
files_tests 7 81.9
loc_R 279 29.5
loc_vignettes 138 34.1
loc_tests 307 61.1
num_vignettes 1 58.9
data_size_total 822 58.2
data_size_median 411 58.0
n_fns_r 12 17.0
n_fns_r_exported 6 30.0
n_fns_r_not_exported 6 14.0
n_fns_per_file_r 1 1.9 TRUE
num_params_per_fn 4 51.2
loc_per_fn_r 35 81.4
loc_per_fn_r_exp 19 44.5
loc_per_fn_r_not_exp 50 90.3
rel_whitespace_R 16 31.1
rel_whitespace_vignettes 51 48.0
rel_whitespace_tests 10 43.2
doclines_per_fn_exp 53 65.9
doclines_per_fn_not_exp 0 0.0 TRUE
fn_call_network_size 4 19.8

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


3. goodpractice and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

R-CMD-check.yaml
pkgcheck

GitHub Workflow Results

id name conclusion sha run_number date
12088199123 pages build and deployment success ded264 24 2024-11-29
12088183822 pkgcheck success ed8586 8 2024-11-29
12088183810 pkgdown.yaml success ed8586 43 2024-11-29
12088183807 R-CMD-check.yaml success ed8586 45 2024-11-29
12088183819 test-coverage.yaml success ed8586 46 2024-11-29

3b. goodpractice results

R CMD check with rcmdcheck

rcmdcheck found no errors, warnings, or notes

Test coverage with covr

Package coverage: 100

Cyclocomplexity with cyclocomp

No functions have cyclocomplexity >= 15

Static code analyses with lintr

lintr found no issues with this package!


4. Other Checks

Details of other checks (click to open)

✖️ The following 3 function names are duplicated in other packages:

    • catch from catch, gnn, promises, qrmtools, TULIP
    • release from packager, simmer
    • timeline from ndtv, timeline, trackeR


Package Versions

package version
pkgstats 0.2.0.48
pkgcheck 0.1.2.77


Editor-in-Chief Instructions:

This package is in top shape and may be passed on to a handling editor

@emilyriederer
Copy link

Hi @thomaszwagerman ! Thanks for your submission. We're excited to move forward with this review, and it looks like the package is in great shape. I will begin to look for a handling editor.

@emilyriederer
Copy link

@ropensci-review-bot assign @username as editor

@emilyriederer
Copy link

@ropensci-review-bot assign @emilyriederer as editor

@ropensci-review-bot
Copy link
Collaborator

Assigned! @emilyriederer is now the editor

@emilyriederer
Copy link

Hi again @thomaszwagerman ! I'll also be handling editor for this one. Looking forward to working with you

@emilyriederer
Copy link

Editor checks:

  • Documentation: The package has sufficient documentation available online (README, pkgdown docs) to allow for an assessment of functionality and scope without installing the package. In particular,
    • Is the case for the package well made?
    • Is the reference index page clear (grouped by topic if necessary)?
    • Are vignettes readable, sufficiently detailed and not just perfunctory?
  • Fit: The package meets criteria for fit and overlap.
  • Installation instructions: Are installation instructions clear enough for human users?
  • Tests: If the package has some interactivity / HTTP / plot production etc. are the tests using state-of-the-art tooling?
  • Contributing information: Is the documentation for contribution clear enough e.g. tokens for tests, playgrounds?
  • License: The package has a CRAN or OSI accepted license.
  • Project management: Are the issue and PR trackers in a good shape, e.g. are there outstanding bugs, is it clear when feature requests are meant to be tackled?

Editor comments

Overall, the package looks in great shape for review. The level of documentation is especially impressive; I particularly like the graphics.

The repo also seems well organized, making the package easy to understand, install, and observe the development.

I marked the "Contributing information" box as complete since the package doesn't require any specific dev set up; however, you could consider adding explicit information on what types of contributions you welcome and how folks should engage (e.g. open an issue before beginning development? See example here )


@emilyriederer
Copy link

@ropensci-review-bot help

@ropensci-review-bot
Copy link
Collaborator

Hello @emilyriederer, here are the things you can ask me to do:


# Add a review's info to the ROpenSci logs
@ropensci-review-bot submit review <REVIEW_URL> time <REVIEW_HOURS(ex. 10.5)>

# List all available commands
@ropensci-review-bot help

# Show our Code of Conduct
@ropensci-review-bot code of conduct

# Checks peer-review badge is in README.md
@ropensci-review-bot check readme

# Switch to 'seeking reviewers'
@ropensci-review-bot seeking reviewers

# Approves a package. This command will close the issue.
@ropensci-review-bot approve package-name

# Invite the author of a package to the corresponding rOpenSci team. This command should be issued by the author of the package.
@ropensci-review-bot invite me to ropensci/package-name

# Adds package's repo to the rOpenSci team. This command should be issued after approval and transfer of the package.
@ropensci-review-bot finalize transfer of package-name

# Mint package as [bronze/silver/gold]
@ropensci-review-bot mint silver

# Add a user to this issue's reviewers list
@ropensci-review-bot assign xxxxx as reviewer

# Remove a user from the reviewers list
@ropensci-review-bot remove xxxxx from reviewers

# Assign a user as the editor of this submission
@ropensci-review-bot assign @username as editor

# Put the submission on hold for the next 90 days
@ropensci-review-bot put on hold

# Remove the editor assigned to this submission
@ropensci-review-bot remove editor

# Change or add a review's due date for a reviewer
@ropensci-review-bot set due date for @reviewer to YYYY-MM-DD

# Close the issue
@ropensci-review-bot out of scope

# Various package checks
@ropensci-review-bot check package

# Checks srr documentation for stats packages
@ropensci-review-bot check srr

@emilyriederer
Copy link

@ropensci-review-bot seeking reviewers

@ropensci-review-bot
Copy link
Collaborator

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/676_status.svg)](https://github.com/ropensci/software-review/issues/676)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

@thomaszwagerman
Copy link
Author

thomaszwagerman commented Dec 12, 2024

Thank you for your comments @emilyriederer - a very good point regarding being explicit about contributions. I have updated the CONTRIBUTING.md using the example you provided, and made it more bespoke to the package (antarctica/butterfly#30).

A small point of general clarification, before submitting for review I followed the contribution guide, which states it is compulsory to have CONTRIBUTING.md under either .github/ or docs/. Is this intended to discourage having the CONTRIBUTING.md at the top level? This is where I would probably expect it to be, as it is in {targets} (I briefly forgot I already had one, under .github/).

@emilyriederer
Copy link

@ropensci-review-bot assign @qdread as reviewer

@ropensci-review-bot
Copy link
Collaborator

@qdread added to the reviewers list. Review due date is 2025-01-04. Thanks @qdread for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

@ropensci-review-bot
Copy link
Collaborator

@qdread: If you haven't done so, please fill this form for us to update our reviewers records.

@emilyriederer
Copy link

@thomaszwagerman - let me introduce @qdread as our first reviewer! He's an expert reviewer with rOpenSci with lots of experience in longitudinal data.

@qdread - regarding the change in your email, please use the form above to update!

@emilyriederer
Copy link

@thomaszwagerman - great question and thanks for the catch! You're probably right its better in one of those locations to avoid confusion building the R package (otherwise we can tell the build to ignore it)

@emilyriederer
Copy link

@ropensci-review-bot assign @TheAnalyticalEdge as reviewer

@ropensci-review-bot
Copy link
Collaborator

@TheAnalyticalEdge added to the reviewers list. Review due date is 2025-01-06. Thanks @TheAnalyticalEdge for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

@ropensci-review-bot
Copy link
Collaborator

@TheAnalyticalEdge: If you haven't done so, please fill this form for us to update our reviewers records.

@emilyriederer
Copy link

@thomaszwagerman - I'm delighted to announce @TheAnalyticalEdge as our second review!

@TheAnalyticalEdge and @qdread - Please do not stress about the dates the bot "autoassigned" you. As we all discussed offline, the end of December is quite busy for all with the holidays. We can easily bump the date when the time comes.

@qdread
Copy link

qdread commented Dec 29, 2024

My review is below. Happy New Year to all! And specifically for @emilyriederer Go Tar Heels!

Package Review

  • Briefly describe any working relationship you have (had) with the package authors. None to declare.
  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need: clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s): demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions
  • Examples: (that run successfully locally) for all exported functions
  • Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software have been confirmed.
  • Performance: Any performance claims of the software have been confirmed. Not applicable.
  • Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 5

  • Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

General comments

I enjoyed reviewing this package. Thanks for giving me the opportunity to take a look at it! Below I have written some elaborations on the checklist items that I did not mark complete, and point-by-point comments on different things I noticed while looking over the package. I would say that most of these comments are more like suggestions for possible improvements, than me noticing fatal flaws that must be fixed or else. In general, the package does a well-defined job, does it correctly, and documents itself well.

One general comment I have is that the code execution seemed fairly slow for the examples that only use input datasets with a few dozen rows. I would be concerned that it would scale poorly to large datasets. It might be useful to test this with (realistically) large input datasets to see if the performance is still acceptable.

Explanation of checklist items not marked complete

  • Community guidelines: There is no explicit mention of how to contribute in README and there is no CONTRIBUTING. This is easy to fix. Just add a few sentences to the readme saying that would-be contributors should submit a well-documented pull request, or something along those lines.
  • Automated tests: The tests that are currently in the test suite are all great. They all pass when run on my local machine and have 100% code coverage (I verified this by running covr::report() locally). That is excellent. What I think is potentially missing from the test suite are some tests for what the functions return if passed invalid or incorrectly formatted input. All the tests currently use butterflycount which is good to demonstrate that expected output is returned if passing clean well-formatted input. It would be good to make sure that appropriate errors, warnings, etc., are returned if the input isn't so clean.

Specific comments: suggested improvements to the code

  • It is good that verbosity of the output can be set globally with options() but it might also be helpful to include an argument to each individual function allowing verbose = FALSE to be set on an individual basis.
  • I don't quite understand the purpose of loupe() when catch() exists. When I run loupe() I see which rows, if any, have changed, but they just appear printed out and can't be assigned to a data frame. What is the usefulness of loupe() if catch() gives me the same side effect and also returns something useful? It would almost make more sense to combine those two functions into one, that could both display the changed rows or return them. A nifty way to do this might be to make catch() return an object with a special class that inherits from data.frame, that has its own special print method that would show the nice informative messages about changed rows, but could still be used as a data frame if needed.
  • One other minor suggestion is that in loupe(), the messages are slightly confusing when no rows are new. For changed rows, it makes sense because if rows have changed, a message that rows have changed is followed by a list of changed rows. If no rows have changed, we see And there are no differences with previous data. But regardless of whether there are new rows or not, the message is printed The following rows are new in 'df_current': . I would recommend replacing this with There are no new rows in 'df_current' if no rows are new.

Specific comments: vignettes

  • The main vignette doesn't really start with an intro that clearly explains what the package is doing. The brief intro sentences may be slightly confusing. I would recommend adding another sentence or two explaining in big-picture terms what are the inputs you would need, and what outputs/results you should expect to get.
  • The graphics on the main vignette showing what the package does are very nice!
  • The operational data pipeline vignette was a nice read. It really helped illustrate the utility of the package. Again, I liked the flowchart graphics a lot. The only minor comment I have is that I would note somewhere that the scripts are run in a Bash shell. It says this on the header of one of the scripts but it may not be obvious to people who are not familiar with shell scripting.
  • The title of the third vignette is very long, so it doesn't display well in the "Articles" dropdown menu on the pkgdown site. I noticed this is an unfinished article so I wasn't sure why it is already published on the public-facing site to begin with. I did not review it very closely because it seemed unfinished.
  • I noticed a few typo errors in the vignettes (grammar, punctuation). They might need to be checked over.

Specific comments: function documentation

  • The Description field for loupe() explains the function's name but it does not say what the function does. Because the first sentence after the title becomes the Description field in the help documentation, why not switch sentence 1 and 2?
  • Either documentation or vignette should say somewhere that waldo::compare() is an expanded version of all.equal(). More people are familiar with that base R function. At least for me, I did not know about compare() but I understood better what it was doing when I realized it is sort of a fancier version of all.equal().
  • The documentation of catch(), line 52, says inner join, but I think it should say anti join.
  • The naming of timeline() and timeline_group() is not really consistent with the other functions. The names aren't that descriptive of what they do. (And unfortunately they do not have a cool lepidopteran-themed name!) I might recommend changing those names, of course keeping the old ones as deprecated for backward compatibility.

@ropensci-review-bot
Copy link
Collaborator

📆 @qdread you have 2 days left before the due date for your review (2025-01-04).

@ropensci-review-bot
Copy link
Collaborator

📆 @TheAnalyticalEdge you have 2 days left before the due date for your review (2025-01-06).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants