Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate table of scikit-image's entire runtime API #6905

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

lagru
Copy link
Member

@lagru lagru commented Apr 21, 2023

Description

In previous discussions, it was suggested to create a resource that spells out scikit-image's complete Python API. This is meant as a first step in identifying common patterns in our API to help with the skimage2 transition and also with typing our API (see the proposal at scikit-image/skimage-archive#29). @jni suggested to aggregate this information in a spreadsheet and pointed to similar work in scikit-image/boilerplate-utils#5. Depending on how this shakes out, I think this could be a very useful tool in general to inspect and measure our API.

While still work in progress, I'm already happy to share the current state. Some notes on the current behavior:

  • The script assumes that there is exactly one unique source path for each object, but multiple discovery paths are possible. E.g. skimage.util.unique.unique_rows is the source path, while skimage.util.unique_rows and skimage.util.unique.unique_rows are discovery paths.
  • Methods that aren't overwritten in inherited classes, are considered additional discovery paths. Currently the script just uses the first discovered path as the source path here.
  • Decorated function (having a __wrapped__ attribute) are treated as an additional discovery path.
  • Variable declarations are detected and listed as well, e.g. ski.color.ahx_from_rgb. This was tricky to do.
  • lazy.attach_stub overwrites the the __dir__ function of modules. Because this function is used under the hood to crawl the object tree, things that are actually reachable but not discover-able at runtime are hidden from the script! E.g. img_as_float64 is available in the current skimage/__init__.py but not discover-able because it's not included in the matching PYI file. I'm slightly in favor of arguing that the script should only list objects that are part of the object tree as defined __dir__. So this is actually working as intended.

Checklist

  • Docstrings for all functions
  • Gallery example in ./doc/examples (new features only)
  • Benchmark in ./benchmarks, if your changes aren't covered by an
    existing benchmark
  • Unit tests
  • Clean style in the spirit of PEP8
  • Descriptive commit messages (see below)

For reviewers

  • Check that the PR title is short, concise, and will make sense 1 year
    later.
  • Check that new functions are imported in corresponding __init__.py.
  • Check that new features, API changes, and deprecations are mentioned in
    doc/release/release_dev.rst.
  • There is a bot to help automate backporting a PR to an older branch. For
    example, to backport to v0.19.x after merging, add the following in a PR
    comment: @meeseeksdev backport to v0.19.x
  • To run benchmarks on a PR, add the run-benchmark label. To rerun, the label
    can be removed and then added again. The benchmark output can be checked in
    the "Actions" tab.

lagru added 3 commits April 21, 2023 12:08
Class properties don't seem to be picked up yet...
Attributes of modules or classes whose module cannot be determined at
runtime are still included as long as the module of the parent can be
determined. This makes public variables visible.

The visit_api function should probably be refactored and simplified
going forward.

Currently, print_full_api returns 14484 objects.
@lagru lagru added the 🤖 type: Infrastructure CI, packaging, tools and automation label Apr 21, 2023
@stefanv
Copy link
Member

stefanv commented Apr 21, 2023

I wasn't part of the initial conversation, so I'm not sure of the goal here. How is this different from what we publish in the API reference guide?

The basic guide is: __all__ on any submodule, and only one level deep, except for a few notable exceptions: filters.rank, future.* etc.

@jni
Copy link
Member

jni commented Apr 24, 2023

@stefanv the goal here is to ensure that semantically-similar functions have a consistent API, so that it is easy to experiment switching one for another. (Same name for required arguments, same name and input types for similar optional keyword arguments, same output types, etc.) Kind of a functional equivalent to sklearn's consistent Classifier API having identical fit and predict methods.

Historically we have struggled with this — for example, watershed used to be in morphology instead of segmentation.

The first step is to identify all our public functions and group them according to the (currently imprecisely defined) criterion of semantic similarity.

This script is able to iterate the member tree of skimage. The
discovered information can be exported into two CSV files, one on the
general discovered tree and one on discovered parameters and return
objects of callables.
@lagru
Copy link
Member Author

lagru commented Apr 24, 2023

How is this different from what we publish in the API reference guide?

Basically, I view this as a very useful tool and start to inspect our API. This script is able to walk the entire member tree of a module that is given to it. So the degree of automation is significantly higher and its easy to keep up with changes to our API. Additionally, the implementation is able to record public attributes (of modules and classes) that are not functions or classes themselves (e.g. the color tuples such as skimage.color.rgb_colors.bisque) which are not visible in our HTML API documentation.

This might also be something that we or even other projects can build on. In addition to this, I can think of a few applications of the bat that might find this useful:

  • Semi-automatically generate a diff of the API between releases. Might be something we want to include in some way in the release notes.
  • Check if API as defined by __all__ is consistent with what is reachable via public "import paths". E.g. the function warn_inconsistent__all__ in this script already does this. In the future I would want to go in a direction were there is no difference between both "public API definitions". This script can help with testing these kind of assumptions.
  • Find signatures with missing type annotations / automated testing of these.

I've created a Google Sheet with the output from python tools/inspect_api.py --csv -v -- skimage here: https://docs.google.com/spreadsheets/d/1sfF0MsotNDqOHA3WldyKYk0gZKRhqAbR-HVVGfLLcJY
For some reason the filters are not showing, though you can create temporary filter in the Data > Filter Views menu.

@stefanv
Copy link
Member

stefanv commented Apr 24, 2023

I have a vague memory that @Carreau has worked on something similar.

@stefanv
Copy link
Member

stefanv commented Apr 24, 2023

This might also be something that we or even other projects can build on.

If we can figure out how this can be generally useful, we can host it under scientific-python.

@stefanv
Copy link
Member

stefanv commented Apr 24, 2023

I also think @Erotemic may be interested.

@lagru
Copy link
Member Author

lagru commented Apr 24, 2023

If we can figure out how this can be generally useful, we can host it under scientific-python.

Happy to do so. I can already apply this to SciPy, NumPy, Matplotlib, etc (though I have to limit recursion depth for the former two). With a bit of polishing this should be generally applicable. 😊

@Carreau
Copy link
Contributor

Carreau commented Apr 25, 2023

@stefanv yes I originally worked on https://github.com/carreau/frappuccino, but I would suggest looking at https://github.com/mkdocstrings/griffe which seem more maintained and does so.

@lagru
Copy link
Member Author

lagru commented Apr 25, 2023

Wow, griffe looks very cool! Thanks for recommending it. I wasn't able to find any of these tools when I checked but should have known that I was reinventing the wheel a little bit. 🙈

griffe already seems to collect all the information that this little script does. It exports to JSON but it shouldn't be to hard to transform it into a tabular format if we deem something like https://docs.google.com/spreadsheets/d/1sfF0MsotNDqOHA3WldyKYk0gZKRhqAbR-HVVGfLLcJY useful...

@jni
Copy link
Member

jni commented May 2, 2023

@lagru cool!

Other automated columns I'd add:

  • number of required arguments
  • number of keyword-only arguments

We then have to think about manual (?) columns for our input types and return types, based on the typing stuff we've talked about: image, binary image, label image, coordinates, indices (different from coordinates — e.g. rr, cc tuple), spacing, dtype, etc.

@jni
Copy link
Member

jni commented May 2, 2023

(or maybe we do typing first and then add types to the automated columns)

@lagru
Copy link
Member Author

lagru commented May 2, 2023

Other automated columns I'd add

Can do. After talking a bit with @tlambert03 on Zulip, I'm currently having a naive agenda in mind. It's not really a well defined or detailed list of bullet points yet, but useful to communicate my thoughts anyway 😅 :

  1. Start a draft PR that adds a type to every public signature. Each type that we use goes into a list in skimage.types (or similar).
    • The intention is to accumulate a list of used types, that forces and enables us to think about which types we want to support in our API.
    • I think it's more natural to use our existing workflow and accumulate this list in a Python file. That way we can review and use code insight to jump between type declaration and their usages. Though, I intend to use the generated table to check for missing types, and comparison with our docstring type annotation.
    • I don't intend to apply a type checker at this stage yet. PEP 646 is not supported by mypy yet. Instead I want to use simple placeholders, to get the discussion started about where we want to go. E.g. something like RGBArrayLike = ArrayLike[*Shape, L[3], *Shape] (not sure if that will be possible) but you get the idea.
  2. Once we've settled on a set of types we intend to use, start moving our API and documentation in that direction.
    • We don't have to wait for 1. to be finished here. Agreed on stuff can already be updated (or postponed to skimage2); though definitely in separate PRs.
  3. At this stage support for things like PEP 646 will hopefully be more mature and we can start to actually apply and enforce type checking.

Of those 3 points, 1. is probably the one that will require the most communication and consensus seeking effort. That's why I want to start that as early as possible.

@stefanv
Copy link
Member

stefanv commented May 2, 2023

  • The intention is to accumulate a list of used types, that forces and enables us to think about which types we want to support in our API.

I presume here we will not focus so much on the underlying type (class), but rather of the intent of the parameter (often an array). E.g., you would annotate with Image, Labels, Mask, etc.? Otherwise, you'll have a bunch of Array or Any.

lagru added 5 commits May 24, 2023 16:29
This reverts commit 7576912. I think
the runtime based approach might be more useful in the long-term. This
might make it easier for packages such as NumPy to apply this script as
well if they are interested.
Also fix the bug where None would show up as an empty default for
parameters.
@lagru
Copy link
Member Author

lagru commented May 25, 2023

@jni, I added the suggested columns to the script and updated https://docs.google.com/spreadsheets/d/1sfF0MsotNDqOHA3WldyKYk0gZKRhqAbR-HVVGfLLcJY accordingly. I intend to stay with the runtime based approach for now (as opposed to using griffe's static approach) as this actually checking what the user sees. This might not make much of a difference for us but e.g. for a project like NumPy it would.

Copy link

Hello scikit-image core devs! There hasn't been any activity on this PR for more than 180 days. I have marked it as "dormant" to make it easy to find.
To our contributors, thank you for your contribution and apologies if this contribution fell through the cracks! Hopefully this ping will help bring some fresh attention to the review. If you need help, you can always reach out on our forum
If you think that this PR is no longer relevant, you may close it, or we may do it at some point (either way, it will be done manually). If you think the PR is valuable but you no longer have the bandwidth to update it, please let us know, so that someone can take it over. 🙏

@github-actions github-actions bot added the 😴 Dormant No recent activity label Nov 23, 2023
@lagru
Copy link
Member Author

lagru commented Nov 23, 2023

Haha, I'm sure this will come in handy at some point. Though, rather as a standalone tool. Until I have the time to transfer it, I'd keep this open. :)

@github-actions github-actions bot removed the 😴 Dormant No recent activity label Nov 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖 type: Infrastructure CI, packaging, tools and automation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants