xdoctest #59

ev-br · 2023-04-29T14:43:07Z

Prior art: Take a look at https://github.com/Erotemic/xdoctest
is used by pytorch (?)

Erotemic · 2024-08-12T15:50:18Z

Hi, I'm the author of xdoctest. I'd be interested in knowing if there is something about xdoctest that does not suit your needs? If so perhaps we can integrate features there? I think many of the directives xdoctest has can handle the use-cases scipy-doctest was written for.

Xdoctest has skip blocks via xdoctest +SKIP. It also has conditional skips based on module requirements and environment variables: +xdoctest: +REQUIRES(module:matplotlib) and +xdoctest: +REQUIRES(env:ENABLE_MPL_DOCTEST) . You can ignore the "got/want" with +xdoctest +IGNORE_WANT on a single line or use it on an entire block.

ev-br · 2024-08-16T14:40:06Z

Hey @Erotemic !
I'm indeed all for integrating features. Let's start by comparing features.
Big picture first: scipy-doctest

strives to keep all directives in the tool and out of docstrings or have directives human-readable (hence # may vary not +SKIP et al)
Allows mix-and-match Checker, Parser, Finder and Runner
Default Finder's strategy accounts for public/private split in complex packages
Default Checker is floating-point aware, uses numpy and is heavily informed by the needs of the existing numpy and scipy documentation.

What would you list as major improvements of xdoctest over the vanilla doctest?

Erotemic · 2024-08-16T15:59:47Z

Default Checker is floating-point aware, uses numpy and is heavily informed by the needs of the existing numpy and scipy documentation.

I'd be interested in adding a flexible floating point checker to the xdoctest checker. Currently there is quite a bit of normalization that happens by default and anything that isn't normalized can typically be handled with use of ellipsis, which is enabled by default.

Default Finder's strategy accounts for public/private split in complex packages

Not quite sure that his means. You don't test doctests in private scopes by default? I think that would be easy to integrate into xdoctest because each one knows where it came from, so an option to disable doctests in locations with a leading "_" wouldn't be hard.

Allows mix-and-match Checker, Parser, Finder and Runner

xdoctest is configurable, but for the most part I've been the only person with eyes on the core code. I'd be interested in at least seeing if there are internal architecture improvements that could be made. I know for a fact my parser could be faster, but it's not that slow to begin with so there's been little need to hammer on it.

strives to keep all directives in the tool and out of docstrings or have directives human-readable (hence # may vary not +SKIP et al)

I find directives useful when used sparingly. I think its important that they are readily identifiable as directives though. Personally, I think # may vary just looks like a comment. There is nothing indicating to me as a programmer that it will do something special. In pytorch we just added a sphinx rule that strips them out of the readthedocs before they are published.

However, in xdoctest, I am fairly flexible about what can be counted as a directive, as my work on this started fairly organically ~2015 in the IBEIS project. There are some hard coded things like # SKIP or # DISABLE_DOCTEST at the start of a doctest will skip it, rather than the explicit directives I prefer now.

In any case I'm certainly not opposed to expanding the existing directives.

What would you list as major improvements of xdoctest over the vanilla doctest?

In the README see: the Enhancements section, as well as the enhancements demo

More info:

AST Parser

The biggest improvement is that it doesn't use regex to "parse" Python's grammar, and instead uses the more appropriate AST. This means instead of being forced to write:

>>> x = func(statements='with',
...          multiple='lines')

And manually identify which lines are continuations of a previous statement, the parser does it for you, so you can use a more simple rule to create a doctest: just prefix everything with ">>> "

>>> x = func(statements='with',
>>>          multiple='lines')

(Although, if you look at the xdoctest README you will see it is even more flexible than that).

New Directives

I mentioned some of the new directives in the previous post.

There are also multi-line directives, which means they apply to all lines underneath them. The old doctest module can only put directives inline.

old:

>>> x = 1 # xdoctest: +SKIP
>>> x = 2 # xdoctest: +SKIP
>>> x = 3 # xdoctest: +SKIP
>>> x = 4 # xdoctest: +SKIP

new:

>>> # xdoctest: +SKIP
>>> x = 1
>>> x = 2
>>> x = 3
>>> x = 4

You can "undo" a directive by using the "-" instead of "+", e.g.

>>> # xdoctest: +SKIP
>>> x = 1
>>> x = 2
>>> # xdoctest: -SKIP
>>> x = 3
>>> x = 4

It's not very pretty, so I rarely use this, but it does come in handy.

Better Runner

You can run every doctest in a package with the xdoctest CLI by specifying the module name, or the path to the module. It has the option to statically (new behavior) or dynamically (original behavior) extract the doctests. I like the static option because it guarantees there are no side effects at discovery time.

Better Output

When a doctest runs in verbose mode it will print the code it is about to execute. For failures it will print the failed line relative to the doctest (original behavior) as well as the absolute line relative to the file it lives in (new feature).

If there are failures, it will also output a CLI invocation of xdoctest to re-test just the specific tests that failed, which makes debugging easier.

Forgiving Defaults

Users should be encouraged to write examples, and that means preventing annoying failures.

In the old doctest, if you didn't specify a "want" statement, for something that produce stdout or a value, it would fail. In xdoctest it will just assume you don't want to test that line of output.

It also lets you compare all stdout at the end of the doctest instead of always one-by-one. The following test fails with old doctests, but works with xdoctest:

>>> print('one')
>>> print('two')
>>> print('three')
one 
two
three

Misc

There are a lot more miscellaneous things that I might not remember off the top of my head. There are multiple "finders" in the sense that the "style" can be to parse google-style docstrings or simple freeform docstrings (which works well for numpy style).

In pytorch it helped a lot to be able to force every doctest to have an implicit "import torch" at the top. That is available by the global-exec option to insert custom code before each doctest.

There is an experimental feature to export all doctests as standalone unit-tests.

I also have a PyCon 2020 Presentation where I talk about some of the improvements. Corresponding slides are here.

Erotemic · 2024-08-18T20:28:03Z

I've been going through the scipy_doctest code to understand it better and I have a few thoughts / observations.

scipy_doctest extends stdlib doctest, whereas xdoctest is a complete replacement / re-implementation. This means xdoctest has more code to maintain, but it is also unconstrained by stdlib baggage (although xdoctest does aim to be backwards compatible, and that does mean there is some conceptual baggage in xdoctest)
I like the idea of stopwords. Currently I implement all of these via directives. In my code you will see a very common pattern, e.g.:

Example:
    >>> data = make_some_demodata()
    >>> result = do_some_doctest(data)
    >>> # xdoctest: +REQUIRES(--show)
    >>> import kwplot
    >>> kwplot.autompl()
    >>> kwplot.plt.do_some_plotting_stuff()
    >>> kwplot.show_if_requested()

Using a stopword list could make my doctests more concise.

I think the idea of adding a skiplist to the xdoctest config is supportable, although I do think it makes sense to have the doctest tag itself as a skip. But I try not to be too opinionated in xdoctest, my philosophy is that making it easier for anyone to use doctests the way they like is the main objective.
I would love if the code for the floating-point aware checker could be ported to xdoctest. I just use ellipses to handle those cases, but I think the more like the actual output the "want" part of the doctest looks the better. It's annoying to see ellipses everywhere. In fact, the difficulty of getting "got/want" string to check correctly AND look nice is the driving motivation behind ubelt.urepr, which is what I typically use to consistently format numpy / float / OrderedDict / etc. structures across multiple versions of Python. In fact, this is a big reason I almost never use "got/want" and instead just write asserts in my doctests as if they were unit-tests, which is often less ideal.

Thinking more about flexible checkers, it might even be nice if there was a way to inject custom checkers. The idea is the user provides the doctest configuration with a set of candidate normalization functions that it can use. When the doctest runner gets a got/want test, it sequentially tries all normalizations to see if any pass, and if none do, then fail. I think that's what both xdoctest and scipy_doctest are roughly doing, so an extension to allow user supplied normalizations would probably not be difficult.

I like the parse_namedtuples option a lot. Very neat idea.
On nameerror_after_exception, I think xdoctest's block-based runner works around this naturally. There is no concept of code executing after an unhandled exception. Just like if you were to define a unit test, if the chunk of code fails it does not continue.
I don't quite get how is used "dt_config.local_resources". For tests that need resources, I try to ensure the module I'm working with always have a way to construct some sort of "demodata", and I usually use an appdir via ubelt.Path.appdir to work with local files. I typically like if a doctest can be directly converted to a unit-test with little-to-no changes, so I worry a bit about how much magic this is adding
Does rndm_markers stand for random-markers? I don't quite get the name. It seems like skip_markers would be a better name.
What are you using user_context_mgr for? I'm wondering how important that is for scipy/numpy use-cases?

Also for fun, I ran xdoctest on a develop install of scipy via xdoctest scipy and got === 455 failed, 599 passed, 9 warnings in 181.22 seconds ===. Failed tests look like they are mostly due to the numpy array comparison issue. Also I see the reason to have plt.show in a skiplist. Lots of cool plots kept popping up.

ev-br · 2024-08-26T09:58:39Z

Thanks @Erotemic for starting the discussion! I'm going to have to respond piecemeal over the next few days.

I like the parse_namedtuples option a lot. Very neat idea.

Glad that you liked it :-). To me, this was a hack to work around the fact that in scipy namedtuples are not a part of the API, so one cannot eval things like MoodResult(pvalue=0.9, statistic=42.

local_resources

are useful when a doctest wants to access a local file, yes. It's not always possible or convenient to construct one on the fly. Here's one example, where the user guide talks about reading Matlab/Octave files: https://docs.scipy.org/doc/scipy/tutorial/io.html

rndm_markers

yes, "random". Out of the box, it's # may vary, # random in scipy; numpy also uses # uninitialized
These are subtly different from just plain skips, as they do not skip checking that example.source is valid python.

What are you using user_context_mgr for?

That's basically a placeholder for user library-specific context. Two immediate usages:

Control the random seeds in examples: https://github.com/scipy/scipy/blob/main/scipy/conftest.py#L360
filter out known warnings which otherwise fail the build with warnings=errors; this is relevant when both running doctests and the collection stage.

Two examples: https://github.com/scipy/scipy/blob/main/scipy/conftest.py#L295 and https://github.com/numpy/numpy/blob/main/numpy/conftest.py#L173

Also for fun, I ran xdoctest on a develop install of scipy via xdoctest scipy and got === 455 failed, 599 passed, 9 warnings in 181.22 seconds ===. Failed tests look like they are mostly due to the numpy array comparison issue. Also I see the reason to have plt.show in a skiplist.

Yeah. Pretty much all of scipy-doctest has grown up from accounting for various failures observed in the scipy docs.
NumPy whitespace comparisons is a big one, namedtuples etc.

Lots of cool plots kept popping up.

scipy-doctest uses a context manager to switch to a display-less MPL backend:
https://github.com/scipy/scipy_doctest/blob/main/scipy_doctest/util.py#L15 and
https://github.com/scipy/scipy_doctest/blob/main/scipy_doctest/frontend.py#L250 (there's an equivalent usage in the pytest plugin runner).

ev-br · 2024-08-26T11:05:12Z

Now for a bigger picture. scipy-doctest also started fairly organically in 2015 :-). First was shaped by the scipy docs, later ported/adapted to also account for numpy, and recently moved to a separate package. IIUC pytest-doctestplus started only a little earlier, and was growing out of the needs of astropy. So the need was clearly there. (and still is!)

Personally, I think # may vary just looks like a comment. There is nothing indicating to me as a programmer that it will do something special. In pytorch we just added a sphinx rule that strips them out of the readthedocs before they are published.

That it looks like a comment is by design!
One of early (and key IMO) decisions was that the primary target audience is a user not a programmer. And that users read docstrings in terminal, too (via e.g. ipython). So we're rather heavily avoid using directives (or rather: for almost every directive there's a way to store relevant info in the tool). This avoids the need of a special sphinx rule to remove the directives (I think pytest-doctestplus has a variant of such a rule, too).
Of course, if directives are used sparingly, it's no big deal (# doctest: +SKIPBLOCK is used once or twice in NumPy IIRC then and it's no problem).

Regardless, undoing directives is a very cool idea, actually!

One other move in the design space where it seems xdoctest and scipy-doctest differ is that I tried to minimize deviations from the stdlib doctest module. My reasoning was (and still is) that the doctest syntax is quirky enough, and adding more flexibility adds more cognitive load.

W.r.t. extending stdlib doctest vs reimplementation --- I was and still am lazy, so the more I can reuse and the less I need to reimplement, the better! (and I wouldn't touch the regex parsing in stdlib doctest with anything other than a very long stick). Your extensions to the doctest syntax are nice though, so the benefit from reimplementing is clear.

I'm not entirely sure though why you need to account for different docstring styles (numpydoc vs google etc)? As long as there's >>> to mark a line as a doctest, why is it not enough?

In the old doctest, if you didn't specify a "want" statement, for something that produce stdout or a value, it would fail. In xdoctest it will just assume you don't want to test that line of output.

Where does this logic live, is it the runner or a parser?

I would love if the code for the floating-point aware checker could be ported to xdoctest. I just use ellipses to handle those cases, but I think the more like the actual output the "want" part of the doctest looks the better. It's annoying to see ellipses everywhere. In fact, the difficulty of getting "got/want" string to check correctly AND look nice is the driving motivation behind ubelt.urepr, which is what I typically use to consistently format numpy / float / OrderedDict / etc. structures across multiple versions of Python. In fact, this is a big reason I almost never use "got/want" and instead just write asserts in my doctests as if they were unit-tests, which is often less ideal.

Having a tool dictating how to write doctest examples is something scipy-doctest definitely tries to avoid! And a sizeable fraction of hoops scipy-doctest DTChecker jumps through are exactly about this.

I definitely love the idea of porting DTChecker to xdoctest. One thing though: the DTChecker is very reliant on NumPy, and I've not a clear picture of how to make it work for e.g. pytorch. Do you? Maybe array API is the answer, or a subclass to override DTChecker._do_check to replace np.allclose with a pytorch alternative?

I'll definitely need to study the xdoctest code some more. I'll also take a look if I can plug the xdoctest's Parser, along the lines of https://github.com/scipy/scipy_doctest/blob/main/scipy_doctest/tests/test_runner.py#L94

ev-br · 2024-08-26T11:34:41Z

Going through things I've missed in your comments:

Default Finder's strategy accounts for public/private split in complex packages

Not quite sure that his means. You don't test doctests in private scopes by default? I think that would be easy to integrate into xdoctest because each one knows where it came from, so an option to disable doctests in locations with a leading "_" wouldn't be hard.

Yes, basically. There's a worked example in https://github.com/scipy/scipy_doctest?tab=readme-ov-file#rough-edges-and-sharp-bits (under doctest collection strategies)

xdoctest is configurable, but for the most part I've been the only person with eyes on the core code. I'd be interested in at least seeing if there are internal architecture improvements that could be made. I know for a fact my parser could be faster, but it's not that slow to begin with so there's been little need to hammer on it.

Where does the configuration live, is there a central-ish place to collect various bits of configuration?
I'll take a look at xdoctest internals.

In pytorch it helped a lot to be able to force every doctest to have an implicit "import torch" at the top. That is available by the global-exec option to insert custom code before each doctest.

Yeah, we had an implicit import numpy as np until recently. Then the decision in scipy was to add them all explicitly to make examples self-contained. This is still configurable via DTConfig.default_namespace.

It has the option to statically (new behavior) or dynamically (original behavior) extract the doctests. I like the static option because it guarantees there are no side effects at discovery time.

Not entirely sure what is the difference between dynamic and static here?

Thinking more about flexible checkers, it might even be nice if there was a way to inject custom checkers. The idea is the user provides the doctest configuration with a set of candidate normalization functions that it can use. When the doctest runner gets a got/want test, it sequentially tries all normalizations to see if any pass, and if none do, then fail.

Mind expanding on what do you mean by normalizations here?

Erotemic · 2024-09-02T20:13:33Z

My reasoning was (and still is) that the doctest syntax is quirky enough, and adding more flexibility adds more cognitive load.

I'm of the opinion that the stdlib doctest module is fundamentally flawed (e.g. due to the underpowered parsing, insistence on checking for output on every single line) and that the extra developer cognitive load is justified if it results in a net decrease in the complexity that needs to be taught to a new user. But I acknowledge there is a trade-off here.

I'm not entirely sure though why you need to account for different docstring styles (numpydoc vs google etc)? As long as there's >>> to mark a line as a doctest, why is it not enough?

Because you might want different blocks of doc-strings to represent fundamentally different tests that are not impacted by the previous state and can pass or fail independently. However, by default xdoctest uses the "freeform" parser, which effectively puts all doctest parts into the same test object.

Where does this [wrt to got/want skipping] logic live, is it the runner or a parser?

The runner. Additionally, something I really value is the ability to "accumulate" output and then match it all at the end. This specifically happens here. If a statement produces output, and there is no requested "want" statement, the runner will remember it. Then if a latter part does get a "want" string there are two ways for it to pass:

It can match the exact output of the previous line
It can match all accumulated unchecked want statements.

This means you can write a test as I previously described in the "forgiving defaults" section:

>>> print('one')
>>> print('two')
>>> print('three')
one 
two
three

An opinion I've formed while working on this is that the doctest runner should work really hard to find a way in which the "want" output does match the "got" output. A "got/want" test should only fail if there is something very obviously wrong (and even then the frequency with which you see you see # doctest: +SKIP shows that developers correctly don't value these as real tests). These tests are for illustration and coverage purposes and they should not be relied on for proper and exact testing of a module. That should be handled by explicit assertions (and ideally those live in unit tests).

One thing though: the DTChecker is very reliant on NumPy, and I've not a clear picture of how to make it work for e.g. pytorch. Do you?

It might make sense to have numpy be an optional package, and if it exists, it enriches xdoctest with additional checks that it can perform on "got/want" tests. For torch I just made heavy use of ellipses, IGNORE_WANT, and regular in-doctest assert statements to avoid the "got/want" pattern.

Where does the configuration live, is there a central-ish place to collect various bits of configuration?

The config lives in the xdoctest.doctest_example.DoctestConfig class. You can see it used to help populate the basic CLI here and the pytest CLI here.

The default directive state is defined here, and it can be overwritten by the above config class.

Admittedly, the config system has a bit of baggage and could be simplified, but at its core it is just a dictionary that gets assigned to each xdoctest.doctest_example.DocTest object that is created (which is the container class that stores multiple xdoctest.doctest_parts.DoctestPart objects and handles their execution and reporting).

Not entirely sure what is the difference between dynamic and static here?

This is one of the major difference between xdoctest and stdlib doctest. By default we do not import the user code when we find the doctests. Instead we make use of the AST (in a different way that it is used in the context of actually parsing the docstrings) to statically (i.e. no executed user-code) enumerate all of the top-level functions and classes with docstrings. Using this module-parsing strategy, user code is only executed at doctest runtime, which can help prevent side effects and improve response time of the gathering step. Now, this is great if you code all of your docstrings explicitly, but it does break down if you programatically generate an API. Thus xdoctest offers a "dynamic" parsing strategy which basically works the same way regular doctest does: import the module, enumerate its members, look at the .__docstr__ attribute of each top-level item, and pray that there are no significant import-time side effects.

Mind expanding on what do you mean by normalizations here?

A normalization is the way to flexibly match got/want strings. E.g. use regex to remove trailing whitespace, remove newlines, removing ansi escape sequences, replacing quotes. It's a bit simpler than going into the want string and trying to exact a Python object out of it and then compare them on a programmatic level. Normalize the strings: check if they are equal, if so pass, otherwise fallback to more heavyweight pattern matching. Thinking about it it might not be possible to frame float comparison as a normalization, so perhaps this point is moot.

ev-br added the some day label Jul 3, 2023

ev-br added the question Further information is requested label Jul 24, 2023

ev-br mentioned this issue Sep 2, 2024

ENH: add ParserKlass key to DTConfig; check xdoctest.DocTestParser #174

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xdoctest #59

xdoctest #59

ev-br commented Apr 29, 2023

Erotemic commented Aug 12, 2024

ev-br commented Aug 16, 2024

Erotemic commented Aug 16, 2024

Erotemic commented Aug 18, 2024 •

edited

Loading

ev-br commented Aug 26, 2024

ev-br commented Aug 26, 2024

ev-br commented Aug 26, 2024

Erotemic commented Sep 2, 2024

xdoctest #59

xdoctest #59

Comments

ev-br commented Apr 29, 2023

Erotemic commented Aug 12, 2024

ev-br commented Aug 16, 2024

Erotemic commented Aug 16, 2024

AST Parser

New Directives

Better Runner

Better Output

Forgiving Defaults

Misc

Erotemic commented Aug 18, 2024 • edited Loading

ev-br commented Aug 26, 2024

ev-br commented Aug 26, 2024

ev-br commented Aug 26, 2024

Erotemic commented Sep 2, 2024

Erotemic commented Aug 18, 2024 •

edited

Loading