Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lua function pandoc.read seems to ignore ReaderOptions.extensions #10593

Open
rnwst opened this issue Feb 3, 2025 · 13 comments
Open

Lua function pandoc.read seems to ignore ReaderOptions.extensions #10593

rnwst opened this issue Feb 3, 2025 · 13 comments
Labels

Comments

@rnwst
Copy link
Contributor

rnwst commented Feb 3, 2025

Explain the problem.
When running the following Lua filter (test.lua)

function Pandoc(doc)
    local reader_options = pandoc.ReaderOptions(PANDOC_READER_OPTIONS)
    -- remove 'auto_identifiers' extension
    for i, v in ipairs(reader_options.extensions) do
        if v == 'auto_identifiers' then
            table.remove(reader_options.extensions, i)
        end
    end
    local doc = pandoc.read('# Header {.class key=val}', 'markdown', reader_options)
    io.stderr:write(pandoc.write(doc, 'markdown'))
end

with the following command (where test.md is any document)

pandoc --from=markdown --to=native test.md --lua-filter=test.lua > /dev/null

, the following text is printed to stderr:

# Header {#header .class key="val"}

It seems as if removing the auto_identifiers extension from reader_options.extensions has no effect and the header is automatically assigned an Id regardless. The expected stderr output is

# Header {.class key="val"}

Pandoc version?
pandoc 3.6.2
Features: +server +lua
Scripting engine: Lua 5.4

OS: Linux

@rnwst rnwst added the bug label Feb 3, 2025
@jgm
Copy link
Owner

jgm commented Feb 3, 2025

@tarleb would know what is happening here.

@tarleb
Copy link
Collaborator

tarleb commented Feb 3, 2025

This is indeed ignored, and I seem to remember a discussion about this behavior, but forgot the exact context. I'll try to find a link.

The main issue is that we assume that the format specifier is the authoritative source of reader extensions, so the ReaderOptions field gets overridden.

@tarleb
Copy link
Collaborator

tarleb commented Feb 3, 2025

It seems like I was wrong, I can't find any discussion. The current behavior basically happened because the reader options parameter was added later, and the only obvious, backwards-compatible method was to introduce it with the current behavior.

Related: #9587

@jgm
Copy link
Owner

jgm commented Feb 3, 2025

So putting markdown-auto_identifiers as the target format should have the desired effect?

@rnwst
Copy link
Contributor Author

rnwst commented Feb 4, 2025

I can confirm that

local doc = pandoc.read('# Header {.class key=val}', 'markdown-auto_identifiers', reader_options)

has the desired effect.

I have written some filters where I convert parts of the AST back to pandoc's Markdown, perform some string manipulations, and then convert it back to an AST, like so (I would assume that this is a fairly common theme):

local writer_opts = pandoc.WriterOptions({extensions = PANDOC_READER_OPTIONS.extensions})
local md = pandoc.write(pandoc.Pandoc(blocks), 'markdown', writer_opts)
-- Manipulate md.
-- Then convert it back to an AST:
local ast_fragment = pandoc.read(md, 'markdown', PANDOC_READER_OPTIONS).blocks

To do this accurately (so that one ends up with the same markdown as in the source file, or at least as close as possible), one needs to consider the markdown extensions that were applied when the document was parsed, as is done in the code above. Of course, this doesn't work if those extensions are ignored by the read and write functions. One could instead append all the non-default extensions to the the format specifier (markdown) based on the contents of PANDOC_READER_OPTIONS.extensions, but this would not be nearly as elegant as the code above.

May I therefore propose the following alternative behaviour (which relies on the reader_options/writer_options arguments being optional):

  • If reader_options/writer_options is not specified, the default markdown extensions along with those specified in the format string are used (e.g. if format is markdown-auto_identifiers, the extensions are default minus auto_identifiers). This is the current behaviour and requires no change.
  • If reader_options/writer_options is specified, the corresponding extensions are used instead of those specified in the format string. If the format string specifies non-default extensions, the read/write functions could either error out or print a warning that the extensions specified by the reader_options/writer_options take precedence and those specified in the format string are ignored.

Please let me know what you think.

@rnwst
Copy link
Contributor Author

rnwst commented Feb 4, 2025

Another option might be that reader_options.extensions are used if format includes no non-default extensions, and the extensions in format are used instead if non-default extensions are specified there. That would be more backwards compatible to the current behaviour I suppose.

@tarleb
Copy link
Collaborator

tarleb commented Feb 4, 2025

This should give the desired effect:

local extensions = PANDOC_READER_OPTIONS.extensions
local flavored_format = { format = 'markdown', extensions = extensions }
local md = pandoc.write(pandoc.Pandoc(blocks), flavored_format)
-- Manipulate md.
-- Then convert it back to an AST:
local ast_fragment = pandoc.read(md, flavored_format).blocks

I agree that the current situation is not really satisfying. Some kind of warning would be good.

@tarleb
Copy link
Collaborator

tarleb commented Feb 4, 2025

I'm not happy with either of the options for when to issue a warning, as all of them might lead to warnings in perfectly fine code. For now I'm just going to document the current behavior a bit better.

@jgm, if you have any preferences for how these options should interact, then I'd be most happy to implement that.

@jgm
Copy link
Owner

jgm commented Feb 4, 2025

Perhaps a function that takes a format and a list of extensions and outputs something like markdown-auto_identifiers would address the use case above (using the same extensions as the input, with some alterantions).

@rnwst
Copy link
Contributor Author

rnwst commented Feb 4, 2025

This should give the desired effect:

local extensions = PANDOC_READER_OPTIONS.extensions
local flavored_format = { format = 'markdown', extensions = extensions }
local md = pandoc.write(pandoc.Pandoc(blocks), flavored_format)
-- Manipulate md.
-- Then convert it back to an AST:
local ast_fragment = pandoc.read(md, flavored_format).blocks

I agree that the current situation is not really satisfying. Some kind of warning would be good.

Thank you for that Albert, I didn't realise that you could pass a table instead of the format string as the second argument - I should've done a better job reading the documentation! @jgm, that addresses the use case I presented above, so I don't think there's a need for the kind of function you have described.

Regarding the behaviour of pandoc.read and pandoc.write, I understand that you don't like the idea of warnings or errors if there is nothing wrong with the code @tarleb - that makes sense. Thinking about this a bit more, I think we can achieve 'expected behaviour' without printing any warnings or errors by doing the following:

  • If extensions are specified in the second argument of pandoc.read or pandoc.write (either by appending them to the format string, like so: markdown-auto_identifiers, or by passing a table with extensions, like so: {format = 'markdown', extensions = extensions}), then these extensions are always used and the extensions in ReaderOptions or WriterOptions are ignored, if given. This is the current behaviour.
  • If no extensions are specified in the second argument of pandoc.read or pandoc.write and ReaderOptions or WriterOptions are passed as the third argument, then the extensions specified in the third argument are used instead. This is unlike the current behaviour.

I think this behaviour would be closest to what the user expects these functions to do.

@tarleb
Copy link
Collaborator

tarleb commented Feb 5, 2025

Partially off topic: I wanted to see if it's possible/easy to write a Lua function to stringify a format with extensions, and here's the result:

local function flavored_format_spec (format_name, format_extensions_list)
  local supported_exts = pandoc.format.extensions(format_name)
  local supported_extensions_list = pandoc.List:new(pairs(supported_exts))
  supported_extensions_list:sort()   -- sort list so we get deterministic results
  local results = {format_name}
  for _, ext in ipairs(supported_extensions_list) do
    if supported_exts[ext] ~= format_extensions_list:includes(ext) then
      results[#results + 1] = (supported_exts[ext] and '-' or '+') .. ext
    end
  end
  return table.concat(results)
end

@bpj
Copy link

bpj commented Feb 5, 2025

@tarleb why is it ~= in the conditional? I would have expected a hit if they are both the same. (And yes I'm probably missing something trivial! :-)

@tarleb
Copy link
Collaborator

tarleb commented Feb 5, 2025

The input is a format name and the list of enabled extensions, e.g. PANDOC_READER_OPTIONS.extensions. In the function we first get the set of extensions that are enabled by default for that format. We need to find those extensions that are enabled now but disabled by default, and vice versa. Hence the ~=, as only then does the extension have a non-default polarity. The poor man's ternary operator (… and … or …) then determines the polarity + or -.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants