Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for GitHub-Flavoured Markdown tables #292

Merged
merged 11 commits into from
Nov 20, 2022

Conversation

bobatkey
Copy link
Contributor

First cut of support for GitHub-Flavoured Markdown (GFM) tables. No tests yet, but I thought I'd submit a preliminary pull request to make sure that there was interest. This handles simple tables that always have headers, and only allow inline content in cells, and no multi-column or multi-row cells.

All of the existing tests pass.

Addresses part of issue #205 .

GFM-style tables are documented here: https://github.github.com/gfm/#tables-extension- . The code so far seems to work for all the examples in that document, but I haven't made a proper test suite yet, mostly because I don't yet understand how the testing works. I've noticed that there are a few differences between the GFM documentation and Pandoc's gfm implementation too.

I added some functions to the StrSlice module to make the parsing of tables easier.

There are choices in how to represent the cell alignment in the HTML output. I went for the one chosen by Pandoc (CSS style information), but the align="left/right/center" used in the GFM document might be better.

I made up a converstion to Sexprs. I also noticed that inline code and images are not properly converted into Sexprs. What uses this translation?

I'm not sure if I'm handling link definitions properly in the case of lines at the end of blocks that look like they might be table headers, but aren't (line 80 in block_parser.ml).

@shonfeder shonfeder self-requested a review November 1, 2022 17:54
@bobatkey
Copy link
Contributor Author

bobatkey commented Nov 4, 2022

These four new commits do the following:

  1. Fix up the parsing to cover some more cases, and to bring it inline with the GitHub-Flavoured Markdown specification. There is one exception, where the GFM spec contradicts itself on whether or not backslashes are always preserved in code spans.
  2. Adjust the HTML output to better match the GFM spec when printing tables. This makes testing easier.
  3. Add some tests. There are two sets of test. The first are the excerpt of the GFM specification that covers tables. I modified the one that has the non-CommonMark backslash-in-code-span behaviour. The second set of tests are some more examples covering cases that I discovered while deciding how to parse properly.
  4. Fix the code formatting by applying ocamlformat.

I hope this is a useful contribution!

@shonfeder
Copy link
Collaborator

Thank you very much for the contribution: It is most welcome!

At a superficial survey, the extensions look great, and the additions to the StrSlice module most welcome. I will set aside some time for a proper review this weekend :)

@shonfeder
Copy link
Collaborator

There are choices in how to represent the cell alignment in the HTML output. I went for the one chosen by Pandoc

I think following the lead of pandoc is a good choice, unless there's compelling reasons to differ :)

I made up a converstion to Sexprs. I also noticed that inline code and images are not properly converted into Sexprs. What uses this translation?

These aren't used for anything in Omd. I think they were only meant for debugging and possibly for interop with some other libraries?

Thank for including a s-expr conversion!

Copy link
Collaborator

@shonfeder shonfeder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally had a chance to review properly. Thanks for this lovely addition!

I left a few suggestions and questions. I should have bandwidth on Sunday to make the changes myself. But let me know if you'd rather make them yourself or if you find any of my suggestions objectionable.

@@ -41,6 +50,7 @@ module Make (C : BlockContent) = struct
| Code_block of 'attr * string * string
| Html_block of 'attr * string
| Definition_list of 'attr * 'attr def_elt list
| Table of 'attr * ('attr C.t * cell_alignment) list * 'attr C.t list list
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's worth making the last two fields an inline record to identify for readers the header from the rows... not sure if that's worth it, but at least a comment to explain would be helpful.

Comment on lines +206 to +208
(* Makes sure that there are the same number of delimiters
as headers. See
https://github.github.com/gfm/#example-203 *)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for the helpful comments and references!

src/parser.ml Outdated
@@ -755,25 +756,48 @@ let indented_code ind s =
if indent s + ind < 4 then raise Fail;
Lindented_code (StrSlice.offset (4 - ind) s)

(* A sequence of cell contents separated by unescaped '|'
characters. *)
let table_row pipe_prefix s =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a labeled arg for pipe_prefix will help communicate the meaning of this combinator at call sites. E.g., table_row ~pipe_prefix:true.

@@ -15,13 +15,23 @@ and inline = function
| Text (_, s) -> Atom s
| Emph (_, il) -> List [ Atom "emph"; inline il ]
| Strong (_, il) -> List [ Atom "strong"; inline il ]
| Code _ -> Atom "code"
| Code _ -> Atom "code" (* FIXME: this seems broken? *)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted in #293. Thanks!

src/strSlice.ml Outdated
Comment on lines 66 to 69
let take_n n s =
if n < 0 then invalid_arg "take_n";
let len = min n s.len in
{ s with len }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other take also takes n. I wonder if the name here might be clearer as take_prefix or maybe even prefix length:int s?

| `normal, c when c = sep -> Some (idx - s.off)
| `normal, _ -> loop (idx + 1) `normal
| `escape, _ -> loop (idx + 1) `normal
| `verbatim_open n, '`' -> loop (idx + 1) (`verbatim_open (n + 1))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really helpful use of polymorphic variants to document and track the state 👍

Comment on lines +174 to +179
let trim s =
let is_whitespace = function
| ' ' | '\t' | '\010' .. '\013' -> true
| _ -> false
in
drop_while is_whitespace (drop_last_while is_whitespace s)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary for this pr but: tho it's only a few lines of code, but I still wonder if it's worth trying to share the function defined at https://github.com/ocaml/omd/blob/7b59e5926a1f811e1d89b4b37b163a459692d308/src/parser.ml#L11-L13

Could mean adding a util module or....

src/strSlice.mli Outdated
@@ -13,6 +13,12 @@ val index : (char -> bool) -> t -> int option
(** [index c s] is [Some i] where [i] is the index of the character in [s] for
which [f] is first true, or [None] if [f] holds for no characters in [s]. *)

val index_unescaped : char -> t -> int option
(** [index_unescaped c s] is [Some i] where [i] index of the first
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(** [index_unescaped c s] is [Some i] where [i] index of the first
(** [index_unescaped c s] is [Some i] where [i] is the index of the first

src/strSlice.mli Outdated
Comment on lines 40 to 41
(** [take_n n s] returns the slice consisting of the first [n]
characters of [s]. *)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be consistent with the "equational idiom" used in the rest of the docs for this module:

Suggested change
(** [take_n n s] returns the slice consisting of the first [n]
characters of [s]. *)
(** [take_n n s] is the slice consisting of the first [n]
characters of [s]. *)

</tr>
</thead>
</table>
````````````````````````````````
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding such good test coverage for this feature! 🙏

bobatkey and others added 10 commits November 20, 2022 15:12
support more github flavoury tables

fix HTML generation for tables

Switch to using slices for table lines
- Ignore `|` characters within code spans for the purposes of spotting
  table cell delimiters. Correctly allow for code spans containing ``
  ` `` characters.

- Disallow tables with no columns.

- Allow rows with no `|` characters, including rows that look like
  setext headings (sequences of `=` or `-`)
- Add a new kind of `element_type` to the HTML trees for the table
  elements `table`, `thead`, `tbody`, and `tr`.

- This makes the HTML output match newline placement used the GFM
  specification.
New tests for the GFM-style table support, in the same style as the
existing tests, comparing HTML output.

1. Excerpt the GFM specification's table examples, with one exception
   in the treatment of escape characters inside code spans (noted in the
   test file).

2. Twelve additional tests for various corner cases.
@shonfeder shonfeder force-pushed the topic/support-tables branch from b40cc11 to 543418e Compare November 20, 2022 20:13
@shonfeder shonfeder enabled auto-merge (rebase) November 20, 2022 20:13
@shonfeder shonfeder merged commit 8cc423b into ocaml-community:master Nov 20, 2022
@tmattio
Copy link

tmattio commented Dec 3, 2022

Thanks a lot for the fantastic work on this!

@shonfeder do you think we could cut a release of omd? Having this would solve a rather annoying issue in ocaml.org: ocaml/ocaml.org#59

@shonfeder
Copy link
Collaborator

Hi @tmattio! I’ll make time to cut a new alpha release next week. :)

@shonfeder
Copy link
Collaborator

@tmattio a bit delayed, but see ocaml/opam-repository#22654

@tmattio
Copy link

tmattio commented Dec 13, 2022

Thanks a lot @shonfeder!

mseri pushed a commit to ocaml/opam-repository that referenced this pull request Dec 13, 2022
CHANGES:

- Expose the HTML escape function `htmlentities` (ocaml-community/omd#295 @cuihtlauac)

- Support generation of identifiers in headers (ocaml-community/omd#294, @tatchi)

- Support GitHub-Flavoured Markdown tables (ocaml-community/omd#292, @bobatkey)

- Update parser to support CommonMark Spec 0.30 (ocaml-community/omd#266, @SquidDev)

- Preserve the order of input files in the HTML output to stdout (ocaml-community/omd#258,
  @patricoferris)

- Fix all deviations from CommonMark Spec 0.30 (ocaml-community/omd#284, ocaml-community/omd#283, ocaml-community/omd#278, ocaml-community/omd#277, ocaml-community/omd#269,
  @tatchi)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants