Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docx table with w:firstColumn unmerges cells #10627

Closed
speleo3 opened this issue Feb 18, 2025 · 8 comments
Closed

docx table with w:firstColumn unmerges cells #10627

speleo3 opened this issue Feb 18, 2025 · 8 comments
Labels

Comments

@speleo3
Copy link

speleo3 commented Feb 18, 2025

Explain the problem.

docx table with w:firstColumn and merged cells converts to html with unmerged cells.

Example

colspan.docx

pandoc -t html5 -o colspan.html colspan.docx

Expected output with pandoc 3.6.1:

Col 1 Col 2 Col 3 Col 4 Col 5 Col 6 Col 7
Row 1 A B C D E F
Merge all columns
Row 3 G H I J K L

Unexpected output with pandoc 3.6.3:

Col 1 Col 2 Col 3 Col 4 Col 5 Col 6 Col 7
Row 1 A B C D E F
Merge all columns
Row 3 G H I J K L

Pandoc version?

  • 3.6.1 (OK)
  • 3.6.2 (broken)
  • 3.6.3 (broken)

OS: Ubuntu (WSL2)

@speleo3 speleo3 added the bug label Feb 18, 2025
@jgm
Copy link
Owner

jgm commented Feb 18, 2025

Just to be clear, if w:firstColumn is not set in this document, you do get the merged columns?

@jgm
Copy link
Owner

jgm commented Feb 18, 2025

cbe67b9 is the commit that introduced sensitivity to w:firstColumn (pandoc 3.6.2).

@speleo3
Copy link
Author

speleo3 commented Feb 18, 2025

Just to be clear, if w:firstColumn is not set in this document, you do get the merged columns?

Not sure if I'm looking at the correct thing, but colspan.docx/word/document.xml does contain w:firstColumn="1":

  <w:body>
    <w:p w14:paraId="0406BEDF" w14:textId="25B6C475" w:rsidR="00BD3F59" w:rsidRDefault="00BD3F59" w:rsidP="00BD3F59">
      <w:pPr>
        <w:pStyle w:val="Caption" />
        <w:ind w:left="0" w:firstLine="0" />
      </w:pPr>
    </w:p>
    <w:tbl>
      <w:tblPr>
        <w:tblStyle w:val="StandardtabelleRB" />
        <w:tblW w:w="5001" w:type="pct" />
        <w:tblLayout w:type="fixed" />
        <w:tblLook w:val="04A0" w:firstRow="1" w:lastRow="0" w:firstColumn="1" w:lastColumn="0" w:noHBand="0" w:noVBand="1" />
      </w:tblPr>

Changing it from "1" to "0" and running pandoc 3.6.2 again gives me merged columns in the html output.

@jgm
Copy link
Owner

jgm commented Feb 18, 2025

I can confirm that the issue arises in the docx reader, not the HTML writer.
We get:

        , Row
            ( "" , [] , [] )
            [ Cell
                ( "" , [] , [] )
                AlignCenter
                (RowSpan 1)
                (ColSpan 1)
                [ Plain
                    [ Str "Merge"
                    , Space
                    , Str "all"
                    , Space
                    , Str "columns"
                    ]
                ]
            , Cell
                ( "" , [] , [] ) AlignDefault (RowSpan 1) (ColSpan 1) []
            , Cell
                ( "" , [] , [] ) AlignDefault (RowSpan 1) (ColSpan 1) []
            , Cell
                ( "" , [] , [] ) AlignDefault (RowSpan 1) (ColSpan 1) []
            , Cell
                ( "" , [] , [] ) AlignDefault (RowSpan 1) (ColSpan 1) []
            , Cell
                ( "" , [] , [] ) AlignDefault (RowSpan 1) (ColSpan 1) []
            , Cell
                ( "" , [] , [] ) AlignDefault (RowSpan 1) (ColSpan 1) []
            ]

@jgm
Copy link
Owner

jgm commented Feb 19, 2025

I have localized the issue.
https://github.com/jgm/pandoc/blob/main/src/Text/Pandoc/Readers/Docx.hs#L842-L847

bodyCells is

...
, Row
    ( "" , [] , [] )
    [ Cell
        ( "" , [] , [] )
        AlignCenter
        (RowSpan 1)
        (ColSpan 7)
        [ Plain [ Str "Merge" , Space , Str "all" , Space , Str "columns" ]
        ]
    ]
...

but the output of tableWith (from Text.Pandoc.Builder) is

...
        , Row
            ( "" , [] , [] )
            [ Cell
                ( "" , [] , [] )
                AlignCenter
                (RowSpan 1)
                (ColSpan 1)
                [ Plain
                    [ Str "Merge"
                    , Space
                    , Str "all"
                    , Space
                    , Str "columns"
                    ]
                ]
            , Cell
                ( "" , [] , [] ) AlignDefault (RowSpan 1) (ColSpan 1) []
            , Cell
                ( "" , [] , [] ) AlignDefault (RowSpan 1) (ColSpan 1) []
            , Cell
                ( "" , [] , [] ) AlignDefault (RowSpan 1) (ColSpan 1) []
            , Cell
                ( "" , [] , [] ) AlignDefault (RowSpan 1) (ColSpan 1) []
            , Cell
                ( "" , [] , [] ) AlignDefault (RowSpan 1) (ColSpan 1) []
            , Cell
                ( "" , [] , [] ) AlignDefault (RowSpan 1) (ColSpan 1) []
            ]
...

So this has to do with some table normalization that is done by the tableWith builder, introduced in jgm/pandoc-types#65.
@desprec might have a better understanding than I do of why; I haven't had a chance to delve into that code.
(presumably normalizeBodySection)
https://github.com/jgm/pandoc-types/blob/master/src/Text/Pandoc/Builder.hs#L633-L650

@jgm
Copy link
Owner

jgm commented Feb 19, 2025

The unfortunate thing is that, in my experience, Word always sets w:firstColumn="1" by default for tables. You have to find the Table Design tab and explicitly uncheck "First Column" to make this go away. So, by changing the docx reader to be sensitive to w:firstColumn="1" we have, in effect, broken colspans for just about everyone. Unless there's a straightforward fix, it might be worth considering rolling back cbe67b9

@jgm
Copy link
Owner

jgm commented Feb 19, 2025

I'll say one more thing: conceptually, it's quite odd to have a table that designates the left-hand column as a heading but then has one of those columns span 7 columns. I don't imagine anyone would want that. The problem really arises because of Word's odd default of setting w:firstColumn="1".

@speleo3
Copy link
Author

speleo3 commented Feb 19, 2025

Thanks for the excellent analysis. Word's default behavior and the hard to discover "First Column" checkbox/feature are really unfortunate.

@jgm jgm closed this as completed in 3caf2b1 Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants