Suppress hyphenations (splits) on multiple contiguous lines? #5

PhilterPaper · 2020-12-29T16:37:04Z

When adding a third sample text to examples/KP.pl, I saw three consecutive lines with hyphenated (split) words, including the penultimate line. It is my understanding that the K-P algorithm is supposed to avoid such runs of hyphenation (as well as hyphenating the next-to-last line), so I'm going to consider this a bug.

PhilterPaper · 2021-01-09T18:18:42Z

The last version of typeset (the JavaScript version https://github.com/bramstein/typeset that Text::KnuthPlass is based on) is about a year younger than the last release (of this package) before I picked it up (1.02). Therefore, it would be a good idea to go over the last 18 to 24 months of typeset development (starting 6 to 12 months before the KP 1.02 release) and see if there were any bug fixes that should go into Text::KnuthPlass. On top of that, there has been further development until at least 2017, which should be looked at.

PhilterPaper · 2021-03-03T15:44:40Z

bramstein/typeset#27 mentions that a demerit value of 3000 was used in typesetting Seminumerical Algorithms, while the value used in typeset defaults to 100. Perhaps this has something to do with this reported problem, and could be helped by allowing Text::KnuthPlass to set this parameter to a higher value.

PhilterPaper · 2022-09-20T12:48:31Z

My first pass through @bramstein / typeset hasn't turned up any explicit code to find and handle contiguous hyphenated lines, so the only thing it may be doing is discouraging hyphenation overall via a small penalty (and a lack of adjacent hyphenated lines is just a beneficial byproduct). This case of three in a row may just be bad luck.

bramstein · 2022-09-20T12:58:14Z

@PhilterPaper That's correct, there's no explicit handling of consecutive hyphenated lines. It is all handled by the penalty system. It is thus very possible to get several consecutively hyphenated lines if that is the most optimal justification possible.

PhilterPaper · 2022-09-20T13:11:09Z

Thanks for quickly responding on this. My thoughts are that, if once a line (with hyphenated word) is laid down and "frozen", it would not be too difficult on the next line, if it ends with a hyphenated word, to go back and check the previous line for a hyphenation. If there is, add an additional penalty on this line's hyphenation. On the other hand, if that previous line isn't frozen until it's too late to rearrange the paragraph, that might not do any good.

I see that the final output (at least, for the Perl version) has a hyphenation penalty after every word fragment's box, so I'm not sure where in the code that KP ends up counting just the line-end hyphens for penalties. Back for another dive into the code at some point, I guess. I also need to check whether a naturally hyphenated word split at that hyphen (e.g., "absent-minded") gets the hyphenation penalty.

bramstein · 2022-09-20T14:23:07Z

The KP algorithm tries to minimize the penalties over the entire paragraph, so what is chosen is most likely the most optimal choice. That means that if you try to avoid consecutive hyphenated lines something else has to give (for example the inter-word spacing.) So it's a trade off; you can avoid the hyphenation, but something else will get (slightly) worse.

As for naturally hyphenated words, it depends on the code that generates the sequence of boxes, glue, and penalties. If I remember correctly (and it has been a while), I allow a linebreak after the hyphen but do not add a penalty for naturally hyphenated words.

If you haven't already, you can implement the Unicode line breaking algorithm to find the line breaking properties of the input tokens (for example, see: https://github.com/bramstein/unicode-tokenizer).

PhilterPaper · 2022-09-20T15:30:28Z

I am aware that KP is trying to globally optimize (minimize) penalties over an entire paragraph, so "fixing" contiguous hyphenated lines may force something else to "give". I still think that multiple contiguous hyphenated lines are really a glaring fault (that really catches the eye), and it would be good to get rid of them, even at the cost of slightly worse fitting elsewhere. For N lines in a row hyphenated, perhaps line 1 gets the standard penalty, line 2 gets 1.5 penalties, line 3 gets 3 penalties, etc.

I have briefly looked at the Unicode TR 14 (I think that's the one) on allowed line-breaking points; I will take a look at your code, too. Thanks for the suggestion!

PhilterPaper · 2023-01-15T16:40:23Z

Anyone with thoughts on how to best update the KP algorithm to discourage multiple contiguous hyphenated lines is welcome to chime in. It should be configurable, of course. I don't recall seeing anything in Text::KnuthPlass to allow "naturally hyphenated" words (I presume that means a compound word with explicit hyphen(s) already in it) to be split at the hyphen without penalty, although it should probably count as the first of a run (without its own penalty) if following line(s) want to be hyphenated. When I get back to this later this year, I'll see about handling it in that manner (after checking whether typeset actually does it).

Thinking ahead to adding KP to PDF::Builder to do proper paragraph shaping, I may have to extend the code to handle the array of text chunks, each with its own font and size, rather than just one long string. In addition, line lengths may vary unpredictably due to column shapes and where a line falls due to font and image vertical extents (i.e., you can't give a fixed list of line lengths in advance). It promises to be a major task! I'm also mulling over brewing up my own hyphenation package, improving upon Text::Hyphen et al. in being able to switch among multiple human languages on the fly (among other things), and handling letter changes/repeats needed by some languages. Alex Holkner's thesis (https://citeseerx.ist.psu.edu/pdf/ee95750a9dd047b52901efda59819864bb9ede4a) page 11 has an interesting data structure for such things.

PhilterPaper added the bug Something isn't working label Dec 29, 2020

PhilterPaper mentioned this issue Mar 6, 2021

Enhancements #4

Open

PhilterPaper added enhancement New feature or request and removed bug Something isn't working labels Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suppress hyphenations (splits) on multiple contiguous lines? #5

Suppress hyphenations (splits) on multiple contiguous lines? #5

PhilterPaper commented Dec 29, 2020

PhilterPaper commented Jan 9, 2021 •

edited

Loading

PhilterPaper commented Mar 3, 2021

PhilterPaper commented Sep 20, 2022

bramstein commented Sep 20, 2022

PhilterPaper commented Sep 20, 2022

bramstein commented Sep 20, 2022

PhilterPaper commented Sep 20, 2022

PhilterPaper commented Jan 15, 2023

Suppress hyphenations (splits) on multiple contiguous lines? #5

Suppress hyphenations (splits) on multiple contiguous lines? #5

Comments

PhilterPaper commented Dec 29, 2020

PhilterPaper commented Jan 9, 2021 • edited Loading

PhilterPaper commented Mar 3, 2021

PhilterPaper commented Sep 20, 2022

bramstein commented Sep 20, 2022

PhilterPaper commented Sep 20, 2022

bramstein commented Sep 20, 2022

PhilterPaper commented Sep 20, 2022

PhilterPaper commented Jan 15, 2023

PhilterPaper commented Jan 9, 2021 •

edited

Loading