UTF8 Support #1

yhirose · 2025-02-28T03:36:53Z

Ported the UTF8 support for the original linenoise.

mkdir build
cd build
meson ..
ninja
./linenoise
😀 hello>

Summary by Sourcery

Implements UTF-8 encoding support to enable the use of a broader set of characters in the linenoise library. This includes adding functions for handling UTF-8 encoding, determining character widths, and correctly positioning the cursor when working with multi-byte characters.

New Features:

Adds UTF-8 encoding support, allowing the input and display of a wider range of characters.
Introduces functions to handle UTF-8 characters, including determining the length of the previous and next characters, converting UTF-8 bytes to Unicode code points, and identifying wide and combining characters.

sourcery-ai · 2025-02-28T03:36:57Z

Reviewer's Guide by Sourcery

This pull request adds UTF-8 encoding support to the linenoise library. It includes functions for handling UTF-8 characters, identifying wide and combining characters, and correctly refreshing the display in the presence of UTF-8 characters. It also includes a python script to generate the unicode data tables.

Sequence diagram for inserting a character

sequenceDiagram
    participant linenoiseEditFeed
    participant linenoiseEditInsert
    participant write
    participant refreshLine

    linenoiseEditFeed->>linenoiseEditInsert: linenoiseEditInsert(l, cbuf, nread)
    alt l->len + clen <= l->buflen
        alt l->len == l->pos
            linenoiseEditInsert->>write: write(l->ofd, cbuf, clen)
        else l->len != l->pos
            linenoiseEditInsert->>refreshLine: refreshLine(l)
        end
    end

Sequence diagram for backspace

sequenceDiagram
    participant linenoiseEditBackspace
    participant prevCharLen
    participant memmove
    participant refreshLine

    linenoiseEditBackspace->>prevCharLen: prevCharLen(l->buf, l->len, l->pos, NULL)
    linenoiseEditBackspace->>memmove: memmove(l->buf+l->pos-chlen, l->buf+l->pos, l->len-l->pos)
    linenoiseEditBackspace->>refreshLine: refreshLine(l)

Sequence diagram for moving cursor left

sequenceDiagram
    participant linenoiseEditMoveLeft
    participant prevCharLen
    participant refreshLine

    linenoiseEditMoveLeft->>prevCharLen: prevCharLen(l->buf, l->len, l->pos, NULL)
    linenoiseEditMoveLeft->>refreshLine: refreshLine(l)

Sequence diagram for moving cursor right

sequenceDiagram
    participant linenoiseEditMoveRight
    participant nextCharLen
    participant refreshLine

    linenoiseEditMoveRight->>nextCharLen: nextCharLen(l->buf, l->len, l->pos, NULL)
    linenoiseEditMoveRight->>refreshLine: refreshLine(l)

Sequence diagram for deleting a character

sequenceDiagram
    participant linenoiseEditDelete
    participant nextCharLen
    participant memmove
    participant refreshLine

    linenoiseEditDelete->>nextCharLen: nextCharLen(l->buf, l->len, l->pos, NULL)
    linenoiseEditDelete->>memmove: memmove(l->buf+l->pos, l->buf+l->pos+chlen, l->len-l->pos-chlen)
    linenoiseEditDelete->>refreshLine: refreshLine(l)

Updated class diagram for encoding functions

classDiagram
    class linenoisePrevCharLen {
        +size_t prevCharLen(const char *buf, size_t buf_len, size_t pos, size_t *col_len)
    }
    class linenoiseNextCharLen {
        +size_t nextCharLen(const char *buf, size_t buf_len, size_t pos, size_t *col_len)
    }
    class linenoiseReadCode {
        +size_t readCode(int fd, char *buf, size_t buf_len, int* c)
    }
    class linenoiseState {
        +size_t pos
        +size_t len
        +size_t plen
        +size_t oldcolpos
    }

    linenoiseState : +size_t oldcolpos
    linenoiseState : +size_t pos
    linenoiseState : +size_t len
    linenoiseState : +size_t plen

File-Level Changes

Change	Details	Files
Added UTF-8 encoding support to linenoise.	Added functions to determine the length of UTF-8 code points and graphemes. Added functions to convert UTF-8 bytes to Unicode code points. Added tables and functions to identify wide and combining characters. Modified the line refresh logic to handle UTF-8 characters correctly, including adjusting cursor positions and hint display. Updated the edit insert and delete functions to work with UTF-8 characters. Added function to read a unicode from file.	`linenoise.cpp` `linenoise.h` `example.cpp` `scripts/generate_unicode_data_tables.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!
Generate a plan of action for an issue: Comment @sourcery-ai plan on
an issue to generate a plan of action for it.

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @yhirose - I've reviewed your changes - here's some feedback:

Overall Comments:

Consider adding a configuration option to enable or disable UTF-8 support, as it might introduce overhead for users who only need ASCII characters.
The wide character and combining character tables are quite large; consider compressing them or using a more efficient data structure like a bloom filter.

Here's what I looked at during the review

🟢 General issues: all looks good
🟢 Security: all looks good
🟢 Testing: all looks good
🟡 Complexity: 1 issue found
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-02-28T03:37:44Z

linenoise.cpp

@@ -227,6 +227,445 @@ static void lndebug(const char *, ...) {
 }
 #endif

+/* ========================== Encoding functions ============================= */


issue (complexity): Consider extracting UTF-8 handling, wide/combining character checks, and column calculation routines into a dedicated module or namespace to improve code organization and reduce complexity.

Consider extracting the UTF‑8 handling, wide/combining character checks, and column‐calculation routines into a dedicated module or namespace. For example, move functions such as `prevUtf8CodePointLen()`, `utf8BytesToCodePoint()`, `isWideChar()`, `isCombiningChar()`, and the three similar column routines (`columnPos()`, `promptTextColumnLen()`, `columnPosForMultiLine()`) into a separate file (e.g. `Utf8Utils.cpp` with header `Utf8Utils.h`). For example, you might do the following: // Utf8Utils.h ```cpp #pragma once #include <cstddef> namespace Utf8Utils { size_t prevUtf8CodePointLen(const char* buf, int pos); size_t utf8BytesToCodePoint(const char* buf, size_t len, int* cp); bool isWideChar(unsigned long cp); bool isCombiningChar(unsigned long cp); // Unified column calculation helper: size_t computeColumnPos(const char *buf, size_t buf_len, size_t pos); size_t promptTextColumnLen(const char *prompt, size_t plen); size_t columnPosForMultiLine(const char *buf, size_t buf_len, size_t pos, size_t cols, size_t ini_pos); }

// Utf8Utils.cpp

#include "Utf8Utils.h" #include <cstring> #include <cstdio> // Implement previous functions by moving the current logic here. // Example: size_t Utf8Utils::prevUtf8CodePointLen(const char* buf, int pos) { int end = pos--; while (pos >= 0 && ((unsigned char)buf[pos] & 0xC0) == 0x80) pos--; return end - pos; } // Similarly implement utf8BytesToCodePoint(), isWideChar(), isCombiningChar() // and merge similar column functions into computeColumnPos() where possible.

Then update the main file to use these utilities:

#include "Utf8Utils.h" // Example in refreshSingleLine(): size_t pcollen = Utf8Utils::promptTextColumnLen(l->prompt, strlen(l->prompt)); size_t currCol = Utf8Utils::computeColumnPos(l->buf, l->len, l->pos); // … use pcollen and currCol in the refresh logic …

This refactoring isolates the encoding and column handling logic from the terminal refresh and editing code, reduces duplication, and makes the code easier to manage and test.

yhirose · 2025-02-28T04:20:55Z

@ericcurtin I ported the UTF8 support in the original linenoise to linenoise.cpp as the default behavior. It can handle latin characters with diacritic(s), CJK wide characters, and singleton Emoji characters. It currently supports Unicode 16.0 standard. If you run scripts/generate_unicode_data_tables.py, it will update the Unicode data tables in linenoise.cpp with the latest Unicode Database.

This simple implementation cannot handle more complex characters such as 'Emoji ZWJ Sequences', Indic syllables, Korean Hangul characters. If we need to support such characters, the default implementation needs to be replaced with a better implementation with linenoiseSetEncodingFunctions.

I believe that this default Unicode behavior can satisfy most users. But when I have time, I'll try making a better opt-in implementation with cpp-unitcodelib which supports the Grapheme text segmentation based on the UAX #29. It is supposed to handle the complex characters that I mentioned above.

ericcurtin · 2025-02-28T11:15:04Z

@yhirose thanks, this works great, all sounds good to me, this is breaking the build on shellcheck but I'll catch that in another PR.

ericcurtin · 2025-02-28T11:17:03Z

@benoitf tagging for awareness

ericcurtin · 2025-02-28T11:17:43Z

#4

benoitf · 2025-02-28T11:28:32Z

awesome thanks 😎

UTF8 Support

a141e13

sourcery-ai bot reviewed Feb 28, 2025

View reviewed changes

yhirose mentioned this pull request Feb 28, 2025

Another implementation for utf8 support. antirez/linenoise#187

Open

ericcurtin merged commit 39d72ad into ericcurtin:main Feb 28, 2025
0 of 2 checks passed

yhirose deleted the utf8-support branch February 28, 2025 12:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF8 Support #1

UTF8 Support #1

yhirose commented Feb 28, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Feb 28, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

sourcery-ai bot left a comment

sourcery-ai bot Feb 28, 2025

yhirose commented Feb 28, 2025

ericcurtin commented Feb 28, 2025

ericcurtin commented Feb 28, 2025

ericcurtin commented Feb 28, 2025

benoitf commented Feb 28, 2025

UTF8 Support #1

UTF8 Support #1

Conversation

yhirose commented Feb 28, 2025 • edited by sourcery-ai bot Loading

Summary by Sourcery

sourcery-ai bot commented Feb 28, 2025 • edited Loading

Reviewer's Guide by Sourcery

Sequence diagram for inserting a character

Sequence diagram for backspace

Sequence diagram for moving cursor left

Sequence diagram for moving cursor right

Sequence diagram for deleting a character

Updated class diagram for encoding functions

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

sourcery-ai bot left a comment

Choose a reason for hiding this comment

sourcery-ai bot Feb 28, 2025

Choose a reason for hiding this comment

yhirose commented Feb 28, 2025

ericcurtin commented Feb 28, 2025

ericcurtin commented Feb 28, 2025

ericcurtin commented Feb 28, 2025

benoitf commented Feb 28, 2025

yhirose commented Feb 28, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Feb 28, 2025 •

edited

Loading