Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF8 Support #1

Merged
merged 1 commit into from
Feb 28, 2025
Merged

UTF8 Support #1

merged 1 commit into from
Feb 28, 2025

Conversation

yhirose
Copy link
Contributor

@yhirose yhirose commented Feb 28, 2025

Ported the UTF8 support for the original linenoise.

mkdir build
cd build
meson ..
ninja
./linenoise
😀 hello>

Summary by Sourcery

Implements UTF-8 encoding support to enable the use of a broader set of characters in the linenoise library. This includes adding functions for handling UTF-8 encoding, determining character widths, and correctly positioning the cursor when working with multi-byte characters.

New Features:

  • Adds UTF-8 encoding support, allowing the input and display of a wider range of characters.
  • Introduces functions to handle UTF-8 characters, including determining the length of the previous and next characters, converting UTF-8 bytes to Unicode code points, and identifying wide and combining characters.

Copy link

sourcery-ai bot commented Feb 28, 2025

Reviewer's Guide by Sourcery

This pull request adds UTF-8 encoding support to the linenoise library. It includes functions for handling UTF-8 characters, identifying wide and combining characters, and correctly refreshing the display in the presence of UTF-8 characters. It also includes a python script to generate the unicode data tables.

Sequence diagram for inserting a character

sequenceDiagram
    participant linenoiseEditFeed
    participant linenoiseEditInsert
    participant write
    participant refreshLine

    linenoiseEditFeed->>linenoiseEditInsert: linenoiseEditInsert(l, cbuf, nread)
    alt l->len + clen <= l->buflen
        alt l->len == l->pos
            linenoiseEditInsert->>write: write(l->ofd, cbuf, clen)
        else l->len != l->pos
            linenoiseEditInsert->>refreshLine: refreshLine(l)
        end
    end
Loading

Sequence diagram for backspace

sequenceDiagram
    participant linenoiseEditBackspace
    participant prevCharLen
    participant memmove
    participant refreshLine

    linenoiseEditBackspace->>prevCharLen: prevCharLen(l->buf, l->len, l->pos, NULL)
    linenoiseEditBackspace->>memmove: memmove(l->buf+l->pos-chlen, l->buf+l->pos, l->len-l->pos)
    linenoiseEditBackspace->>refreshLine: refreshLine(l)
Loading

Sequence diagram for moving cursor left

sequenceDiagram
    participant linenoiseEditMoveLeft
    participant prevCharLen
    participant refreshLine

    linenoiseEditMoveLeft->>prevCharLen: prevCharLen(l->buf, l->len, l->pos, NULL)
    linenoiseEditMoveLeft->>refreshLine: refreshLine(l)
Loading

Sequence diagram for moving cursor right

sequenceDiagram
    participant linenoiseEditMoveRight
    participant nextCharLen
    participant refreshLine

    linenoiseEditMoveRight->>nextCharLen: nextCharLen(l->buf, l->len, l->pos, NULL)
    linenoiseEditMoveRight->>refreshLine: refreshLine(l)
Loading

Sequence diagram for deleting a character

sequenceDiagram
    participant linenoiseEditDelete
    participant nextCharLen
    participant memmove
    participant refreshLine

    linenoiseEditDelete->>nextCharLen: nextCharLen(l->buf, l->len, l->pos, NULL)
    linenoiseEditDelete->>memmove: memmove(l->buf+l->pos, l->buf+l->pos+chlen, l->len-l->pos-chlen)
    linenoiseEditDelete->>refreshLine: refreshLine(l)
Loading

Updated class diagram for encoding functions

classDiagram
    class linenoisePrevCharLen {
        +size_t prevCharLen(const char *buf, size_t buf_len, size_t pos, size_t *col_len)
    }
    class linenoiseNextCharLen {
        +size_t nextCharLen(const char *buf, size_t buf_len, size_t pos, size_t *col_len)
    }
    class linenoiseReadCode {
        +size_t readCode(int fd, char *buf, size_t buf_len, int* c)
    }
    class linenoiseState {
        +size_t pos
        +size_t len
        +size_t plen
        +size_t oldcolpos
    }

    linenoiseState : +size_t oldcolpos
    linenoiseState : +size_t pos
    linenoiseState : +size_t len
    linenoiseState : +size_t plen
Loading

File-Level Changes

Change Details Files
Added UTF-8 encoding support to linenoise.
  • Added functions to determine the length of UTF-8 code points and graphemes.
  • Added functions to convert UTF-8 bytes to Unicode code points.
  • Added tables and functions to identify wide and combining characters.
  • Modified the line refresh logic to handle UTF-8 characters correctly, including adjusting cursor positions and hint display.
  • Updated the edit insert and delete functions to work with UTF-8 characters.
  • Added function to read a unicode from file.
linenoise.cpp
linenoise.h
example.cpp
scripts/generate_unicode_data_tables.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!
  • Generate a plan of action for an issue: Comment @sourcery-ai plan on
    an issue to generate a plan of action for it.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @yhirose - I've reviewed your changes - here's some feedback:

Overall Comments:

  • Consider adding a configuration option to enable or disable UTF-8 support, as it might introduce overhead for users who only need ASCII characters.
  • The wide character and combining character tables are quite large; consider compressing them or using a more efficient data structure like a bloom filter.
Here's what I looked at during the review
  • 🟢 General issues: all looks good
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟡 Complexity: 1 issue found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@@ -227,6 +227,445 @@ static void lndebug(const char *, ...) {
}
#endif

/* ========================== Encoding functions ============================= */
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider extracting UTF-8 handling, wide/combining character checks, and column calculation routines into a dedicated module or namespace to improve code organization and reduce complexity.

Consider extracting the UTF‑8 handling, wide/combining character checks, and column‐calculation routines into a dedicated module or namespace. For example, move functions such as `prevUtf8CodePointLen()`, `utf8BytesToCodePoint()`, `isWideChar()`, `isCombiningChar()`, and the three similar column routines (`columnPos()`, `promptTextColumnLen()`, `columnPosForMultiLine()`) into a separate file (e.g. `Utf8Utils.cpp` with header `Utf8Utils.h`).

For example, you might do the following:

// Utf8Utils.h
```cpp
#pragma once
#include <cstddef>

namespace Utf8Utils {
    size_t prevUtf8CodePointLen(const char* buf, int pos);
    size_t utf8BytesToCodePoint(const char* buf, size_t len, int* cp);
    bool isWideChar(unsigned long cp);
    bool isCombiningChar(unsigned long cp);

    // Unified column calculation helper:
    size_t computeColumnPos(const char *buf, size_t buf_len, size_t pos);
    size_t promptTextColumnLen(const char *prompt, size_t plen);
    size_t columnPosForMultiLine(const char *buf, size_t buf_len, size_t pos, size_t cols, size_t ini_pos);
}

// Utf8Utils.cpp

#include "Utf8Utils.h"
#include <cstring>
#include <cstdio>

// Implement previous functions by moving the current logic here.
// Example: 
size_t Utf8Utils::prevUtf8CodePointLen(const char* buf, int pos) {
    int end = pos--;
    while (pos >= 0 && ((unsigned char)buf[pos] & 0xC0) == 0x80)
        pos--;
    return end - pos;
}

// Similarly implement utf8BytesToCodePoint(), isWideChar(), isCombiningChar()
// and merge similar column functions into computeColumnPos() where possible.

Then update the main file to use these utilities:

#include "Utf8Utils.h"

// Example in refreshSingleLine():
size_t pcollen = Utf8Utils::promptTextColumnLen(l->prompt, strlen(l->prompt));
size_t currCol = Utf8Utils::computeColumnPos(l->buf, l->len, l->pos);
// … use pcollen and currCol in the refresh logic …

This refactoring isolates the encoding and column handling logic from the terminal refresh and editing code, reduces duplication, and makes the code easier to manage and test.

@yhirose
Copy link
Contributor Author

yhirose commented Feb 28, 2025

@ericcurtin I ported the UTF8 support in the original linenoise to linenoise.cpp as the default behavior. It can handle latin characters with diacritic(s), CJK wide characters, and singleton Emoji characters. It currently supports Unicode 16.0 standard. If you run scripts/generate_unicode_data_tables.py, it will update the Unicode data tables in linenoise.cpp with the latest Unicode Database.

This simple implementation cannot handle more complex characters such as 'Emoji ZWJ Sequences', Indic syllables, Korean Hangul characters. If we need to support such characters, the default implementation needs to be replaced with a better implementation with linenoiseSetEncodingFunctions.

I believe that this default Unicode behavior can satisfy most users. But when I have time, I'll try making a better opt-in implementation with cpp-unitcodelib which supports the Grapheme text segmentation based on the UAX #29. It is supposed to handle the complex characters that I mentioned above.

@ericcurtin
Copy link
Owner

@yhirose thanks, this works great, all sounds good to me, this is breaking the build on shellcheck but I'll catch that in another PR.

@ericcurtin ericcurtin merged commit 39d72ad into ericcurtin:main Feb 28, 2025
0 of 2 checks passed
@ericcurtin
Copy link
Owner

@benoitf tagging for awareness

@ericcurtin
Copy link
Owner

#4

@benoitf
Copy link

benoitf commented Feb 28, 2025

awesome thanks 😎

@yhirose yhirose deleted the utf8-support branch February 28, 2025 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants