-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF8 Support #1
UTF8 Support #1
Conversation
Reviewer's Guide by SourceryThis pull request adds UTF-8 encoding support to the linenoise library. It includes functions for handling UTF-8 characters, identifying wide and combining characters, and correctly refreshing the display in the presence of UTF-8 characters. It also includes a python script to generate the unicode data tables. Sequence diagram for inserting a charactersequenceDiagram
participant linenoiseEditFeed
participant linenoiseEditInsert
participant write
participant refreshLine
linenoiseEditFeed->>linenoiseEditInsert: linenoiseEditInsert(l, cbuf, nread)
alt l->len + clen <= l->buflen
alt l->len == l->pos
linenoiseEditInsert->>write: write(l->ofd, cbuf, clen)
else l->len != l->pos
linenoiseEditInsert->>refreshLine: refreshLine(l)
end
end
Sequence diagram for backspacesequenceDiagram
participant linenoiseEditBackspace
participant prevCharLen
participant memmove
participant refreshLine
linenoiseEditBackspace->>prevCharLen: prevCharLen(l->buf, l->len, l->pos, NULL)
linenoiseEditBackspace->>memmove: memmove(l->buf+l->pos-chlen, l->buf+l->pos, l->len-l->pos)
linenoiseEditBackspace->>refreshLine: refreshLine(l)
Sequence diagram for moving cursor leftsequenceDiagram
participant linenoiseEditMoveLeft
participant prevCharLen
participant refreshLine
linenoiseEditMoveLeft->>prevCharLen: prevCharLen(l->buf, l->len, l->pos, NULL)
linenoiseEditMoveLeft->>refreshLine: refreshLine(l)
Sequence diagram for moving cursor rightsequenceDiagram
participant linenoiseEditMoveRight
participant nextCharLen
participant refreshLine
linenoiseEditMoveRight->>nextCharLen: nextCharLen(l->buf, l->len, l->pos, NULL)
linenoiseEditMoveRight->>refreshLine: refreshLine(l)
Sequence diagram for deleting a charactersequenceDiagram
participant linenoiseEditDelete
participant nextCharLen
participant memmove
participant refreshLine
linenoiseEditDelete->>nextCharLen: nextCharLen(l->buf, l->len, l->pos, NULL)
linenoiseEditDelete->>memmove: memmove(l->buf+l->pos, l->buf+l->pos+chlen, l->len-l->pos-chlen)
linenoiseEditDelete->>refreshLine: refreshLine(l)
Updated class diagram for encoding functionsclassDiagram
class linenoisePrevCharLen {
+size_t prevCharLen(const char *buf, size_t buf_len, size_t pos, size_t *col_len)
}
class linenoiseNextCharLen {
+size_t nextCharLen(const char *buf, size_t buf_len, size_t pos, size_t *col_len)
}
class linenoiseReadCode {
+size_t readCode(int fd, char *buf, size_t buf_len, int* c)
}
class linenoiseState {
+size_t pos
+size_t len
+size_t plen
+size_t oldcolpos
}
linenoiseState : +size_t oldcolpos
linenoiseState : +size_t pos
linenoiseState : +size_t len
linenoiseState : +size_t plen
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @yhirose - I've reviewed your changes - here's some feedback:
Overall Comments:
- Consider adding a configuration option to enable or disable UTF-8 support, as it might introduce overhead for users who only need ASCII characters.
- The wide character and combining character tables are quite large; consider compressing them or using a more efficient data structure like a bloom filter.
Here's what I looked at during the review
- 🟢 General issues: all looks good
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟡 Complexity: 1 issue found
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
@@ -227,6 +227,445 @@ static void lndebug(const char *, ...) { | |||
} | |||
#endif | |||
|
|||
/* ========================== Encoding functions ============================= */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (complexity): Consider extracting UTF-8 handling, wide/combining character checks, and column calculation routines into a dedicated module or namespace to improve code organization and reduce complexity.
Consider extracting the UTF‑8 handling, wide/combining character checks, and column‐calculation routines into a dedicated module or namespace. For example, move functions such as `prevUtf8CodePointLen()`, `utf8BytesToCodePoint()`, `isWideChar()`, `isCombiningChar()`, and the three similar column routines (`columnPos()`, `promptTextColumnLen()`, `columnPosForMultiLine()`) into a separate file (e.g. `Utf8Utils.cpp` with header `Utf8Utils.h`).
For example, you might do the following:
// Utf8Utils.h
```cpp
#pragma once
#include <cstddef>
namespace Utf8Utils {
size_t prevUtf8CodePointLen(const char* buf, int pos);
size_t utf8BytesToCodePoint(const char* buf, size_t len, int* cp);
bool isWideChar(unsigned long cp);
bool isCombiningChar(unsigned long cp);
// Unified column calculation helper:
size_t computeColumnPos(const char *buf, size_t buf_len, size_t pos);
size_t promptTextColumnLen(const char *prompt, size_t plen);
size_t columnPosForMultiLine(const char *buf, size_t buf_len, size_t pos, size_t cols, size_t ini_pos);
}
// Utf8Utils.cpp
#include "Utf8Utils.h"
#include <cstring>
#include <cstdio>
// Implement previous functions by moving the current logic here.
// Example:
size_t Utf8Utils::prevUtf8CodePointLen(const char* buf, int pos) {
int end = pos--;
while (pos >= 0 && ((unsigned char)buf[pos] & 0xC0) == 0x80)
pos--;
return end - pos;
}
// Similarly implement utf8BytesToCodePoint(), isWideChar(), isCombiningChar()
// and merge similar column functions into computeColumnPos() where possible.
Then update the main file to use these utilities:
#include "Utf8Utils.h"
// Example in refreshSingleLine():
size_t pcollen = Utf8Utils::promptTextColumnLen(l->prompt, strlen(l->prompt));
size_t currCol = Utf8Utils::computeColumnPos(l->buf, l->len, l->pos);
// … use pcollen and currCol in the refresh logic …
This refactoring isolates the encoding and column handling logic from the terminal refresh and editing code, reduces duplication, and makes the code easier to manage and test.
@ericcurtin I ported the UTF8 support in the original linenoise to linenoise.cpp as the default behavior. It can handle latin characters with diacritic(s), CJK wide characters, and singleton Emoji characters. It currently supports Unicode 16.0 standard. If you run This simple implementation cannot handle more complex characters such as 'Emoji ZWJ Sequences', Indic syllables, Korean Hangul characters. If we need to support such characters, the default implementation needs to be replaced with a better implementation with I believe that this default Unicode behavior can satisfy most users. But when I have time, I'll try making a better opt-in implementation with cpp-unitcodelib which supports the Grapheme text segmentation based on the UAX #29. It is supposed to handle the complex characters that I mentioned above. |
@yhirose thanks, this works great, all sounds good to me, this is breaking the build on shellcheck but I'll catch that in another PR. |
@benoitf tagging for awareness |
awesome thanks 😎 |
Ported the UTF8 support for the original linenoise.
Summary by Sourcery
Implements UTF-8 encoding support to enable the use of a broader set of characters in the linenoise library. This includes adding functions for handling UTF-8 encoding, determining character widths, and correctly positioning the cursor when working with multi-byte characters.
New Features: