Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shorten CWD path #854

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Shorten CWD path #854

wants to merge 1 commit into from

Conversation

onlined
Copy link

@onlined onlined commented Oct 21, 2021

For showing the CWD path in a more compact way, this function

  • shortens path parts before last from right to left, until first
    character of the path stays (directory -> direc~ or d~)
  • combines parts before last from right with left (a~/b~ -> a~~)
  • shortens the last part from middle, preferring to keep characters
    from left (longdirectoryname -> long~ame)

in order (of 2nd and 3rd, the one that removes more characters is applied first) until given maximum length is reached.

Closes #462.

Copy link
Member

@BenBE BenBE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR. There's still some problems with it, that should be addressed. One major one is the missing support for multibyte charsets like UTF-8 and as we've seen even usernames with kanji it's essential that paths containing such characters are handled properly, as paths with special characters are even more common than usernames with such arcane characters.

When updating your code feel free to squash and rebase your branch freely to keep a clean commit history.

Process.c Outdated Show resolved Hide resolved
Process.c Outdated Show resolved Hide resolved
Process.c Outdated
strcpy(ctx->cwd + writeIndex, ctx->parts[i]);
writeIndex += ctx->partLengths[i];
if (i < ctx->partsLen - 1) {
strcpy(ctx->cwd + writeIndex, "/");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use string functions from XUtils.h for copying things around.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied character directly instead.

Process.c Outdated Show resolved Hide resolved
Process.c Outdated Show resolved Hide resolved
Process.c Outdated Show resolved Hide resolved
Process.c Outdated
size_t i;
for (i = ctx->partsLen - 2; i > 1; i--) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite like re-using a loop variable outside the initial loop scope it was worked with …

Copy link
Author

@onlined onlined Oct 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other way would be adding an extra variable and update it with i's value every time, or maybe converting loop condition into an if at the and and assign in the block just before break, but these are both verbose and unnecessary IMO. Since the function body is fairly small and i is relevant until the end, I don't think that it causes any risk.

Process.c Outdated Show resolved Hide resolved
Process.c Outdated Show resolved Hide resolved
Process.c Outdated Show resolved Hide resolved
@onlined onlined force-pushed the shorten-cwd-path branch 7 times, most recently from 253b91a to b08dbcf Compare October 21, 2021 22:35
Process.c Outdated
return diff;
}

static size_t shortenCwdLastPart(shortenCwdContext *ctx, bool doActualWork) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I don't like a boolean data type in parameters, what does doActualWork do and what's the difference between when I set it to true and set to false?

Copy link
Author

@onlined onlined Oct 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If doActualWork is true, the function does the actual work but if it's false, only the number of characters to be removed is calculated. Adding comments could help, like you mentioned before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you can split the length calculation and the "actual work" part into two routines, or redesign the parameter so it makes more sense (e.g. copy and work with the path in a new buffer instead of doing that in-place). The boolean parameter introduces control coupling that may be unnecessary.

Process.c Outdated
Comment on lines 848 to 855
shortenCwdParts(&ctx);
if (shortenCwdLastPart(&ctx, false) > collapseCwdParts(&ctx, false)) {
shortenCwdLastPart(&ctx, true);
collapseCwdParts(&ctx, true);
} else {
collapseCwdParts(&ctx, true);
shortenCwdLastPart(&ctx, true);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should document the path shortening behavior somewhere in the code comments. It isn't obvious on how the path would be shortened as your algorithm looks not trivial enough.
It would be better to also put in examples, to help people test the behaviors.

@BenBE BenBE added this to the 3.2.0 milestone Oct 22, 2021
@onlined
Copy link
Author

onlined commented Oct 22, 2021

@BenBE I see that you indicated that the code must be multibyte aware. I totally agree with you and I started working on it. I have one question: Is it OK to use standard library multibyte functions (mblen, mbsrtowcs etc.) or do I need another solution for it?

Process.c Outdated Show resolved Hide resolved
@BenBE
Copy link
Member

BenBE commented Oct 22, 2021

@BenBE I see that you indicated that the code must be multibyte aware. I totally agree with you and I started working on it. I have one question: Is it OK to use standard library multibyte functions (mblen, mbsrtowcs etc.) or do I need another solution for it?

Using standard library is totally fine and much preferred over including additional libraries we do not yet link with.

@onlined onlined force-pushed the shorten-cwd-path branch 2 times, most recently from 25d449d to e2b9b21 Compare October 25, 2021 23:17
@Explorer09
Copy link
Contributor

I finally have a chance to test your code, and it might be late to inform you that there's one thing you need to address.
htop CWD with CJK characters screenshot

People who didn't live in East Asia wouldn't know, that not all characters occupy the same width in the terminal. Specifically,

  • CJK characters may occupy twice the width of the Latin letters, and
  • Combining characters occupy no width, but modifies the character before it.

The above screenshot shows CJK characters (Chinese numerals '一二三四五六'), I've also tested with some Zalgo text and it displays weirdly, too.

@onlined
Copy link
Author

onlined commented Oct 26, 2021

@Explorer09 @BenBE I tried to find a solution for finding out how many columns a character occupies. I saw wcwidth and wcswidth but their behavior changes according to locale, so it's not much sane to use them. Do you know any portable (POSIX is enough probably) and reliable way for finding how much space a wide character occupies?

@Explorer09
Copy link
Contributor

@online I already expected you would get stuck with the problem when I mentioned it. It troubled me too (when I worked in a project of my company), and I have a conservative approach:

  1. Set terminal cursor position to where the character would be printed (make sure it's safe to print junk characters there).
  2. Print the character (one at a time).
  3. Query cursor position to determine the width.
  4. Repeat step 2.

Yes, the safest way is to just print it out. And you are right that the width is locale-dependent. If you have read the UAX #11 East Asian Width, there is an "East Asian Ambiguous" category of characters, whose rendering widths are determined by the terminal's default character encoding (full-width when a CJK encoding is selected, and half-width when a non-CJK encoding is selected). I don't think the locale data is more reliable than the terminal setting when it comes to rendering, and that's why I suggest the above steps instead.

@onlined
Copy link
Author

onlined commented Oct 30, 2021

@Explorer09 But I need to preprocess the string and decide which characters to keep or remove, before actually printing the characters. Do you suggest printing to a non-visible or dummy terminal?

@Explorer09
Copy link
Contributor

@onlined My idea is to use the space of the CWD table cell -- you print the characters just as you process it. It's okay to overprint the space multiple time until you finally shorten the path.

@BenBE
Copy link
Member

BenBE commented Oct 30, 2021

@onlined My idea is to use the space of the CWD table cell -- you print the characters just as you process it. It's okay to overprint the space multiple time until you finally shorten the path.

Problem is, this cell is not actually available for reference in that routine. The function should be independent of which screen position it is later on printed too. In particular, the printed string is only rendered into a RichString later on and that is finally written onto the display. So shortening to 25 code points with wcwidth/wcswidth is fully acceptable IMHO (even if this may differ somewhat from what the terminal will finally end up actually writing). Cf. mbswidth(3)

@Explorer09
Copy link
Contributor

@BenBE That's a structural problem. It would be better to fix the routine to accept a terminal position as the parameter. You don't need to prepare the string until it's printed at least once. So it's not unreasonable to request a terminal position for the routine.

@BenBE
Copy link
Member

BenBE commented Oct 30, 2021

@BenBE That's a structural problem. It would be better to fix the routine to accept a terminal position as the parameter. You don't need to prepare the string until it's printed at least once. So it's not unreasonable to request a terminal position for the routine.

Yes and no.

Yes, we should delay the preparation of this string to when we first actually write it to the display. But that's not how this works internally in htop right now and would take a bit of structural work to change it that way. That's not part of this PR and if we go that route, then we'd need to do quite a bit of related work elsewhere too to keep things consistent.

And no, you shouldn't print temporary things to a terminal. True story: I had a text-based app once that had a progress bar for showing the load progress when it was initializing different things. So far, so normal. One of the users had a braille display connected to the computer and sent in a complaint about the braille display going brrrrrrrrrr whenever the program started. Long story short: DON'T write temporary or fast-changing information to a terminal.

@Explorer09
Copy link
Contributor

@BenBE But htop is ncursed-based! If htop prints everything sequentially (to maintain compatibility with line terminals), I won't suggest this approach.
I know what you mean when you say progress bar as an example: wget is the example of having two ways to output the progress. One prints everything sequentially (incuding the progress indicator), the other uses overprinting to indicate the progress (and it prints the filename being downloaded in a scrolling text).

@BenBE
Copy link
Member

BenBE commented Oct 31, 2021

@BenBE But htop is ncursed-based! If htop prints everything sequentially (to maintain compatibility with line terminals), I won't suggest this approach. I know what you mean when you say progress bar as an example: wget is the example of having two ways to output the progress. One prints everything sequentially (incuding the progress indicator), the other uses overprinting to indicate the progress (and it prints the filename being downloaded in a scrolling text).

Both methods cause the mal-behavior I described. The problem is not the sequential printing, but the printing itself.

Copy link
Member

@BenBE BenBE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this update. I haven't yet taken a closer look at CwdUtils.c; but I noticed some points with the surrounding changes.

Process.c Outdated Show resolved Hide resolved
XUtils.c Outdated

char* xWcstombs(const wchar_t *wcs) {
size_t len = wcslen(wcs);
size_t mbsSize = len * sizeof(wchar_t) + 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing overflow check.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

size_t mbsSize = len * sizeof(wchar_t) + 1;
char *mbs = xMalloc(mbsSize);
mbstate_t mbstate = {0};
wcsrtombs(mbs, &wcs, mbsSize, &mbstate);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will fail for multi-byte string, where a sizable number of characters is more than 4 bytes. This may happen with strings containing many emoji or other characters that use more than 4 bytes for UTF-8 encoding.

Copy link
Contributor

@Explorer09 Explorer09 Jan 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenBE For your info, emoji sequences are considered multiple characters in Unicode. It is only that the display of the characters would look combined together.

The real problem is we can't ensure the sizeof(wchar_t) is at least 4. For (hypothetical) systems that use 2 bytes for wchar_t (thus only be able to handle the Unicode BMP), this would surely overflow the mbs buffer.

The right element size to use is MB_CUR_MAX, not sizeof(wchar_t).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenBE For your info, emoji sequences are considered multiple characters in Unicode. It is only that the display of the characters would look combined together.

I was aware of this (after all 🐈‍⬛ is f0 9f 90 88 e2 80 8d e2 ac 9b, which is 3 Unicode codepoints), but through a mix-up on my side I kinda had in mind that some of the non-BMP characters needed 5/6 bytes. Further fueled by MySQL, who had a utf8mb4 charset, which they added to support things beyond the BMP.

But …

The real problem is we can't ensure the sizeof(wchar_t) is at least 4. For (hypothetical) systems that use 2 bytes for wchar_t (thus only be able to handle the Unicode BMP), this would surely overflow the mbs buffer.

The right element size to use is MB_CUR_MAX, not sizeof(wchar_t).

Exactly. Or encode twice and use the first encoding to query how much space is actually needed.

XUtils.c Outdated
Comment on lines 439 to 440
for (size_t i = 0; i < len && str[i]; i++)
data[i] = str[i];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use memcpy and write explicit terminator?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I preferred wcsncpy, but I can change it to memcpy if you like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@onlined
wcsncpy has an additional behavior of padding L'\0' to the unused character space. If you don't require this behavior, it's better to avoid it as it wastes processor cycles.

Copy link
Author

@onlined onlined Jan 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memcpy would copy len bytes, so it wastes cycles too, so should I go back to explicitly copying in a loop?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@onlined The compiler can inline memcpy when it's useful. When it comes to performance, memcpy is arguably the best.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to memcpy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized a problem, for using memcpy we have to assume that str has more characters than len or we can risk accessing unallocated memory and crash. Because of that concern, I changed it to use Wstring_safeWcsncpy.

XUtils.c Outdated Show resolved Hide resolved
@onlined onlined force-pushed the shorten-cwd-path branch 5 times, most recently from 4d439ae to f2209e4 Compare January 19, 2025 20:36
Comment on lines +136 to +155
shortenCwdParts(&ctx);
if (shortenCwdLastPart(&ctx, false) > collapseCwdParts(&ctx, false)) {
shortenCwdLastPart(&ctx, true);
collapseCwdParts(&ctx, true);
} else {
collapseCwdParts(&ctx, true);
shortenCwdLastPart(&ctx, true);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rules/logic in this area is arcane and hard to read. What's the reasoning behind these function calls and conditionals?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick comment about the rationale of the check would be fine.

CwdUtils.c Outdated
Comment on lines 125 to 134
ShortenCwdContext ctx = {
.maxLength = maxLength,
.len = len,
};
ctx.parts = Wstring_split(wcwd, L'/', &ctx.partsLen);
free(wcwd);
wcwd = NULL;
ctx.partLengths = xCalloc(ctx.partsLen, sizeof(size_t));
for (size_t i = 0; i < ctx.partsLen; i++)
ctx.partLengths[i] = wcslen(ctx.parts[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of code looks like a constructor to ShortenCwdContext, but with complex malloc operations. Can we have a ShortenCwdContext_new(wchar_t* wstr) function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a ShortenCwdContext_init for not to allocate unnecessarily.

return 0;

size_t lastPartLen = ctx->partLengths[ctx->partsLen - 1];
if (lastPartLen <= 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexplained magic number 3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minimal length of directory/part strings (after shortening, if not shorter before).

CwdUtils.c Outdated Show resolved Hide resolved
@onlined onlined force-pushed the shorten-cwd-path branch 6 times, most recently from 070af35 to 21b5eca Compare January 20, 2025 15:34
#define HEADER_CwdUtils

#include <stddef.h>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second blank line after the header includes.

CwdUtils.c Show resolved Hide resolved
Comment on lines +13 to +17
size_t len;
wchar_t **parts;
size_t partsLen;
size_t *partLengths;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fields partsLen and partLengths have quite similar names which may cause confusion.

To mitigate this, using a naming scheme like strSize/partCount/partPtr/partSize might be somewhat more easy to disambiguate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be partPtrs and partSizes because the latter two are arrays.

CwdUtils.c Outdated Show resolved Hide resolved
CwdUtils.c Outdated Show resolved Hide resolved
}

char* CwdUtils_shortenCwd(const char *cwd, const size_t maxLength) {
wchar_t *wcwd = xMbstowcs(cwd);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quick pre-test could be:

   if (strnlen(cwd, maxLength) < maxLength) {
      return xStrdup(cwd);
   }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the comparison be <=?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because == would mean that maxLength bytes were read and no NUL byte was encountered. You couldn't distinguish the case of exactly maxLength bytes from the one with more bytes. This doesn't hurt though, as the == case is caught below in the propr check (with conversion) …

You could do strnlen(cwd, maxLength + 1) < maxLength if you wanted to catch that, but strictly speaking you'd then also have to include an overflow check for the maxLength + 1


static void shortenCwdParts(ShortenCwdContext *ctx) {
for (size_t i = ctx->partsLen - 2; i != (size_t)-1 && ctx->len > ctx->maxLength; i--) {
if (ctx->partLengths[i] < 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (ctx->partLengths[i] < 3)
if (ctx->partLengths[i] <= 3)

Also, documenting where the 3 comes from (initial, separator, trailer) would be nice.

return 0;

size_t lastPartLen = ctx->partLengths[ctx->partsLen - 1];
if (lastPartLen <= 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minimal length of directory/part strings (after shortening, if not shorter before).

For showing the CWD path in a more compact way, this function

    - shortens path parts before last from right to left, until first
      character of the path stays (directory -> direc~ or d~)
    - combines parts before last from right with left (a~/b~ -> a~~)
    - shortens the last part from middle, preferring to keep characters
      from left (longdirectoryname -> long~ame)

in order (of 2nd and 3rd, the one that removes more characters is
applied first) until given maximum length is reached.
@Explorer09
Copy link
Contributor

Before I comment on the code quality further, I noticed the entire PR had a wrong assumption that every Unicode character occupy the same width on the terminal, which is not the case.

Forget my previous comments (that was left 3 years ago) about printing the strings on the terminal twice. We need to determine the display width of the characters through wcwidth(3) or similar. Otherwise we might shorten the paths in all the wrong way.

Another thing that confused me is, why are we converting the wide character strings back to multi-byte strings (xWcstombs) when they are for display? I believe a better conversion would be to a RichString instance directly.

And by using RichString I have an idea to do one more thing: Color the abbreviated tildes (~) with a different color from surrounding text. This can give a stronger hint that the tildes are abbreviations and not part of directory names.

@onlined
Copy link
Author

onlined commented Jan 21, 2025

Should I still store this short CWD (this time as RichString) or should I create it on the fly?

@BenBE
Copy link
Member

BenBE commented Jan 21, 2025

Before I comment on the code quality further, I noticed the entire PR had a wrong assumption that every Unicode character occupy the same width on the terminal, which is not the case.

I remarked about this about 3 years ago re wcwidth/wcswidth

So, this is one point that is on the todo list before inclusion.

Forget my previous comments (that was left 3 years ago) about printing the strings on the terminal twice. We need to determine the display width of the characters through wcwidth(3) or similar. Otherwise we might shorten the paths in all the wrong way.

Furthermore, we should take into account when sequences of Unicode codepoints need to go together like 🐈‍⬛ or 🏳️‍⚧️. Also, for several characters, in particular in the CJK range, the width of single characters may be larger than 1.

Another thing that confused me is, why are we converting the wide character strings back to multi-byte strings (xWcstombs) when they are for display? I believe a better conversion would be to a RichString instance directly.

And by using RichString I have an idea to do one more thing: Color the abbreviated tildes (~) with a different color from surrounding text. This can give a stronger hint that the tildes are abbreviations and not part of directory names.

That's actually a good idea. Given the metadata the functions collect internally, both outputting into a raw string and a formatted RichString should be feasible without too much extra work.

Should I still store this short CWD (this time as RichString) or should I create it on the fly?

When creating a RichString you will need about 400 bytes of memory (for each instance). With about 3000 LWP on my kinda idle system, we are talking 1.2 MiB of extra memory (without optimization) for the shortened strings alone.

Contrary, if we assumed kinda optimized storage of the metadata to produce the RichStrings of like 10 parts max and 8 bytes per part, we'd be down to ~250 KiB. If we were to use bit masks instead, we could get this to 1 bit per wchar + string data (~30 bytes -> 90 KiB).

So the important step is less the output as RichString, but to be able to produce it from as little temporary data as possible.

NB: Storing an uint8_t per 8 wchars gives the last estimate, IFF you don't need to store the (maximum) length of the string. If done that way, you can just loop over each character in the output string and color it in (shade) when a bit in the bitmask is set. Extra handling for the OS directory prefixes may cause this scheme to take up more than 1 bit though.

Overall, there has been some discussion on IRC about this topic. During this discussion we found that we would like to give this feature a bit more time so the following things can be handled and implemented properly:

  1. Proper character (display) width handling
  2. RichString support for printing
  3. Memory footprint
  4. Handling together with highlighting default OS directories

@BenBE BenBE modified the milestones: 3.4.0, 3.5.0 Jan 21, 2025
@Explorer09
Copy link
Contributor

Furthermore, we should take into account when sequences of Unicode codepoints need to go together like 🐈‍⬛ or 🏳️‍⚧️. Also, for several characters, in particular in the CJK range, the width of single characters may be larger than 1.

Just for everyone's info, I've seen a wcswidth implementation that does address multiple character emojis (a la ZWJ sequences). But's it's Python only.

I believe it's still harmless if we miscalculated the width of emojis right now, because they happen more rarely in practice, and, depending on terminal display, they might not display as a combined icon but a separated sequence of emoji icons (which is not necessary a bug - Unicode permits such a fallback sequence display, especially for emojis that are added in later Unicode versions).

So the important step is less the output as RichString, but to be able to produce it from as little temporary data as possible.

My personal concern with the current state of the PR is that it might use too many malloc operations to hold the split tokens. malloc have chances to fail, and you have to free each of them when you have done. I wonder if it's feasible to hold a wchar_t [] or a RichString buffer containing the whole path string, and shorten the strings in-place. Just an idea.

@fasterit
Copy link
Member

Furthermore, we should take into account when sequences of Unicode codepoints need to go together like 🐈‍⬛ or 🏳️‍⚧️. Also, for several characters, in particular in the CJK range, the width of single characters may be larger than 1.

Just for everyone's info, I've seen a wcswidth implementation that does address multiple character emojis (a la ZWJ sequences). But's it's Python only.

wcswidth(3) is good enough™ for the htop use case. There will always be ways to cause glitches in a curses display on virtual terminals. We don't have to go to extremes to cover the (only) deliberately-constructed corner cases, too.
Just completely ignoring Unicode these days doesn't do either (as per the 2021 implementation).
/DLange

@BenBE
Copy link
Member

BenBE commented Jan 22, 2025

So the important step is less the output as RichString, but to be able to produce it from as little temporary data as possible.

My personal concern with the current state of the PR is that it might use too many malloc operations to hold the split tokens. malloc have chances to fail, and you have to free each of them when you have done. I wonder if it's feasible to hold a wchar_t [] or a RichString buffer containing the whole path string, and shorten the strings in-place. Just an idea.

Yes, working with a single buffer for splitting is possible, but will require a second allocation to hold pointers for where each part begins (or some in-band information where the string terminates). Given you know the maximum final width, you can determine the number of splits for the result. For the input this is in the worst case wcslen(input) / 2.

NB: The actual conversion performance isn't too pressing, as the CWD doesn't change often AND we can cache the result when it stayed the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Shorten path for CWD column when displaying
4 participants