-
-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shorten CWD path #854
base: main
Are you sure you want to change the base?
Shorten CWD path #854
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR. There's still some problems with it, that should be addressed. One major one is the missing support for multibyte charsets like UTF-8 and as we've seen even usernames with kanji it's essential that paths containing such characters are handled properly, as paths with special characters are even more common than usernames with such arcane characters.
When updating your code feel free to squash and rebase your branch freely to keep a clean commit history.
Process.c
Outdated
strcpy(ctx->cwd + writeIndex, ctx->parts[i]); | ||
writeIndex += ctx->partLengths[i]; | ||
if (i < ctx->partsLen - 1) { | ||
strcpy(ctx->cwd + writeIndex, "/"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use string functions from XUtils.h
for copying things around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied character directly instead.
Process.c
Outdated
size_t i; | ||
for (i = ctx->partsLen - 2; i > 1; i--) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite like re-using a loop variable outside the initial loop scope it was worked with …
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other way would be adding an extra variable and update it with i
's value every time, or maybe converting loop condition into an if
at the and and assign in the block just before break
, but these are both verbose and unnecessary IMO. Since the function body is fairly small and i
is relevant until the end, I don't think that it causes any risk.
253b91a
to
b08dbcf
Compare
Process.c
Outdated
return diff; | ||
} | ||
|
||
static size_t shortenCwdLastPart(shortenCwdContext *ctx, bool doActualWork) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I don't like a boolean data type in parameters, what does doActualWork
do and what's the difference between when I set it to true and set to false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If doActualWork
is true
, the function does the actual work but if it's false
, only the number of characters to be removed is calculated. Adding comments could help, like you mentioned before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you can split the length calculation and the "actual work" part into two routines, or redesign the parameter so it makes more sense (e.g. copy and work with the path in a new buffer instead of doing that in-place). The boolean parameter introduces control coupling that may be unnecessary.
Process.c
Outdated
shortenCwdParts(&ctx); | ||
if (shortenCwdLastPart(&ctx, false) > collapseCwdParts(&ctx, false)) { | ||
shortenCwdLastPart(&ctx, true); | ||
collapseCwdParts(&ctx, true); | ||
} else { | ||
collapseCwdParts(&ctx, true); | ||
shortenCwdLastPart(&ctx, true); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should document the path shortening behavior somewhere in the code comments. It isn't obvious on how the path would be shortened as your algorithm looks not trivial enough.
It would be better to also put in examples, to help people test the behaviors.
b08dbcf
to
f4f0a2a
Compare
@BenBE I see that you indicated that the code must be multibyte aware. I totally agree with you and I started working on it. I have one question: Is it OK to use standard library multibyte functions ( |
f4f0a2a
to
f7ca2ab
Compare
Using standard library is totally fine and much preferred over including additional libraries we do not yet link with. |
25d449d
to
e2b9b21
Compare
I finally have a chance to test your code, and it might be late to inform you that there's one thing you need to address. People who didn't live in East Asia wouldn't know, that not all characters occupy the same width in the terminal. Specifically,
The above screenshot shows CJK characters (Chinese numerals |
@Explorer09 @BenBE I tried to find a solution for finding out how many columns a character occupies. I saw |
@online I already expected you would get stuck with the problem when I mentioned it. It troubled me too (when I worked in a project of my company), and I have a conservative approach:
Yes, the safest way is to just print it out. And you are right that the width is locale-dependent. If you have read the UAX #11 East Asian Width, there is an "East Asian Ambiguous" category of characters, whose rendering widths are determined by the terminal's default character encoding (full-width when a CJK encoding is selected, and half-width when a non-CJK encoding is selected). I don't think the locale data is more reliable than the terminal setting when it comes to rendering, and that's why I suggest the above steps instead. |
@Explorer09 But I need to preprocess the string and decide which characters to keep or remove, before actually printing the characters. Do you suggest printing to a non-visible or dummy terminal? |
@onlined My idea is to use the space of the CWD table cell -- you print the characters just as you process it. It's okay to overprint the space multiple time until you finally shorten the path. |
Problem is, this cell is not actually available for reference in that routine. The function should be independent of which screen position it is later on printed too. In particular, the printed string is only rendered into a |
@BenBE That's a structural problem. It would be better to fix the routine to accept a terminal position as the parameter. You don't need to prepare the string until it's printed at least once. So it's not unreasonable to request a terminal position for the routine. |
Yes and no. Yes, we should delay the preparation of this string to when we first actually write it to the display. But that's not how this works internally in And no, you shouldn't print temporary things to a terminal. True story: I had a text-based app once that had a progress bar for showing the load progress when it was initializing different things. So far, so normal. One of the users had a braille display connected to the computer and sent in a complaint about the braille display going brrrrrrrrrr whenever the program started. Long story short: DON'T write temporary or fast-changing information to a terminal. |
@BenBE But htop is ncursed-based! If htop prints everything sequentially (to maintain compatibility with line terminals), I won't suggest this approach. |
Both methods cause the mal-behavior I described. The problem is not the sequential printing, but the printing itself. |
e2b9b21
to
91630c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for this update. I haven't yet taken a closer look at CwdUtils.c
; but I noticed some points with the surrounding changes.
XUtils.c
Outdated
|
||
char* xWcstombs(const wchar_t *wcs) { | ||
size_t len = wcslen(wcs); | ||
size_t mbsSize = len * sizeof(wchar_t) + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing overflow check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
size_t mbsSize = len * sizeof(wchar_t) + 1; | ||
char *mbs = xMalloc(mbsSize); | ||
mbstate_t mbstate = {0}; | ||
wcsrtombs(mbs, &wcs, mbsSize, &mbstate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will fail for multi-byte string, where a sizable number of characters is more than 4 bytes. This may happen with strings containing many emoji or other characters that use more than 4 bytes for UTF-8 encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenBE For your info, emoji sequences are considered multiple characters in Unicode. It is only that the display of the characters would look combined together.
The real problem is we can't ensure the sizeof(wchar_t)
is at least 4. For (hypothetical) systems that use 2 bytes for wchar_t
(thus only be able to handle the Unicode BMP), this would surely overflow the mbs
buffer.
The right element size to use is MB_CUR_MAX
, not sizeof(wchar_t)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenBE For your info, emoji sequences are considered multiple characters in Unicode. It is only that the display of the characters would look combined together.
I was aware of this (after all 🐈⬛ is f0 9f 90 88 e2 80 8d e2 ac 9b, which is 3 Unicode codepoints), but through a mix-up on my side I kinda had in mind that some of the non-BMP characters needed 5/6 bytes. Further fueled by MySQL, who had a utf8mb4 charset, which they added to support things beyond the BMP.
But …
The real problem is we can't ensure the
sizeof(wchar_t)
is at least 4. For (hypothetical) systems that use 2 bytes forwchar_t
(thus only be able to handle the Unicode BMP), this would surely overflow thembs
buffer.The right element size to use is
MB_CUR_MAX
, notsizeof(wchar_t)
.
Exactly. Or encode twice and use the first encoding to query how much space is actually needed.
XUtils.c
Outdated
for (size_t i = 0; i < len && str[i]; i++) | ||
data[i] = str[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use memcpy
and write explicit terminator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I preferred wcsncpy
, but I can change it to memcpy
if you like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@onlined
wcsncpy
has an additional behavior of padding L'\0'
to the unused character space. If you don't require this behavior, it's better to avoid it as it wastes processor cycles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memcpy
would copy len
bytes, so it wastes cycles too, so should I go back to explicitly copying in a loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@onlined The compiler can inline memcpy
when it's useful. When it comes to performance, memcpy
is arguably the best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to memcpy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized a problem, for using memcpy
we have to assume that str
has more characters than len
or we can risk accessing unallocated memory and crash. Because of that concern, I changed it to use Wstring_safeWcsncpy
.
4d439ae
to
f2209e4
Compare
shortenCwdParts(&ctx); | ||
if (shortenCwdLastPart(&ctx, false) > collapseCwdParts(&ctx, false)) { | ||
shortenCwdLastPart(&ctx, true); | ||
collapseCwdParts(&ctx, true); | ||
} else { | ||
collapseCwdParts(&ctx, true); | ||
shortenCwdLastPart(&ctx, true); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rules/logic in this area is arcane and hard to read. What's the reasoning behind these function calls and conditionals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick comment about the rationale of the check would be fine.
CwdUtils.c
Outdated
ShortenCwdContext ctx = { | ||
.maxLength = maxLength, | ||
.len = len, | ||
}; | ||
ctx.parts = Wstring_split(wcwd, L'/', &ctx.partsLen); | ||
free(wcwd); | ||
wcwd = NULL; | ||
ctx.partLengths = xCalloc(ctx.partsLen, sizeof(size_t)); | ||
for (size_t i = 0; i < ctx.partsLen; i++) | ||
ctx.partLengths[i] = wcslen(ctx.parts[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of code looks like a constructor to ShortenCwdContext
, but with complex malloc
operations. Can we have a ShortenCwdContext_new(wchar_t* wstr)
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a ShortenCwdContext_init
for not to allocate unnecessarily.
return 0; | ||
|
||
size_t lastPartLen = ctx->partLengths[ctx->partsLen - 1]; | ||
if (lastPartLen <= 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unexplained magic number 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minimal length of directory/part strings (after shortening, if not shorter before).
070af35
to
21b5eca
Compare
#define HEADER_CwdUtils | ||
|
||
#include <stddef.h> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second blank line after the header includes.
size_t len; | ||
wchar_t **parts; | ||
size_t partsLen; | ||
size_t *partLengths; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fields partsLen
and partLengths
have quite similar names which may cause confusion.
To mitigate this, using a naming scheme like strSize
/partCount
/partPtr
/partSize
might be somewhat more easy to disambiguate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be partPtrs
and partSizes
because the latter two are arrays.
} | ||
|
||
char* CwdUtils_shortenCwd(const char *cwd, const size_t maxLength) { | ||
wchar_t *wcwd = xMbstowcs(cwd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I quick pre-test could be:
if (strnlen(cwd, maxLength) < maxLength) {
return xStrdup(cwd);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the comparison be <=
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because ==
would mean that maxLength
bytes were read and no NUL byte was encountered. You couldn't distinguish the case of exactly maxLength
bytes from the one with more bytes. This doesn't hurt though, as the ==
case is caught below in the propr check (with conversion) …
You could do strnlen(cwd, maxLength + 1) < maxLength
if you wanted to catch that, but strictly speaking you'd then also have to include an overflow check for the maxLength + 1
…
|
||
static void shortenCwdParts(ShortenCwdContext *ctx) { | ||
for (size_t i = ctx->partsLen - 2; i != (size_t)-1 && ctx->len > ctx->maxLength; i--) { | ||
if (ctx->partLengths[i] < 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (ctx->partLengths[i] < 3) | |
if (ctx->partLengths[i] <= 3) |
Also, documenting where the 3 comes from (initial, separator, trailer) would be nice.
return 0; | ||
|
||
size_t lastPartLen = ctx->partLengths[ctx->partsLen - 1]; | ||
if (lastPartLen <= 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minimal length of directory/part strings (after shortening, if not shorter before).
For showing the CWD path in a more compact way, this function - shortens path parts before last from right to left, until first character of the path stays (directory -> direc~ or d~) - combines parts before last from right with left (a~/b~ -> a~~) - shortens the last part from middle, preferring to keep characters from left (longdirectoryname -> long~ame) in order (of 2nd and 3rd, the one that removes more characters is applied first) until given maximum length is reached.
21b5eca
to
430e2bb
Compare
Before I comment on the code quality further, I noticed the entire PR had a wrong assumption that every Unicode character occupy the same width on the terminal, which is not the case. Forget my previous comments (that was left 3 years ago) about printing the strings on the terminal twice. We need to determine the display width of the characters through wcwidth(3) or similar. Otherwise we might shorten the paths in all the wrong way. Another thing that confused me is, why are we converting the wide character strings back to multi-byte strings ( And by using |
Should I still store this short CWD (this time as |
I remarked about this about 3 years ago re So, this is one point that is on the todo list before inclusion.
Furthermore, we should take into account when sequences of Unicode codepoints need to go together like 🐈⬛ or 🏳️⚧️. Also, for several characters, in particular in the CJK range, the width of single characters may be larger than 1.
That's actually a good idea. Given the metadata the functions collect internally, both outputting into a raw string and a formatted
When creating a Contrary, if we assumed kinda optimized storage of the metadata to produce the So the important step is less the output as NB: Storing an Overall, there has been some discussion on IRC about this topic. During this discussion we found that we would like to give this feature a bit more time so the following things can be handled and implemented properly:
|
Just for everyone's info, I've seen a wcswidth implementation that does address multiple character emojis (a la ZWJ sequences). But's it's Python only. I believe it's still harmless if we miscalculated the width of emojis right now, because they happen more rarely in practice, and, depending on terminal display, they might not display as a combined icon but a separated sequence of emoji icons (which is not necessary a bug - Unicode permits such a fallback sequence display, especially for emojis that are added in later Unicode versions).
My personal concern with the current state of the PR is that it might use too many |
wcswidth(3) is good enough™ for the |
Yes, working with a single buffer for splitting is possible, but will require a second allocation to hold pointers for where each part begins (or some in-band information where the string terminates). Given you know the maximum final width, you can determine the number of splits for the result. For the input this is in the worst case NB: The actual conversion performance isn't too pressing, as the CWD doesn't change often AND we can cache the result when it stayed the same. |
For showing the CWD path in a more compact way, this function
character of the path stays (
directory
->direc~
ord~
)a~/b~
->a~~
)from left (
longdirectoryname
->long~ame
)in order (of 2nd and 3rd, the one that removes more characters is applied first) until given maximum length is reached.
Closes #462.