-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Unicode support to flexdll (was: Add wide-character version of flexdll_dlopen) #34
Conversation
flexdll.c
Outdated
{ | ||
int exec = mode & FLEXDLL_RTLD_NOEXEC ? 0 : 1; | ||
if (!file) return &main_unit; | ||
return flexdll_dlopen_aux(ll_dlopen(file, exec), mode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is wrong: ll_dlopen would now be called before setting FLEXDLL_RELOCATE, which can be read by the DLL's entry point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I had missed that!
OK, I pushed a different approach. To avoid duplicating any code I extracted the two functions that need to be adapted ( What do you think ? |
I'm not a big fan of putting such logic in the build system and introducing a custom syntax. It seems relying on the C preprocessor would work as well (simply include the same file twice, defining macros with different content each time). Alternatively, you could use a single function taking a flag to choose between A and W variants; the string argument can be a (void*), cast to either (char*) or (WCHAR*) according to the flag. |
OK, I pushed another approach. We make the wide version the "main" one and the existing version becomes a small wrapper around the wide one converting between the two encodings. I will now be testing this to make sure it works :) |
Tested by bootstrapping |
I added one more commit with a primitive solution for #36, to be discussed. |
I think the Cygwin case is broken, as reported by the warning:
|
Also the tests are failing:
|
Tests can be fixed by including windows.h in flexdll.h instead of flexdll.c. |
The Cygwin problem is more tricky. We want to use Cygwin's dlopen to allow using POSIX paths. But I could not find a "w" variant of it. So it is not clear what to do for flexdll_wdlopen. One could try to recode the WCHAR to the current code page, but this can obviously fail. |
Ah yes, sorry about this. I will fix the tests. Re the Cygwin issue, a simple possibility is to define a macro (say, |
For Cygwin, your suggestion means that the caller needs to behave differently if compiled through cygwin or a native compiler. And this means only filenames that can be expressed in the current code page can be used. Why not, but at this point, it think it's better to tell users to use flexdll_dlopen only under Cywin (either hide flexdll_wdlopen or have it fail at runtime). Or we could try to recode to local code page in ll_dlopen for Cygwin (so that flexdll_wdlopen can be used as long as the filename can be encoded). Yet another option is to add a fallback to LoadLibraryW when the filename cannot be encoded (so the caller can use either a POSIX path that can be encoded in the local code page or an arbitrary Windows path). |
Yes, you are right - my suggestion was a bit muddled; what I meant is this: in void *flexdll_dlopen(const char *, int);
#ifdef _WIN32
void *flexdll_wdlopen(const WCHAR *, int);
#endif
#if defined(__CYGWIN__) || !defined(_UNICODE)
#define flexdll_tdlopen flexdll_dlopen
#else
#define flexdll_tdlopen flexdll_wdlopen
#endif Then
Personally I think this is simpler than trying to recode under Cygwin and cleaner than using a fallback. |
BTW, is https://github.com/alainfrisch/flexdll/blob/master/flexdll.c#L362 correct? Shouldn't it be |
We pass -DCYGWIN (resp. -DMSVC/-DMINGW) in the Makefile when we compile flexdll.c. |
I think it's useful to have access to flexdll_wdlopen even under Cygwin. But perhaps it's ok to declare that it doesn't support POSIX paths (so that it can always use LoadLibraryW). Only flexdll_dlopen would support POSIX path (through Cygwin's dlopen). |
What would be the usecase of having |
The case I had in mind was simply opening a DLL specified by a POSIX path with "non-local" characters (i.e. "/foo/Ђ"). |
Maybe I am missing something, but what is "POSIX path" in this context ? I assumed that Cygwin's dllopen (and other filename-related functions) always assumed the argument to be encoded in the local codepage. They must make an encoding assumption somewhere along the way in order to implement those functions using native Windows APIs, right ? |
POSIX path: a path interpreted by Cygwin (supporting "/foo/bar", following Cygwin symlinks, etc). I share your interpretation that Cygwin has to make an assumption about the encoding of file name when mapping to the Windows API. I guess they rely on the local codepage (probabperhaps by calling *A functions, not *W ones). So:
|
So, it turns out that Cygwin functions assume their arguments are encoded in the "current locale" and translated internally into UTF-16, so that there is no need for See https://cygwin.com/cygwin-ug-net/setup-locale.html for more. |
I amended this PR as discussed (no wide version under Cygwin), and cleaned it up. AFAIK, the only remaining issue is that the UTF-16 response file support was not backwards compatible, as it assumes that OCaml strings are UTF-8 encoded. To handle this, I implemented the usual fallback: strings are assumed to be UTF-8 but if this is not the case then we assume they are in the local code page and fall back to the usual non UTF-16-encoded response files. Let me know if I am forgetting about anything else, otherwise I think this should be good to go! |
let cp n = Buffer.add_char b (Char.chr (n land 0xFF)); Buffer.add_char b (Char.chr ((n lsr 8) land 0xFF)) in | ||
while !i < String.length s do | ||
let n = utf8_next s i in | ||
if n <= 0xFFFF then cp n else (cp (0xD7C0 + (n lsl 10)); cp (0xDC00 + (n land 0x3FF))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be UTF-16 surrogate encoding? The formula for the high surrogate is (very) wrong, I'm afraid - although sufficiently so that I'm wondering if I'm missing something else.
else let n = n - 0x10000 in (cp (0xD800 + (n lsr 10)); ...)
is allowing me to run make world
from a directory called 🐫
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooops, did I make a mistake? I will take a closer look in a bit, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, the first term seems to be really wrong - I can't understand what I must have been thinking to be honest... And it did work in the few tests I did -- pure luck I guess!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please make a PR with this fix? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh! Just the for proverbial record, it's described on Wikipedia.
Over my ☕, I couldn't resist the puzzle: your code works for the high surrogates for exactly 16 of the non-BMP characters (precisely ones where you get the extra 1 required to correct the base bit pattern for 0xD800)... perhaps very appropriately, lots of them are from Linear-B 😄 𐀁 𐁁 𐂁 𐃁 𐄁 𐅁 𐆁 𐊁 𐋁 𐌁 𐍁 𐎁 𐏁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #47
See #33, ocaml/ocaml#1200.