Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Unicode support to flexdll (was: Add wide-character version of flexdll_dlopen) #34

Merged
merged 3 commits into from
Jul 25, 2017

Conversation

nojb
Copy link
Collaborator

@nojb nojb commented Jun 26, 2017

flexdll.c Outdated
{
int exec = mode & FLEXDLL_RTLD_NOEXEC ? 0 : 1;
if (!file) return &main_unit;
return flexdll_dlopen_aux(ll_dlopen(file, exec), mode);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong: ll_dlopen would now be called before setting FLEXDLL_RELOCATE, which can be read by the DLL's entry point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I had missed that!

@nojb
Copy link
Collaborator Author

nojb commented Jun 27, 2017

OK, I pushed a different approach.

To avoid duplicating any code I extracted the two functions that need to be adapted (flexdll_dlopen and ll_dlopen) into the file flexdll_dlopen-templ.c with some %%VARIABLE%% placeholders. The Makefile generates flexdll_dlopen.c and flexdll_dlwopen.c by making the necessary substitutions (using sed) and those two files are #included from flexdll.c.

What do you think ?

@alainfrisch
Copy link
Collaborator

I'm not a big fan of putting such logic in the build system and introducing a custom syntax. It seems relying on the C preprocessor would work as well (simply include the same file twice, defining macros with different content each time).

Alternatively, you could use a single function taking a flag to choose between A and W variants; the string argument can be a (void*), cast to either (char*) or (WCHAR*) according to the flag.

@nojb
Copy link
Collaborator Author

nojb commented Jun 29, 2017

OK, I pushed another approach. We make the wide version the "main" one and the existing version becomes a small wrapper around the wide one converting between the two encodings.

I will now be testing this to make sure it works :)

@nojb
Copy link
Collaborator Author

nojb commented Jun 30, 2017

Tested by bootstrapping flexdll and ocaml from the official distribution and using the resulting native-code compiler to compile a small example, all of which seemed to work well.

@nojb nojb changed the title Add wide-character version of flexdll_dlopen Add Unicode support to flexdll (was: Add wide-character version of flexdll_dlopen) Jun 30, 2017
@nojb
Copy link
Collaborator Author

nojb commented Jun 30, 2017

I added one more commit with a primitive solution for #36, to be discussed.

@alainfrisch
Copy link
Collaborator

I think the Cygwin case is broken, as reported by the warning:

i686-pc-cygwin-gcc -c -DCYGWIN -o flexdll_cygwin.o flexdll.c
flexdll.c: In function ‘flexdll_wdlopen’:
flexdll.c:380:22: warning: passing argument 1 of ‘ll_dlopen’ from incompatible pointer type [-Wincompatible-pointer-types]
   handle = ll_dlopen(file, exec);
                      ^
flexdll.c:52:14: note: expected ‘const char *’ but argument is of type ‘const WCHAR * {aka const short unsigned int *}’
 static void *ll_dlopen(const char *libname, int for_execution) {
              ^

@alainfrisch
Copy link
Collaborator

Also the tests are failing:

  • MSVC:
$ make demo_msvc
(cd test && LIB="C:/cygwin/home/frisch/SHARED~2/msvc/v9/Lib;C:/cygwin/home/frisch/SHARED~2/winsdk/v7.0/Lib" INCLUDE="C:/cygwin/home/frisch/SHARED~2/msvc/v9/Include;C:/cygwin/home/frisch/SHARED~2/winsdk/v7.0/Include" make clean demo CHAIN=msvc CC="C:/cygwin/home/frisch/SHARED~2/msvc/v9/Bin/cl.exe /nologo /MD -D_CRT_SECURE_NO_DEPRECATE /GS-" O=obj)
make[1]: Entering directory '/home/frisch/flexdll/test'
rm -f *.o *.obj *.dll *.exe *~ *.manifest
C:/cygwin/home/frisch/SHARED~2/msvc/v9/Bin/cl.exe /nologo /MD -D_CRT_SECURE_NO_DEPRECATE /GS- -I.. -c dump.c
dump.c
..\flexdll.h(25) : error C2143: syntax error : missing ')' before '*'
..\flexdll.h(25) : error C2143: syntax error : missing '{' before '*'
..\flexdll.h(25) : error C2059: syntax error : ','
..\flexdll.h(25) : error C2059: syntax error : ')'
  • MINGW:
$ make demo_mingw
(cd test && make clean demo CHAIN=mingw CC="i686-w64-mingw32-gcc" O=o)
make[1]: Entering directory '/home/frisch/flexdll/test'
rm -f *.o *.obj *.dll *.exe *~ *.manifest
i686-w64-mingw32-gcc -I.. -c dump.c
In file included from dump.c:14:0:
../flexdll.h:25:29: error: unknown type name ‘WCHAR’
 void *flexdll_wdlopen(const WCHAR *, int);
                             ^

@alainfrisch
Copy link
Collaborator

Tests can be fixed by including windows.h in flexdll.h instead of flexdll.c.

@alainfrisch
Copy link
Collaborator

The Cygwin problem is more tricky. We want to use Cygwin's dlopen to allow using POSIX paths. But I could not find a "w" variant of it. So it is not clear what to do for flexdll_wdlopen. One could try to recode the WCHAR to the current code page, but this can obviously fail.

@nojb
Copy link
Collaborator Author

nojb commented Jul 13, 2017

Ah yes, sorry about this. I will fix the tests.

Re the Cygwin issue, a simple possibility is to define a macro (say, TCHAR) which stands for char in Cygwin and WCHAR on WIN32, we use that for the signature of flexdll_wdlopen, and and we make flexdll_wdlopen equal to flexdll_dlopen on Cygwin.

@alainfrisch
Copy link
Collaborator

For Cygwin, your suggestion means that the caller needs to behave differently if compiled through cygwin or a native compiler. And this means only filenames that can be expressed in the current code page can be used. Why not, but at this point, it think it's better to tell users to use flexdll_dlopen only under Cywin (either hide flexdll_wdlopen or have it fail at runtime). Or we could try to recode to local code page in ll_dlopen for Cygwin (so that flexdll_wdlopen can be used as long as the filename can be encoded). Yet another option is to add a fallback to LoadLibraryW when the filename cannot be encoded (so the caller can use either a POSIX path that can be encoded in the local code page or an arbitrary Windows path).

@nojb
Copy link
Collaborator Author

nojb commented Jul 13, 2017

Yes, you are right - my suggestion was a bit muddled; what I meant is this: in flexdll.h we write the following

void *flexdll_dlopen(const char *, int);
#ifdef _WIN32
void *flexdll_wdlopen(const WCHAR *, int);
#endif

#if defined(__CYGWIN__) || !defined(_UNICODE)
#define flexdll_tdlopen flexdll_dlopen
#else
#define flexdll_tdlopen flexdll_wdlopen
#endif

Then

  1. One can use flexdll_dlopen both in Cygwin and native and its behaviour is the same as today.
  2. One can use flexdll_wdlopen in native if one wants that.
  3. One can use flexdll_tdlopen in both Cygwin and native to automatically choose the wide or narrow version (compatible with the way Windows headers work). Under Cygwin this is always flexdll_dlopen.

Personally I think this is simpler than trying to recode under Cygwin and cleaner than using a fallback.

@nojb
Copy link
Collaborator Author

nojb commented Jul 13, 2017

BTW, is https://github.com/alainfrisch/flexdll/blob/master/flexdll.c#L362 correct? Shouldn't it be __CYGWIN__?

@alainfrisch
Copy link
Collaborator

We pass -DCYGWIN (resp. -DMSVC/-DMINGW) in the Makefile when we compile flexdll.c.

@alainfrisch
Copy link
Collaborator

I think it's useful to have access to flexdll_wdlopen even under Cygwin. But perhaps it's ok to declare that it doesn't support POSIX paths (so that it can always use LoadLibraryW). Only flexdll_dlopen would support POSIX path (through Cygwin's dlopen).

@nojb
Copy link
Collaborator Author

nojb commented Jul 17, 2017

What would be the usecase of having flexdll_wdlopen under Cygwin ? Is it in case one wants/needs to use both Cygwin's POSIX API and the native Win API ?

@alainfrisch
Copy link
Collaborator

The case I had in mind was simply opening a DLL specified by a POSIX path with "non-local" characters (i.e. "/foo/Ђ").

@nojb
Copy link
Collaborator Author

nojb commented Jul 17, 2017

I think it's useful to have access to flexdll_wdlopen even under Cygwin. But perhaps it's ok to declare that it doesn't support POSIX paths (so that it can always use LoadLibraryW). Only flexdll_dlopen would support POSIX path (through Cygwin's dlopen).

Maybe I am missing something, but what is "POSIX path" in this context ? I assumed that Cygwin's dllopen (and other filename-related functions) always assumed the argument to be encoded in the local codepage. They must make an encoding assumption somewhere along the way in order to implement those functions using native Windows APIs, right ?

@alainfrisch
Copy link
Collaborator

POSIX path: a path interpreted by Cygwin (supporting "/foo/bar", following Cygwin symlinks, etc).

I share your interpretation that Cygwin has to make an assumption about the encoding of file name when mapping to the Windows API. I guess they rely on the local codepage (probabperhaps by calling *A functions, not *W ones).

So:

  • It would be a regression for flexdll under Cygwin not to allow POSIX path in flexdll_dlopen => we should keep the previous behavior and call Cygwin's dlopen in that case.

  • It would be a limitation of the Cygwin port if flexdll_wdlopen were not available => let's expose it (it calls LoadLibraryW internally), and document that this function does not support POSIX paths. We can lift this restriction later if we find a way around it (e.g. by manualling translating the POSIX path to a Windows path, using a Cygwin-specific function).

@alainfrisch alainfrisch reopened this Jul 18, 2017
@nojb
Copy link
Collaborator Author

nojb commented Jul 19, 2017

So, it turns out that Cygwin functions assume their arguments are encoded in the "current locale" and translated internally into UTF-16, so that there is no need for flexdll_wdlopen. Here, "current locale" is the one set at process startup (changes during the lifetime of the process do not change the behaviour of the Cygwin API).

See https://cygwin.com/cygwin-ug-net/setup-locale.html for more.

@nojb
Copy link
Collaborator Author

nojb commented Jul 22, 2017

I amended this PR as discussed (no wide version under Cygwin), and cleaned it up. AFAIK, the only remaining issue is that the UTF-16 response file support was not backwards compatible, as it assumes that OCaml strings are UTF-8 encoded. To handle this, I implemented the usual fallback: strings are assumed to be UTF-8 but if this is not the case then we assume they are in the local code page and fall back to the usual non UTF-16-encoded response files.

Let me know if I am forgetting about anything else, otherwise I think this should be good to go!

@alainfrisch alainfrisch merged commit ecdc6fb into ocaml:master Jul 25, 2017
let cp n = Buffer.add_char b (Char.chr (n land 0xFF)); Buffer.add_char b (Char.chr ((n lsr 8) land 0xFF)) in
while !i < String.length s do
let n = utf8_next s i in
if n <= 0xFFFF then cp n else (cp (0xD7C0 + (n lsl 10)); cp (0xDC00 + (n land 0x3FF)))
Copy link
Member

@dra27 dra27 Oct 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be UTF-16 surrogate encoding? The formula for the high surrogate is (very) wrong, I'm afraid - although sufficiently so that I'm wondering if I'm missing something else.

else let n = n - 0x10000 in (cp (0xD800 + (n lsr 10)); ...)

is allowing me to run make world from a directory called 🐫

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops, did I make a mistake? I will take a closer look in a bit, thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the first term seems to be really wrong - I can't understand what I must have been thinking to be honest... And it did work in the few tests I did -- pure luck I guess!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please make a PR with this fix? Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh! Just the for proverbial record, it's described on Wikipedia.

Over my ☕, I couldn't resist the puzzle: your code works for the high surrogates for exactly 16 of the non-BMP characters (precisely ones where you get the extra 1 required to correct the base bit pattern for 0xD800)... perhaps very appropriately, lots of them are from Linear-B 😄 𐀁 𐁁 𐂁 𐃁 𐄁 𐅁 𐆁 𐊁 𐋁 𐌁 𐍁 𐎁 𐏁

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants