-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Add Unicode support #48
Conversation
This is a first try to adding Unicode support to `ocaml-re`. The approach is the one suggested at the end of #24. Namely, translate *unicode* regular expressions into *byte*-oriented regular expressions that match UTF-8-encoded strings. This patchset adds a new module, `Re_unicode` which defines the type of unicode regular expressions. A new type is needed because it is not safe to mix unicode and byte-oriented regular expressions. The interface is a slightly modified version of the existing interface in `Re`. There is also a corresponding findlib library `re.unicode`. On the implementation side, `Re_unicode` uses the same implementation as `Re`, but overloading the `Set` variant to mean sets of *unicode* code points. Before compilation the unicode regular expression is traversed (see `handle_unicode` in `re_unicode.ml`) to translate it into a byte-oriented one and then everything goes through as before. The same POSIX character classes (`alnum`, `digit`, etc.) present in `Re` are offered by this module, but implemented in terms of unicode as suggested by [UTS #18](http://www.unicode.org/reports/tr18/#Compatibility_Properties). More fine-grained unicode character sets can be added easily. The definition of unicode character sets depends on static data contained in the module `Unicode_groups`. This module can be regenerated from the Unicode Character Database by a tool called `gen_unicode_groups` found in the `tools/` directory. The tool itself depends on the [`uucd`](http://erratique.ch/software/uucd) library. No extra dependency is required for `re`. There are some difficulties in integrating unicode elegantly to the current code base arising from the fact that this library makes some assumptions about the character set when compiling and interpreting REs: - The `bol`, `eol`, `eow`, `bow` combinators are ASCII-specific and handled specially in the RE engine. These are not currently present in the `Re_unicode` interface. - (**BREAKING CHANGE**) The `case`, `no_case` combinators have different semantics for unicode because a single unicode character can change case to a *sequence* of unicode characters, so it is no longer true that applying one of these combinators to a character set gives a character set. In the present patchset the result of applying one of these combinators to a character set is no longer considered to be a character set (which is the right behaviour for unicode, I think). This can be reverted back to the old behaviour for the `Re` module with a little bit of refactoring. - The `~pos` and `~len` arguments in the `exec*`, `all*`, `split*` functions require indexing into a UTF-8 string which takes time linear on the length of the string. These arguments are not present in the current changeset, but could be easily added. In summary, I think the best course of action would be to refactor the library so that the core is completely independent of the character sets and deals only with arbitrary bytes. Both ASCII and unicode engines should be added on top. There is some special handling of the `bol`, `eol`, `eow`, `bow` combinators in the RE engine that would have to be factored out, but someone more knowledgeable about the internals needs to explain what is involved in this. Lastly, this changeset has only been tested very lightly and more thorough testing is required, but I wanted to put it out now in order to receive feedback from those interested. Any and all comments welcome!
It is not currently needed.
Matches any UTF-8 character except '\n'. This is what re2 does.
`any_byte` => `any`
This is in preparation to implementing case, no_case for unicode using simple_case_folding (which is a 1:1 mapping between unicode characters).
Case folding table still takes up too much space, but it can be reduced further by using same tricks as in re2. (see https://github.com/nelhage/re2/blob/master/re2/unicode_casefold.h)
Untranslated unicode code points were being passed to `compile_1`. Also, changed signature of `case_insens` back to `Cset.t -> Cset.t` in preparation to implementing case, no_case for Unicode.
Same as in `Re`.
The `case`, `no_case` now work with Unicode. A more compact encoding of the `Unicode_groups.foldcase` table will be introduced later.
For some reason removing them causes some tests to fail. Also it is not clear whether removing or leaving them is the better option.
They are implemented using `uutf`. Later they could be rewritten directly so that we do not depend on any external library.
This is great! Thank a lot! I agree the library will need to be refactored. Regarding the The |
Great! Regarding the problematic combinators ( |
Indeed, these combinators are special cases of lookahead and lookbehind, and I think that's how they should be implemented. The way to implement a lookbehind subexpression The implementation of ocaml-re is DFA-based. Intuitively, each state of the DFA is the disjunction of the positions we may have reached in the regular expression. In fact, we have a tree structure rather than simply a disjunction to deal with the matching semantics (longest, shortest or first match). To deal with lookahead, we will need something like a disjunction of conjunctions (both the main regular expression and the lookahead expression(s) must match). We need a couple of restrictions on lookbehind and lookahead expression to make this work:
And it's going to be much simpler to implement if we do not allow group matching inside lookahead and lookbehind expressions. |
👍 |
A fun hack, but no more than that. |
This is a first try to adding Unicode support to
ocaml-re
. The approach is the one suggested at the end of #24. Namely, translate unicode regular expressions into byte-oriented regular expressions that match UTF-8-encoded strings.This patchset adds a new module,
Re_unicode
which defines the type of unicode regular expressions. A new type is needed because it is not safe to mix unicode and byte-oriented regular expressions. The interface is a slightly modified version of the existing interface inRe
. There is also a corresponding findlib libraryre.unicode
. The moduleRe_unicode
usesuutf
to decode UTF-8 strings, but this is not a hard dependency and could easily be swapped out later to keep dependencies to a minumum.On the implementation side,
Re_unicode
uses the same implementation asRe
, but overloading theSet
variant to mean sets of unicode code points. Before compilation the unicode regular expression is traversed (seehandle_unicode
inre_unicode.ml
) to translate unicode character sets into suitable byte-oriented regular expression. After this step, everything goes through as before.The same POSIX character classes (
alnum
,digit
, etc.) present inRe
are offered by this module, but implemented in terms of unicode as suggested by UTS #18. More fine-grained unicode character sets can be added easily. The definition of unicode character sets depends on static data contained in the moduleUnicode_groups
. This module can be regenerated from the Unicode Character Database by a tool calledgen_unicode_groups
found in thetools/
directory. This tool depends on theuucd
library.There are some difficulties in integrating unicode elegantly to the current code base arising from the fact that this library makes some assumptions about the character set when compiling and interpreting REs:
bol
,eol
,eow
,bow
combinators are ASCII-specific and handled specially in the RE engine. These are not currently present in theRe_unicode
interface.(BREAKING CHANGE) TheThis is solved by using thecase
,no_case
combinators have different semantics for unicode because a single unicode character can change case to a sequence of unicode characters, so it is no longer true that applying one of these combinators to a character set gives a character set. In the present patchset the result of applying one of these combinators to a character set is no longer considered to be a character set (which is the right behaviour for unicode, I think). This can be reverted back to the old behaviour for theRe
module with a little bit of refactoring.simple_case_folding
property of unicode characters which is 1:1, and so does not suffer from the above problem. UPDATE Thecase
,no_case
combinators are now working with Unicode.~pos
and~len
arguments in theexec*
,all*
,split*
functions require indexing into a UTF-8 string which takes time linear on the length of the string. These arguments are not present in the current changeset, but could be easily added.In summary, I think that, long-term, the best course of action would be to refactor the library so that the core is completely independent of the character set used and deals only with arbitrary bytes. Both ASCII and unicode engines should be added on top. Right now, there is some special handling of the
bol
,eol
,eow
,bow
combinators in the RE engine that would have to be factored out, but someone more knowledgeable about the internals needs to explain what is involved in this.Any and all comments welcome!