-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collation: case_level doesn't seem to work properly to ignore accents but take case into account (ICU StringSearch and UCollator; UCOL_STRENGTH=UCOL_PRIMARY, UCOL_CASE_LEVEL=UCOL_ON) #415
Comments
Could you please paste the output of a call to |
I wonder if strength=2 is what you might need: stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 2L,case_level = TRUE, locale="pt_BR")
## [1] TRUE TRUE FALSE FALSE
stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 2L,case_level = TRUE)
## [1] TRUE TRUE FALSE FALSE |
$Unicode.version $ICU.version $Locale $Locale$Country $Locale$Variant $Locale$Name $Charset.internal $Charset.native $Charset.native$Name.ICU $Charset.native$Name.UTR22 $Charset.native$Name.IBM $Charset.native$Name.WINDOWS $Charset.native$Name.JAVA $Charset.native$Name.IANA $Charset.native$Name.MIME $Charset.native$ASCII.subset $Charset.native$Unicode.1to1 $Charset.native$CharSize.8bit $Charset.native$CharSize.min $Charset.native$CharSize.max $ICU.system $ICU.UTF8 |
Actually, what I want is to ignore accents, but not case. According to the ICU documentation mentioned above, I should achieve that by setting strength to level 1 and case level to On. I also work with PostgreSQL and it works accordingly, but with stringi, I can only get the same result by setting locale to '"", regardless of case_level's setting.
|
First of all, thanks, there was a bug; |
instead, stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 1L, locale="POSIX")
## [1] FALSE TRUE FALSE TRUE |
Yes, but I still believe that it can get a bit confusing because, in order to ignore accents but not case, in stringi, you have to set the locale, but not the case_level, when this is not the way PostgreSQL works and probably other languages, i.e, you must have the case level turned on along with the primary strength. This is also what the ICU documentation says:
|
Hmmm.... interestingly, a collator-based string comparison honours the above rule... > stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", case_level=TRUE, strength=2L)
[1] FALSE TRUE FALSE FALSE
> stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", case_level=TRUE, strength=1L)
[1] FALSE TRUE FALSE TRUE
> stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", strength=1L)
[1] TRUE TRUE TRUE TRUE
> stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", strength=2L)
[1] TRUE TRUE FALSE FALSE |
I was trying hard to figure out why stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", case_level=TRUE, strength=1L)
## [1] TRUE TRUE TRUE TRUE
stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", case_level=TRUE, strength=1L)
## [1] FALSE TRUE FALSE TRUE |
(note to self): ICU 69.1 gives the results as above. @todo: create a minimal reproducible example outside of stringi |
[note to self] Yes, this is reproducible outside of stringi: /* g++ -std=c++11 icu_test_bug_ucol_caselevel.cpp -licui18n -licuuc -licudata && ./a.out */
#include <unicode/ustring.h>
#include <unicode/rbbi.h>
#include <unicode/coll.h>
#include <unicode/ucol.h>
#include <unicode/stsearch.h>
#include <cstdio>
using namespace icu;
#define WITH_CHECK_STATUS(f) \
status = U_ZERO_ERROR; \
f; \
if (U_FAILURE(status)) {printf("error %s!\n", u_errorName(status));return 1;}
int main()
{
const char* haystacks[] = {
"mario", "Mario", "MARIO", "MÁRIO", "Mário", "mário", "dario"
};
const char* needle = "mario";
UErrorCode status;
UCollator* col;
UStringSearch* matcher;
int v;
printf("U_ICU_VERSION=%s\n", U_ICU_VERSION);
WITH_CHECK_STATUS(col = ucol_open("pt_BR", &status))
WITH_CHECK_STATUS(ucol_setAttribute(col, UCOL_STRENGTH, UCOL_PRIMARY, &status))
WITH_CHECK_STATUS(ucol_setAttribute(col, UCOL_CASE_LEVEL, UCOL_ON, &status))
UnicodeString _needle = UnicodeString::fromUTF8(needle);
UnicodeString haystack = UnicodeString::fromUTF8("whatever");
WITH_CHECK_STATUS(matcher = usearch_openFromCollator(
_needle.getBuffer(),
_needle.length(),
haystack.getBuffer(),
haystack.length(),
col, NULL, &status
))
for (int i=0; i<sizeof(haystacks)/sizeof(haystacks[0]); ++i) {
printf("%s vs. %s: ", needle, haystacks[i]);
haystack = UnicodeString::fromUTF8(haystacks[i]);
WITH_CHECK_STATUS(
usearch_setText(matcher, haystack.getBuffer(), haystack.length(), &status)
)
usearch_reset(matcher);
WITH_CHECK_STATUS(v = ((int)USEARCH_DONE!=usearch_first(matcher, &status)))
printf(" usearch=%d", v);
WITH_CHECK_STATUS(v = (int)UCOL_EQUAL==ucol_strcollUTF8(col,
haystacks[i], -1, needle, -1, &status))
printf(" ucol=%d", v);
RuleBasedCollator* rbc = RuleBasedCollator::rbcFromUCollator(col);
WITH_CHECK_STATUS(StringSearch matcher2(needle, haystack, rbc, NULL, status))
WITH_CHECK_STATUS(v = ((int)USEARCH_DONE!=matcher2.first(status)))
printf(" StringSearch=%d", v);
printf("\n");
}
ucol_close(col);
usearch_close(matcher);
return 0;
}
yields:
|
All right, it turns out that this issue has already been reported. It is ICU-related. |
Thank you very much for taking your time to work this out. I have followed your thread posts and tried to find the issue on ICU's bug page, but couldn't figure it out. |
According to the Unicode Technical Standard #35, when case Level is set to On and strength set to primary, ICU should ignore accents but take case into account. This is not the behavior I find in
stringi
package. To achieve that, I have to set locale = "", which is not what I expected. Examples:This is ok:
This should return FALSE:
In order to achieve that I need to set locale to "" on
stringi
:Here I respect case but don't add accent and get the expected result:
The text was updated successfully, but these errors were encountered: