-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply NFKC normalisation #1
Comments
Thanks for this, I'll have to look into it. I'll leave it open until I fix it |
https://stackoverflow.com/questions/5258623/remove-special-characters-from-string I think this method: >>> unicodedata.normalize('NFKD', source).encode('ascii', 'ignore') is the simplest and most correct method here, in fact I think that you could just compare the text with the encoded/cleaned version and it would be ok. |
Why would you re-encode as ASCII? |
To strip of all non-ascii chars. just to make sure there is nothing at all that could be used to fingerprint the text. |
With ASCII you can still fingerprint on:
And on a bunch of things that are probably out of scope
Remember the attacker only needs about log2(number of people with access) bits of identifying changes to survive any sanitation and conversion. |
Number of spaces is easy to spot and also easy to fix - eg collapse all spaces to a single one. Typos could be dealt with but I agree it is hard to do it automatically. I think its about lowering the probability, not removing the possibility of such attack altogether. |
Reading through this:
Thanks for the feedback, wanted to say I appreciate it. I'll try to get around to implementing things within in the next few days. And of course, feel free to submit a pull request! |
Otherwise I can fingerprint on diacritic form, ligatures, etc.
I don't know if it also removes the homoglyphs. Might want to look into that.
NFKC does change the appearance of the text a bit if you're using display variants e.g. blacktype h Vs Latin h, but NFC normalisation permits too many fingerprinting options.
http://unicode.org/reports/tr15/#Canon_Compat_Equivalence
The text was updated successfully, but these errors were encountered: