Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply NFKC normalisation #1

Open
cmcaine opened this issue Jan 1, 2018 · 7 comments
Open

Apply NFKC normalisation #1

cmcaine opened this issue Jan 1, 2018 · 7 comments
Assignees

Comments

@cmcaine
Copy link

cmcaine commented Jan 1, 2018

Otherwise I can fingerprint on diacritic form, ligatures, etc.

I don't know if it also removes the homoglyphs. Might want to look into that.

NFKC does change the appearance of the text a bit if you're using display variants e.g. blacktype h Vs Latin h, but NFC normalisation permits too many fingerprinting options.

http://unicode.org/reports/tr15/#Canon_Compat_Equivalence

@DavidJacobson
Copy link
Owner

Thanks for this, I'll have to look into it. I'll leave it open until I fix it

@Visgean
Copy link

Visgean commented Jan 2, 2018

https://stackoverflow.com/questions/5258623/remove-special-characters-from-string

I think this method:

>>> unicodedata.normalize('NFKD', source).encode('ascii', 'ignore')

is the simplest and most correct method here, in fact I think that you could just compare the text with the encoded/cleaned version and it would be ok.

@cmcaine
Copy link
Author

cmcaine commented Jan 2, 2018

Why would you re-encode as ASCII?

@Visgean
Copy link

Visgean commented Jan 2, 2018

To strip of all non-ascii chars. just to make sure there is nothing at all that could be used to fingerprint the text.

@cmcaine
Copy link
Author

cmcaine commented Jan 2, 2018

With ASCII you can still fingerprint on:

  • Number of whitespace characters
  • Extra/changed characters hidden as typos and/or wrong punctuation (unicode just expands this option)

And on a bunch of things that are probably out of scope

  • Exact numbers used
  • Rephrasings
  • Restructuring (moving sections, paragraphs, etc around)

Remember the attacker only needs about log2(number of people with access) bits of identifying changes to survive any sanitation and conversion.

@Visgean
Copy link

Visgean commented Jan 2, 2018

Number of spaces is easy to spot and also easy to fix - eg collapse all spaces to a single one. Typos could be dealt with but I agree it is hard to do it automatically.

I think its about lowering the probability, not removing the possibility of such attack altogether.

@DavidJacobson
Copy link
Owner

Reading through this:

  1. I'll add the normalization, looks pretty useful,
  2. As for the comments about reencoding as ASCII - I'm going to agree with @Visgean in that we want to remove anything 'non-ascii'. This would be a concern if the tool were to be used with other languages, but really I'm centering it around the Latin character set.
  3. @cmcaine You raise valid points, it's just easier to clean once all the "questionable" characters have been removed. And in regards to your last 3 bullet points, you are entirely correct. However, I'm trying to address the issue of fingerprinting in text - not fingerprinting through language patterns/word choice. In the future, I may try to add something that swaps out words with synonyms, but that's down the road.

Thanks for the feedback, wanted to say I appreciate it. I'll try to get around to implementing things within in the next few days. And of course, feel free to submit a pull request!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants