Apply NFKC normalisation #1

cmcaine · 2018-01-01T12:39:58Z

Otherwise I can fingerprint on diacritic form, ligatures, etc.

I don't know if it also removes the homoglyphs. Might want to look into that.

NFKC does change the appearance of the text a bit if you're using display variants e.g. blacktype h Vs Latin h, but NFC normalisation permits too many fingerprinting options.

http://unicode.org/reports/tr15/#Canon_Compat_Equivalence

DavidJacobson · 2018-01-02T05:45:34Z

Thanks for this, I'll have to look into it. I'll leave it open until I fix it

Visgean · 2018-01-02T19:40:17Z

https://stackoverflow.com/questions/5258623/remove-special-characters-from-string

I think this method:

>>> unicodedata.normalize('NFKD', source).encode('ascii', 'ignore')

is the simplest and most correct method here, in fact I think that you could just compare the text with the encoded/cleaned version and it would be ok.

cmcaine · 2018-01-02T19:49:42Z

Why would you re-encode as ASCII?

Visgean · 2018-01-02T21:05:20Z

To strip of all non-ascii chars. just to make sure there is nothing at all that could be used to fingerprint the text.

cmcaine · 2018-01-02T21:32:57Z

With ASCII you can still fingerprint on:

Number of whitespace characters
Extra/changed characters hidden as typos and/or wrong punctuation (unicode just expands this option)

And on a bunch of things that are probably out of scope

Exact numbers used
Rephrasings
Restructuring (moving sections, paragraphs, etc around)

Remember the attacker only needs about log2(number of people with access) bits of identifying changes to survive any sanitation and conversion.

Visgean · 2018-01-02T22:24:43Z

Number of spaces is easy to spot and also easy to fix - eg collapse all spaces to a single one. Typos could be dealt with but I agree it is hard to do it automatically.

I think its about lowering the probability, not removing the possibility of such attack altogether.

DavidJacobson · 2018-01-03T06:09:15Z

Reading through this:

I'll add the normalization, looks pretty useful,
As for the comments about reencoding as ASCII - I'm going to agree with @Visgean in that we want to remove anything 'non-ascii'. This would be a concern if the tool were to be used with other languages, but really I'm centering it around the Latin character set.
@cmcaine You raise valid points, it's just easier to clean once all the "questionable" characters have been removed. And in regards to your last 3 bullet points, you are entirely correct. However, I'm trying to address the issue of fingerprinting in text - not fingerprinting through language patterns/word choice. In the future, I may try to add something that swaps out words with synonyms, but that's down the road.

Thanks for the feedback, wanted to say I appreciate it. I'll try to get around to implementing things within in the next few days. And of course, feel free to submit a pull request!

DavidJacobson self-assigned this Jan 2, 2018

DavidJacobson added the good first issue label Jan 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply NFKC normalisation #1

Apply NFKC normalisation #1

cmcaine commented Jan 1, 2018

DavidJacobson commented Jan 2, 2018

Visgean commented Jan 2, 2018

cmcaine commented Jan 2, 2018

Visgean commented Jan 2, 2018

cmcaine commented Jan 2, 2018

Visgean commented Jan 2, 2018

DavidJacobson commented Jan 3, 2018

Apply NFKC normalisation #1

Apply NFKC normalisation #1

Comments

cmcaine commented Jan 1, 2018

DavidJacobson commented Jan 2, 2018

Visgean commented Jan 2, 2018

cmcaine commented Jan 2, 2018

Visgean commented Jan 2, 2018

cmcaine commented Jan 2, 2018

Visgean commented Jan 2, 2018

DavidJacobson commented Jan 3, 2018