Skip to content

Wikipedia extraction seems to be giving bigrams #3

@MikeHopcroft

Description

@MikeHopcroft

Repro:

BitFunnel: 9e9e96ecb32841c53edc4542813ed1531fd4c4a9
Workbench: 580b74b

StatisticsBuilder c:\git\Wikipedia\Manifest100.txt c:\temp\wiki\out100 -statistics -text

Shouldn't have bigrams, shouldn't have capital letters:

Bigram where none expected (also capital letter):
72a2c4b53c781027,1,1,0.000144196,zephyrinus
bd01f0b68e57b2a7,1,1,0.000144196,sveshtari
3fad0c4faf3cb52b,1,0,0.000144196,Algebraic geometry
50c9029d9d3c5378,1,1,0.000144196,darabont
a2f5153a7612c5d0,1,1,0.000144196,up─üsik─ü

3ca7b8a975b95d4d,1,1,0.000144196,crisplock

Capital letter
49fc77672b6b54c4,1,0,0.000144196,Alexander Graham Bell

7d8b10a0a2b9f455,1,0,0.000144196,Evolutionarily stable strategy

Random garbase
b651bc4fddcd84af,1,1,0.000144196,86p
6cc733ca24bc18e,1,1,0.000144196,ಹರಿವೆ
5847567b67dc03cb,1,1,0.000144196,xis

a31bc33fc17f3fc6,1,1,0.000144196,लाख

b3d2c5a33dd1efc6,1,1,0.000144196,k├╢nigsberger

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions