-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizing bug. Some tokens are split into 2 #27
Comments
@florian-pe thanks for this. Out of curiosity, do you have the same issue if you change the length of |
@oalders You are right, it does fix the problem. |
@florian-pe what happens if you set my $p = myparser->new;
$p->unbroken_text(1);
$p->parse_file(shift // exit); |
@oalders Yes it fixes the bug.
I have read the man page at |
Yes @florian-pe, I think for your use case you want this option enabled in the parser. So, it does not appear to me to be a bug. |
@oalders I don't understand how it's not a bug if the problems originates from the subroutine |
I didn't write the code, but just for some history, this sub enters into the codebase in 1996 with a chunk size of 2048: aeb6d0ba14e680e6#diff-abe42eabebfc8528859aa468da65d562ea1c37c368905ddc25d8b10ad1f801b0R298 Not sure how relevant that is, but it's a fun fact! I had a closer look at the docs for |
Alright, here's a simple example.
We can use this one-liner to find the number of "a" needed so that the token "splitted token" will be splitted.
Then if we run If we count the number of characters with this
we see the 512th character happen to be the last "n" of "splitted token". I redid that same little experiment, but removing
I hope that helps, and that it convinces you that it is indeed a bug. |
Thanks for this @florian-pe. That does look like a bug to me. Are you motivated to fix this? |
That's a nice challenge. I tried to add print statment and it compiles OK. The problem is that, when I try to use the hand compiled module in a script, even with So I can not even begin to poke around the C code. |
Ok never mind. |
This problem happens on a particular webpage
https://www.radiofrance.fr/franceinter/podcasts
This is my golfed script which shows the bug
Unfortunately, I can't post a golfed HTML snippet because when I try to reduce the size of the webpage, the bug disappear. So I will have to explain the exact steps I did to reproduce the bug.
In Chromium, go to https://www.radiofrance.fr/franceinter/podcasts.
Then load the entire webpage by going at the bottom, clicking on "VOIR PLUS DE PODCASTS" repetitively until everything is loaded.
Then save the webpage.
After that you just have to execute the script example with the downloaded page as argument.
The script prints all the text which is outside of any tag. Like this
/tag>TEXT HERE<othertag
THE BUG
The bug is that some "text elements" are splitted in 2.
This happens for several podcast names.
"Sur les épaules de Darwin"
is one of those.You can see that the script will output
instead of just
"Sur les épaules de Darwin"
This also happens to
"Sur Les routes de la musique"
(just below) and a few others.Now, I found that when deleting
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
, just at the top of<head></head>
, the bug disappear. And It also happens when deleting just; charset=UTF-8
The problem is that the bug also disappear when I leave the charset as is and when I delete a bunch of the stuff inside
<head></head>
or I delete a lot of thediv
s corresponding to the other podcasts entries of the index.This is all the information that I have.
The text was updated successfully, but these errors were encountered: