Skip to content

Tokenizing bug. Some tokens are split into 2 #27

Open
@florian-pe

Description

@florian-pe

This problem happens on a particular webpage
https://www.radiofrance.fr/franceinter/podcasts

This is my golfed script which shows the bug

#!/usr/bin/perl
package myparser;
use strict;
use warnings;
use v5.10;
use base qw(HTML::Parser);

sub text {
    my ($self, $text, $is_cdata) = @_;
    say "\"$text\"";
}

package main;
use strict;
use warnings;

my $p = myparser->new;
$p->parse_file(shift // exit);

Unfortunately, I can't post a golfed HTML snippet because when I try to reduce the size of the webpage, the bug disappear. So I will have to explain the exact steps I did to reproduce the bug.

In Chromium, go to https://www.radiofrance.fr/franceinter/podcasts.
Then load the entire webpage by going at the bottom, clicking on "VOIR PLUS DE PODCASTS" repetitively until everything is loaded.
Then save the webpage.

After that you just have to execute the script example with the downloaded page as argument.

The script prints all the text which is outside of any tag. Like this /tag>TEXT HERE<othertag

THE BUG
The bug is that some "text elements" are splitted in 2.
This happens for several podcast names. "Sur les épaules de Darwin" is one of those.

You can see that the script will output

"Sur les épaules de"
" Darwin"

instead of just "Sur les épaules de Darwin"
This also happens to "Sur Les routes de la musique" (just below) and a few others.

Now, I found that when deleting <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">, just at the top of <head></head>, the bug disappear. And It also happens when deleting just ; charset=UTF-8

The problem is that the bug also disappear when I leave the charset as is and when I delete a bunch of the stuff inside <head></head> or I delete a lot of the divs corresponding to the other podcasts entries of the index.

This is all the information that I have.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions