Tokenizing bug. Some tokens are split into 2

This problem happens on a particular webpage
https://www.radiofrance.fr/franceinter/podcasts

This is my golfed script which shows the bug
```
#!/usr/bin/perl
package myparser;
use strict;
use warnings;
use v5.10;
use base qw(HTML::Parser);

sub text {
    my ($self, $text, $is_cdata) = @_;
    say "\"$text\"";
}

package main;
use strict;
use warnings;

my $p = myparser->new;
$p->parse_file(shift // exit);
```
Unfortunately, I can't post a golfed HTML snippet because when I try to reduce the size of the webpage, the bug disappear. So I will have to explain the exact steps I did to reproduce the bug.

In Chromium, go to [https://www.radiofrance.fr/franceinter/podcasts](https://www.radiofrance.fr/franceinter/podcasts).
Then load the entire webpage by going at the bottom, clicking on "VOIR PLUS DE PODCASTS" repetitively until everything is loaded.
Then save the webpage.

After that you just have to execute the script example with the downloaded page as argument.

The script prints all the text which is outside of any tag. Like this `/tag>TEXT HERE<othertag`

**THE BUG**
The bug is that some "text elements" are splitted in 2.
This happens for several podcast names. `"Sur les épaules de Darwin"` is one of those.

You can see that the script will output
```
"Sur les épaules de"
" Darwin"
```
instead of just `"Sur les épaules de Darwin"`
This also happens to `"Sur Les routes de la musique"` (just below) and a few others.

Now, I found that when deleting `<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">`, just at the top of `<head></head>`, the bug disappear. And It also happens when deleting just `; charset=UTF-8`

The problem is that the bug also disappear when I leave the charset as is and when I delete a bunch of the stuff inside `<head></head>` or I delete a lot of the `div`s corresponding to the other podcasts entries of the index.

This is all the information that I have.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenizing bug. Some tokens are split into 2 #27

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tokenizing bug. Some tokens are split into 2 #27

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions