Description
This problem happens on a particular webpage
https://www.radiofrance.fr/franceinter/podcasts
This is my golfed script which shows the bug
#!/usr/bin/perl
package myparser;
use strict;
use warnings;
use v5.10;
use base qw(HTML::Parser);
sub text {
my ($self, $text, $is_cdata) = @_;
say "\"$text\"";
}
package main;
use strict;
use warnings;
my $p = myparser->new;
$p->parse_file(shift // exit);
Unfortunately, I can't post a golfed HTML snippet because when I try to reduce the size of the webpage, the bug disappear. So I will have to explain the exact steps I did to reproduce the bug.
In Chromium, go to https://www.radiofrance.fr/franceinter/podcasts.
Then load the entire webpage by going at the bottom, clicking on "VOIR PLUS DE PODCASTS" repetitively until everything is loaded.
Then save the webpage.
After that you just have to execute the script example with the downloaded page as argument.
The script prints all the text which is outside of any tag. Like this /tag>TEXT HERE<othertag
THE BUG
The bug is that some "text elements" are splitted in 2.
This happens for several podcast names. "Sur les épaules de Darwin"
is one of those.
You can see that the script will output
"Sur les épaules de"
" Darwin"
instead of just "Sur les épaules de Darwin"
This also happens to "Sur Les routes de la musique"
(just below) and a few others.
Now, I found that when deleting <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
, just at the top of <head></head>
, the bug disappear. And It also happens when deleting just ; charset=UTF-8
The problem is that the bug also disappear when I leave the charset as is and when I delete a bunch of the stuff inside <head></head>
or I delete a lot of the div
s corresponding to the other podcasts entries of the index.
This is all the information that I have.