-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with New York Times stories #76
Comments
I looked at the html for that page. Unfortunately the nytimes is doing a lot that makes it hard to grab their content in a generic way. Here's a sample of some of the html: <h2 class="interactive-headline">
The U.S. Is the Biggest Carbon Polluter in History. It Just Walked Away From the Paris Climate Deal.
</h2>
<p class="interactive-summary">
The United States has emitted more planet-warming carbon dioxide into the atmosphere than any other country. Now it is walking back a promise to lower emissions.
</p>
</figcaption>
<div class="interactive-image-container">
<div class="interactive-image">
<img src="https://static01.nyt.com/images/2017/05/27/climate/the-us-led-the-world-in-carbon-emissions-its-falling-behind-on-solutions-1495840402762/the-us-led-the-world-in-carbon-emissions-its-falling-behind-on-solutions-1495840402762-master495-v3.jpg" />
</div>
<div class="interactive-overlay">
<i class="icon sprite-icon interactive-overlay-icon"></i>
</div>
</div>
</a>
</figure>
<p class="story-body-text story-content" data-para-count="230" data-total-count="6521">On Twitter, Miguel Arias Cañete, the European Union’s commissioner for climate, said that “today’s announcement has galvanized us rather than weakened us, and this vacuum will be filled by new broad committed leadership.”</p>
<div id="story-ad-3" class="story-ad ad ad-placeholder nocontent robots-nocontent ad-aggro_4-4-8 ad-aggro_4-5-7">
<div class="accessibility-ad-header visually-hidden">
<p>Advertisement</p>
</div>
<a class="visually-hidden skip-to-text-link" href="#story-continues-12">Continue reading the main story</a>
</div> They are breaking up the text into such small bits that For this case, I'd recommend writing some custom code to parse the nytimes. Here's some simple node.js code that will grab the contents of the story: const fs = require('fs');
const cheerio = require('cheerio');
const html = fs.readFileSync('story.html').toString();
const $ = cheerio.load(html);
const textTags = $('.story-body-text');
const storyText = textTags.text();
console.log(storyText); When I run that, I get:
|
I came here to report this as well. This only extracts about 200 words of this article. |
The text field is empty when running unfluff on the html from a New York Times story. For example, if I request a story from nytimes.com in the node console and then pass the page html to unfluff, the returned text field is empty:
Result:
{ title: 'Trump Will Withdraw U.S. From Paris Climate Agreement', softTitle: 'Trump Will Withdraw U.S. From Paris Climate Agreement', date: '2017-06-01T14:48:08-04:00', author: [ 'Michael D. Shear', 'https://www.nytimes.com/by/michael-d-shear' ], publisher: undefined, copyright: '2017 The New York Times Company', favicon: 'https://static01.nyt.com/favicon.ico', description: 'The withdrawal process could take four years to complete, meaning a final decision would be up to the American voters in the next presidential election.', keywords: 'United Nations Framework Convention on Climate Change,Trump Donald J,United States Politics and Government,Global Warming', lang: 'en', canonicalLink: 'https://www.nytimes.com/2017/06/01/climate/trump-paris-climate-agreement.html', tags: [], image: 'https://static01.nyt.com/images/2017/06/02/us/02climatesub-alpha1/02climatesub-alpha1-facebookJumbo.jpg', videos: [], links: [], text: '' }
I've tried a couple of different Times urls and ensured that the request method is indeed passing the correct page html to the callback.
The text was updated successfully, but these errors were encountered: