-
Notifications
You must be signed in to change notification settings - Fork 9
New OceanParcels website #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is the final script I used to extract the article data to port to the new site Details
"""Script to scrape articles from old OceanParcels website."""
import requests
from bs4 import BeautifulSoup
import json
import re
import sys
def scrape_articles(url):
try:
# Fetch the webpage
response = requests.get(url)
response.raise_for_status() # Raise an exception for bad status codes
# Parse the HTML content
soup = BeautifulSoup(response.text, "html.parser")
# Find all card elements
cards = soup.find_all("div", class_="card")
# List to store extracted article information
articles = []
# Process each card
for card in cards:
try:
# Extract title from h5 element
title_elem = card.find("h5")
title = title_elem.get_text(strip=True) if title_elem else ""
# Extract authors (text immediately after h5)
authors = ""
if title_elem and title_elem.next_sibling:
authors = (
title_elem.next_sibling.strip()
if isinstance(title_elem.next_sibling, str)
else ""
)
# Extract published info (journal, volume, pages)
published_info = ""
if title_elem:
# Find all text between authors and <br/>
next_elem = title_elem.find_next_sibling()
while next_elem and next_elem.name != "br":
if isinstance(next_elem, str):
published_info += next_elem.strip() + " "
else:
published_info += next_elem.get_text(strip=True) + " "
next_elem = next_elem.next_sibling
published_info = published_info.strip()
# Extract DOI from card-link
# Extract DOI from card-link
doi_link = card.find(
"a", class_="card-link", href=lambda href: href and "doi" in href
)
if doi_link:
doi = doi_link.get("href", "")
# Extract abstract from card-body
card_body = card.find("div", class_="card-body")
abstract = card_body.get_text(strip=True) if card_body else ""
# Clean up abstract by replacing newlines and multiple spaces with single space
authors = authors.rstrip(",")
published_info = re.sub(r"\s*,", ",", published_info)
# Create article dictionary
article = {
"title": title,
"published_info": published_info,
"authors": authors,
"doi": doi,
"abstract": abstract,
}
article = {k: re.sub(r"\n\s*", " ", v) for k, v in article.items()}
articles.append(article)
except Exception as card_error:
print(f"Error processing card: {card_error}")
print("Problematic card HTML:")
print(card.prettify())
sys.exit(1)
# Make articles chronological
articles.reverse()
# Save to JSON file
with open("articles.json", "w", encoding="utf-8") as f:
json.dump(articles, f, indent=2, ensure_ascii=False)
print(f"Successfully scraped {len(articles)} articles.")
return articles
except requests.RequestException as e:
print(f"Error fetching URL: {e}")
sys.exit(1)
# Main execution
if __name__ == "__main__":
url = "https://oceanparcels.org/articles.html"
scrape_articles(url) |
|
Updated view @erikvansebille |
|
Let's merge on Monday :) |
27b1712 to
b738a5a
Compare
|
Should we remove the placeholder |
|
Good point, I just removed it. Will write it in the coming week(?) But perhaps you(?) can write a short blog post celebrating the new website launch, highlighting that we thank xarray for the design? |
done :) |
Rename sponsors to funders

Created the new OceanParcels website. Used https://xarray.dev as a starting point.
Made sure in the migration to bring
example_dataacross as that is how parcels downloads example datasets.Items still TODO:
Fixes #112