-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(fixtures): new blog post "Ten years of CERN Open Data portal" #112
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
# Allow non-heading start since the heading will be added by the portal system | ||
MD041: false |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
19 changes: 19 additions & 0 deletions
19
...ures/data/docs/ten-years-of-cern-open-data-portal/ten-years-of-cern-open-data-portal.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
[ | ||
{ | ||
"author": "CERN Open Data team", | ||
"body": { | ||
"content": "ten-years-of-cern-open-data-portal.md", | ||
"format": "md" | ||
}, | ||
"date_published": "2024-12-10", | ||
"short_description": { | ||
"content": "The CERN Open Data portal celebrates its ten year birthday! Find out about its journey and today's challenges." | ||
}, | ||
"featured": 1, | ||
"slug": "ten-years-of-cern-open-data-portal", | ||
"title": "Ten years of CERN Open Data portal", | ||
"type": { | ||
"primary": "News" | ||
} | ||
} | ||
] |
114 changes: 114 additions & 0 deletions
114
...a/docs/ten-years-of-cern-open-data-portal/ten-years-of-cern-open-data-portal.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,114 @@ | ||
Ten years ago, a handful of enthusiastic researchers from the | ||
[ALICE](https://alice.cern), [ATLAS](https://atlas.cern/), | ||
[CMS](https://cms.cern/) and [LHCb](https://lhcb.web.cern.ch/) collaboration | ||
open access teams, together with a handful of software engineers from the CERN | ||
[Department of Information | ||
Technology](https://information-technology.web.cern.ch/) and the information | ||
specialists from the CERN [Scientific Information | ||
Service](https://sis.web.cern.ch/), grouped together to build the CERN Open | ||
Data portal. | ||
|
||
Under the umbrella of the [Data Preservation in High Energy | ||
Physics](https://dphep.web.cern.ch/), the work started in summer 2014 by | ||
devising a metadata schema that would neatly describe the open data from the | ||
LHC experiments in both their technical and physics-oriented facets. We had to | ||
design the web site portal (based on the | ||
[Invenio](https://inveniosoftware.org/) digital repository framework) to | ||
provide web pages for general public, including visualising collision events, | ||
all the while ensuring the best data preservation practices to describe, manage | ||
and disseminate the data for researchers. The data needed to be made searchable | ||
and downloadable in a way that would be attractive to both general public for | ||
educational purposes and usable by independent researchers for referencing and | ||
independent theoretical analysis. (Some of the technical challenges behind the | ||
CERN Open Data portal were described in a [later | ||
interview](https://superuser.openinfra.org/articles/cern-open-data-portal/) for | ||
the SuperUser magazine.) | ||
|
||
The efforts concluded in the launch of the CERN Open Data portal on [November | ||
20th | ||
2014](https://home.web.cern.ch/news/news/accelerators/cern-makes-public-first-data-lhc-experiments). | ||
The portal managed about 30 terabytes of open data from LHC experiments in a | ||
ground-breaking service at the time. The [Reddit AskMeAnything | ||
session](https://www.reddit.com/r/IAmA/comments/2nxwkb/a_few_days_ago_cern_launched_an_open_data_portal/) | ||
organised alongside the release attracted large attention and many tens of | ||
thousands of portal visitors, more than the total number of particle physicists | ||
in the world. | ||
|
||
Fast-forward ten years to the present time. The CERN Open Data portal now | ||
disseminates more than 5 petabytes of open data, which is a whopping 200 times | ||
more data than at launch. More particle physics experiments have joined the | ||
open data portal, with [DELPHI](https://delphi-www.web.cern.ch/delphi-www/), | ||
[OPERA](https://en.wikipedia.org/wiki/OPERA_experiment), | ||
[PHENIX](https://www.phenix.bnl.gov/), and | ||
[TOTEM](https://totem-experiment.web.cern.ch/) releasing data samples or even | ||
full data collections. More experiments are in the pipeline, such as | ||
[JADE](https://www.mpp.mpg.de/en/research/data-preservation/jade/). The CERN | ||
Open Data portal is becoming a sort of "HEP Open Data" portal, covering not | ||
only the LHC experiments, but the particle physics domain at large, further | ||
demonstrating success of the original idea. | ||
|
||
Looking back at the origins and the path travelled in the past ten years, any | ||
sceptical concerns whether these data would be understandable and usable for | ||
independent theoretical research have been positively answered. The [leading | ||
publication](https://news.mit.edu/2017/first-open-access-data-large-collider-subatomic-particle-patterns-0929) | ||
by Jesse Thaler's team in MIT analysing CMS open data showed that independent | ||
theoretical publications are not only possible, but that they enrich the | ||
collaboration research practices themselves, with CMS collaboration starting to | ||
cite the independent theoretical work in their own publications. There are now | ||
more than 70 research papers published on the CMS open data and the [number of | ||
published papers is | ||
growing](https://cms.cern/news/cms-celebrates-decade-open-data). Matt Strassler | ||
published a series of blog posts [on the importance of open | ||
data](https://profmattstrassler.com/2019/03/19/the-importance-and-challenges-of-open-data-at-the-large-hadron-collider/) | ||
in this realm. | ||
|
||
The independent usage of the released data for research has led to the | ||
strengthening of published data provenance information when releasing the data | ||
in order to provide physics context and auxiliary information about the data as | ||
accurately and as completely as possible. The data are being published together | ||
with analysis examples demonstrating how to extract physics objects out of the | ||
data and how to use them in one's own analyses. The care about the data usage | ||
patterns and the further usability and reinterpretability of data [has | ||
naturally led](https://www.nature.com/articles/s41567-018-0342-2) to sister | ||
projects dedicated towards facilitating [reproducible | ||
analyses](https://www.reana.io) and [continuous | ||
reuse](https://zenodo.org/records/10263204) of the data. | ||
|
||
Besides independent theoretical research, the data are being used in numerous | ||
masterclasses and education programs to train the next generation of | ||
scientists, as well as by software engineers in the efforts to benchmark | ||
software tools to ensure their feasibility in the forthcoming high-luminosity | ||
experimental data-taking era. | ||
|
||
The bottom-up efforts on preparing and releasing open data were complemented by | ||
the top-down efforts and support from CERN laboratory management towards open | ||
science. The efforts by CERN as the supporting hosting lab together with LHC | ||
collaboration management boards as the data producers and owners paved the way | ||
towards the formal establishment of the [CERN Open Data | ||
policy](https://cds.cern.ch/record/2745133/files/CERN-OPEN-2020-013.pdf) in | ||
2020, and, two years later, the [CERN Open Science | ||
Policy](https://cds.cern.ch/record/2835057/files/CERN-OPEN-2022-013.pdf). It is | ||
under these auspices that the open data pilot efforts progressively took shape | ||
to what they are today whilst seeking the long-term sustainability of making | ||
science open. | ||
|
||
Looking into the future, there are clear challenges ahead. The growing number | ||
of open data releases calls for using more efficient data publishing workflows | ||
leveraging scientific data managers used in collaborations, such as | ||
[Rucio](https://rucio.cern.ch/) in ATLAS and CMS and | ||
[DIRAC](https://dirac.readthedocs.io/) in LHCb. The vast quantities of | ||
published data calls for implementing an efficient "hot" and "cold" storage | ||
mechanism behind the portal in order to save on storage costs. All this content | ||
necessitates efficient tape backups and the on-demand data staging for users | ||
from the cold storage, when necessary. Finally, the experimental collaborations | ||
plan to release even more data during the LHC Run-3 phase, which calls for | ||
novel approaches to open data publishing that are going beyond the digital | ||
repository domain, such as the nascent system allowing theorists to ask for | ||
custom LHCb open data production via a dedicated [Ntupling | ||
Wizard](https://arxiv.org/pdf/2302.14235v2) service. | ||
|
||
It has been a blast working together between software engineers, information | ||
specialists and particle physicists on fostering open and reproducible science | ||
practices in particle physics. | ||
|
||
Looking forward to working together in the next decennial! |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need `"featured"`` still?
Some time ago, we have discussed that we don't really feature old articles, so the portal can simply display last 3 (or last 6) news items based on date. This should perfectly do, and we could remove "featured" from the content repository articles too. Was this implemented already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need the field. The portal displays the last 6 news that have been
featured
. The idea behind is that there might be some news that do not deserve to be advertised in the featured page.The main difference is that now the field to sort the
featured
news is the date (instead of the value of featured). That means that the previous entries do not have to be modified when a new featured news is inserted.