Skip to content

Commit ed5ef21

Browse files
committed
Updated dataset service names to full versions and fixed validation errors
1 parent 38f0a13 commit ed5ef21

File tree

279 files changed

+16743
-13442
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

279 files changed

+16743
-13442
lines changed
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,47 @@
1-
Deprecated: True
1+
Deprecated: true
22
DeprecatedNotice: Amazon is no longer hosting this Data Lakehouse Ready dataset
3-
Name: 1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5 - Data Lakehouse Ready
4-
Description: "The 1000 Genomes Project is an international collaboration which has established the most detailed catalogue of human genetic variation, including SNPs, structural variants, and their haplotype context. There were a total of 3202 individuals sequenced as part of Phase 3 of this project. The high coverage samples were processed using the Illumina DRAGEN v3.5.7b pipeline and are available at s3://1000genomes-dragen/. This dataset contains the VCFs transformed to Parquet/ORC in 3 different schemas - partitioned by samples, partitioned by chromosome and a nested data format. These representations of the 1000 Genomes DRAGEN data are stored in Parquet/ORC format and can be queried through [Amazon Athena](https://aws.amazon.com/athena/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc). To add these tables to your Glue Data Catalog and for sample queries on this dataset, please refer to the link in our Documentation."
3+
Name: 1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5 - Data Lakehouse Ready
4+
Description: The 1000 Genomes Project is an international collaboration which has
5+
established the most detailed catalogue of human genetic variation, including SNPs,
6+
structural variants, and their haplotype context. There were a total of 3202 individuals
7+
sequenced as part of Phase 3 of this project. The high coverage samples were processed
8+
using the Illumina DRAGEN v3.5.7b pipeline and are available at s3://1000genomes-dragen/.
9+
This dataset contains the VCFs transformed to Parquet/ORC in 3 different schemas
10+
- partitioned by samples, partitioned by chromosome and a nested data format. These
11+
representations of the 1000 Genomes DRAGEN data are stored in Parquet/ORC format
12+
and can be queried through [Amazon Athena](https://aws.amazon.com/athena/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc).
13+
To add these tables to your Glue Data Catalog and for sample queries on this dataset,
14+
please refer to the link in our Documentation.
515
Documentation: https://github.com/aws-samples/data-lake-as-code/tree/roda#readme
616
Contact: https://github.com/aws-samples/data-lake-as-code/issues
7-
ManagedBy: "[Amazon Web Services](https://aws.amazon.com/)"
17+
ManagedBy: '[Amazon Web Services](https://aws.amazon.com/)'
818
UpdateFrequency: Not updated
919
Tags:
10-
- biology
11-
- bioinformatics
12-
- genetic
13-
- genomic
14-
- Homo sapiens
15-
- life sciences
16-
- parquet
17-
- population genetics
18-
- vcf
19-
License: "Data from the 1000 Genomes Project is now available without embargo, following the final publication from the project. Use of the data should be cited in the usual way, with current details available at http://www.internationalgenome.org/faq/how-do-i-cite-1000-genomes-project."
20+
- biology
21+
- bioinformatics
22+
- genetic
23+
- genomic
24+
- Homo sapiens
25+
- life sciences
26+
- parquet
27+
- population genetics
28+
- vcf
29+
License: Data from the 1000 Genomes Project is now available without embargo, following
30+
the final publication from the project. Use of the data should be cited in the usual
31+
way, with current details available at http://www.internationalgenome.org/faq/how-do-i-cite-1000-genomes-project.
2032
Resources:
21-
- Description: Parquet representations of 1000 Genomes VCF outputs from DRAGEN, ready for enrollment into Data Lake as Code.
22-
ARN: arn:aws:s3:::aws-roda-hcls-datalake/thousandgenomes_dragen
23-
Region: us-east-1
24-
Type: S3 Bucket
33+
- Description: Parquet representations of 1000 Genomes VCF outputs from DRAGEN, ready
34+
for enrollment into Data Lake as Code.
35+
ARN: arn:aws:s3:::aws-roda-hcls-datalake/thousandgenomes_dragen
36+
Region: us-east-1
37+
Type: S3 Bucket
2538
DataAtWork:
2639
Tutorials:
27-
- Title: Sample Queries on the 1000 Genomes, gnomAD and ClinVar data Lake
28-
URL: https://github.com/aws-samples/aws-genomics-datalake/blob/main/1000Genomes.ipynb
29-
AuthorName: Sujaya Srinivasan
30-
Services:
31-
- Athena
32-
- Glue
33-
Tools & Applications:
34-
Publications:
35-
40+
- Title: Sample Queries on the 1000 Genomes, gnomAD and ClinVar data Lake
41+
URL: https://github.com/aws-samples/aws-genomics-datalake/blob/main/1000Genomes.ipynb
42+
AuthorName: Sujaya Srinivasan
43+
Services:
44+
- Amazon Athena
45+
- AWS Glue
46+
Tools & Applications: null
47+
Publications: null

datasets/4dnucleome.yaml

+35-32
Original file line numberDiff line numberDiff line change
@@ -1,41 +1,44 @@
11
Name: 4D Nucleome (4DN)
2-
Description: |
3-
The goal of the National Institutes of Health (NIH) Common Fund’s 4D Nucleome (4DN) program
4-
is to study the three-dimensional organization of the nucleus in space and time (the 4th dimension).
5-
The nucleus of a cell contains DNA, the genetic “blueprint” that encodes all of the genes a living
6-
organism uses to produce proteins needed to carry out life-sustaining cellular functions. Understanding
7-
the conformation of the nuclear DNA and how it is maintained or changes in response to environmental
8-
and cellular cues over time will provide insights into basic biology as well as aspects of human
9-
health and disease. The 4DN is an international consortium of researchers who generate data that
10-
include results from a variety of genomics and imaging assays with a focus on, but not exclusive to,
11-
those that demonstrate close contact between chromatin loci that are non-adjacent on the linear DNA
12-
sequence of chromosomes. Additional assays probe the nuclear landscape in the context of interactions
13-
of chromatin with specific proteins, RNAs and epigenetic changes.
2+
Description: "The goal of the National Institutes of Health (NIH) Common Fund\u2019\
3+
s 4D Nucleome (4DN) program\nis to study the three-dimensional organization of the\
4+
\ nucleus in space and time (the 4th dimension).\nThe nucleus of a cell contains\
5+
\ DNA, the genetic \u201Cblueprint\u201D that encodes all of the genes a living\n\
6+
organism uses to produce proteins needed to carry out life-sustaining cellular functions.\
7+
\ Understanding\nthe conformation of the nuclear DNA and how it is maintained or\
8+
\ changes in response to environmental\nand cellular cues over time will provide\
9+
\ insights into basic biology as well as aspects of human\nhealth and disease. The\
10+
\ 4DN is an international consortium of researchers who generate data that\ninclude\
11+
\ results from a variety of genomics and imaging assays with a focus on, but not\
12+
\ exclusive to,\nthose that demonstrate close contact between chromatin loci that\
13+
\ are non-adjacent on the linear DNA\nsequence of chromosomes. Additional assays\
14+
\ probe the nuclear landscape in the context of interactions\nof chromatin with\
15+
\ specific proteins, RNAs and epigenetic changes.\n"
1416
1517
ManagedBy: 4DN Data Coordination and Integration Center (4DN-DCIC)
1618
Documentation: https://data.4dnucleome.org
1719
UpdateFrequency: Daily
1820
Tags:
19-
- biology
20-
- bioinformatics
21-
- genetic
22-
- genomic
23-
- imaging
24-
- life sciences
25-
- aws-pds
26-
License: External data users may freely download, analyze, and publish results based on any 4DN data provided here without restrictions.
21+
- biology
22+
- bioinformatics
23+
- genetic
24+
- genomic
25+
- imaging
26+
- life sciences
27+
- aws-pds
28+
License: External data users may freely download, analyze, and publish results based
29+
on any 4DN data provided here without restrictions.
2730
Resources:
28-
- Description: Released and archived 4DNucleome data
29-
ARN: arn:aws:s3:::4dn-open-data-public
30-
Region: us-east-1
31-
Type: S3 Bucket
31+
- Description: Released and archived 4DNucleome data
32+
ARN: arn:aws:s3:::4dn-open-data-public
33+
Region: us-east-1
34+
Type: S3 Bucket
3235
DataAtWork:
3336
Tutorials:
34-
- Title: Finding and Downloading 4DN Data files
35-
URL: https://data.4dnucleome.org/help/user-guide/downloading-files
36-
AuthorName: 4DN-DCIC
37-
AuthorURL: data.4dnucleome.org
38-
- Title: Using jupyterhub on the 4DN data portal
39-
URL: https://data.4dnucleome.org/tools/jupyterhub
40-
AuthorName: 4DN-DCIC
41-
AuthorURL: data.4dnucleome.org
37+
- Title: Finding and Downloading 4DN Data files
38+
URL: https://data.4dnucleome.org/help/user-guide/downloading-files
39+
AuthorName: 4DN-DCIC
40+
AuthorURL: data.4dnucleome.org
41+
- Title: Using jupyterhub on the 4DN data portal
42+
URL: https://data.4dnucleome.org/tools/jupyterhub
43+
AuthorName: 4DN-DCIC
44+
AuthorURL: data.4dnucleome.org

datasets/abeja-cc-ja.yaml

+22-20
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,29 @@
11
Name: ABEJA CC JA
2-
Description: A large Japanese language corpus created through preprocessing Common Crawl data
2+
Description: A large Japanese language corpus created through preprocessing Common
3+
Crawl data
34
Documentation: https://github.com/abeja-inc/Megatron-LM/blob/main/docs/dataset/about_data.md
4-
5-
ManagedBy: "[ABEJA inc.](https://www.abejainc.com/)"
5+
6+
ManagedBy: '[ABEJA inc.](https://www.abejainc.com/)'
67
UpdateFrequency: None
78
Tags:
8-
- natural language processing
9-
- web archive
10-
- internet
11-
- japanese
12-
License: "This data is available for anyone to use under the [Common Crawl Terms of Use](https://commoncrawl.org/terms-of-use/)"
9+
- natural language processing
10+
- web archive
11+
- internet
12+
- japanese
13+
License: This data is available for anyone to use under the [Common Crawl Terms of
14+
Use](https://commoncrawl.org/terms-of-use/)
1315
Resources:
14-
- Description: Text corpus
15-
ARN: arn:aws:s3:::abeja-cc-ja
16-
Region: ap-northeast-1
17-
Type: S3 Bucket
16+
- Description: Text corpus
17+
ARN: arn:aws:s3:::abeja-cc-ja
18+
Region: ap-northeast-1
19+
Type: S3 Bucket
1820
DataAtWork:
1921
Tutorials:
20-
- Title: Tutorial of ABEJA CC JA dataset
21-
URL: https://github.com/abeja-inc/Megatron-LM/blob/main/docs/dataset/tutorials.md
22-
AuthorName: Kyo Hattori
23-
Tools & Applications:
24-
Publications:
25-
- Title: "Building a Large-Scale Japanese Corpus from Common Crawl and Its Preprocessing"
26-
URL: https://tech-blog.abeja.asia/entry/abeja-nedo-project-part2-202405
27-
AuthorName: Kyo Hattori
22+
- Title: Tutorial of ABEJA CC JA dataset
23+
URL: https://github.com/abeja-inc/Megatron-LM/blob/main/docs/dataset/tutorials.md
24+
AuthorName: Kyo Hattori
25+
Tools & Applications: null
26+
Publications:
27+
- Title: Building a Large-Scale Japanese Corpus from Common Crawl and Its Preprocessing
28+
URL: https://tech-blog.abeja.asia/entry/abeja-nedo-project-part2-202405
29+
AuthorName: Kyo Hattori

datasets/aev-a2d2.yaml

+29-33
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,36 @@
1-
Name: "A2D2: Audi Autonomous Driving Dataset"
2-
Description:
3-
An open multi-sensor dataset for autonomous driving research.
4-
This dataset comprises semantically segmented images, semantic
5-
point clouds, and 3D bounding boxes. In addition, it contains
6-
unlabelled 360 degree camera images, lidar, and bus data for
7-
three sequences. We hope this dataset will further facilitate
8-
active research and development in AI, computer vision, and
9-
robotics for autonomous driving.
1+
Name: 'A2D2: Audi Autonomous Driving Dataset'
2+
Description: An open multi-sensor dataset for autonomous driving research. This dataset
3+
comprises semantically segmented images, semantic point clouds, and 3D bounding
4+
boxes. In addition, it contains unlabelled 360 degree camera images, lidar, and
5+
bus data for three sequences. We hope this dataset will further facilitate active
6+
research and development in AI, computer vision, and robotics for autonomous driving.
107
118
Documentation: http://a2d2.audi
12-
ManagedBy: "[Audi AG](http://a2d2.audi/)"
13-
UpdateFrequency:
14-
The dataset may be updated with additional or corrected data
15-
on a need-to-update basis.
9+
ManagedBy: '[Audi AG](http://a2d2.audi/)'
10+
UpdateFrequency: The dataset may be updated with additional or corrected data on a
11+
need-to-update basis.
1612
Tags:
17-
- autonomous vehicles
18-
- deep learning
19-
- computer vision
20-
- lidar
21-
- mapping
22-
- machine learning
23-
- robotics
24-
- aws-pds
13+
- autonomous vehicles
14+
- deep learning
15+
- computer vision
16+
- lidar
17+
- mapping
18+
- machine learning
19+
- robotics
20+
- aws-pds
2521
License: https://creativecommons.org/licenses/by-nd/4.0/
2622
Resources:
27-
- Description: http://a2d2.audi
28-
ARN: arn:aws:s3:::aev-autonomous-driving-dataset
29-
Region: eu-central-1
30-
Type: S3 Bucket
23+
- Description: http://a2d2.audi
24+
ARN: arn:aws:s3:::aev-autonomous-driving-dataset
25+
Region: eu-central-1
26+
Type: S3 Bucket
3127
DataAtWork:
3228
Tutorials:
33-
- Title: Autonomous Driving Data Service (ADDS)
34-
URL: https://github.com/aws-samples/amazon-eks-autonomous-driving-data-service
35-
AuthorName: Ajay Vohra, Amazon
36-
Services:
37-
- EKS
38-
- Redshift
39-
- S3
40-
- FSx
29+
- Title: Autonomous Driving Data Service (ADDS)
30+
URL: https://github.com/aws-samples/amazon-eks-autonomous-driving-data-service
31+
AuthorName: Ajay Vohra, Amazon
32+
Services:
33+
- Amazon EKS
34+
- Amazon Redshift
35+
- Amazon S3
36+
- Amazon FSx
+40-38
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,47 @@
11
Name: A region-wide, multi-year set of crop field boundary labels for Africa
2-
Description: >
3-
Crop field boundaries digitized in Planet imagery collected across Africa
4-
between 2017 and 2023, developed by [Farmerline](https://farmerline.co/),
5-
[Spatial Collective](https://spatialcollective.com/), and the
6-
[Agricultural Impacts Research Group](https://agroimpacts.info/) at
7-
[Clark University](https://www.clarku.edu/), with support from the
8-
[Lacuna Fund](https://lacunafund.org/)
9-
Documentation: "https://github.com/agroimpacts/lacunalabels/"
2+
Description: 'Crop field boundaries digitized in Planet imagery collected across Africa
3+
between 2017 and 2023, developed by [Farmerline](https://farmerline.co/), [Spatial
4+
Collective](https://spatialcollective.com/), and the [Agricultural Impacts Research
5+
Group](https://agroimpacts.info/) at [Clark University](https://www.clarku.edu/),
6+
with support from the [Lacuna Fund](https://lacunafund.org/)
7+
8+
'
9+
Documentation: https://github.com/agroimpacts/lacunalabels/
1010
11-
ManagedBy: "[The Agricultural Impacts Research Group](https://agroimpacts.info/)"
12-
UpdateFrequency: "Updated versions of the dataset are added as they are developed"
11+
ManagedBy: '[The Agricultural Impacts Research Group](https://agroimpacts.info/)'
12+
UpdateFrequency: Updated versions of the dataset are added as they are developed
1313
Tags:
14-
- agriculture
15-
- machine learning
16-
- land cover
17-
- satellite imagery
18-
- cog
19-
- labeled
20-
License: "[Planet NICFI participant license agreement](https://assets.planet.com/docs/Planet_ParticipantLicenseAgreement_NICFI.pdf)"
14+
- agriculture
15+
- machine learning
16+
- land cover
17+
- satellite imagery
18+
- cog
19+
- labeled
20+
License: '[Planet NICFI participant license agreement](https://assets.planet.com/docs/Planet_ParticipantLicenseAgreement_NICFI.pdf)'
2121
Resources:
22-
- Description: Field boundary labels and corresponding Planet images
23-
ARN: arn:aws:s3:::africa-field-boundary-labels
24-
Region: us-west-2
25-
Type: S3 Bucket
22+
- Description: Field boundary labels and corresponding Planet images
23+
ARN: arn:aws:s3:::africa-field-boundary-labels
24+
Region: us-west-2
25+
Type: S3 Bucket
2626
DataAtWork:
2727
Tutorials:
28-
- Title: Instructions on data access and label-making demonstration notebook
29-
URL: https://github.com/agroimpacts/lacunalabels
30-
NotebookURL: https://github.com/agroimpacts/lacunalabels/blob/main/notebooks/makelabels/label-chips.ipynb
31-
AuthorName: Lyndon Estes
32-
AuthorURL: https://github.com/ldemaz
28+
- Title: Instructions on data access and label-making demonstration notebook
29+
URL: https://github.com/agroimpacts/lacunalabels
30+
NotebookURL: https://github.com/agroimpacts/lacunalabels/blob/main/notebooks/makelabels/label-chips.ipynb
31+
AuthorName: Lyndon Estes
32+
AuthorURL: https://github.com/ldemaz
3333
Publications:
34-
- Title: A region-wide, multi-year set of crop field boundary labels for Africa
35-
URL: https://zenodo.org/records/11060871
36-
AuthorName: Wussah et al. (2023)
37-
- Title: Technical report on label develop and processing
38-
URL: https://github.com/agroimpacts/lacunalabels/blob/main/docs/report/technical-report.pdf
39-
AuthorName: Wussah et al. (2023)
40-
- Title: High resolution, annual maps of field boundaries for smallholder-dominated croplands at national scales
41-
URL: https://www.frontiersin.org/article/10.3389/frai.2021.744863
42-
AuthorName: Estes et al. (2022)
43-
- Title: A platform for crowdsourcing the creation of representative, accurate landcover maps
44-
URL: http://www.sciencedirect.com/science/article/pii/S136481521630010X
45-
AuthorName: Estes et al. (2016)
34+
- Title: A region-wide, multi-year set of crop field boundary labels for Africa
35+
URL: https://zenodo.org/records/11060871
36+
AuthorName: Wussah et al. (2023)
37+
- Title: Technical report on label develop and processing
38+
URL: https://github.com/agroimpacts/lacunalabels/blob/main/docs/report/technical-report.pdf
39+
AuthorName: Wussah et al. (2023)
40+
- Title: High resolution, annual maps of field boundaries for smallholder-dominated
41+
croplands at national scales
42+
URL: https://www.frontiersin.org/article/10.3389/frai.2021.744863
43+
AuthorName: Estes et al. (2022)
44+
- Title: A platform for crowdsourcing the creation of representative, accurate landcover
45+
maps
46+
URL: http://www.sciencedirect.com/science/article/pii/S136481521630010X
47+
AuthorName: Estes et al. (2016)

0 commit comments

Comments
 (0)