From b14a74a21e366f7e48f490ffa94f98b880bf5f5a Mon Sep 17 00:00:00 2001 From: Lavanya Ashokkumar Date: Tue, 30 Sep 2025 11:58:08 -0500 Subject: [PATCH 1/2] updated readme file with resources folder for videos --- README.md | 190 ++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 149 insertions(+), 41 deletions(-) diff --git a/README.md b/README.md index 03da9ce2..93bc4e78 100644 --- a/README.md +++ b/README.md @@ -5,79 +5,153 @@ [![DOI](https://zenodo.org/badge/153786129.svg)](https://zenodo.org/doi/10.5281/zenodo.10724716) ## Introduction - The pyQuARC (*pronounced "pie-quark"*) library was designed to read and evaluate descriptive metadata used to catalog Earth observation data products and files. This type of metadata focuses and limits attention to important aspects of data, such as the spatial and temporal extent, in a structured manner that can be leveraged by data catalogs and other applications designed to connect users to data. Therefore, poor quality metadata (e.g. inaccurate, incomplete, improperly formatted, inconsistent) can yield subpar results when users search for data. Metadata that inaccurately represents the data it describes risks matching users with data that does not reflect their search criteria and, in the worst-case scenario, can make data impossible to find. Given the importance of high quality metadata, it is necessary that metadata be regularly assessed and updated as needed. pyQuARC is a tool that can help streamline the process of assessing metadata quality by automating it as much as possible. In addition to basic validation checks (e.g. adherence to the metadata schema, controlled vocabularies, and link checking), pyQuARC flags opportunities to improve or add contextual metadata information to help the user connect to, access, and better understand the data product. pyQuARC also ensures that information common to both data product (i.e. collection) and the file-level (i.e. granule) metadata are consistent and compatible. As open source software, pyQuARC can be adapted and customized to allow for quality checks unique to different needs. -## pyQuARC Base Package +## pyQuARC Metadata Quality Framework +pyQuARC was designed to assess metadata in NASA’s [Common Metadata Repository (CMR)](https://earthdata.nasa.gov/eosdis/science-system-description/eosdis-components), a centralized repository for all of NASA’s Earth observation data products. In addition, the CMR contains metadata for Earth observation products submitted by external partners. The CMR serves as the backend for NASA’s Earthdata Search ([search.earthdata.nasa.gov](https://search.earthdata.nasa.gov/)) and is also the authoritative metadata source for NASA’s [Earth Observing System Data and Information System (EOSDIS)](https://earthdata.nasa.gov/eosdis). -pyQuARC was specifically designed to assess metadata in NASA’s [Common Metadata Repository (CMR)](https://earthdata.nasa.gov/eosdis/science-system-description/eosdis-components), which is a centralized metadata repository for all of NASA’s Earth observation data products. In addition to NASA’s ~9,000 data products, the CMR also holds metadata for over 40,000 additional Earth observation data products submitted by external data partners. The CMR serves as the backend for NASA’s Earthdata Search (search.earthdata.nasa.gov) and is also the authoritative metadata source for NASA’s [Earth Observing System Data and Information System (EOSDIS).](https://earthdata.nasa.gov/eosdis) +pyQuARC was initially developed by a group called the [Analysis and Review of the CMR (ARC)](https://www.earthdata.nasa.gov/data/projects/analysis-review-cmr-project) team. The ARC team conducted quality assessments of NASA’s metadata records in the CMR, identified opportunities for improvement in the metadata records, and collaborated with the data archive centers to resolve any identified issues. ARC has developed a [metadata quality assessment framework](http://doi.org/10.5334/dsj-2021-017) which specifies a common set of assessment criteria. These criteria focus on correctness, completeness, and consistency with the goal of making data more discoverable, accessible, and usable. The ARC metadata quality assessment framework is the basis for the metadata checks that have been incorporated into pyQuARC base package. Specific quality criteria for each CMR metadata element are documented in the [Earthdata Wiki space](https://wiki.earthdata.nasa.gov/display/CMR/CMR+Metadata+Best+Practices%3A+Landing+Page). -pyQuARC was developed by a group called the [Analysis and Review of the CMR (ARC)](https://earthdata.nasa.gov/esds/impact/arc) team. The ARC team conducts quality assessments of NASA’s metadata records in the CMR, identifies opportunities for improvement in the metadata records, and collaborates with the data archive centers to resolve any identified issues. ARC has developed a [metadata quality assessment framework](http://doi.org/10.5334/dsj-2021-017) which specifies a common set of assessment criteria. These criteria focus on correctness, completeness, and consistency with the goal of making data more discoverable, accessible, and usable. The ARC metadata quality assessment framework is the basis for the metadata checks that have been incorporated into pyQuARC base package. Specific quality criteria for each CMR metadata element is documented in the following wiki: -[https://wiki.earthdata.nasa.gov/display/CMR/CMR+Metadata+Best+Practices%3A+Landing+Page](https://wiki.earthdata.nasa.gov/display/CMR/CMR+Metadata+Best+Practices%3A+Landing+Page) +Each metadata element’s wiki page includes an “Metadata Validation and QA/QC” section that lists quality criteria categorized by priority levels, referred to as a priority matrix. The [priority matrix](https://wiki.earthdata.nasa.gov/spaces/CMR/pages/109874556/ARC+Priority+Matrix) are designated as high (red), medium (yellow), or low (blue), and are intended to communicate the importance of meeting the specified criteria. -There is an “ARC Metadata QA/QC” section on the wiki page for each metadata element that lists quality criteria categorized by level of [priority. Priority categories](https://wiki.earthdata.nasa.gov/display/CMR/ARC+Priority+Matrix) are designated as high (red), medium (yellow), or low (blue), and are intended to communicate the importance of meeting the specified criteria. +The CMR is designed around its own metadata standard called the [Unified Metadata Model (UMM)](https://www.earthdata.nasa.gov/about/esdis/eosdis/cmr/umm). In addition to being an extensible metadata model, the UMM provides a crosswalk for mapping among the various CMR-supported metadata standards, including DIF10, ECHO10, ISO 19115-1, and ISO 19115-2. -The CMR is designed around its own metadata standard called the [Unified Metadata Model (UMM).](https://earthdata.nasa.gov/eosdis/science-system-description/eosdis-components/cmr/umm) In addition to being an extensible metadata model, the UMM also provides a cross-walk for mapping between the various CMR-supported metadata standards. CMR-supported metadata standards currently include: -* [DIF10](https://earthdata.nasa.gov/esdis/eso/standards-and-references/directory-interchange-format-dif-standard) (Collection/Data Product-level only) -* [ECHO10](https://earthdata.nasa.gov/esdis/eso/standards-and-references/echo-metadata-standard) (Collection/Data Product and Granule/File-level metadata) -* [ISO19115-1 and ISO19115-2](https://earthdata.nasa.gov/esdis/eso/standards-and-references/iso-19115) (Collection/Data Product and Granule/File-level metadata) +pyQuARC currently supports the following metadata standards: * [UMM-JSON](https://wiki.earthdata.nasa.gov/display/CMR/UMM+Documents) (UMM) - * UMM-C (Collection/Data Product-level metadata) - * UMM-G (Granule/File-level metadata) - * UMM-S (Service metadata) - * UMM-T (Tool metadata) + * Collection/Data Product-level metadata (UMM-C) + * Granule/File-level metadata (UMM-G) +* [ECHO10](https://earthdata.nasa.gov/esdis/eso/standards-and-references/echo-metadata-standard) + * Collection/Data Product-level metadata (ECHO-C) + * Granule/File-level metadata (ECHO-G) +* [DIF10](https://earthdata.nasa.gov/esdis/eso/standards-and-references/directory-interchange-format-dif-standard) + * Collection/Data Product-level only +## pyQuARC User Demo Series +A series of user demos has been created to explain what pyQuARC does and how it can be used. These demos cover the process of installing, activating, and using the library for a specific schema. The demo files are available in the **resources** folder of the pyQuARC GitHub repository. -pyQuARC supports DIF10 (collection only), ECHO10 (collection and granule), UMM-C, and UMM-G standards. At this time, there are no plans to add ISO 19115 or UMM-S/T specific checks. **Note that pyQuARC development is still underway, so further enhancements and revisions are planned.** +## Install and Clone the Repository +The pyQuARC library requires `Python 3.10` to function properly across all operating systems. -**For inquiries, please email: sheyenne.kirkland@uah.edu** +### 1. Open your Command Prompt or Terminal and use the following command to clone the pyQuARC repository: +* `git clone https://github.com/NASA-IMPACT/pyQuARC.git` -## pyQuARC as a Service (QuARC) +Note: If you see the message `fatal: destination path 'pyQuARC' already exists and is not an empty directory` when running this command, it means the repository has already been cloned. To reclone it, delete the folder and its contents using the following command before running the original command again. -QuARC is pyQuARC deployed as a service and can be found here: https://quarc.nasa-impact.net/docs/. +* `rmdir /s /q pyQuARC` # deletes the directory (be cautious) -QuARC is still in beta but is regularly synced with the latest version of pyQuARC on GitHub. Fully cloud-native, the architecture diagram of QuARC is shown below: +Additional note: If you want to know where your freshly cloned pyQuARC folder ended up, you can use the following command to print your working directory: -![QuARC](https://user-images.githubusercontent.com/17416300/179866276-7c025699-01a1-4d3e-93cd-50e12c5a5ec2.png) +* `pwd` # for Linux/MacOS operating systems +* `cd` # for Windows operating systems + +This will show you the full path to the directory where the cloned pyQuARC repository is located. You can then append `\pyQuARC` to the end of the path to get the full path to the folder. + +### 2. Configure and Activate Environment: +Create an environment to set up an isolated workspace for using pyQuARC. You can do this with Anaconda/Miniconda (Option A) or with Python’s built-in `venv` module (Option B). + +**A. Use the Conda package manager to create and name the environment:** +* `conda create --name ` # - Replace `` with the name of your environment. + +**B. Use the Python interpreter to create a virtual environment in your current directory:** +* `python -m venv env` + +Next, activate the environment using either Option A or Option B, depending on how you created it in the previous step: + +**A. Activate the Conda environment with the Conda package manager:** +* `conda activate ` + +**B. Activate the Python virtual environment** +For macOS/Linux operating systems, use the following: +* `source env/bin/activate` + +For Windows operating systems, use the following command: +* `env\Scripts\activate` + +Note: On Windows, you may encounter an error with this command. If that happens, use: +* `.\env\Scripts\Activate.ps1` -## Architecture +Be sure to reference the correct location of the env directory, as you may need to activate either the `.bat` or `.ps1` script. This error is uncommon. +### 3. Install Requirements +Next, install the required packages. The requirements are included as a text file in the repository and will be available on your local machine automatically once you clone the pyQuARC repository. Before installing the requirements, make sure you are in your working directory and navigate to the pyQuARC folder. + +Navigate to your directory: +* `cd` + +Navigate to the pyQuARC folder: +* `cd pyQuARC` + +Install the requirements: +* `pip install -r requirements.txt` + +You are almost there! Open your code editor (e.g., VS Code), navigate to the location where you cloned the repository, select the pyQuARC folder, and click Open. You should now be able to see all the existing files and contents of the pyQuARC folder in your code editor. Voilà! You are ready to use pyQuARC! + +## pyQuARC Architecture ![pyQuARC Architecture](/images/architecture.png) -The Downloader is used to obtain a copy of a metadata record of interest from the CMR. This is accomplished using a [CMR API query,](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html) where the metadata record of interest is identified by its unique identifier in the CMR (concept_id). CMR API documentation can be found here: -[https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html) +pyQuARC uses a Downloader to obtain a copy of a metadata record of interest from the CMR API. This is accomplished using a [CMR API query,](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html) where the metadata record of interest is identified by its unique identifier in the CMR (concept_id). For more, please visi the [CMR API documentation](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html). -There is also the option to select and run pyQuARC on a metadata record already downloaded to your local desktop. +After cloning the repository, you can find a set of files in the `schemas` folder including `checks.json`, `rule_mapping.json`, and `check_messages.json` that define and apply the rules used to evaluate metadata. Each rule is specified by its `rule_id`, associated function, and any dependencies on specific metadata elements. -The `checks.json` file includes a comprehensive list of rules. Each rule is specified by its `rule_id,` associated function, and any dependencies on specific metadata elements. +* The `checks.json` file contains a comprehensive list of all metadata quality rules used by pyQuARC. Each rule in this file includes a `check_function` that specifies the name of the check. +* The `check_messages.json` file contains the messages that are displayed when a check fails. You can use the `check_function` name from the `checks.json` file to locate the output message associated with each check. +* The `rule_mapping.json` file specifies which metadata element(s) each rule applies to. -The `rule_mapping.json` file specifies which metadata element(s) each rule applies to. The `rule_mapping.json` also references the `messages.json` file which includes messages that can be displayed when a check passes or fails. +Furthermore, the `rule_mapping.json` file specifies the severity level associated with a failure. If a check fails, it is assigned one of three categories: ❌ Error, ⚠️ Warning, or ℹ️ Info. These categories correspond to priority levels in [ARC’s priority matrix](https://wiki.earthdata.nasa.gov/display/CMR/ARC+Priority+Matrix) and indicate the importance of the failed check. Default severity values are based on ARC’s metadata quality assessment framework but can be customized to meet individual needs. -Furthermore, the `rule_mapping.json` file specifies the level of severity associated with a failure. If a check fails, it will be assigned a severity category of “error”, “warning”, or "info.” These categories correspond to priority categorizations in [ARC’s priority matrix](https://wiki.earthdata.nasa.gov/display/CMR/ARC+Priority+Matrix) and communicate the importance of the failed check, with “error” being the most critical category, “warning” indicating a failure of medium priority, and “info” indicating a minor issue or inconsistency. Default severity values are assigned based on ARC’s metadata quality assessment framework, but can be customized to meet individual needs. +❌ Error → most critical issues +⚠️ Warning → medium-priority issues +ℹ️ Info → minor issues -## Customization -pyQuARC is designed to be customizable. Output messages can be modified using the `messages_override.json` file - any messages added to `messages_override.json` will display over the default messages in the `message.json` file. Similarly, there is a `rule_mapping_override.json` file which can be used to override the default settings for which rules/checks are applied to which metadata elements. +In the `code` folder, you will find a series of Python files containing the implementations for each check. For example, the `data_format_gcmd_check` listed in the `checks.json` file can be found in the `string_validator.py` file, where the code performs the check using a string validator. -There is also the opportunity for more sophisticated customization. New QA rules can be added and existing QA rules can be edited or removed. Support for new metadata standards can be added as well. Further details on how to customize pyQuARC will be provided in the technical user’s guide below. +## Run pyQuARC on a Single Record -While the pyQuARC base package is currently managed by the ARC team, the long term goal is for it to be owned and governed by the broader EOSDIS metadata community. +### Locating the Concept ID +To run pyQuARC on a single record, either at the collection (data product) level or the granule (individual file) level, you will need the associated Concept ID. If you don’t know the Concept ID for the record, you can find it by following these steps: -## Install/User’s Guide -### Running the program +1. Go to NASA [Earthdata Search](https://search.earthdata.nasa.gov/) and locate the data product of interest. +2. Click Collection Details and locate the dataset’s Short Name, which is often highlighted in gray along with the Version number (for example: Short Name = Aqua_AIRS_MODIS1km_IND, Version = 1). +3. Copy the Short Name and Version number, then modify the following path: -*Note:* This program requires `Python 3.8` installed in your system. +* `https://cmr.earthdata.nasa.gov/search/collections.umm-json?entry_id=SHORTNAME_VERSION#.2&all_revisions=true` -**Clone the repo:** [https://github.com/NASA-IMPACT/pyQuARC/](https://github.com/NASA-IMPACT/pyQuARC/) +You will need to replace `SHORTNAME` in the path with the actual Short Name of the dataset (for example: Aqua_AIRS_MODIS1km_IND). +You will also need to replace `VERSION#` in the path with the actual Version number listed under Collection Details in Earthdata Search (for example: 1). -**Go to the project directory:** `cd pyQuARC` +For the dataset “Aqua AIRS-MODIS 1-km Matchup Indexes V1 (Aqua_AIRS_MODIS1km_IND) at GES_DISC” with Short Name Aqua_AIRS_MODIS1km_IND and Version 1, the path is modified as follows: + +* `https://cmr.earthdata.nasa.gov/search/collections.umm-json?entry_id=Aqua_AIRS_MODIS1km_IND_1&all_revisions=true` + +You should now be able to find the `concept-id` for that collection (data product). + +For individual files (granules), locating the Concept ID is straightforward. In [Earthdata Search](https://search.earthdata.nasa.gov/), find the file of interest, click View Details, and then check the Information tab to see the Concept ID. -**Create a python virtual environment:** `python -m venv env` +### Running pyQuARC Using the Concept ID +Now that you have identified the Concept ID for the collection (data product) or granule (individual file) metadata, you can use the following command in your code editor to curate it: -**Activate the environment:** `source env/bin/activate` +* `python pyQuARC/main.py --concept_ids CONCEPT_ID --format FORMAT` -**Install the requirements:** `pip install -r requirements.txt` +`CONCEPT_ID` should be replaced with the Concept ID of the collection or granule-level metadata (for example: `C2515837343-GES_DISC`). +`FORMAT` should be replaced with the schema you are using to validate the metadata. This will differ depending on whether you are curating collection- or granule-level metadata. The list of acceptable formats is as follows: + +- `umm-c` (for collection) +- `umm-g` (for granule) +- `echo-c` (for collection) +- `echo-g` (for granule) +- `dif10` (for both collection and granule) + +**Example** +For `C2515837343-GES_DISC`, the command above can be modified as follows: + +`python pyQuARC/main.py --concept_ids C2515837343-GES_DISC --format umm-c` + +In this example, `CONCEPT_ID` has been replaced with `C2515837343-GES_DISC`, and `FORMAT` has been replaced with `umm-c` + +### Running pyQuARC on a Local File +There is also the option to select and run pyQuARC on a metadata record already downloaded to your local desktop. **Run `main.py`:** @@ -110,8 +184,33 @@ or ▶ python pyQuARC/main.py --file "/Users/batman/projects/pyQuARC/tests/fixtures/test_cmr_metadata.echo10" ``` -### Adding a custom rule +## Run pyQuARC on Multiple Records +pyQuARC has the capability to run metadata checks on multiple collection or granule IDs. This feature allows users to perform validation checks on multiple records simultaneously. When performing validation checks on multiple records, it is essential that all records share the same schema format, which could be one of the following: `umm-c`, `umm-g`, `echo-c`, `echo-g`, and `dif10`. + +To run pyQuARC on multiple records, use one of the following options/commands: + +A. List the collection IDs consecutively, separated by commas. The results will be displayed in the console. + +`python pyQuARC/main.py --concept_ids , , , …. --format umm-c` + +B. If you have multiple collection IDs (e.g., more than 10 records), it is recommended to create a text file listing the collection IDs. The format of the records should be: + + + + +…… + + +`python pyQuARC/main.py --concept_ids $(cat pyQuARC/files.txt) --format umm-c` + +C. If you prefer to save the output from multiple records to a `.csv` file for reference, use the following command. Note that the output format may not be perfectly structured due to the default settings used when writing output from the Python console. +`python pyQuARC/main.py --concept_ids , , , …. --format umm-c > pyquarc_output.csv` + +## Customization +pyQuARC is designed to be customizable. Output messages can be modified using the `messages_override.json` file - any messages added to `messages_override.json` will display over the default messages in the `message.json` file. Similarly, there is a `rule_mapping_override.json` file which can be used to override the default settings for which rules/checks are applied to which metadata elements. There is also the opportunity for more sophisticated customization. New QA rules can be added and existing QA rules can be edited or removed. Support for new metadata standards can be added as well. + +### Adding a custom rule To add a custom rule, follow the following steps: **Add an entry to the `schemas/rule_mapping.json` file in the form:** @@ -389,7 +488,6 @@ The values 0 and 1 do not amount to a true value >>> ... ``` - **To provide custom messages for new or old fields:** ```python @@ -418,3 +516,13 @@ The values 0 and 1 do not amount to a true value >>> validator.validate() >>> ... ``` + +## pyQuARC as a Service (QuARC) +QuARC is pyQuARC deployed as a service and can be found here: https://quarc.nasa-impact.net/docs/. + +QuARC is still in beta but is regularly synced with the latest version of pyQuARC on GitHub. Fully cloud-native, the architecture diagram of QuARC is shown below: + +![QuARC](https://user-images.githubusercontent.com/17416300/179866276-7c025699-01a1-4d3e-93cd-50e12c5a5ec2.png) + +## Have a question? +If you have any questions, please contact us at **earthdata-support@nasa.gov**. From 3ede23b517a3fbcdcdaf66c430fc855a1127ed9b Mon Sep 17 00:00:00 2001 From: Lavanya Ashokkumar Date: Wed, 8 Oct 2025 13:22:40 -0500 Subject: [PATCH 2/2] removed demo videos -- Updated README file #354 --- README.md | 3 --- 1 file changed, 3 deletions(-) diff --git a/README.md b/README.md index 93bc4e78..6e4d61b1 100644 --- a/README.md +++ b/README.md @@ -28,9 +28,6 @@ pyQuARC currently supports the following metadata standards: * [DIF10](https://earthdata.nasa.gov/esdis/eso/standards-and-references/directory-interchange-format-dif-standard) * Collection/Data Product-level only -## pyQuARC User Demo Series -A series of user demos has been created to explain what pyQuARC does and how it can be used. These demos cover the process of installing, activating, and using the library for a specific schema. The demo files are available in the **resources** folder of the pyQuARC GitHub repository. - ## Install and Clone the Repository The pyQuARC library requires `Python 3.10` to function properly across all operating systems.