Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@
.tox/
__pycache__/
dist/
.idea
123 changes: 123 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

backups2datalad is a Python tool for mirroring Dandisets (datasets from the DANDI neuroscience data archive) and their Zarr files as git-annex repositories. It works with the DANDI API to fetch metadata and data, creating local mirrors that can be pushed to GitHub organizations.

The tool handles both public and embargoed Dandisets. Embargoed Dandisets are mirrored as private GitHub repositories, which are automatically converted to public when they are unembargoed.

## Development Environment Setup

### Prerequisites

- Python 3.10+
- git-annex version 10.20240430 or newer
- DANDI API token (set as environment variable `DANDI_API_KEY`)
- For pushing to GitHub, a GitHub access token stored in the `hub.oauthtoken` key in `~/.gitconfig`

### Installation

```bash
# Install in development mode
pip install -e .
```

## Common Commands

### Running Tests

```bash
# Run all tests
tox

# Run specific test environment
tox -e lint # Run linting checks
tox -e typing # Run type checking
tox -e py3 # Run Python tests

# Run a specific test file
pytest test/test_core.py

# Run a specific test
pytest test/test_core.py::test_1
```

Before committing code, make sure that typing check passes.

### Linting and Type Checking

```bash
# Run linting checks
flake8 src test

# Run type checking
mypy src test
```

## Architecture Overview

backups2datalad is structured around these key components:

1. **Command Line Interface**: Implemented using `asyncclick` for async operations, defined in `__main__.py`.

2. **Configuration**: `BackupConfig` class in `config.py` handles loading and validation of configuration settings from YAML files.

3. **Core Components**:
- `DandiDatasetter` in `datasetter.py`: Main class for mirroring operations
- `AsyncDandiClient` in `adandi.py`: Async client for interacting with DANDI API
- `AsyncDataset` in `adataset.py`: Wrapper around DataLad Dataset for async operations
- `Syncer` in `syncer.py`: Handles synchronization of assets

4. **Manager and GitHub Integration**: `Manager` class with GitHub API integration for pushing repositories.

5. **Zarr Support**: Special handling for Zarr files, with checksumming and specialized mirroring.

## Embargo Handling

The system supports working with both public and embargoed Dandisets:

1. **Embargoed Dandisets**:
- Stored in git-annex with embargo status tracked in `.datalad/config`
- When pushed to GitHub, they are created as private repositories
- Special handling for authentication when accessing embargoed Dandisets

2. **Unembargoed Dandisets**:
- When a Dandiset is unembargoed, the system updates its status
- GitHub repositories are converted from private to public
- S3 URLs for assets are registered with git-annex

3. **Status Tracking**:
- The embargo status of a Dandiset is tracked and synchronized between the remote server and local backup
- GitHub repository access status (private/public) is stored in the superdataset's `.gitmodules` file

## Main Workflow

1. Configuration is loaded from a YAML file
2. DANDI API client is initialized with an API token
3. The mirroring command (e.g., `update-from-backup`) is executed, which:
- Fetches Dandiset metadata from the DANDI API
- Creates or updates local git-annex repositories
- Sets appropriate embargo status for each Dandiset
- Synchronizes assets between DANDI and local repositories
- Optionally pushes changes to GitHub organizations (with appropriate privacy settings)
- Creates tags for published versions

## Testing

The project uses pytest for testing, with fixtures for:
- Setting up Docker-based DANDI instances
- Creating sample Dandisets
- Managing temporary directories

The tests verify:
- Proper syncing of Dandisets
- Creation and updating of local repositories
- Handling of published versions and tagging
- Error conditions and edge cases
- Embargo status handling

## Important Environment Variables

- `DANDI_API_KEY`: Required API token for the DANDI instance being mirrored
27 changes: 27 additions & 0 deletions src/backups2datalad/datasetter.py
Original file line number Diff line number Diff line change
Expand Up @@ -509,12 +509,15 @@
timestamp=None,
asset_paths=[asset.path],
)
# Get embargo status from parent Dandiset
dandiset_embargo_status = await ds.get_embargo_status()
await sync_zarr(
asset,
zarr_digest,
zarr_dspath,
self.manager.with_sublogger(f"Zarr {asset.zarr}"),
link=zl,
embargo_status=dandiset_embargo_status,
)
log.info("Zarr %s: Moving dataset", asset.zarr)
shutil.move(str(zarr_dspath), str(ultimate_dspath))
Expand Down Expand Up @@ -564,6 +567,30 @@
path=[asset.path],
commit_date=ts,
)
# Add github-access-status for the Zarr submodule based on parent
# Dandiset's embargo status
if self.config.zarr_gh_org is not None:
embargo = await ds.get_embargo_status()
access_status = (

Check warning on line 574 in src/backups2datalad/datasetter.py

View check run for this annotation

Codecov / codecov/patch

src/backups2datalad/datasetter.py#L573-L574

Added lines #L573 - L574 were not covered by tests
"private" if embargo is EmbargoStatus.EMBARGOED else "public"
)
log.debug(

Check warning on line 577 in src/backups2datalad/datasetter.py

View check run for this annotation

Codecov / codecov/patch

src/backups2datalad/datasetter.py#L577

Added line #L577 was not covered by tests
"Setting github-access-status to %s for Zarr submodule %s",
access_status,
asset.path,
)
await ds.set_repo_config(

Check warning on line 582 in src/backups2datalad/datasetter.py

View check run for this annotation

Codecov / codecov/patch

src/backups2datalad/datasetter.py#L582

Added line #L582 was not covered by tests
f"submodule.{asset.path}.github-access-status",
access_status,
file=".gitmodules",
)
await ds.commit_if_changed(

Check warning on line 587 in src/backups2datalad/datasetter.py

View check run for this annotation

Codecov / codecov/patch

src/backups2datalad/datasetter.py#L587

Added line #L587 was not covered by tests
f"[backups2datalad] Update github-access-status for "
f"Zarr {asset.zarr}",
paths=[".gitmodules"],
check_dirty=False,
commit_date=ts,
)
ds.assert_no_duplicates_in_gitmodules()
log.debug("Zarr %s: Changes saved", asset.zarr)
# now that we have as a subdataset and know that it is all good,
Expand Down
71 changes: 71 additions & 0 deletions src/backups2datalad/syncer.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from __future__ import annotations

from dataclasses import dataclass, field
from pathlib import Path

from dandi.consts import EmbargoStatus
from ghrepo import GHRepo
Expand Down Expand Up @@ -72,6 +73,10 @@ async def update_embargo_status(self) -> None:
private=False,
)

# Update GitHub access status for all Zarr repositories
if self.config.zarr_gh_org is not None:
await self.update_zarr_repos_privacy()

async def sync_assets(self) -> None:
self.log.info("Syncing assets...")
report = await async_assets(
Expand Down Expand Up @@ -135,3 +140,69 @@ def get_commit_message(self) -> str:
if not msgparts:
msgparts.append("Only some metadata updates")
return f"[backups2datalad] {', '.join(msgparts)}"

async def update_zarr_repos_privacy(self) -> None:
"""
Update all Zarr GitHub repositories to public when the parent Dandiset
is unembargoed. Also updates the github-access-status in .gitmodules
for all Zarr submodules.
"""
# Only proceed if we have GitHub org configured for both
# Dandisets and Zarrs
if not (self.config.gh_org and self.config.zarr_gh_org):
return

self.log.info("Updating privacy for Zarr repositories...")

# Get all submodules from the dataset
submodules = await self.ds.get_subdatasets()

# Track which submodules we've updated for .gitmodules
updated_submodules = {}

for submodule in submodules:
path = submodule["path"]
basename = Path(path).name

# Check if this is a Zarr submodule (typical zarr files end
# with .zarr or .ngff)
if basename.endswith((".zarr", ".ngff")):
submodule_path = submodule["gitmodule_path"]
zarr_id = Path(submodule["gitmodule_url"]).name

# Update the GitHub repository privacy to public
try:
self.log.info("Making Zarr repository %s public", zarr_id)
await self.manager.edit_github_repo(
GHRepo(self.config.zarr_gh_org, zarr_id),
private=False,
)

# Track for updating .gitmodules
updated_submodules[submodule_path] = "public"
except Exception as e:
self.log.error(
"Failed to update Zarr repository %s privacy: %s",
zarr_id,
str(e),
)

# Update github-access-status in .gitmodules for all Zarr submodules
if updated_submodules:
self.log.info(
"Updating github-access-status in .gitmodules for %d Zarr "
"submodules",
len(updated_submodules),
)

for path, status in updated_submodules.items():
await self.ds.set_repo_config(
f"submodule.{path}.github-access-status", status, file=".gitmodules"
)

# Commit the changes to .gitmodules
await self.ds.commit_if_changed(
"[backups2datalad] Update github-access-status for Zarr " "submodules",
paths=[".gitmodules"],
check_dirty=False,
)
11 changes: 10 additions & 1 deletion src/backups2datalad/zarr.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from aiobotocore.config import AioConfig
from aiobotocore.session import get_session
from botocore import UNSIGNED
from dandi.consts import EmbargoStatus
from pydantic import BaseModel
from zarr_checksum.tree import ZarrChecksumTree

Expand Down Expand Up @@ -508,6 +509,7 @@ async def sync_zarr(
manager: Manager,
link: ZarrLink | None = None,
error_on_change: bool = False,
embargo_status: EmbargoStatus = EmbargoStatus.OPEN,
) -> None:
async with manager.config.zarr_limit:
assert manager.config.zarrs is not None
Expand All @@ -524,6 +526,7 @@ async def sync_zarr(
backup_remote=manager.config.zarrs.remote,
backend="MD5E",
cfg_proc=None,
embargo_status=embargo_status,
)
if not (ds.pathobj / ".dandi" / ".gitattributes").exists():
manager.log.debug("Excluding .dandi/ from git-annex")
Expand All @@ -540,10 +543,16 @@ async def sync_zarr(
)
if (zgh := manager.config.zarrs.github_org) is not None:
manager.log.debug("Creating GitHub sibling")
# Override default embargo status (from dataset) with parent
# dandiset's status
await ds.set_embargo_status(embargo_status)
await ds.create_github_sibling(
owner=zgh, name=asset.zarr, backup_remote=manager.config.zarrs.remote
)
manager.log.debug("Created GitHub sibling")
manager.log.debug(
"Created GitHub sibling with privacy %s",
"private" if embargo_status is EmbargoStatus.EMBARGOED else "public",
)
if await ds.is_dirty():
raise RuntimeError(
f"Zarr {asset.zarr} in Dandiset {asset.dandiset_id} is dirty;"
Expand Down
Loading