Skip to content

feat: ignore files not in remote when push is false #10749

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Northo
Copy link

@Northo Northo commented May 22, 2025

Fixes #10317

Enables opt-in to remove push: false stage outputs from not_in_remote data status results.

Notable changes:

  • Add outs_no_push to dvc.stage.utils.fill_stage_outputs keys, to facilitate making outputs with push: false.
  • In status, when flag enabled, filter through files reported as not_in_remote, and remove them if not can_push.
  • Add corresponding flag --respect-no-push flag to CLI

Open to suggestions on how to make the flag names more intuitive!

Corresponding PR for the docs: iterative/dvc.org#5373

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

Copy link

codecov bot commented May 22, 2025

Codecov Report

Attention: Patch coverage is 93.10345% with 2 lines in your changes missing coverage. Please review.

Project coverage is 91.10%. Comparing base (2431ec6) to head (b5a6a58).
Report is 51 commits behind head on main.

Files with missing lines Patch % Lines
dvc/repo/data.py 86.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10749      +/-   ##
==========================================
+ Coverage   90.68%   91.10%   +0.41%     
==========================================
  Files         504      504              
  Lines       39795    39997     +202     
  Branches     3141     3160      +19     
==========================================
+ Hits        36087    36438     +351     
+ Misses       3042     2934     -108     
+ Partials      666      625      -41     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Northo Northo force-pushed the fix/10317/ignore-not-in-remote-push-false branch 3 times, most recently from a3fe1e0 to 4537674 Compare May 22, 2025 12:33
@Northo Northo force-pushed the fix/10317/ignore-not-in-remote-push-false branch from 4537674 to b5a6a58 Compare May 22, 2025 12:58
@Northo Northo changed the title Fix/10317/ignore not in remote push false feat: ignore files not in remote when push is false May 22, 2025
@Northo Northo marked this pull request as ready for review May 22, 2025 13:24
@skshetry skshetry added this to DVC May 25, 2025
@skshetry skshetry moved this to Review In Progress in DVC May 25, 2025
@Northo
Copy link
Author

Northo commented Jun 5, 2025

@skshetry, have you had time to look at this? This feature would be really great for our team!

@skshetry skshetry moved this from Review In Progress to In Progress in DVC Jun 7, 2025
Comment on lines +183 to +188
data_status_parser.add_argument(
"--respect-no-push",
action="store_true",
default=False,
help="Respect the `push: false` flag in the DVC stage's outs.",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can enable this by default. dvc status does the same.

See:

Comment on lines +263 to +268
not_in_remote = uncommitted_diff.pop("not_in_remote", [])

if respect_no_push:
logger.debug("Filtering out paths that are not pushable")
not_in_remote = _filter_out_push_false_outs(repo, not_in_remote)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to extract out the "not_in_remote" checks from above _diff(), as it has nothing to do with diff between worktree and committed changes (index). We need to calculate only for the given index (commited changes).

That is what repo.index is. It's an index of committed changes. You can filter that index to a view, using worktree_view:

def worktree_view(

if not_in_remote:
    view = worktree_view(repo.index, push=True)
    # ... existing logic

push=True gives us

push: Whether the view should be restricted to pushable data only.
.

You can get access to DataIndex using view.data["repo"]. And then use index.iteritems(shallow=not granular) on it.

Let me know if you need help. I can take over the PR if you prefer that way.

Copy link
Collaborator

@skshetry skshetry Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can extract the above not_in_remote stuff. It is using Change type, but you can make minor modification to it to instead use Entry type. change.old and change.new are just Entry type.

Something like follows, maybe:

data_index = view.data["repo"]
for key, entry in data_index.iteritems(shallow=not granular):
    if not (entry and entry.hash_info):
        continue

    k = (*key, "") if entry.meta and entry.meta.isdir else key
    try:
        if not index.storage_map.remote_exists(entry, refresh=remote_refresh):
            yield os.path.sep.join(k)
    except StorageError:
        pass

@@ -62,6 +62,7 @@ def fill_stage_outputs(stage, **kwargs):
"plots_persist_no_cache",
"outs_no_cache",
"outs",
"outs_no_push",
Copy link
Collaborator

@skshetry skshetry Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not add these please. These are also used in CLIs.

I'll suggest a different way to create a stage in tests below with push=False.

Comment on lines +432 to +434
dvc.stage.add(
name="create-foo", cmd="echo foo > foo", deps=["fixed"], outs_no_push=["foo"]
)
Copy link
Collaborator

@skshetry skshetry Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit verbose, but it's better than adding outs_no_push arg to stage.add() function.

Suggested change
dvc.stage.add(
name="create-foo", cmd="echo foo > foo", deps=["fixed"], outs_no_push=["foo"]
)
stage = dvc.stage.create(
name="create-foo", cmd="echo foo > foo", deps=["fixed"], outs=["foo"]
)
stage.outs[0].can_push = False
stage.dump()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, you don't have to create a pipeline stage, which can be slower.

Suggested change
dvc.stage.add(
name="create-foo", cmd="echo foo > foo", deps=["fixed"], outs_no_push=["foo"]
)
stage = dvc.stage.create(single_stage=True, outs=["foo"])
stage.outs[0].can_push = False
stage.dump()

Comment on lines +438 to +440
assert set(
dvc.data_status(remote_refresh=True, not_in_remote=True)["not_in_remote"]
) == {"foo", "bar"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move all of these steps into the test itself. If we get rid of --respect-no-push, we'd only need one single test.

@skshetry skshetry moved this from In Progress to Review In Progress in DVC Jun 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Review In Progress
Development

Successfully merging this pull request may close these issues.

data status returns files as "Not in remote" even though they are marked as push: false in pipeline
2 participants