Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lotus-miner deal status not in sync with on chain status of deals #185

Closed
11 of 18 tasks
f8-ptrk opened this issue Jan 10, 2022 · 42 comments
Closed
11 of 18 tasks

lotus-miner deal status not in sync with on chain status of deals #185

f8-ptrk opened this issue Jan 10, 2022 · 42 comments
Labels
area/deals Area: Deals area/retrieval Area: Retrieval kind/bug Kind: Bug

Comments

@f8-ptrk
Copy link
Contributor

f8-ptrk commented Jan 10, 2022

Checklist

  • This is not a security-related bug/issue. If it is, please follow please follow the security policy.
  • This is not a question or a support request. If you have any lotus related questions, please ask in the lotus forum.
  • This is not a new feature request. If it is, please file a feature request instead.
  • This is not an enhancement request. If it is, please file a improvement suggestion instead.
  • I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
  • I am running the Latest release, or the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
  • I did not make any code changes to lotus.

Lotus component

  • lotus daemon - chain sync
  • lotus miner - mining and block production
  • lotus miner/worker - sealing
  • lotus miner - proving(WindowPoSt)
  • lotus miner/market - storage deal
  • lotus miner/market - retrieval deal
  • lotus miner/market - data transfer
  • lotus client
  • lotus JSON-RPC API
  • lotus message management (mpool)
  • Other

Lotus Version

1.13.2-rc4

Describe the Bug

when lotus-miner declares deals as erroneous and then seals them anyways the deal state in the list output for deals, the active deals count in the info output aren't representing the chain state of deals.

short: lotus-miner does not show active deals as active.

might be more than one bug that leads to this phenomenon.

Logging Information

ask if needed

Repo Steps

  1. Run a couple of hundreds of deals, maybe it appears - maybe not. hard to say without checking every single sector a miner has on chain and compare it with the lotus-miner outputs
@Reiers
Copy link

Reiers commented Jan 10, 2022

Hi @f8-ptrk

Thanks for creating the ticket.
I have added labels and will assign it to the right(several) teams for analysis.
We might tag you again for more information.

@jennijuju
Copy link
Member

this is great ux feedback - transferring to boost so they will make sure they cover this in their design.

In lotus = we should make sector fsm states up to date for sector list @Reiers could you please confirm if we have a ticket for this? if not - lets def create one, that is not too hard to improve but has been quite a pain in the ass.

@jennijuju jennijuju transferred this issue from filecoin-project/lotus Jan 10, 2022
@RobQuistNL
Copy link
Contributor

Related; #260 #261

@f8-ptrk
Copy link
Contributor Author

f8-ptrk commented Feb 22, 2022

@Reiers this should actually live in the lotus repo

@SBudo
Copy link

SBudo commented Feb 23, 2022

I have the same issue:
On random deals, after the deal gets handed to sealing subsystem for PC1, the market node pop ups the error below and marks the deal as failed ("Error"). The thing is that, the sealing process actually goes fine, the sector is finalised and the deal is activated on the chain
In other words, on my side (in my logs), I see a "StorageDealError" status, while on the client side and on the chain the deal is fine and active.
Note that the daemon is absolutely fine, no chain pruning, nor have I enabled Splitstore.

It's been happening a LOT more lately, and since we're only doing deals to increase storage, it's becoming a big problem as I have to check manually if each error'ed deals has gone through :-(

Storage Deal status (lotus-miner storage-deals list -v):

error awaiting deal pre-commit: failed to set up called handler: called check error (h: 1570875): failed to look up deal on chain: deal 3964785 not found - deal may not have completed sealing before deal proposal start epoch, or deal may have been slashed

Market node logs (with the hand off first)

2022-02-21T19:58:30.126+1100 INFO providerstates providerstates/provider_states.go:329 handing off deal to sealing subsystem {"pieceCid": "baga6ea4seaqpm5ipi346kviurqgeglar3qkg4yn2r32fb5aanuza3z24jghymky", "proposalCid": "bafyreihr743zllr2eckgfiweouiap7pgcjnoyqldqa3mg3t75jjt7sfcpu"}

2022-02-21T20:00:54.519+1100 INFO providerstates providerstates/provider_states.go:376 successfully handed off deal to sealing subsystem {"pieceCid": "baga6ea4seaqpm5ipi346kviurqgeglar3qkg4yn2r32fb5aanuza3z24jghymky", "proposalCid": "bafyreihr743zllr2eckgfiweouiap7pgcjnoyqldqa3mg3t75jjt7sfcpu"}

2022-02-21T20:00:54.522+1100 INFO markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventDealHandedOff", "proposal CID": "bafyreihr743zllr2eckgfiweouiap7pgcjnoyqldqa3mg3t75jjt7sfcpu", "state": "StorageDealAwaitingPreCommit", "message": ""}

2022-02-21T20:06:01.555+1100 INFO markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventDealPrecommitFailed", "proposal CID": "bafyreihr743zllr2eckgfiweouiap7pgcjnoyqldqa3mg3t75jjt7sfcpu", "state": "StorageDealFailing", "message": "error awaiting deal pre-commit: failed to set up called handler: called check error (h: 1570875): failed to look up deal on chain: deal 3964785 not found - deal may not have completed sealing before deal proposal start epoch, or deal may have been slashed"}

2022-02-21T20:06:01.559+1100 WARN providerstates providerstates/provider_states.go:561 deal bafyreihr743zllr2eckgfiweouiap7pgcjnoyqldqa3mg3t75jjt7sfcpu failed: error awaiting deal pre-commit: failed to set up called handler: called check error (h: 1570875): failed to look up deal on chain: deal 3964785 not found - deal may not have completed sealing before deal proposal start epoch, or deal may have been slashed

2022-02-21T20:06:03.950+1100 INFO markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventFailed", "proposal CID": "bafyreihr743zllr2eckgfiweouiap7pgcjnoyqldqa3mg3t75jjt7sfcpu", "state": "StorageDealError", "message": "error awaiting deal pre-commit: failed to set up called handler: called check error (h: 1570875): failed to look up deal on chain: deal 3964785 not found - deal may not have completed sealing before deal proposal start epoch, or deal may have been slashed"}

@sypro2020
Copy link

I am affected by this too, it looks like more than half of my deals.

截屏2022-05-10 上午5 12 05

@f8-ptrk f8-ptrk closed this as completed Jun 14, 2022
@f8-ptrk f8-ptrk moved this to Done in Boost Jun 14, 2022
@f8-ptrk f8-ptrk reopened this Jun 14, 2022
@f8-ptrk
Copy link
Contributor Author

f8-ptrk commented Jun 14, 2022

is this still an issue with boost?

@SBudo
Copy link

SBudo commented Jun 15, 2022

I haven't seen it happening on boost, so I think it might have been a market node problem. I would close it. We can start a new issue if it appears on boost.

@RobQuistNL
Copy link
Contributor

80% of miners are still using Markets
70% of data from them is unretrievable because of this

Its still an issue

@f8-ptrk
Copy link
Contributor Author

f8-ptrk commented Jun 15, 2022

i think markets are dead and dev is fully on boost by now?

@marshyonline
Copy link

I'm seeing this issue on more than one monolith miner+markets setups.
Deals in error status but power is still showing the correct adjusted power as if the deals where active

@SBudo
Copy link

SBudo commented Jun 16, 2022

@f8-ptrk @RobQuistNL Yeah, I don't think there are anymore dev on markets. Boost v1 just got released and they are working on a boost version that will work with the v16 network upgrade, so I'm going to make a guess that market will be dead after v16 (?).

@marshyonline those issues seems to have been fixed in boost

@f8-ptrk
Copy link
Contributor Author

f8-ptrk commented Jun 16, 2022

seem to or have been fixed? will this clean up the mess the market v1 created?

@SBudo
Copy link

SBudo commented Jun 16, 2022

I've been on boost for a couple of months now, and the error has not appeared.
Whether or not it's fixed: only time will tell I guess, but the above is a good sign

@momack2
Copy link

momack2 commented Jun 17, 2022

hey @dirkmc - are you the DRI for fully resolving this issue now that boost is GA and the default path going forward? I want to make sure we resolve this sync issue more holistically than this new "register shard" one-off command (#517) currently does. Effectively, we should ensure there is a process that audits the dagstore and repairs shards automatically, not wait for an SP to detect and manually fix an issue.

Please LMK if you need more support from the lotus team on this - but I hear this is still a major issue for Evergreen SPs effectively serving retrieval issues - and therefore think this should be a P1 with a clear, dedicated DRI.

@dirkmc
Copy link
Contributor

dirkmc commented Jun 20, 2022

Switching from markets to Boost will resolve the "failed to lookup deal on chain" problem.
Boost also gives the user the option to retry or cancel the deal when there is a non-fatal error at any stage of the process.

With regards to syncing the deal state from the state of the chain, we have an open ticket to do so: Investigate syncing Boost with the latest state of the miner and/or chain.
It seems like this is less of a priority given that the underlying problem is solved in the ways mentioned above.

@f8-ptrk
Copy link
Contributor Author

f8-ptrk commented Jun 20, 2022

this is not a failed to look up deal on chain problem. this is a local database problem. the deal actually never fails - it fails to be recorded in the local database.

the only way to make this clean is to sync the local database with whats on chain and then sync with the sector store. everything else will fail again in the future

in the end it's a design problem of using multiple databases that record the "same information" and aren't able to stay in sync!

@dirkmc
Copy link
Contributor

dirkmc commented Jun 20, 2022

The underlying issue is that after the deal has been submitted to the sealing node, markets attempts to look up the deal state on chain, and this lookup fails.
Boost doesn't do this lookup, so this is not an issue in Boost.

@dirkmc
Copy link
Contributor

dirkmc commented Jun 20, 2022

Let me clarify that last sentence. Boost doesn't do the lookup for deals made with v1.2.0 of the deal proposal protocol.
Deals made with v1.1.0 of the deal proposal protocol (legacy deals) go through the old markets code base, so this will still be an issue for v1.1.0 deals.

One straight-forward fix would be for us to modify the legacy markets code base such that it will no longer try to lookup the state of the deal on chain after handing it off to the sealing subsystem. My understanding is that people are not really interested in checking the state of the deal on chain by using lotus-miner storage-deals list etc. Instead, people normally just look up the deal using web UIs like filfox. Would anyone have objections to us no longer reporting the status of the deal on chain through the lotus-miner storage-deals CLI commands?

@RobQuistNL
Copy link
Contributor

I would like my local miner to know what deals are on there without needing to parse the entire chain or use an external API to do that.

It would also be nice if we were able to retrieve all the exabytes of storage that are on the network now from the slingshot events - a lot of that now is not retrievable and if I'm reading this correctly you're saying its not ever going to work because they are old deals.

@stuberman
Copy link

@f8-ptrk does this look related to 547?

@SBudo
Copy link

SBudo commented Jun 20, 2022

I don't mind not having the "storage-deals list" on the miner since this function is now available on the boost node.
For me, I think all market calls should be located/instantiated from the boost node (remove the load/complexities from the miner)

@f8-ptrk
Copy link
Contributor Author

f8-ptrk commented Jun 21, 2022

i pretty much care to not have that call removed from the lotus-miner! boost is an option to use and we shouldn't rely on it being present!

i agree with rob: without this being fixed for past deals we have a huge problem as we have 100PB dead data on the chain that cannot be retrieved in the end

@SBudo
Copy link

SBudo commented Jun 21, 2022

Two separate issues here:

  • Depending on the roadmap, if boost is to replace the market in the future, fixing the issue in market for upcoming deals is a bit mute (well depending on how long to migrate across). That would mean that going forward, the issue does no longer appear.
  • fixing previous error'ed deals that do not match in the miner compared to the chain. Whether or not you use boost or the miner to check the deals status, it doesn't matter as this is across both boost and market. The only way to fix that one would be to have a tool that would check onchain vs. database (in market or boost)

@f8-ptrk
Copy link
Contributor Author

f8-ptrk commented Jun 21, 2022

where does the wrong data reside right now? in the miner or the markets/boost or the miner? who is responsible for the underlying database that causes the problems? thats where it needs to be fixed for past deals.

@benjaminh83
Copy link

I agree that needs a fix.

Let’s put this into perspective - the proofs and lotus team has put a great effort into implementing snap up, which potentially allows us to activate all the CC out there.

Now, on the other hand we cannot prioritise fixing broken deal data? Then we might just as well count all this as dead data. It’s even worse than CC, as it cannot be snapped up, and it cannot be retrieved.

I guess I just don’t buy that we won’t get a fix, as it is boring to fix old stuff, rather than building new features.

@jennijuju
Copy link
Member

@dirkmc Afaik lotus-miner storage-deals list is commonly used by sp for checking deal status regularly - and developers use this for monitoring devnet development testing as well. I wasn’t aware/don’t think check deals using explorer was more common than using this cli.

@benjaminh83 while I understood your point, I think it’s important to point out that - the core devs/maintainers for boost, were also the main contributors to market protocol v1 & lotus market. The same team is now not only developing a new deal making software, but also developing a new version of market protocol. It’s common that certain bug fixes are only included in the newer version of protocol/software, without backporting to the order/potentially-to-be-deprecated version.
With such a small team (2 main engineers atm), prioritize among all incoming bug reports & feature request to two version of market protocol & two software is not easy. and in order to spend the engineering resources on the most high impact items, there will be asks getting cutoff from effort<>impact trade off analysis. (afaik, the team is focusing on scaling/perfecting storage deal onboarding atm more than retrieval)
what I’m trying to say here is that - I def don’t think boost team is avoiding bug fixes cuz it’s less exciting work, however, I think the team is missing insights on:

  • why this is critical to storage providers , what’s the bug and what is it blocking exactly & how (If you can provide an example of retrieval deal to your SP, the log when it errors out, or even how to repro I think that can be helpful)
  • why SP still needs this to fixed on lotus market side when boost doesn’t have the issue - more precisely why SPs aren’t converting production line to boost in the first place, what’s the concern/blocker there;

@f8-ptrk
Copy link
Contributor Author

f8-ptrk commented Jun 21, 2022

it's critical as a lot of deals are relying on first being able to be retrieved (either evergreen or avoiding re-transfer for renewals). evergreen records a 98% fail rate for retrievals afaik for example. Most of the slingshot deals cannot be retrieved.

as a storage provider, with that bug in existence, it is hard to make serious storage deals knowing that the data cannot be retrieved by the client anymore.

the problem is that it affects possibly a lot of old deals and boost doesn't seem to fix the issue for old deals and they stay irretrievable forever - are dead weight, as benjamin said.

@RobQuistNL
Copy link
Contributor

Shouln't this be fixed by manually being able to reimport a missing piece?

LexLuthr already backported this function: filecoin-project/lotus#8645

If I'm not mistaken (I haven't tried 1.17.0-rc1 yet, but will soon) - we can use this version of lotus to;

the new command < lotus-miner storage-deals list | grep "key not found"

and then we should be able to retroactively fix all the broken ones? Once again I'm not sure but if its a one off thing (that we can potentially simply throw into the dagstore migration code) we should be able to add that in (and maybe even run it once a week, I'm fine with that workaround)

@dirkmc
Copy link
Contributor

dirkmc commented Jun 21, 2022

Sounds like there are two issues here:

  1. In lotus the deal ends up in the Error state with an error message, but the deal still makes it on chain and gets sealed.
    This is an annoying UX bug but doesn't seem to have other consequences.
  2. The deal data doesn't get indexed.
    It's not possible to retrieve the deal data, this is a serious problem

We're planning to write a boost doctor command that will help SPs diagnose and fix problems with deals: Add a boost doctor command to enable SP's to surface known issues
For example this command will check if the deal has been indexed, and if not it will index the deal so that it can be retrieved.

As Jennifer pointed out, we are a small team with limited resources so it's a question of what to prioritize. After working on the existing codebase for about a year we realized that

  • we were spending a lot of time tracking down bugs, and in the long run we would save time by moving to a newer codebase that is easier to reason about
  • we were tied to the lotus release cycle
  • some of the assumptions made when the code was originally written turned out to be incorrect (eg unsealing on-the-fly is not realistic)
  • we would benefit from using a battle-tested data transfer protocol (HTTP)
  • using uuids as identifiers would mitigate a lot of bugs

To fix this issue we'll need to spend a significant amount of time to figure out what's causing the problem, fix and test, and then you'll need to wait for the next stable release of lotus to upgrade.
If you switch over to Boost and ask your clients to use the Boost client, then the problem will be fixed for you.

We've now released Boost v1.0.0 and we'd like to help SPs switch over. Please let us know in the #boost-help channel if there are things that are holding you back from upgrading and we'll try to address them.
If there are good reasons not to switch over then we can take a look at fixing this issue.

@f8-ptrk
Copy link
Contributor Author

f8-ptrk commented Jun 21, 2022

there is no boost version for v16 that could be tested on calibration net - thats a show stopper.

for storage providers the issue is: retrievals for a lot of deals, most likely the majority, on the network don't work. for storage providers it is not relevant what happens to future deals, but what happens to the ones that are already stored.

sure future deals are important but the work of the last 12+ month was in the end for nothing if stuff that was stored, 100PB, cannot be retrieved.

@dirkmc
Copy link
Contributor

dirkmc commented Jun 21, 2022

there is no boost version for v16 that could be tested on calibration net - thats a show stopper

Release v1.1-rc1 targets lotus version 16. We're working on fixing a couple of bugs in it, hopefully we'll have v1.1-rc2 today or tomorrow.

retrievals for a lot of deals, most likely the majority, on the network don't work

We're planning to write a boost doctor command that will help SPs diagnose and fix problems with deals: Add a boost doctor command to enable SP's to surface known issues
For example this command will check if the deal has been indexed, and if not it will index the deal so that it can be retrieved.

@jennijuju
Copy link
Member

@f8-ptrk #606 is a thing now.

@f8-ptrk
Copy link
Contributor Author

f8-ptrk commented Jun 21, 2022

We're planning to write a boost doctor command that will help SPs diagnose and fix problems with deals: #582
For example this command will check if the deal has been indexed, and if not it will index the deal so that it can be retrieved

P0!

@jennijuju looking at it and someone will try it out

@momack2
Copy link

momack2 commented Jun 21, 2022

Just chiming in here - it is absolutely a priority that the 100PiB of data that is currently stored on Filecoin (ex major datasets like below⬇️) are retrievable, and we fix the bugs in retrieving these v1.1.0 deals. Great to fix this in boost (yay), but we also need retrievals on old deals to work. The priority of the CLI command for listing local deals is definitely less important than retrievals working for all old deals, imho - so if that's the fastest way to fix that for now, I would be supportive. But I also agree with @RobQuistNL's comment that ideally there's a fix that does still allow listing out old deals (afaik Boost will only list out deals made with boost, and there are a lot of deals out there that have already been made and deserve support, attention, love, and proper consideration.)

image

@dirkmc
Copy link
Contributor

dirkmc commented Jun 22, 2022

Boost will only list out deals made with boost

In the Boost UI you can see deals made with v1.1.0 of the deal proposal protocol (legacy deals) and deals made with v1.2.0 of the deal proposal protocol. See https://boost.filecoin.io for screenshots.

Note that in Boost, deals made with either version of the protocol are retrievable.

This issue is open against the Boost repo so I've been discussing it in terms of how to fix it in Boost.
With regards to fixing retrievals for legacy deals made against lotus: when a deal fails in lotus before the indexing stage, the piece does not get registered with the DAG store. We added a command to allow SPs to manually register the piece: feat: dagstore: add dagstore register-shard command. With this command users can tell the DAG store to register the piece and index it, so that it will be retrievable:

lotus-miner dagstore register-shard <piece cid>

This command is available in v1.17.x of lotus

@dirkmc
Copy link
Contributor

dirkmc commented Jun 23, 2022

Just to be clear we are committed to making sure all deals can be retrieved, both new deals and deals made in the past, whether SPs are running lotus or boost.
If you have questions please also feel free to ask in #boost-help

@stuberman
Copy link

there are a lot of deals out there that have already been made and deserve support, attention, love, and proper consideration

+1 for giving those deals plenty of love...

@flyworker
Copy link

flyworker commented Jun 30, 2022

hey, this is Charles from Filswan team, we suffer from this problem when we do cross chain storage, because the sync status is not accurate on lotus client, we cannot unlock user fund. something like this:

StorageDealError
 error waiting for deal pre-commit message to appear on chain: failed to set up called handler: called check error (h: 1905218): failed to look up deal on chain: deal 7579639 not found - deal may not have completed sealing before deal proposal start epoch, or deal may have been slashed

lotus 1.5
But the deal actually is active,do we have a work around for it before the fix?
thanks

@momack2
Copy link

momack2 commented Jul 4, 2022

@flyworker - can you try the command @dirkmc mentioned in #185 (comment) and see if that manually fixes the issue you're seeing?

@f8-ptrk
Copy link
Contributor Author

f8-ptrk commented Jul 5, 2022

1.17.x is not in a stable state yet @momack2

@jacobheun
Copy link
Contributor

Hi Everyone, as this issue is currently referencing several issues (chain status not in sync, and multiple, separate retrieval problems), we've separated out the larger retrieval problem into a tracking issue, #645, and will be aggregating all bugs, fixes, and new features we develop to resolve these issues. We'll be posting updates to there at least weekly, so please subscribe to that issue if you want to follow updates. Our goal is to leverage that issue to better holistically track the problem.

Thank you to everyone who's been submitting issues in Boost and Lotus, we'll be cross referencing issues from Lotus and Slack over the next few days, and ensure they're tracked in #645.

Improving retrieval is currently the priority for the team, for all deals (old and new), with an emphasis on supporting deal renewal programs. While our primary focus will be landing fixes in Boost, as we can ship these more frequently, we will be ensuring that we have workarounds, at minimum, in Lotus markets for critical issues.

@jacobheun jacobheun added area/retrieval Area: Retrieval kind/bug Kind: Bug area/deals Area: Deals labels Nov 25, 2022
@f8-ptrk f8-ptrk closed this as completed Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/deals Area: Deals area/retrieval Area: Retrieval kind/bug Kind: Bug
Projects
Status: Done
Development

No branches or pull requests