[LU-128] OSSs frequent crashes due to LBUG/[ASSERTION(last_rcvd>=le64_to_cpu(lcd->lcd_last_transno)) failed] in recovery Created: 15/Mar/11 Updated: 25/Aug/11 Resolved: 14/Jun/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.0.0 |
| Fix Version/s: | Lustre 2.1.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Sebastien Buisson (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
x86_64, RHEL6 |
||
| Attachments: |
|
| Severity: | 3 |
| Bugzilla ID: | 24,420 |
| Rank (Obsolete): | 5040 |
| Description |
|
As suggested by Peter Jones, we open a Jira ticket for this issue in order to get the fix landed in 2.1. We are having this bug when we reboot some OSSs. It's being raised in the recovery phase and it's Each time/crash, the panic'ing thread stack-trace looked like following : In a particular analysis from our on-site support we get the following values when the LBUG is lcd_last_transno=0x4ddebb So we have the client (lcd_last_transno) having a bad transaction number with the actual I could see there is a similar bug (bz23296) but I don't think this bug is related with this one, Does this sound as a known bug for you? In order to work-around this bug, what would be the I also attach here the patch from bugzilla 24420 that is already landed in 1.8.6. Thanks, |
| Comments |
| Comment by Mikhail Pershin [ 15/Mar/11 ] |
|
> lcd_last_transno=0x4ddebb This is not correct, lcd_last_transno is not client transno but the latest transno written on server, it is stored in 'client_data' per export. Given the fact it is on disk that said it was committed, therefore all lower transactions must be committed as well. But client tries to replay 0x4ddeba transaction which is bad and shouldn't happen. Therefore we have the following possible cases: I doubt we have case 1) as it is generic code for MDS/OSS. Also this happens always with IO replays, maybe something is wrong in that path. I'd propose to track how such replays are handled on server. |
| Comment by Mikhail Pershin [ 15/Mar/11 ] |
|
the patch is correct but only to avoid assertion on wire data, the bug itself is not fixed and still exists, that is fundamentally wrong to try to execute the lower transaction request. Do you have lustre log about that crash? |
| Comment by Diego Moreno (Inactive) [ 15/Mar/11 ] |
|
Thanks for your feedback Mikhail, We asked our client to activate this debug levels:
They cannot set full debug level as their fs is in production. Do you think this will be enough trying to trace trnasno's? |
| Comment by Sebastien Buisson (Inactive) [ 15/Mar/11 ] |
|
Hi, The patch is only considered as a work-around for us, but it is vital because without it production is completely blocked at customer site. Unfortunately we do not have the lustre debug logs, all the information we have is in bugzilla 24420. Regards, |
| Comment by Mikhail Pershin [ 15/Mar/11 ] |
|
Diego, HA should help a lot. Sebastien, am I right that bug was seen only with 1.8.x so far? |
| Comment by Sebastien Buisson (Inactive) [ 15/Mar/11 ] |
|
No, our customer is installed with Lustre 2.0.0.1, this is where we have seen this issue. |
| Comment by Peter Jones [ 15/Mar/11 ] |
|
Niu Could you please look at this one? thanks Peter |
| Comment by Niu Yawei (Inactive) [ 15/Mar/11 ] |
|
Hi, tappro Though the 0x4ddebb was already committed, we can't guarantee that the reply with last_commited_trasno >= 0x4ddebb arrived client before server crash, so the replay requests probably were still in the client's replay list, and they should be replayed on recovery, did I miss anything? To my understanding, such kind of assert is incorrect (the first replay request might have lower transno), we can just simply remove it. |
| Comment by Mikhail Pershin [ 15/Mar/11 ] |
|
upon connect the real last_committed is reported back to the client, so it shouldn't send lower replays. That is the first questions why that doesn't work? Also server always can compare them during replay and must drop them because processing them breaks consistency. The assertion is set in filter_finish_transno which is endpoint, it shouldn't go so far. Note that this assertion was there for quite long time and all cases are about bulk replay which was introduced recently. I think this is root of problem, we need to review that paths, what last_committed value client has after reconnect? What server values for next_transno, committed, current at the moment of replay? After that it will be more clear what is wrong. |
| Comment by Mikhail Pershin [ 15/Mar/11 ] |
|
in target_queue_recovery_request() we have exclusion for lower transno requests: if (transno < obd->obd_next_recovery_transno) { /* Processing the queue right now, don't re-add. */ LASSERT(cfs_list_empty(&req->rq_list)); cfs_spin_unlock(&obd->obd_recovery_task_lock); RETURN(1); }that is done for open replays which can be older than committed values, but such requests are specially kept on client. That makes sense only on MDS and I am not sure about bulk replay - maybe they also can be kept even if lower than committed for some reason? Though I can't imagine any. So the only question remains why bulk replays are possible with lower transno, see my previous comment about last_committed which must be applied after reconnect. |
| Comment by Niu Yawei (Inactive) [ 15/Mar/11 ] |
|
Right, client should get last_committed upon connect and cleanup all the requests with lower transactions, I missed that part. Will rethink it. Thank you, Tappro. |
| Comment by Mikhail Pershin [ 20/Mar/11 ] |
|
1) Such replays will be replayed during replay phase, there is no mechanism to replay something after replay phase. It is still in replay queue and will be replayed again at first place (not committed on server) or freed if committed. The server last_committed value must be equal to its transno or less. 2) The replay errors causes import invalidation, client is evicted from server. Anyway we have no assertion for '==' case, it is allowed. That doesn't look as source of problem. I suspect the source of problem is in client code, which replays requests with transno lesser than last_committed from server. I don't know how yet, but that is good point to start with. The problem is not in assertion, this is just fundamentally wrong to replay lesser transno over greater one. |
| Comment by Niu Yawei (Inactive) [ 21/Mar/11 ] |
|
Hmm, whenever a replay request get error or timeout (lost reply), the ptlrpc_replay_interpret() calls ptlrpc_connect_import() to restart recover, if that replay wasn't committed before the reconnect replied, then client will resend it as the first replay request upon reconnect. The transno of this resent replay might be less than the 'obd_next_recovery_transno', so it'll be processed immediately. For filter, this looks like a defect to me, because OSS doesn't check the non-idempotent resent request as MDS does, however, it shouldn't trigger the LBUG, because the transno is at most equal to lcd_last_transno in this case. I can't imagine the case of client sending lower transno niether, maybe some more log would be helpful. |
| Comment by Niu Yawei (Inactive) [ 23/Mar/11 ] |
|
Hi, Tappro It looks to me there is a serious race in target_handle_connect() which could casuse this LBUG. In target_handle_connect(), it'll check the obd_recovering to see if we're in recovery state, if it's true, then MSG_CONNECT_RECOVERING will packed in reply to inform the client to start replay, however, recovery might be finished just after this checking/setting done, at the end, this client will be allowed to do replay after recovery window. Do you agree with me on this? If you think it's a race either, I'll make a patch to fix it soon. Thanks. |
| Comment by Mikhail Pershin [ 24/Mar/11 ] |
|
Niu, that shouldn't be. Recovery is waiting for all clients to start and cannot end until connected client will finish recovery. That said if client was connected then 1) it will be added to recovery 2) if recovery stops before that then this client will be evicted as missed recovery window. I see no way how it can do replays after recovery window. Please note that this bug would be seen much often if exits but we start to see that bug only recently and only with bulk replays. I think that OSS recovery is right place to investigate. |
| Comment by Niu Yawei (Inactive) [ 29/Mar/11 ] |
|
Hi, Diego & Sebastien Is this LBUG reproduced with the HA and WARNING logs? So far, I don't see how it can be triggered with only code inspection, if there is any logs indicating what's the recovery stage it was on and if any client was evicted, that'll be helpful. Thanks |
| Comment by Sebastien Buisson (Inactive) [ 29/Mar/11 ] |
|
Hi Niu, We have asked our on-site Support Team for the logs, but unfortunately we have not been sent them. Sebastien. |
| Comment by Peter Jones [ 29/Mar/11 ] |
|
Dropping priority to reflect importance to CEA. Please just attach logs when you are able to gather them and we will see what we can do then. |
| Comment by Mikhail Pershin [ 01/Apr/11 ] |
|
Peter, does it make sense to land 24420 patch on master, it is safe and replaces assertions on wire data. I have updated version of patch so it will require just inspections and testing |
| Comment by Peter Jones [ 07/Apr/11 ] |
|
Mike Yes, Oleg is ok with this approach. Please upload the patch to gerrit so that we can get the inspections underways Thanks Peter |
| Comment by Peter Jones [ 21/Apr/11 ] |
|
Mike Are you able to hand off your patch to Niu so he can complete the testing and landing of this patch? Peter |
| Comment by Mikhail Pershin [ 21/Apr/11 ] |
|
Peter, I've just added patch to gerrit for review. Local tests are passed. |
| Comment by Peter Jones [ 21/Apr/11 ] |
|
thanks Mike! |
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
|
| Comment by Gregoire Pichon [ 18/May/11 ] |
|
The patch has been installed on our client's cluster, and he started to get MDS crashes The MDS Panic thread stack-trace looks like following : Looking at the stack trace, the failing ASSERTION(req_is_replay(req)) is likely to come from the fix in /lustre/mdt/mdt_recovery.c. It appears some scenario are still not covered by the patch. Could you have a look ? Grégoire. |
| Comment by Mikhail Pershin [ 18/May/11 ] |
|
Can you provide more information about setup, what are versions on clients and servers? When does bug occur - immediately after start or occasionally during normal work? |
| Comment by Gregoire Pichon [ 19/May/11 ] |
|
It is the same cluster than initial problem (Lustre 2.0.0, x86_64, RHEL6). Here is the information from the client: The patch has been removed from the cluster. |
| Comment by Mikhail Pershin [ 19/May/11 ] |
|
That reason can be other just 2.0.0 code itself, some bug which causes this assertion. The assert can be removed from patch, in that case there will be client evictions only and more debug info about why that is happening. |
| Comment by Peter Jones [ 25/May/11 ] |
|
Adding as 2.1 blocker upon advice of Bull because this patch is landed for 2.1 and caused issues when deployed in production at CEA. |
| Comment by Mikhail Pershin [ 25/May/11 ] |
|
Peter, it is not quite correct, patch was landed for 2.1 but problems with it were seen with 2.0.0. I am afraid the differences between 2.1 and 2.0.0 can be the reason for this. So it is not blocker for 2.1 at least we saw no issues with it so far, but we need patch which works with 2.0.0 correctly. I'd propose to cook special patch for Bull with removed assertions just to make it loyal to any error and we will be able to see debug messages if issues will occur again. |
| Comment by Peter Jones [ 25/May/11 ] |
|
Ah thanks for clarifying Mike. This can certainly remain an important support issue for CEA and a priority for us without being considered a 2.1 blocker. I have adjusted the status accordingly. |
| Comment by Peter Jones [ 13/Jun/11 ] |
|
Mike Are you still expecting to be able to create a patch based on 2.0 for CEA? CEA, Would you deploy such a patch or is the window until you rebase on 2.1 small enough that it would not be worthwhile? Peter |
| Comment by Sebastien Buisson (Inactive) [ 14/Jun/11 ] |
|
Peter, As suggested by Mike we cooked a patch for CEA with removed assertions just to make it loyal to any error. Also these error messages are deactivated by default, but can be activated via a kernel module option. That way CEA will be able to collect debug messages when the issue reoccurs. Sebastien. |
| Comment by Peter Jones [ 14/Jun/11 ] |
|
ok then I think that we can close this ticket and reopen it if it transpires that we need to take any further action before CEA realign on 2.1 |
| Comment by Sebastien Buisson (Inactive) [ 14/Jun/11 ] |
|
I agree. Thank you, |
| Comment by Niu Yawei (Inactive) [ 25/Aug/11 ] |
This should be fixed in |