[LU-128] OSSs frequent crashes due to LBUG/[ASSERTION(last_rcvd>=le64_to_cpu(lcd->lcd_last_transno)) failed] in recovery Created: 15/Mar/11  Updated: 25/Aug/11  Resolved: 14/Jun/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.0.0
Fix Version/s: Lustre 2.1.0

Type: Bug Priority: Major
Reporter: Sebastien Buisson (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

x86_64, RHEL6


Attachments: File 24420-b18.patch    
Severity: 3
Bugzilla ID: 24,420
Rank (Obsolete): 5040

 Description   

As suggested by Peter Jones, we open a Jira ticket for this issue in order to get the fix landed in 2.1.
I simply copy here the initial description from bugzilla 24420:

We are having this bug when we reboot some OSSs. It's being raised in the recovery phase and it's
provoking a long Lustre service interruption.

Each time/crash, the panic'ing thread stack-trace looked like following :
=========================================================================
#0 [ffff881021fd1238] machine_kexec at ffffffff8102e66b
#1 [ffff881021fd1298] crash_kexec at ffffffff810a9ae8
#2 [ffff881021fd1368] panic at ffffffff8145210d
#3 [ffff881021fd13e8] lbug_with_loc at ffffffffa0454eeb
#4 [ffff881021fd1438] libcfs_assertion_failed at ffffffffa04607d6
#5 [ffff881021fd1488] filter_finish_transno at ffffffffa096c825
#6 [ffff881021fd1548] filter_do_bio at ffffffffa098e390
#7 [ffff881021fd15e8] filter_commitrw_write at ffffffffa0990a78
#8 [ffff881021fd17d8] filter_commitrw at ffffffffa09833d5
#9 [ffff881021fd1898] obd_commitrw at ffffffffa093affa
#10 [ffff881021fd1918] ost_brw_write at ffffffffa0943644
#11 [ffff881021fd1af8] ost_handle at ffffffffa094837a
#12 [ffff881021fd1ca8] ptlrpc_server_handle_request at ffffffffa060eb11
#13 [ffff881021fd1de8] ptlrpc_main at ffffffffa060feea
#14 [ffff881021fd1f48] kernel_thread at ffffffff8100d1aa
=========================================================================

In a particular analysis from our on-site support we get the following values when the LBUG is
raised on "filter_finish_transno" function:

lcd_last_transno=0x4ddebb
oti_transno=last_rcvd=0x4ddeba
lsd_last_transno=0x4de0ee

So we have the client (lcd_last_transno) having a bad transaction number with the actual
transaction number being lower than client's one which, according the the ASSERT, is bad.

I could see there is a similar bug (bz23296) but I don't think this bug is related with this one,
as in bz23296 the problem comes from a bad initialization in obdecho/echo_client.c which is used
only for tests, not for production as it's our case.

Does this sound as a known bug for you? In order to work-around this bug, what would be the
consequences of disabling this LBUG? I mean, I think we would loss some data on a client but I
don't know if there is any other important consequence.

I also attach here the patch from bugzilla 24420 that is already landed in 1.8.6.

Thanks,
Sebastien.



 Comments   
Comment by Mikhail Pershin [ 15/Mar/11 ]

> lcd_last_transno=0x4ddebb
> oti_transno=last_rcvd=0x4ddeba
> lsd_last_transno=0x4de0ee
>
> So we have the client (lcd_last_transno) having a bad transaction number with the actual
> transaction number being lower than client's one which, according the the ASSERT, is bad.

This is not correct, lcd_last_transno is not client transno but the latest transno written on server, it is stored in 'client_data' per export. Given the fact it is on disk that said it was committed, therefore all lower transactions must be committed as well. But client tries to replay 0x4ddeba transaction which is bad and shouldn't happen.

Therefore we have the following possible cases:
1) bug in client replay code - it issues lower transaction after higher
2) bug on server side - it should skip such lower transno if server committed trasno is higher

I doubt we have case 1) as it is generic code for MDS/OSS. Also this happens always with IO replays, maybe something is wrong in that path. I'd propose to track how such replays are handled on server.

Comment by Mikhail Pershin [ 15/Mar/11 ]

the patch is correct but only to avoid assertion on wire data, the bug itself is not fixed and still exists, that is fundamentally wrong to try to execute the lower transaction request. Do you have lustre log about that crash?

Comment by Diego Moreno (Inactive) [ 15/Mar/11 ]

Thanks for your feedback Mikhail,

We asked our client to activate this debug levels:

  • On OSS: HA and warning
  • On MDS: HA, inode and warning
  • On clients: nothing in particular

They cannot set full debug level as their fs is in production. Do you think this will be enough trying to trace trnasno's?

Comment by Sebastien Buisson (Inactive) [ 15/Mar/11 ]

Hi,

The patch is only considered as a work-around for us, but it is vital because without it production is completely blocked at customer site.

Unfortunately we do not have the lustre debug logs, all the information we have is in bugzilla 24420.

Regards,
Sebastien.

Comment by Mikhail Pershin [ 15/Mar/11 ]

Diego, HA should help a lot.

Sebastien, am I right that bug was seen only with 1.8.x so far?

Comment by Sebastien Buisson (Inactive) [ 15/Mar/11 ]

No, our customer is installed with Lustre 2.0.0.1, this is where we have seen this issue.

Comment by Peter Jones [ 15/Mar/11 ]

Niu

Could you please look at this one?

thanks

Peter

Comment by Niu Yawei (Inactive) [ 15/Mar/11 ]

Hi, tappro

Though the 0x4ddebb was already committed, we can't guarantee that the reply with last_commited_trasno >= 0x4ddebb arrived client before server crash, so the replay requests probably were still in the client's replay list, and they should be replayed on recovery, did I miss anything?

To my understanding, such kind of assert is incorrect (the first replay request might have lower transno), we can just simply remove it.

Comment by Mikhail Pershin [ 15/Mar/11 ]

upon connect the real last_committed is reported back to the client, so it shouldn't send lower replays. That is the first questions why that doesn't work?

Also server always can compare them during replay and must drop them because processing them breaks consistency. The assertion is set in filter_finish_transno which is endpoint, it shouldn't go so far.

Note that this assertion was there for quite long time and all cases are about bulk replay which was introduced recently. I think this is root of problem, we need to review that paths, what last_committed value client has after reconnect? What server values for next_transno, committed, current at the moment of replay? After that it will be more clear what is wrong.

Comment by Mikhail Pershin [ 15/Mar/11 ]

in target_queue_recovery_request() we have exclusion for lower transno requests:

if (transno < obd->obd_next_recovery_transno)

{ /* Processing the queue right now, don't re-add. */ LASSERT(cfs_list_empty(&req->rq_list)); cfs_spin_unlock(&obd->obd_recovery_task_lock); RETURN(1); }

that is done for open replays which can be older than committed values, but such requests are specially kept on client. That makes sense only on MDS and I am not sure about bulk replay - maybe they also can be kept even if lower than committed for some reason? Though I can't imagine any.

So the only question remains why bulk replays are possible with lower transno, see my previous comment about last_committed which must be applied after reconnect.

Comment by Niu Yawei (Inactive) [ 15/Mar/11 ]

Right, client should get last_committed upon connect and cleanup all the requests with lower transactions, I missed that part. Will rethink it. Thank you, Tappro.

Comment by Mikhail Pershin [ 20/Mar/11 ]

1) Such replays will be replayed during replay phase, there is no mechanism to replay something after replay phase. It is still in replay queue and will be replayed again at first place (not committed on server) or freed if committed. The server last_committed value must be equal to its transno or less.

2) The replay errors causes import invalidation, client is evicted from server. Anyway we have no assertion for '==' case, it is allowed.

That doesn't look as source of problem. I suspect the source of problem is in client code, which replays requests with transno lesser than last_committed from server. I don't know how yet, but that is good point to start with.

The problem is not in assertion, this is just fundamentally wrong to replay lesser transno over greater one.

Comment by Niu Yawei (Inactive) [ 21/Mar/11 ]

Hmm, whenever a replay request get error or timeout (lost reply), the ptlrpc_replay_interpret() calls ptlrpc_connect_import() to restart recover, if that replay wasn't committed before the reconnect replied, then client will resend it as the first replay request upon reconnect. The transno of this resent replay might be less than the 'obd_next_recovery_transno', so it'll be processed immediately. For filter, this looks like a defect to me, because OSS doesn't check the non-idempotent resent request as MDS does, however, it shouldn't trigger the LBUG, because the transno is at most equal to lcd_last_transno in this case.

I can't imagine the case of client sending lower transno niether, maybe some more log would be helpful.

Comment by Niu Yawei (Inactive) [ 23/Mar/11 ]

Hi, Tappro

It looks to me there is a serious race in target_handle_connect() which could casuse this LBUG.

In target_handle_connect(), it'll check the obd_recovering to see if we're in recovery state, if it's true, then MSG_CONNECT_RECOVERING will packed in reply to inform the client to start replay, however, recovery might be finished just after this checking/setting done, at the end, this client will be allowed to do replay after recovery window.

Do you agree with me on this? If you think it's a race either, I'll make a patch to fix it soon. Thanks.

Comment by Mikhail Pershin [ 24/Mar/11 ]

Niu, that shouldn't be. Recovery is waiting for all clients to start and cannot end until connected client will finish recovery. That said if client was connected then 1) it will be added to recovery 2) if recovery stops before that then this client will be evicted as missed recovery window. I see no way how it can do replays after recovery window.

Please note that this bug would be seen much often if exits but we start to see that bug only recently and only with bulk replays. I think that OSS recovery is right place to investigate.

Comment by Niu Yawei (Inactive) [ 29/Mar/11 ]

Hi, Diego & Sebastien

Is this LBUG reproduced with the HA and WARNING logs? So far, I don't see how it can be triggered with only code inspection, if there is any logs indicating what's the recovery stage it was on and if any client was evicted, that'll be helpful.

Thanks
Niu

Comment by Sebastien Buisson (Inactive) [ 29/Mar/11 ]

Hi Niu,

We have asked our on-site Support Team for the logs, but unfortunately we have not been sent them.

Sebastien.

Comment by Peter Jones [ 29/Mar/11 ]

Dropping priority to reflect importance to CEA. Please just attach logs when you are able to gather them and we will see what we can do then.

Comment by Mikhail Pershin [ 01/Apr/11 ]

Peter, does it make sense to land 24420 patch on master, it is safe and replaces assertions on wire data. I have updated version of patch so it will require just inspections and testing

Comment by Peter Jones [ 07/Apr/11 ]

Mike

Yes, Oleg is ok with this approach. Please upload the patch to gerrit so that we can get the inspections underways

Thanks

Peter

Comment by Peter Jones [ 21/Apr/11 ]

Mike

Are you able to hand off your patch to Niu so he can complete the testing and landing of this patch?

Peter

Comment by Mikhail Pershin [ 21/Apr/11 ]

Peter, I've just added patch to gerrit for review. Local tests are passed.

Comment by Peter Jones [ 21/Apr/11 ]

thanks Mike!

Comment by Build Master (Inactive) [ 11/May/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #117
LU-128 Avoid assertion on wire data in last_rcvd update

Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
Files :

  • lustre/mdt/mdt_recovery.c
  • lustre/mdt/mdt_open.c
  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 11/May/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #117
LU-128 Avoid assertion on wire data in last_rcvd update

Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
Files :

  • lustre/mdt/mdt_recovery.c
  • lustre/obdfilter/filter.c
  • lustre/mdt/mdt_open.c
Comment by Build Master (Inactive) [ 11/May/11 ]

Integrated in lustre-master » i686,client,el5,ofa #117
LU-128 Avoid assertion on wire data in last_rcvd update

Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
Files :

  • lustre/mdt/mdt_recovery.c
  • lustre/obdfilter/filter.c
  • lustre/mdt/mdt_open.c
Comment by Build Master (Inactive) [ 11/May/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #117
LU-128 Avoid assertion on wire data in last_rcvd update

Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
Files :

  • lustre/mdt/mdt_recovery.c
  • lustre/mdt/mdt_open.c
  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 11/May/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #117
LU-128 Avoid assertion on wire data in last_rcvd update

Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
Files :

  • lustre/obdfilter/filter.c
  • lustre/mdt/mdt_recovery.c
  • lustre/mdt/mdt_open.c
Comment by Build Master (Inactive) [ 11/May/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #117
LU-128 Avoid assertion on wire data in last_rcvd update

Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
Files :

  • lustre/mdt/mdt_recovery.c
  • lustre/mdt/mdt_open.c
  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 11/May/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #117
LU-128 Avoid assertion on wire data in last_rcvd update

Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
Files :

  • lustre/mdt/mdt_open.c
  • lustre/obdfilter/filter.c
  • lustre/mdt/mdt_recovery.c
Comment by Build Master (Inactive) [ 11/May/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,ofa #117
LU-128 Avoid assertion on wire data in last_rcvd update

Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
Files :

  • lustre/mdt/mdt_recovery.c
  • lustre/obdfilter/filter.c
  • lustre/mdt/mdt_open.c
Comment by Build Master (Inactive) [ 11/May/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #117
LU-128 Avoid assertion on wire data in last_rcvd update

Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
Files :

  • lustre/obdfilter/filter.c
  • lustre/mdt/mdt_open.c
  • lustre/mdt/mdt_recovery.c
Comment by Build Master (Inactive) [ 11/May/11 ]

Integrated in lustre-master » i686,server,el5,ofa #117
LU-128 Avoid assertion on wire data in last_rcvd update

Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
Files :

  • lustre/obdfilter/filter.c
  • lustre/mdt/mdt_recovery.c
  • lustre/mdt/mdt_open.c
Comment by Build Master (Inactive) [ 11/May/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #117
LU-128 Avoid assertion on wire data in last_rcvd update

Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
Files :

  • lustre/mdt/mdt_recovery.c
  • lustre/obdfilter/filter.c
  • lustre/mdt/mdt_open.c
Comment by Build Master (Inactive) [ 11/May/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #117
LU-128 Avoid assertion on wire data in last_rcvd update

Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
Files :

  • lustre/obdfilter/filter.c
  • lustre/mdt/mdt_recovery.c
  • lustre/mdt/mdt_open.c
Comment by Build Master (Inactive) [ 11/May/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #117
LU-128 Avoid assertion on wire data in last_rcvd update

Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
Files :

  • lustre/obdfilter/filter.c
  • lustre/mdt/mdt_open.c
  • lustre/mdt/mdt_recovery.c
Comment by Build Master (Inactive) [ 11/May/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #117
LU-128 Avoid assertion on wire data in last_rcvd update

Oleg Drokin : 2bb3a7f6b9889af696485267eb254db7980fe193
Files :

  • lustre/obdfilter/filter.c
  • lustre/mdt/mdt_open.c
  • lustre/mdt/mdt_recovery.c
Comment by Gregoire Pichon [ 18/May/11 ]

The patch has been installed on our client's cluster, and he started to get MDS crashes
for LBUG/ASSERTION(req_is_replay(req)) failed.

The MDS Panic thread stack-trace looks like following :
=======================================================
panic()
lbug_with_loc()
libcfs_assertion_failed()
mdt_txn_stop_cb()
dt_txn_hook_stop()
osd_trans_stop()
mdd_trans_stop()
mdd_create()
cml_create()
mdt_reint_open()
mdt_reint_rec()
mdt_reint_internal()
mdt_intent_reint()
mdt_intent_policy()
ldlm_lock_enqueue()
ldlm_handle_enqueue0()
mdt_enqueue()
mdt_handle_common()
mdt_regular_handle()
ptlrpc_server_handle_request()
ptlrpc_main()
kernel_thread()
=======================================================

Looking at the stack trace, the failing ASSERTION(req_is_replay(req)) is likely to come from the fix in /lustre/mdt/mdt_recovery.c.

It appears some scenario are still not covered by the patch.

Could you have a look ?
Thanks,

Grégoire.

Comment by Mikhail Pershin [ 18/May/11 ]

Can you provide more information about setup, what are versions on clients and servers? When does bug occur - immediately after start or occasionally during normal work?

Comment by Gregoire Pichon [ 19/May/11 ]

It is the same cluster than initial problem (Lustre 2.0.0, x86_64, RHEL6).

Here is the information from the client:
The 1st occurence was during normal operations, and next ones during restart+recovery.
And no msgs of connection problems nor client eviction at that time ...

The patch has been removed from the cluster.

Comment by Mikhail Pershin [ 19/May/11 ]

That reason can be other just 2.0.0 code itself, some bug which causes this assertion. The assert can be removed from patch, in that case there will be client evictions only and more debug info about why that is happening.

Comment by Peter Jones [ 25/May/11 ]

Adding as 2.1 blocker upon advice of Bull because this patch is landed for 2.1 and caused issues when deployed in production at CEA.

Comment by Mikhail Pershin [ 25/May/11 ]

Peter, it is not quite correct, patch was landed for 2.1 but problems with it were seen with 2.0.0. I am afraid the differences between 2.1 and 2.0.0 can be the reason for this. So it is not blocker for 2.1 at least we saw no issues with it so far, but we need patch which works with 2.0.0 correctly.

I'd propose to cook special patch for Bull with removed assertions just to make it loyal to any error and we will be able to see debug messages if issues will occur again.

Comment by Peter Jones [ 25/May/11 ]

Ah thanks for clarifying Mike. This can certainly remain an important support issue for CEA and a priority for us without being considered a 2.1 blocker. I have adjusted the status accordingly.

Comment by Peter Jones [ 13/Jun/11 ]

Mike

Are you still expecting to be able to create a patch based on 2.0 for CEA?

CEA,

Would you deploy such a patch or is the window until you rebase on 2.1 small enough that it would not be worthwhile?

Peter

Comment by Sebastien Buisson (Inactive) [ 14/Jun/11 ]

Peter,

As suggested by Mike we cooked a patch for CEA with removed assertions just to make it loyal to any error. Also these error messages are deactivated by default, but can be activated via a kernel module option. That way CEA will be able to collect debug messages when the issue reoccurs.

Sebastien.

Comment by Peter Jones [ 14/Jun/11 ]

ok then I think that we can close this ticket and reopen it if it transpires that we need to take any further action before CEA realign on 2.1

Comment by Sebastien Buisson (Inactive) [ 14/Jun/11 ]

I agree.

Thank you,
Sebastien.

Comment by Niu Yawei (Inactive) [ 25/Aug/11 ]

The patch has been installed on our client's cluster, and he started to get MDS crashes
for LBUG/ASSERTION(req_is_replay(req)) failed.

The MDS Panic thread stack-trace looks like following :
=======================================================
panic()
lbug_with_loc()
libcfs_assertion_failed()
mdt_txn_stop_cb()
dt_txn_hook_stop()
osd_trans_stop()
mdd_trans_stop()
mdd_create()
cml_create()
mdt_reint_open()
mdt_reint_rec()
mdt_reint_internal()
mdt_intent_reint()
mdt_intent_policy()
ldlm_lock_enqueue()
ldlm_handle_enqueue0()
mdt_enqueue()
mdt_handle_common()
mdt_regular_handle()
ptlrpc_server_handle_request()
ptlrpc_main()
kernel_thread()
=======================================================

Looking at the stack trace, the failing ASSERTION(req_is_replay(req)) is likely to come from the fix in /lustre/mdt/mdt_recovery.c.

This should be fixed in LU-617.

Generated at Sat Feb 10 01:04:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.