[LU-5128] ASSERTION( atomic_read(&obd->obd_req_replay_clients) == 0 ) failed Created: 01/Jun/14  Updated: 18/Aug/14  Resolved: 06/Aug/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: Lustre 2.7.0, Lustre 2.5.3

Type: Bug Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: duu, mn4
Environment:

Lustre-2.4.3


Severity: 3
Rank (Obsolete): 14153

 Description   

MDS failovered and once MDS's recovery finished, many OSS crahsed due to following ASSERTION.

2014-05-30 17:39:07 Lustre: Skipped 3 previous similar messages
2014-05-30 17:39:07 LustreError: 18967:0:(ldlm_lib.c:1851:target_next_replay_req()) ASSERTION( atomic_read(&obd->obd_req_replay_clients) == 0 ) failed: 
2014-05-30 17:39:07 LustreError: 18967:0:(ldlm_lib.c:1851:target_next_replay_req()) LBUG
2014-05-30 17:39:07 Pid: 18967, comm: tgt_recov
2014-05-30 17:39:07 
2014-05-30 17:39:07 Call Trace:
2014-05-30 17:39:07  [<ffffffffa0353895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2014-05-30 17:39:07  [<ffffffffa0353e97>] lbug_with_loc+0x47/0xb0 [libcfs]
2014-05-30 17:39:07  [<ffffffffa066f48c>] target_recovery_thread+0x14ac/0x1970 [ptlrpc]
2014-05-30 17:39:07  [<ffffffffa066dfe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
2014-05-30 17:39:07  [<ffffffff8100c0ca>] child_rip+0xa/0x20
2014-05-30 17:39:07  [<ffffffffa066dfe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
2014-05-30 17:39:07  [<ffffffffa066dfe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
2014-05-30 17:39:07  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
2014-05-30 17:39:07 
2014-05-30 17:39:07 Kernel panic - not syncing: LBUG
2014-05-30 17:39:07 Pid: 18967, comm: tgt_recov Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1
2014-05-30 17:39:07 Call Trace:
2014-05-30 17:39:07  [<ffffffff8150de58>] ? panic+0xa7/0x16f
2014-05-30 17:39:07  [<ffffffffa0353eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
2014-05-30 17:39:07  [<ffffffffa066f48c>] ? target_recovery_thread+0x14ac/0x1970 [ptlrpc]
2014-05-30 17:39:07  [<ffffffffa066dfe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
2014-05-30 17:39:07  [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
2014-05-30 17:39:07  [<ffffffffa066dfe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
2014-05-30 17:39:07  [<ffffffffa066dfe0>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
2014-05-30 17:39:07  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

LU-1522 and LU-2397 reported similar problem, but these patches have been merged in b2_4, already.



 Comments   
Comment by Peter Jones [ 02/Jun/14 ]

Hongchao

Could you please advise on this one?

Thanks

Peter

Comment by Hongchao Zhang [ 06/Jun/14 ]

Hi,

Could you please attach the whole logs of this issue, thanks!
btw, did the OSS also failovered along with MDS?

Thanks

Comment by Hongchao Zhang [ 06/Jun/14 ]

there could be a race between "target_process_req_flags" and "class_export_recovery_cleanup", and if the replay request contains the flag
"MSG_REQ_REPLAY_DONE", the "exp->exp_req_replay_needed" will be cleared and "obd->obd_req_replay_clients" will be decreased with
protection "exp_lock" in "target_process_req_flags". but "class_export_recovery_cleanup" checks the "exp_req_replay_needed" without the lock
"exp_lock", then it could decrement "obd_req_replay_clients" once more and causes this issue.

the patch against b2_4 is tracked at http://review.whamcloud.com/#/c/10628/

Comment by Shuichi Ihara (Inactive) [ 24/Jun/14 ]

this only happens on b2_4 branch or same problem maybe occur even on b2_5?

Comment by Hongchao Zhang [ 25/Jun/14 ]

the issue tracked at http://review.whamcloud.com/#/c/10628/ also exists on b2_5

Comment by Hongchao Zhang [ 26/Jun/14 ]

the patch against master is tracked at http://review.whamcloud.com/#/c/10849/

Comment by wu libin (Inactive) [ 15/Jul/14 ]

Here is the patch for b2_5: http://review.whamcloud.com/#/c/11102/

Comment by Peter Jones [ 06/Aug/14 ]

Landed for 2.7

Generated at Sat Feb 10 01:48:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.