[LU-5287] (ldlm_lib.c:2253:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed Created: 02/Jul/14 Updated: 19/Sep/16 Resolved: 06/Nov/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.7.0, Lustre 2.5.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | John Hammond | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl, ost | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 14750 | ||||||||||||||||
| Description |
|
Running racer with 2 clients MDSCOUNT=1 and 2.5.60-90-g37432a8 + http://review.whamcloud.com/#/c/5936/ I see this when restarting a crashed OST with some clients still mounted. [ 230.089707] Lustre: Skipped 75 previous similar messages [ 231.775205] Lustre: 2151:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404323793/real 1404323793] req@ffff8801f78fc110 x1472540086110788/t0(0) o400->lustre-OST0001-osc-MDT0000@0@lo:28/4 lens 224/224 e 1 to 1 dl 1404323837 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1 [ 237.775938] Lustre: lustre-OST0001: Denying connection for new client cc64d6dc-4180-e700-9f7e-ce147524a8f0 (at 0@lo), waiting for all 4 known clients (2 recovered, 1 in progress, and 1 evicted) to recover in 0:36 [ 237.781858] Lustre: Skipped 3 previous similar messages [ 242.801254] LustreError: 2880:0:(ldlm_lib.c:2253:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed: [ 242.805102] LustreError: 2880:0:(ldlm_lib.c:2253:target_queue_recovery_request()) LBUG [ 242.807953] Pid: 2880, comm: ll_ost00_007 [ 242.809274] [ 242.809276] Call Trace: [ 242.810585] [<ffffffffa02b98c5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [ 242.812764] [<ffffffffa02b9ec7>] lbug_with_loc+0x47/0xb0 [libcfs] [ 242.814689] [<ffffffffa064ea0c>] target_queue_recovery_request+0xbac/0xc10 [ptlrpc] [ 242.816347] [<ffffffffa06e122f>] tgt_handle_recovery+0x38f/0x520 [ptlrpc] [ 242.817666] [<ffffffffa06e6b8d>] tgt_request_handle+0x18d/0xad0 [ptlrpc] [ 242.818987] [<ffffffffa0699e31>] ptlrpc_main+0xcf1/0x1880 [ptlrpc] [ 242.820261] [<ffffffffa0699140>] ? ptlrpc_main+0x0/0x1880 [ptlrpc] [ 242.821440] [<ffffffff8109eab6>] kthread+0x96/0xa0 [ 242.822360] [<ffffffff8100c30a>] child_rip+0xa/0x20 [ 242.823303] [<ffffffff81554710>] ? _spin_unlock_irq+0x30/0x40 [ 242.824390] [<ffffffff8100bb10>] ? restore_args+0x0/0x30 [ 242.825391] [<ffffffff8109ea20>] ? kthread+0x0/0xa0 [ 242.826315] [<ffffffff8100c300>] ? child_rip+0x0/0x20 [ 242.827283] |
| Comments |
| Comment by Christopher Morrone [ 26/Aug/14 ] |
|
We hit this same assertion with Lustre version 2.4.2-14.1chaos (see github.com/chaos/lustre) while the OSTs were in recovery. |
| Comment by Peter Jones [ 05/Sep/14 ] |
|
Niu Could you please advise on this issue? Thanks Peter |
| Comment by Nathaniel Clark [ 08/Sep/14 ] |
|
replay-single/73c on review-dne-part-2 on master: |
| Comment by Niu Yawei (Inactive) [ 09/Sep/14 ] |
|
The replay-single/73c seems never be really tested since the fail_loc OBD_FAIL_TGT_LAST_REPLAY was never be used (from the day one it was introduced)... Not sure if this test can trigger the bug more easily once it's fixed. I'm going to investigate it further. |
| Comment by Niu Yawei (Inactive) [ 11/Sep/14 ] |
|
Well, I found that there are two places which modify the exp_flags without holding exp_lock, that could result in concurrent exp_flags updating overwrites each other. patch for master: http://review.whamcloud.com/11871 |
| Comment by Niu Yawei (Inactive) [ 26/Sep/14 ] |
|
Andriy discovered another path to trigger this assertion. (see |
| Comment by Sarah Liu [ 29/Sep/14 ] |
|
Hit this bug on master branch build #2671 |
| Comment by Christopher Morrone [ 01/Oct/14 ] |
|
We just had 181 servers crash with this assertion when starting up 2.4.2-16chaos (see github.com/chaos/lustre) on the servers for the first time. We need a patch for our branch as well. |
| Comment by Bob Glossman (Inactive) [ 01/Oct/14 ] |
|
Christopher, which patch(es) do you need back ported to b2_4? Don't want to do too much or too little. |
| Comment by Christopher Morrone [ 01/Oct/14 ] |
|
All of the patches needed to fix this ticket's assertion. But actually, we are moving to 2.5 soon, so patches for b2_5 should be sufficient. |
| Comment by Bob Glossman (Inactive) [ 01/Oct/14 ] |
|
backports to b2_5: |
| Comment by Peter Jones [ 06/Nov/14 ] |
|
Landed for 2.5.4 and 2.7 |
| Comment by Vinayak (Inactive) [ 07/Sep/16 ] |
|
Hello Niu, I am trying to understand http://review.whamcloud.com/#/c/12162. Can you please help me to understand how it affects req->rq_export->exp_lock_replay_needed flag ? Thanks in advance, |
| Comment by Niu Yawei (Inactive) [ 07/Sep/16 ] |
|
All export flags share the same "unsigned long" and flag set operation "export->exp_xxx = 1" isn't atomic (load the whole unsigned long, change a bit, write the whole unsigned long), so flag change operation needs to take lock. |
| Comment by Vinayak (Inactive) [ 08/Sep/16 ] |
|
Thanks Niu. Please correct me if wrong.
If the operations are not concurrent then it does not matter whether bit filed change happens with lock or without lock. The whole unsigned long will not be affected if we change any of the bits. Am I right ?
what is the real importance of lock here ? Thanks in advance, |
| Comment by Niu Yawei (Inactive) [ 18/Sep/16 ] |
|
Right, it's safe if there isn't any concurrent changes. The importance of lock here is that there could be concurrent changes, I didn't take a thoroughly retrospect on the code changes, it looks concurrency is possible at first glance. Even if no concurrent changes in present code, I don't think we can hypothesize that concurrent changes will never happen, taking lock is the safe way to avoid nasty bugs. |
| Comment by Vinayak (Inactive) [ 19/Sep/16 ] |
|
Thanks Niu. We have faced similar issue on our set up. I will update my further investigation here. |