[LU-5287] (ldlm_lib.c:2253:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.7.0, Lustre 2.5.4
Affects Version/s: Lustre 2.6.0, Lustre 2.7.0
Labels:
- llnl
- ost

Severity:
3
Rank (Obsolete):
14750

Description

Running racer with 2 clients MDSCOUNT=1 and 2.5.60-90-g37432a8 + http://review.whamcloud.com/#/c/5936/ I see this when restarting a crashed OST with some clients still mounted.

[  230.089707] Lustre: Skipped 75 previous similar messages
[  231.775205] Lustre: 2151:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404323793/real 1404323793]  req@ffff8801f78fc110 x1472540086110788/t0(0) o400->lustre-OST0001-osc-MDT0000@0@lo:28/4 lens 224/224 e 1 to 1 dl 1404323837 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
[  237.775938] Lustre: lustre-OST0001: Denying connection for new client cc64d6dc-4180-e700-9f7e-ce147524a8f0 (at 0@lo), waiting for all 4 known clients (2 recovered, 1 in progress, and 1 evicted) to recover in 0:36
[  237.781858] Lustre: Skipped 3 previous similar messages
[  242.801254] LustreError: 2880:0:(ldlm_lib.c:2253:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed: 
[  242.805102] LustreError: 2880:0:(ldlm_lib.c:2253:target_queue_recovery_request()) LBUG
[  242.807953] Pid: 2880, comm: ll_ost00_007
[  242.809274] 
[  242.809276] Call Trace:
[  242.810585]  [<ffffffffa02b98c5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[  242.812764]  [<ffffffffa02b9ec7>] lbug_with_loc+0x47/0xb0 [libcfs]
[  242.814689]  [<ffffffffa064ea0c>] target_queue_recovery_request+0xbac/0xc10 [ptlrpc]
[  242.816347]  [<ffffffffa06e122f>] tgt_handle_recovery+0x38f/0x520 [ptlrpc]
[  242.817666]  [<ffffffffa06e6b8d>] tgt_request_handle+0x18d/0xad0 [ptlrpc]
[  242.818987]  [<ffffffffa0699e31>] ptlrpc_main+0xcf1/0x1880 [ptlrpc]
[  242.820261]  [<ffffffffa0699140>] ? ptlrpc_main+0x0/0x1880 [ptlrpc]
[  242.821440]  [<ffffffff8109eab6>] kthread+0x96/0xa0
[  242.822360]  [<ffffffff8100c30a>] child_rip+0xa/0x20
[  242.823303]  [<ffffffff81554710>] ? _spin_unlock_irq+0x30/0x40
[  242.824390]  [<ffffffff8100bb10>] ? restore_args+0x0/0x30
[  242.825391]  [<ffffffff8109ea20>] ? kthread+0x0/0xa0
[  242.826315]  [<ffffffff8100c300>] ? child_rip+0x0/0x20
[  242.827283]

Attachments

Issue Links

duplicates

LU-5572 replay-single test_73b: import is not in FULL state

Closed

is related to

LU-5651 ASSERTION( req->rq_export->exp_lock_replay_needed ) failed

Resolved

mentioned in: Page Loading...

Activity

[LU-5287] (ldlm_lib.c:2253:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed

Vinayak (Inactive) added a comment - 08/Sep/16 6:47 AM

Thanks Niu.

Please correct me if wrong.

so flag change operation needs to take lock

If the operations are not concurrent then it does not matter whether bit filed change happens with lock or without lock. The whole unsigned long will not be affected if we change any of the bits. Am I right ?

load the whole unsigned long, change a bit, write the whole unsigned long

what is the real importance of lock here ?

Thanks in advance,

Vinayak (Inactive) added a comment - 08/Sep/16 6:47 AM Thanks Niu. Please correct me if wrong. so flag change operation needs to take lock If the operations are not concurrent then it does not matter whether bit filed change happens with lock or without lock. The whole unsigned long will not be affected if we change any of the bits. Am I right ? load the whole unsigned long, change a bit, write the whole unsigned long what is the real importance of lock here ? Thanks in advance,

Niu Yawei (Inactive) added a comment - 07/Sep/16 1:00 PM

All export flags share the same "unsigned long" and flag set operation "export->exp_xxx = 1" isn't atomic (load the whole unsigned long, change a bit, write the whole unsigned long), so flag change operation needs to take lock.

Niu Yawei (Inactive) added a comment - 07/Sep/16 1:00 PM All export flags share the same "unsigned long" and flag set operation "export->exp_xxx = 1" isn't atomic (load the whole unsigned long, change a bit, write the whole unsigned long), so flag change operation needs to take lock.

Vinayak (Inactive) added a comment - 07/Sep/16 8:46 AM - edited

Hello Niu,

I am trying to understand http://review.whamcloud.com/#/c/12162.

Can you please help me to understand how it affects req->rq_export->exp_lock_replay_needed flag ?

Thanks in advance,

Vinayak (Inactive) added a comment - 07/Sep/16 8:46 AM - edited Hello Niu, I am trying to understand http://review.whamcloud.com/#/c/12162 . Can you please help me to understand how it affects req->rq_export->exp_lock_replay_needed flag ? Thanks in advance,

Peter Jones added a comment - 06/Nov/14 9:49 PM

Landed for 2.5.4 and 2.7

Peter Jones added a comment - 06/Nov/14 9:49 PM Landed for 2.5.4 and 2.7

Bob Glossman (Inactive) added a comment - 01/Oct/14 9:06 PM

backports to b2_5:
http://review.whamcloud.com/#/c/12162
http://review.whamcloud.com/#/c/12163

Bob Glossman (Inactive) added a comment - 01/Oct/14 9:06 PM backports to b2_5: http://review.whamcloud.com/#/c/12162 http://review.whamcloud.com/#/c/12163

Christopher Morrone (Inactive) added a comment - 01/Oct/14 7:48 PM

All of the patches needed to fix this ticket's assertion.

But actually, we are moving to 2.5 soon, so patches for b2_5 should be sufficient.

Christopher Morrone (Inactive) added a comment - 01/Oct/14 7:48 PM All of the patches needed to fix this ticket's assertion. But actually, we are moving to 2.5 soon, so patches for b2_5 should be sufficient.

Bob Glossman (Inactive) added a comment - 01/Oct/14 6:45 PM

Christopher, which patch(es) do you need back ported to b2_4? Don't want to do too much or too little.

Bob Glossman (Inactive) added a comment - 01/Oct/14 6:45 PM Christopher, which patch(es) do you need back ported to b2_4? Don't want to do too much or too little.

Christopher Morrone (Inactive) added a comment - 01/Oct/14 6:24 PM

We just had 181 servers crash with this assertion when starting up 2.4.2-16chaos (see github.com/chaos/lustre) on the servers for the first time. We need a patch for our branch as well.

Christopher Morrone (Inactive) added a comment - 01/Oct/14 6:24 PM We just had 181 servers crash with this assertion when starting up 2.4.2-16chaos (see github.com/chaos/lustre) on the servers for the first time. We need a patch for our branch as well.

Sarah Liu added a comment - 29/Sep/14 11:41 PM

Hit this bug on master branch build #2671
https://testing.hpdd.intel.com/test_sets/d986a3a2-472c-11e4-a9ec-5254006e85c2

Sarah Liu added a comment - 29/Sep/14 11:41 PM Hit this bug on master branch build #2671 https://testing.hpdd.intel.com/test_sets/d986a3a2-472c-11e4-a9ec-5254006e85c2

Niu Yawei (Inactive) added a comment - 26/Sep/14 12:10 PM

Andriy discovered another path to trigger this assertion. (see ~~LU-5651~~), patch is being reviewed on: http://review.whamcloud.com/#/c/12015/

Niu Yawei (Inactive) added a comment - 26/Sep/14 12:10 PM Andriy discovered another path to trigger this assertion. (see LU-5651 ), patch is being reviewed on: http://review.whamcloud.com/#/c/12015/

Niu Yawei (Inactive) added a comment - 11/Sep/14 11:04 AM

Well, I found that there are two places which modify the exp_flags without holding exp_lock, that could result in concurrent exp_flags updating overwrites each other.

patch for master: http://review.whamcloud.com/11871

Niu Yawei (Inactive) added a comment - 11/Sep/14 11:04 AM Well, I found that there are two places which modify the exp_flags without holding exp_lock, that could result in concurrent exp_flags updating overwrites each other. patch for master: http://review.whamcloud.com/11871

People

Assignee:: Niu Yawei (Inactive)

Reporter:: John Hammond

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 02/Jul/14 6:06 PM

Updated:: 19/Sep/16 2:51 AM

Resolved:: 06/Nov/14 9:49 PM