Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.6.0, Lustre 2.7.0
-
3
-
14750
Description
Running racer with 2 clients MDSCOUNT=1 and 2.5.60-90-g37432a8 + http://review.whamcloud.com/#/c/5936/ I see this when restarting a crashed OST with some clients still mounted.
[ 230.089707] Lustre: Skipped 75 previous similar messages [ 231.775205] Lustre: 2151:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404323793/real 1404323793] req@ffff8801f78fc110 x1472540086110788/t0(0) o400->lustre-OST0001-osc-MDT0000@0@lo:28/4 lens 224/224 e 1 to 1 dl 1404323837 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1 [ 237.775938] Lustre: lustre-OST0001: Denying connection for new client cc64d6dc-4180-e700-9f7e-ce147524a8f0 (at 0@lo), waiting for all 4 known clients (2 recovered, 1 in progress, and 1 evicted) to recover in 0:36 [ 237.781858] Lustre: Skipped 3 previous similar messages [ 242.801254] LustreError: 2880:0:(ldlm_lib.c:2253:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed: [ 242.805102] LustreError: 2880:0:(ldlm_lib.c:2253:target_queue_recovery_request()) LBUG [ 242.807953] Pid: 2880, comm: ll_ost00_007 [ 242.809274] [ 242.809276] Call Trace: [ 242.810585] [<ffffffffa02b98c5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [ 242.812764] [<ffffffffa02b9ec7>] lbug_with_loc+0x47/0xb0 [libcfs] [ 242.814689] [<ffffffffa064ea0c>] target_queue_recovery_request+0xbac/0xc10 [ptlrpc] [ 242.816347] [<ffffffffa06e122f>] tgt_handle_recovery+0x38f/0x520 [ptlrpc] [ 242.817666] [<ffffffffa06e6b8d>] tgt_request_handle+0x18d/0xad0 [ptlrpc] [ 242.818987] [<ffffffffa0699e31>] ptlrpc_main+0xcf1/0x1880 [ptlrpc] [ 242.820261] [<ffffffffa0699140>] ? ptlrpc_main+0x0/0x1880 [ptlrpc] [ 242.821440] [<ffffffff8109eab6>] kthread+0x96/0xa0 [ 242.822360] [<ffffffff8100c30a>] child_rip+0xa/0x20 [ 242.823303] [<ffffffff81554710>] ? _spin_unlock_irq+0x30/0x40 [ 242.824390] [<ffffffff8100bb10>] ? restore_args+0x0/0x30 [ 242.825391] [<ffffffff8109ea20>] ? kthread+0x0/0xa0 [ 242.826315] [<ffffffff8100c300>] ? child_rip+0x0/0x20 [ 242.827283]
Thanks Niu.
Please correct me if wrong.
If the operations are not concurrent then it does not matter whether bit filed change happens with lock or without lock. The whole unsigned long will not be affected if we change any of the bits. Am I right ?
what is the real importance of lock here ?
Thanks in advance,