Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
Patrick Farrel did the analysis on this one. He'll be submitting a fix for this issue. The following is his analysis:
From log 33 (Lustre 2.7):
00010000:00010000:16.0:1470249261.958417:0:8202:0:(ldlm_resource.c:1244:ldlm_resource_add_lock()) ### About to add this lock:
ns: husk-OST0004-osc-ffff88082a5d7800 lock: ffff8807cc3d13c0/0xe5c877bd577e5377 lrc: 4/1,0 mode: -/PR res: [0x3409618:0x0:0x0].0 rrc: 14 type: EXT [6710886400->6715080703] (req 6710886400>6715080703) flags: 0x10000000020000 nid: local remote: 0x7df4b36cbdd565b2 expref: -99 pid: 8202 timeout: 0 lvb_type: 1
00010000:00000001:16.0:1470249261.958420:0:8202:0:(ldlm_lock.c:1859:ldlm_lock_enqueue()) Process leaving via out (rc=0 : 0 : 0x0)
00010000:00000001:16.0:1470249261.958421:0:8202:0:(ldlm_request.c:253:ldlm_completion_ast()) Process entered
00010000:00010000:16.0:1470249261.958421:0:8202:0:(ldlm_request.c:266:ldlm_completion_ast()) ### client-side enqueue returned a blocked lock, sleeping ns: husk-OST0004-osc-ffff88082a5d7800 lock: ffff8807cc3d13c0/0xe5c877bd577e5377 lrc: 4/1,0 mode: -/PR res: [0x3409618:0x0:0x0].0 rrc: 14 type: EXT [6710886400->6715080703] (req 6710886400>6715080703) flags: 0x20000 nid: local remote: 0x7df4b36cbdd565b2 expref: -99 pid: 8202 timeout: 0 lvb_type: 1
00010000:00010000:16.0:1470249261.958424:0:8202:0:(ldlm_request.c:283:ldlm_completion_ast()) ### waiting indefinitely because of NO_TIMEOUT ns: husk-OST0004-osc-ffff88082a5d7800 lock: ffff8807cc3d13c0/0xe5c877bd577e5377 lrc: 4/1,0 mode: -/PR res: [0x3409618:0x0:0x0].0 rrc: 14 type: EXT [6710886400->6715080703] (req 6710886400>6715080703) flags: 0x20000 nid: local remote: 0x7df4b36cbdd565b2 expref: -99 pid: 8202 timeout: 0 lvb_type: 1And, much later:
00010000:00010000:16.0:1470249264.458654:0:8202:0:(ldlm_request.c:310:ldlm_completion_ast()) ### client-side enqueue waking up: failed (4) ns: husk-OST0004-osc-ffff88082a5d7800 lock: ffff8807cc3d13c0/0xe5c877bd577e5377 lrc: 4/1,0 mode: --/PR res: [0x3409618:0x0:0x0].0 rrc: 9 type: EXT [6710886400->6715080703] (req 6710886400>6715080703) flags: 0x20000 nid: local remote: 0x7df4b36cbdd565b2 expref: -99 pid: 8202 timeout: 0 lvb_type: 1
00010000:00000001:16.0:1470249264.458659:0:8202:0:(ldlm_request.c:311:ldlm_completion_ast()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)Note that in this case, there's only one user process here. In several other examples, the ldlm_completion_ast messages are only present for the wakeup with -4, not the going to sleep. (Just not in the logs.)
But in those cases, all of the processes wake up at the same time. I think we're sending a signal and waking them up - This is expected behavior.
And now I see why it doesn't happen on Lustre 2.5. Lustre 2.5 uses ldlm_completion_ast_async for OST locks like this. ldlm_completion_ast_async doesn't sleep (it's called from the ptlrpcd thread anyway).
Instead, the userspace threads in Lustre 2.5 are sleeping in cl_lock_state_wait. That code returns -ERESTARTSYS when interrupted.
Here's it being interrupted from the 2.5 logs:
32.dklog:00000020:00010000:18.0:1470253848.163773:0:8262:0:(cl_lock.c:151:cl_lock_trace0()) state wait lock: ffff880834494c40@(3 ffff88082fe5a3c0 1 1 0 1 1 0)(ffff880830a0e680/0/1) at cl_lock_state_wait():968
32.dklog:00000020:00000001:19.0:1470253851.931837:0:8264:0:(cl_lock.c:999:cl_lock_state_wait()) Process leaving (rc=18446744073709551104 : -512 : fffffffffffffe00)
32.dklog:00000020:00000001:3.0:1470253851.931837:0:8263:0:(cl_lock.c:999:cl_lock_state_wait()) Process leaving (rc=18446744073709551104 : -512 : fffffffffffffe00)
32.dklog:00000020:00000001:18.0:1470253851.931840:0:8262:0:(cl_lock.c:999:cl_lock_state_wait()) Process leaving (rc=18446744073709551104 : -512 : fffffffffffffe00)
32.dklog:00000020:00000001:17.0:1470253851.931843:0:8260:0:(cl_lock.c:999:cl_lock_state_wait()) Process leaving (rc=18446744073709551104 : -512 : fffffffffffffe00)
32.dklog:00000020:00000001:16.0:1470253851.931855:0:8258:0:(cl_lock.c:999:cl_lock_state_wait()) Process leaving (rc=18446744073709551104 : -512 : fffffffffffffe00)
32.dklog:00000020:00000001:18.0:1470253853.940209:0:8262:0:(cl_lock.c:962:cl_lock_state_wait()) Process enteredGetting back to Lustre 2.7 and ldlm_completion_ast:
So... We could return -ERESTARTSYS here, like 2.5 does from the cl_lock_state_wait code?Considering other choices: There is no timeout here. So we can't do a non-interruptible wait.
So we probably need to follow the model from osc_enter_cache (look at the l_wait_event there), and translate -EINTR from l_wait_event here in to -ERESTARTSYS.
If anyone is interested in seeing the logs referenced by patrick I can attach them to the ticket.