[LU-8494] IOR Aborts with EINTR in 2.7 but not 2.5 Created: 10/Aug/16  Updated: 06/Feb/22  Resolved: 06/Feb/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Chris Horn Assignee: Patrick Farrell (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None

Issue Links:
Related
is related to LU-9340 PFL fails performance testsSpirit Resolved
is related to LU-3020 Lustre returns EINTR during writes wh... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Patrick Farrel did the analysis on this one. He'll be submitting a fix for this issue. The following is his analysis:

From log 33 (Lustre 2.7):
00010000:00010000:16.0:1470249261.958417:0:8202:0:(ldlm_resource.c:1244:ldlm_resource_add_lock()) ### About to add this lock:
ns: husk-OST0004-osc-ffff88082a5d7800 lock: ffff8807cc3d13c0/0xe5c877bd577e5377 lrc: 4/1,0 mode: -/PR res: [0x3409618:0x0:0x0].0 rrc: 14 type: EXT [6710886400->6715080703] (req 6710886400>6715080703) flags: 0x10000000020000 nid: local remote: 0x7df4b36cbdd565b2 expref: -99 pid: 8202 timeout: 0 lvb_type: 1
00010000:00000001:16.0:1470249261.958420:0:8202:0:(ldlm_lock.c:1859:ldlm_lock_enqueue()) Process leaving via out (rc=0 : 0 : 0x0)
00010000:00000001:16.0:1470249261.958421:0:8202:0:(ldlm_request.c:253:ldlm_completion_ast()) Process entered
00010000:00010000:16.0:1470249261.958421:0:8202:0:(ldlm_request.c:266:ldlm_completion_ast()) ### client-side enqueue returned a blocked lock, sleeping ns: husk-OST0004-osc-ffff88082a5d7800 lock: ffff8807cc3d13c0/0xe5c877bd577e5377 lrc: 4/1,0 mode: -/PR res: [0x3409618:0x0:0x0].0 rrc: 14 type: EXT [6710886400->6715080703] (req 6710886400>6715080703) flags: 0x20000 nid: local remote: 0x7df4b36cbdd565b2 expref: -99 pid: 8202 timeout: 0 lvb_type: 1
00010000:00010000:16.0:1470249261.958424:0:8202:0:(ldlm_request.c:283:ldlm_completion_ast()) ### waiting indefinitely because of NO_TIMEOUT ns: husk-OST0004-osc-ffff88082a5d7800 lock: ffff8807cc3d13c0/0xe5c877bd577e5377 lrc: 4/1,0 mode: -/PR res: [0x3409618:0x0:0x0].0 rrc: 14 type: EXT [6710886400->6715080703] (req 6710886400>6715080703) flags: 0x20000 nid: local remote: 0x7df4b36cbdd565b2 expref: -99 pid: 8202 timeout: 0 lvb_type: 1

And, much later:
00010000:00010000:16.0:1470249264.458654:0:8202:0:(ldlm_request.c:310:ldlm_completion_ast()) ### client-side enqueue waking up: failed (4) ns: husk-OST0004-osc-ffff88082a5d7800 lock: ffff8807cc3d13c0/0xe5c877bd577e5377 lrc: 4/1,0 mode: --/PR res: [0x3409618:0x0:0x0].0 rrc: 9 type: EXT [6710886400->6715080703] (req 6710886400>6715080703) flags: 0x20000 nid: local remote: 0x7df4b36cbdd565b2 expref: -99 pid: 8202 timeout: 0 lvb_type: 1
00010000:00000001:16.0:1470249264.458659:0:8202:0:(ldlm_request.c:311:ldlm_completion_ast()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)

Note that in this case, there's only one user process here. In several other examples, the ldlm_completion_ast messages are only present for the wakeup with -4, not the going to sleep. (Just not in the logs.)

But in those cases, all of the processes wake up at the same time. I think we're sending a signal and waking them up - This is expected behavior.

And now I see why it doesn't happen on Lustre 2.5. Lustre 2.5 uses ldlm_completion_ast_async for OST locks like this. ldlm_completion_ast_async doesn't sleep (it's called from the ptlrpcd thread anyway).

Instead, the userspace threads in Lustre 2.5 are sleeping in cl_lock_state_wait. That code returns -ERESTARTSYS when interrupted.

Here's it being interrupted from the 2.5 logs:
32.dklog:00000020:00010000:18.0:1470253848.163773:0:8262:0:(cl_lock.c:151:cl_lock_trace0()) state wait lock: ffff880834494c40@(3 ffff88082fe5a3c0 1 1 0 1 1 0)(ffff880830a0e680/0/1) at cl_lock_state_wait():968
32.dklog:00000020:00000001:19.0:1470253851.931837:0:8264:0:(cl_lock.c:999:cl_lock_state_wait()) Process leaving (rc=18446744073709551104 : -512 : fffffffffffffe00)
32.dklog:00000020:00000001:3.0:1470253851.931837:0:8263:0:(cl_lock.c:999:cl_lock_state_wait()) Process leaving (rc=18446744073709551104 : -512 : fffffffffffffe00)
32.dklog:00000020:00000001:18.0:1470253851.931840:0:8262:0:(cl_lock.c:999:cl_lock_state_wait()) Process leaving (rc=18446744073709551104 : -512 : fffffffffffffe00)
32.dklog:00000020:00000001:17.0:1470253851.931843:0:8260:0:(cl_lock.c:999:cl_lock_state_wait()) Process leaving (rc=18446744073709551104 : -512 : fffffffffffffe00)
32.dklog:00000020:00000001:16.0:1470253851.931855:0:8258:0:(cl_lock.c:999:cl_lock_state_wait()) Process leaving (rc=18446744073709551104 : -512 : fffffffffffffe00)
32.dklog:00000020:00000001:18.0:1470253853.940209:0:8262:0:(cl_lock.c:962:cl_lock_state_wait()) Process entered

Getting back to Lustre 2.7 and ldlm_completion_ast:
So... We could return -ERESTARTSYS here, like 2.5 does from the cl_lock_state_wait code?

Considering other choices: There is no timeout here. So we can't do a non-interruptible wait.

So we probably need to follow the model from osc_enter_cache (look at the l_wait_event there), and translate -EINTR from l_wait_event here in to -ERESTARTSYS.

If anyone is interested in seeing the logs referenced by patrick I can attach them to the ticket.



 Comments   
Comment by Gerrit Updater [ 10/Aug/16 ]

Patrick Farrell (paf@cray.com) uploaded a new patch: http://review.whamcloud.com/21863
Subject: LU-8494 ldlm: Return -ERESTARTSYS from wait
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 565e4c46d7e00d344de22ffb7277236bf624031d

Comment by Patrick Farrell (Inactive) [ 10/Aug/16 ]

I've tested the above patch with a reproducer for this problem.

Comment by Cory Spitz [ 24/Jan/22 ]

I think it is OK to resolve this and abandon https://review.whamcloud.com/#/c/21863/. Is it?

Comment by Patrick Farrell [ 06/Feb/22 ]

I agree, Cory - Nothing's been reported here in an age and beyond.  I'm not sure it's fixed but there have been a bunch of changes in this area, so I'm comfortable abandoning it and picking it back up if it turns out to be needed.

Comment by Patrick Farrell [ 06/Feb/22 ]

Doesn't seem to be a problem in newer versions.

Generated at Sat Feb 10 02:18:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.