Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8494

IOR Aborts with EINTR in 2.7 but not 2.5

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Patrick Farrel did the analysis on this one. He'll be submitting a fix for this issue. The following is his analysis:

      From log 33 (Lustre 2.7):
      00010000:00010000:16.0:1470249261.958417:0:8202:0:(ldlm_resource.c:1244:ldlm_resource_add_lock()) ### About to add this lock:
      ns: husk-OST0004-osc-ffff88082a5d7800 lock: ffff8807cc3d13c0/0xe5c877bd577e5377 lrc: 4/1,0 mode: -/PR res: [0x3409618:0x0:0x0].0 rrc: 14 type: EXT [6710886400->6715080703] (req 6710886400>6715080703) flags: 0x10000000020000 nid: local remote: 0x7df4b36cbdd565b2 expref: -99 pid: 8202 timeout: 0 lvb_type: 1
      00010000:00000001:16.0:1470249261.958420:0:8202:0:(ldlm_lock.c:1859:ldlm_lock_enqueue()) Process leaving via out (rc=0 : 0 : 0x0)
      00010000:00000001:16.0:1470249261.958421:0:8202:0:(ldlm_request.c:253:ldlm_completion_ast()) Process entered
      00010000:00010000:16.0:1470249261.958421:0:8202:0:(ldlm_request.c:266:ldlm_completion_ast()) ### client-side enqueue returned a blocked lock, sleeping ns: husk-OST0004-osc-ffff88082a5d7800 lock: ffff8807cc3d13c0/0xe5c877bd577e5377 lrc: 4/1,0 mode: -/PR res: [0x3409618:0x0:0x0].0 rrc: 14 type: EXT [6710886400->6715080703] (req 6710886400>6715080703) flags: 0x20000 nid: local remote: 0x7df4b36cbdd565b2 expref: -99 pid: 8202 timeout: 0 lvb_type: 1
      00010000:00010000:16.0:1470249261.958424:0:8202:0:(ldlm_request.c:283:ldlm_completion_ast()) ### waiting indefinitely because of NO_TIMEOUT ns: husk-OST0004-osc-ffff88082a5d7800 lock: ffff8807cc3d13c0/0xe5c877bd577e5377 lrc: 4/1,0 mode: -/PR res: [0x3409618:0x0:0x0].0 rrc: 14 type: EXT [6710886400->6715080703] (req 6710886400>6715080703) flags: 0x20000 nid: local remote: 0x7df4b36cbdd565b2 expref: -99 pid: 8202 timeout: 0 lvb_type: 1

      And, much later:
      00010000:00010000:16.0:1470249264.458654:0:8202:0:(ldlm_request.c:310:ldlm_completion_ast()) ### client-side enqueue waking up: failed (4) ns: husk-OST0004-osc-ffff88082a5d7800 lock: ffff8807cc3d13c0/0xe5c877bd577e5377 lrc: 4/1,0 mode: --/PR res: [0x3409618:0x0:0x0].0 rrc: 9 type: EXT [6710886400->6715080703] (req 6710886400>6715080703) flags: 0x20000 nid: local remote: 0x7df4b36cbdd565b2 expref: -99 pid: 8202 timeout: 0 lvb_type: 1
      00010000:00000001:16.0:1470249264.458659:0:8202:0:(ldlm_request.c:311:ldlm_completion_ast()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)

      Note that in this case, there's only one user process here. In several other examples, the ldlm_completion_ast messages are only present for the wakeup with -4, not the going to sleep. (Just not in the logs.)

      But in those cases, all of the processes wake up at the same time. I think we're sending a signal and waking them up - This is expected behavior.

      And now I see why it doesn't happen on Lustre 2.5. Lustre 2.5 uses ldlm_completion_ast_async for OST locks like this. ldlm_completion_ast_async doesn't sleep (it's called from the ptlrpcd thread anyway).

      Instead, the userspace threads in Lustre 2.5 are sleeping in cl_lock_state_wait. That code returns -ERESTARTSYS when interrupted.

      Here's it being interrupted from the 2.5 logs:
      32.dklog:00000020:00010000:18.0:1470253848.163773:0:8262:0:(cl_lock.c:151:cl_lock_trace0()) state wait lock: ffff880834494c40@(3 ffff88082fe5a3c0 1 1 0 1 1 0)(ffff880830a0e680/0/1) at cl_lock_state_wait():968
      32.dklog:00000020:00000001:19.0:1470253851.931837:0:8264:0:(cl_lock.c:999:cl_lock_state_wait()) Process leaving (rc=18446744073709551104 : -512 : fffffffffffffe00)
      32.dklog:00000020:00000001:3.0:1470253851.931837:0:8263:0:(cl_lock.c:999:cl_lock_state_wait()) Process leaving (rc=18446744073709551104 : -512 : fffffffffffffe00)
      32.dklog:00000020:00000001:18.0:1470253851.931840:0:8262:0:(cl_lock.c:999:cl_lock_state_wait()) Process leaving (rc=18446744073709551104 : -512 : fffffffffffffe00)
      32.dklog:00000020:00000001:17.0:1470253851.931843:0:8260:0:(cl_lock.c:999:cl_lock_state_wait()) Process leaving (rc=18446744073709551104 : -512 : fffffffffffffe00)
      32.dklog:00000020:00000001:16.0:1470253851.931855:0:8258:0:(cl_lock.c:999:cl_lock_state_wait()) Process leaving (rc=18446744073709551104 : -512 : fffffffffffffe00)
      32.dklog:00000020:00000001:18.0:1470253853.940209:0:8262:0:(cl_lock.c:962:cl_lock_state_wait()) Process entered

      Getting back to Lustre 2.7 and ldlm_completion_ast:
      So... We could return -ERESTARTSYS here, like 2.5 does from the cl_lock_state_wait code?

      Considering other choices: There is no timeout here. So we can't do a non-interruptible wait.

      So we probably need to follow the model from osc_enter_cache (look at the l_wait_event there), and translate -EINTR from l_wait_event here in to -ERESTARTSYS.

      If anyone is interested in seeing the logs referenced by patrick I can attach them to the ticket.

      Attachments

        Issue Links

          Activity

            People

              paf Patrick Farrell
              hornc Chris Horn
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: