Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6368

ASSERTION( cur->oe_dlmlock == victim->oe_dlmlock ) failed

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.7.0
    • centos 6 + Lustre head of tree (2.7+)
    • 3
    • 9223372036854775807

    Description

      I can consistently crash the lustre client with the reproducer attached.

      Info from the logs:

      <0>LustreError: 26474:0:(osc_cache.c:519:osc_extent_merge()) ASSERTION( cur->oe_dlmlock == victim->oe_dlmlock ) failed: 
      <0>LustreError: 26474:0:(osc_cache.c:519:osc_extent_merge()) LBUG
      

      Stack trace from crash:

      crash> bt
      PID: 26474  TASK: ffff88003747caa0  CPU: 3   COMMAND: "llsendfile3"
       #0 [ffff88001a2835f0] machine_kexec at ffffffff81038f3b
       #1 [ffff88001a283650] crash_kexec at ffffffff810c5b62
       #2 [ffff88001a283720] panic at ffffffff815285a3
       #3 [ffff88001a2837a0] lbug_with_loc at ffffffffa0ac8eeb [libcfs]
       #4 [ffff88001a2837c0] osc_extent_merge at ffffffffa06ce57d [osc]
       #5 [ffff88001a2838d0] osc_extent_release at ffffffffa06d3efb [osc]
       #6 [ffff88001a283900] osc_io_end at ffffffffa06c520f [osc]
       #7 [ffff88001a283920] cl_io_end at ffffffffa0dfc270 [obdclass]
       #8 [ffff88001a283950] lov_io_end_wrapper at ffffffffa070f3b1 [lov]
       #9 [ffff88001a283970] lov_io_call at ffffffffa070f0fe [lov]
      #10 [ffff88001a2839a0] lov_io_end at ffffffffa0710fbc [lov]
      #11 [ffff88001a2839c0] cl_io_end at ffffffffa0dfc270 [obdclass]
      #12 [ffff88001a2839f0] cl_io_loop at ffffffffa0e00b52 [obdclass]
      #13 [ffff88001a283a20] ll_file_io_generic at ffffffffa125e20c [lustre]
      #14 [ffff88001a283b40] ll_file_aio_write at ffffffffa125e933 [lustre]
      #15 [ffff88001a283ba0] ll_file_write at ffffffffa125edd9 [lustre]
      #16 [ffff88001a283c10] vfs_write at ffffffff81188df8
      #17 [ffff88001a283c50] kernel_write at ffffffff811b8ded
      #18 [ffff88001a283c80] write_pipe_buf at ffffffff811b8e5a
      #19 [ffff88001a283cc0] splice_from_pipe_feed at ffffffff811b7a92
      #20 [ffff88001a283d10] __splice_from_pipe at ffffffff811b84ee
      #21 [ffff88001a283d50] splice_from_pipe at ffffffff811b8551
      #22 [ffff88001a283da0] default_file_splice_write at ffffffff811b858d
      #23 [ffff88001a283dc0] do_splice_from at ffffffff811b862e
      #24 [ffff88001a283e00] direct_splice_actor at ffffffff811b8680
      #25 [ffff88001a283e10] splice_direct_to_actor at ffffffff811b8956
      #26 [ffff88001a283e80] do_splice_direct at ffffffff811b8a9d
      #27 [ffff88001a283ed0] do_sendfile at ffffffff811891fc
      #28 [ffff88001a283f30] sys_sendfile64 at ffffffff81189294
      #29 [ffff88001a283f80] system_call_fastpath at ffffffff8100b072
          RIP: 0000003a522df7da  RSP: 00007fffe6f8add8  RFLAGS: 00010206
          RAX: 0000000000000028  RBX: ffffffff8100b072  RCX: 0000000000a00000
          RDX: 0000000000000000  RSI: 0000000000000003  RDI: 0000000000000004
          RBP: 0000000000000004   R8: 0000003a5258f300   R9: 0000003a51a0e9f0
          R10: 0000000000a00000  R11: 0000000000000206  R12: 0000000000000000
          R13: 00007fffe6f8aed0  R14: 0000000000401b90  R15: 0000000000000003
          ORIG_RAX: 0000000000000028  CS: 0033  SS: 002b
      

      This is related to the group lock on the target file. If the group lock is commented out, then no crash happens.

      Attachments

        Issue Links

          Activity

            [LU-6368] ASSERTION( cur->oe_dlmlock == victim->oe_dlmlock ) failed
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14093/
            Subject: LU-6368 ldlm: Do not use cbpending for group locks
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 3e6c20afa18a64c5cb949ecf2ed0f49202ba3e15

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14093/ Subject: LU-6368 ldlm: Do not use cbpending for group locks Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3e6c20afa18a64c5cb949ecf2ed0f49202ba3e15

            Jinshan - That flag is being set by ldlm_lock_decref_and_cancel, which is called specifically from osc_cancel_base just for group locks, and the only difference from regular ldlm_lock_decref is that cbpending is set on the lock in decref_and_cancel.

            Here's the comment on it:
            /**

            • Decrease reader/writer refcount for LDLM lock with handle
            • \a lockh and mark it for subsequent cancellation once r/w refcount
            • drops to zero instead of putting into LRU.
              *
            • Typical usage is for GROUP locks which we cannot allow to be cached.
              */
              void ldlm_lock_decref_and_cancel(struct lustre_handle *lockh, __u32 mode)
              {
              struct ldlm_lock *lock = __ldlm_handle2lock(lockh, 0);
              ENTRY;

            LASSERT(lock != NULL);

            LDLM_DEBUG(lock, "ldlm_lock_decref(%s)", ldlm_lockname[mode]);
            lock_res_and_lock(lock);
            ldlm_set_cbpending(lock);
            unlock_res_and_lock(lock);
            ldlm_lock_decref_internal(lock, mode);
            LDLM_LOCK_PUT(lock);
            }
            EXPORT_SYMBOL(ldlm_lock_decref_and_cancel);


            So it's definitely intentional that CBPENDING be set on group locks. I don't fully understand the life cycle for LDLM and OSC locks, but here's what I see...

            The group lock request (from the ioctl) gets a writer reference on the lock, and this reference is not released until the lock is cancelled by the usual manual cancellation request.

            So I've decided to try removing the usage of CBPENDING for group locks and instead make ldlm_lock_decref_internal group lock aware. It will cancel group locks once they have zero reader & writer references, whether or not they have CBPENDING set.

            I'll put that patch up in gerrit in a moment. (Haven't tested it yet, but will do.)

            paf Patrick Farrell (Inactive) added a comment - - edited Jinshan - That flag is being set by ldlm_lock_decref_and_cancel, which is called specifically from osc_cancel_base just for group locks, and the only difference from regular ldlm_lock_decref is that cbpending is set on the lock in decref_and_cancel. Here's the comment on it: /** Decrease reader/writer refcount for LDLM lock with handle \a lockh and mark it for subsequent cancellation once r/w refcount drops to zero instead of putting into LRU. * Typical usage is for GROUP locks which we cannot allow to be cached. */ void ldlm_lock_decref_and_cancel(struct lustre_handle *lockh, __u32 mode) { struct ldlm_lock *lock = __ldlm_handle2lock(lockh, 0); ENTRY; LASSERT(lock != NULL); LDLM_DEBUG(lock, "ldlm_lock_decref(%s)", ldlm_lockname [mode] ); lock_res_and_lock(lock); ldlm_set_cbpending(lock); unlock_res_and_lock(lock); ldlm_lock_decref_internal(lock, mode); LDLM_LOCK_PUT(lock); } EXPORT_SYMBOL(ldlm_lock_decref_and_cancel); So it's definitely intentional that CBPENDING be set on group locks. I don't fully understand the life cycle for LDLM and OSC locks, but here's what I see... The group lock request (from the ioctl) gets a writer reference on the lock, and this reference is not released until the lock is cancelled by the usual manual cancellation request. So I've decided to try removing the usage of CBPENDING for group locks and instead make ldlm_lock_decref_internal group lock aware. It will cancel group locks once they have zero reader & writer references, whether or not they have CBPENDING set. I'll put that patch up in gerrit in a moment. (Haven't tested it yet, but will do.)

            I agree that it's wrong to give up group lock therefore group lock should have never had CBPENDING bit set.

            jay Jinshan Xiong (Inactive) added a comment - I agree that it's wrong to give up group lock therefore group lock should have never had CBPENDING bit set.

            Client dk logs with special debug added. (The debug is part of my strided lock patches and so includes a lot of extra info in other places as well.)

            paf Patrick Farrell (Inactive) added a comment - Client dk logs with special debug added. (The debug is part of my strided lock patches and so includes a lot of extra info in other places as well.)

            The patch at http://review.whamcloud.com/14093 fixes the LBUG described here by attacking my "first thought" above: It causes search_queue to ignore the CBPDENDING flag on existing group locks, and match to those locks anyway.

            I'm still concerned about the other two thoughts - Should the CBPENDING flag be set? Does it have any meaning for group locks? If not, should we be refusing to set it?

            And what about the fact this lock was granted? It's a robustness thing rather than a fix for this bug, but it seems like we should have some sanity check that prevents granting this lock. I'll take a stab at writing one in search_queue, but I'm not sure I like what I'm thinking of.

            paf Patrick Farrell (Inactive) added a comment - The patch at http://review.whamcloud.com/14093 fixes the LBUG described here by attacking my "first thought" above: It causes search_queue to ignore the CBPDENDING flag on existing group locks, and match to those locks anyway. I'm still concerned about the other two thoughts - Should the CBPENDING flag be set? Does it have any meaning for group locks? If not, should we be refusing to set it? And what about the fact this lock was granted? It's a robustness thing rather than a fix for this bug, but it seems like we should have some sanity check that prevents granting this lock. I'll take a stab at writing one in search_queue, but I'm not sure I like what I'm thinking of.

            Patrick Farrell (paf@cray.com) uploaded a new patch: http://review.whamcloud.com/14093
            Subject: LU-6368 ldlm: Ignore cbpending for group locks
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 8a9287c2fb29f98d83654c42f7ee3c61015bb243

            gerrit Gerrit Updater added a comment - Patrick Farrell (paf@cray.com) uploaded a new patch: http://review.whamcloud.com/14093 Subject: LU-6368 ldlm: Ignore cbpending for group locks Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8a9287c2fb29f98d83654c42f7ee3c61015bb243

            For reference, here is dump_namespaces from the server shortly after a client crashed with this test (this is not the same run as the rest of the DK logs):
            00010000:00010000:0.0:1426493058.930173:0:2207:0:(ldlm_resource.c:1376:ldlm_resource_dump()) — Resource: [0xf2:0x0:0x0].0 (ffff880125d5b0c0) refcount = 3
            00010000:00010000:0.0:1426493058.930174:0:2207:0:(ldlm_resource.c:1379:ldlm_resource_dump()) Granted locks (in reverse order):
            00010000:00010000:0.0:1426493058.930176:0:2207:0:(ldlm_resource.c:1382:ldlm_resource_dump()) ### ### ns: filter-centss01-OST0000_UUID lock: ffff880121730b40/0x82dec12a52ea32d1 lrc: 2/0,0 mode: GROUP/GROUP res: [0xf2:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 4096->8191) flags: 0x40000000000000 nid: 192.168.1.20@tcp remote: 0xcde4e8d126caf1be expref: 13 pid: 2175 timeout: 0 lvb_type: 0
            00010000:00010000:0.0:1426493058.930179:0:2207:0:(ldlm_resource.c:1382:ldlm_resource_dump()) ### ### ns: filter-centss01-OST0000_UUID lock: ffff880121730340/0x82dec12a52ea32ca lrc: 2/0,0 mode: GROUP/GROUP res: [0xf2:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x40000000020000 nid: 192.168.1.20@tcp remote: 0xcde4e8d126caf1b7 expref: 13 pid: 2175 timeout: 0 lvb_type: 0

            paf Patrick Farrell (Inactive) added a comment - For reference, here is dump_namespaces from the server shortly after a client crashed with this test (this is not the same run as the rest of the DK logs): 00010000:00010000:0.0:1426493058.930173:0:2207:0:(ldlm_resource.c:1376:ldlm_resource_dump()) — Resource: [0xf2:0x0:0x0] .0 (ffff880125d5b0c0) refcount = 3 00010000:00010000:0.0:1426493058.930174:0:2207:0:(ldlm_resource.c:1379:ldlm_resource_dump()) Granted locks (in reverse order): 00010000:00010000:0.0:1426493058.930176:0:2207:0:(ldlm_resource.c:1382:ldlm_resource_dump()) ### ### ns: filter-centss01-OST0000_UUID lock: ffff880121730b40/0x82dec12a52ea32d1 lrc: 2/0,0 mode: GROUP/GROUP res: [0xf2:0x0:0x0] .0 rrc: 3 type: EXT [0->18446744073709551615] (req 4096->8191) flags: 0x40000000000000 nid: 192.168.1.20@tcp remote: 0xcde4e8d126caf1be expref: 13 pid: 2175 timeout: 0 lvb_type: 0 00010000:00010000:0.0:1426493058.930179:0:2207:0:(ldlm_resource.c:1382:ldlm_resource_dump()) ### ### ns: filter-centss01-OST0000_UUID lock: ffff880121730340/0x82dec12a52ea32ca lrc: 2/0,0 mode: GROUP/GROUP res: [0xf2:0x0:0x0] .0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x40000000020000 nid: 192.168.1.20@tcp remote: 0xcde4e8d126caf1b7 expref: 13 pid: 2175 timeout: 0 lvb_type: 0

            So:

            The first thought I have is that for a group lock, it should still be matchable if callback is pending (on the existing lock), since group locks cannot be called back.

            But that makes me wonder: Is it correct to allow the callback pending flag to be set on a group lock? That seems like it might be wrong as well, for the same reason as above: Group locks cannot be called back, so callback pending has no meaning.

            Finally, and separately, I have not yet looked in to this yet, but it seems clearly wrong that the server should issue two group locks on the same resource to the same client. Not sure how best to handle that, but it definitely shouldn't grant such a lock, unless I've really missed something.

            (I am not quite sure how/why the OSC layer decides to request a group lock for the second lock, rather than a normal PW lock.)

            paf Patrick Farrell (Inactive) added a comment - So: The first thought I have is that for a group lock, it should still be matchable if callback is pending (on the existing lock), since group locks cannot be called back. But that makes me wonder: Is it correct to allow the callback pending flag to be set on a group lock? That seems like it might be wrong as well, for the same reason as above: Group locks cannot be called back, so callback pending has no meaning. Finally, and separately, I have not yet looked in to this yet, but it seems clearly wrong that the server should issue two group locks on the same resource to the same client. Not sure how best to handle that, but it definitely shouldn't grant such a lock, unless I've really missed something. (I am not quite sure how/why the OSC layer decides to request a group lock for the second lock, rather than a normal PW lock.)

            A few notes. File must be > 1 page in size, so > 4k. (Definitely happens with a 100KiB file. Haven't tested intervening sizes.)

            The crash happens during the write of the second page (It seems with sendfile that ll_file_write is called for each page.)

            I'll get some dk logs attached shortly, but here's my initial analysis.

            The core problem is that two separate group locks are issued. From the client logs...
            00010000:00010000:1.0:1426488989.036499:0:1492:0:(ldlm_request.c:712:ldlm_cli_enqueue_fini()) ### client-side enqueue END ns: centss01-OST0000-osc-ffff8801383d5c00 lock: ffff8801395e17c0/0x17d7a5b11d7f9867 lrc: 4/0,1 mode: GROUP/GROUP res: [0xf2:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20000 nid: local remote: 0x82dec12a52ea27a9 expref: -99 pid: 1492 timeout: 0 lvb_type: 1

            And without the first lock ever being released, it gets another group lock:
            00010000:00010000:0.0:1426488989.132324:0:1492:0:(ldlm_request.c:712:ldlm_cli_enqueue_fini()) ### client-side enqueue END ns: centss01-OST0000-osc-ffff8801383d5c00 lock: ffff8801395e19c0/0x17d7a5b11d7f986e lrc: 4/0,1 mode: GROUP/GROUP res: [0xf2:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 4096->8191) flags: 0x0 nid: local remote: 0x82dec12a52ea27b0 expref: -99 pid: 1492 timeout: 0 lvb_type: 1

            This happens in the sendfile code after it fails to match the first group lock. I've got an instrumented version of search_queue that shows the reason for the failed match:

                            /* llite sometimes wants to match locks that will be
                             * canceled when their users drop, but we allow it to match
                             * if it passes in CBPENDING and the lock still has users.
                             * this is generally only going to be used by children
                             * whose parents already hold a lock so forward progress
                             * can still happen. */
                            if (ldlm_is_cbpending(lock) &&
                                !(flags & LDLM_FL_CBPENDING)) {
                                    reason = 3;
                                    continue;
                            }
            

            So trying to match a requested group lock against the existing one (s: centss01-OST0000-osc-ffff880139645000 lock: ffff88011286ad80/0x4166f63c6cfea1fd lrc: 4/0,1 mode: GROUP/GROUP res: [0xf2:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] stride: 0 (req 0->18446744073709551615) flags: 0x20400020000nid: local remote: 0x82dec12a52ea2ff2 expref: -99 pid: 6805timeout: 0 lvb_type: 1), it fails with reason 3, as noted above.

            So at a first remove, the cause of this problem is two group locks being issued which cover the same extent. I'll give my further thoughts on this in another comment, and I'll also attach these dk logs. (I can make the dump available as well.)

            paf Patrick Farrell (Inactive) added a comment - - edited A few notes. File must be > 1 page in size, so > 4k. (Definitely happens with a 100KiB file. Haven't tested intervening sizes.) The crash happens during the write of the second page (It seems with sendfile that ll_file_write is called for each page.) I'll get some dk logs attached shortly, but here's my initial analysis. The core problem is that two separate group locks are issued . From the client logs... 00010000:00010000:1.0:1426488989.036499:0:1492:0:(ldlm_request.c:712:ldlm_cli_enqueue_fini()) ### client-side enqueue END ns: centss01-OST0000-osc-ffff8801383d5c00 lock: ffff8801395e17c0/0x17d7a5b11d7f9867 lrc: 4/0,1 mode: GROUP/GROUP res: [0xf2:0x0:0x0] .0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20000 nid: local remote: 0x82dec12a52ea27a9 expref: -99 pid: 1492 timeout: 0 lvb_type: 1 And without the first lock ever being released, it gets another group lock : 00010000:00010000:0.0:1426488989.132324:0:1492:0:(ldlm_request.c:712:ldlm_cli_enqueue_fini()) ### client-side enqueue END ns: centss01-OST0000-osc-ffff8801383d5c00 lock: ffff8801395e19c0/0x17d7a5b11d7f986e lrc: 4/0,1 mode: GROUP/GROUP res: [0xf2:0x0:0x0] .0 rrc: 2 type: EXT [0->18446744073709551615] (req 4096->8191) flags: 0x0 nid: local remote: 0x82dec12a52ea27b0 expref: -99 pid: 1492 timeout: 0 lvb_type: 1 This happens in the sendfile code after it fails to match the first group lock. I've got an instrumented version of search_queue that shows the reason for the failed match: /* llite sometimes wants to match locks that will be * canceled when their users drop, but we allow it to match * if it passes in CBPENDING and the lock still has users. * this is generally only going to be used by children * whose parents already hold a lock so forward progress * can still happen. */ if (ldlm_is_cbpending(lock) && !(flags & LDLM_FL_CBPENDING)) { reason = 3; continue ; } So trying to match a requested group lock against the existing one (s: centss01-OST0000-osc-ffff880139645000 lock: ffff88011286ad80/0x4166f63c6cfea1fd lrc: 4/0,1 mode: GROUP/GROUP res: [0xf2:0x0:0x0] .0 rrc: 2 type: EXT [0->18446744073709551615] stride: 0 (req 0->18446744073709551615) flags: 0x20400020000nid: local remote: 0x82dec12a52ea2ff2 expref: -99 pid: 6805timeout: 0 lvb_type: 1), it fails with reason 3, as noted above. So at a first remove, the cause of this problem is two group locks being issued which cover the same extent. I'll give my further thoughts on this in another comment, and I'll also attach these dk logs. (I can make the dump available as well.)

            People

              wc-triage WC Triage
              fzago Frank Zago (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: