Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4269

ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0, Lustre 2.5.1
    • Lustre 2.6.0
    • Hyperion/LLNL
    • 3
    • 11723

    Description

      Running SWL - miranda IO test.

      2013-11-16 17:27:14 LustreError: 105183:0:(ldlm_lock.c:222:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed:
      2013-11-16 17:27:14 LustreError: 105183:0:(ldlm_lock.c:222:ldlm_lock_put()) LBUG
      2013-11-16 17:27:14 Pid: 105183, comm: miranda_io
      2013-11-16 17:27:14 Nov 16 17:27:14
      2013-11-16 17:27:14 Call Trace:
      2013-11-16 17:27:14 iwc48 kernel: LustreError: 10518 [<ffffffffa056b895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      2013-11-16 17:27:14 3:0:(ldlm_lock.c:222:ldlm_lock_p [<ffffffffa056be97>] lbug_with_loc+0x47/0xb0 [libcfs]
      2013-11-16 17:27:14 ut()) ASSERTION( (((( lock))->l_flags & (1ULL << [<ffffffffa080f94d>] ldlm_lock_put+0x44d/0x560 [ptlrpc]
      2013-11-16 17:27:14  50)) != 0) ) failed:
      2013-11-16 17:27:14 Nov 16 17:27:14 iwc48 kernel: LustreError [<ffffffffa0822d22>] ldlm_cli_cancel_list+0xf2/0x3e0 [ptlrpc]
      2013-11-16 17:27:14 : 105183:0:(ldlm_lock.c:222:ldlm_lock_put()) LBUG
      2013-11-16 17:27:14  [<ffffffffa0824952>] ldlm_cli_cancel+0x132/0x360 [ptlrpc]
      2013-11-16 17:27:14  [<ffffffffa09fdbce>] osc_lock_cancel+0xfe/0x1c0 [osc]
      2013-11-16 17:27:14  [<ffffffffa0708525>] cl_lock_cancel0+0x75/0x160 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa07090db>] cl_lock_cancel+0x13b/0x140 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa070d7e4>] cl_lock_enqueue_wait+0xc4/0x2d0 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa070aa17>] ? cl_lock_mutex_put+0x77/0x90 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa0a9126b>] lov_lock_enqueue+0x3fb/0x850 [lov]
      2013-11-16 17:27:14  [<ffffffffa070c65c>] cl_enqueue_try+0xfc/0x300 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa070da5f>] cl_enqueue_locked+0x6f/0x1f0 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa070e6ae>] cl_lock_request+0x7e/0x270 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa071391c>] cl_io_lock+0x3cc/0x560 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa0713b52>] cl_io_loop+0xa2/0x1b0 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa0b0e290>] ll_file_io_generic+0x460/0x610 [lustre]
      2013-11-16 17:27:14  [<ffffffffa0701e39>] ? cl_env_get+0x29/0x350 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa0b0ecb2>] ll_file_aio_write+0x142/0x2c0 [lustre]
      2013-11-16 17:27:14  [<ffffffffa0b0ef9c>] ll_file_write+0x16c/0x2a0 [lustre]
      2013-11-16 17:27:14  [<ffffffff81181398>] vfs_write+0xb8/0x1a0
      2013-11-16 17:27:14  [<ffffffff81181c91>] sys_write+0x51/0x90
      2013-11-16 17:27:14  [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      2013-11-16 17:27:14
      

      Lctl dump attached

      Attachments

        1. analysis.txt
          334 kB
        2. iwc.console.gz
          29 kB
        3. iwc106.sysrq.txt
          375 kB
        4. iwc106.sysrq.txt
          375 kB
        5. iwc106.sysrq.txt.gz
          29 kB
        6. iwc106.sysrq.txt.gz
          29 kB
        7. iwc48.lctl.log.txt.gz
          0.2 kB
        8. log.bz2
          525 kB
        9. LU-4268-140123-log.tar.gz
          1.75 MB
        10. vmcore-dmesg.txt
          78 kB

        Activity

          [LU-4269] ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed
          bogl Bob Glossman (Inactive) added a comment - backport to b2_5: http://review.whamcloud.com/9346

          Patch landed to Master. Please reopen ticket if more work is needed.

          jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master. Please reopen ticket if more work is needed.
          bobijam Zhenyu Xu added a comment - updated patch http://review.whamcloud.com/#/c/8772/

          The root cause of this issue is pretty clear.

          Bobijam, can you please create a patch for this?

          jay Jinshan Xiong (Inactive) added a comment - The root cause of this issue is pretty clear. Bobijam, can you please create a patch for this?

          Li, Jinshan - Are either of you able to share how you're reproducing this?

          paf Patrick Farrell (Inactive) added a comment - Li, Jinshan - Are either of you able to share how you're reproducing this?

          Ah, yeah, the right lock order is lock->l_lock, res->lr_lock, ns->ns_lock. Thanks for poiting that error out.

          We are able to reproduce the problem steadily too. If you could push a patch, we can check whether it helps. I feel that there are more than one place where lock->l_flags is changed without the protection of lock->l_lock.

          lixi Li Xi (Inactive) added a comment - Ah, yeah, the right lock order is lock->l_lock, res->lr_lock, ns->ns_lock. Thanks for poiting that error out. We are able to reproduce the problem steadily too. If you could push a patch, we can check whether it helps. I feel that there are more than one place where lock->l_flags is changed without the protection of lock->l_lock.

          Hi Li Xi,

          That looks true. Somehow I looked at the code wrong. The dlm lock was set destroyed flag before taking out of LRU, so ldlm_clear_skipped() in ldlm_lock_remove_from_lru_nolock() is indeed the root cause of this problem.

          Your patch will cause deadlock because we should take res lock first and then ns_lock. My idea to fix this problem is to move ldlm_clear_skipped() out of ldlm_lock_remove_from_lru_nolock() into ldlm_lock_add_to_lru_nolock(). I think this will fix the problem as well.

          jay Jinshan Xiong (Inactive) added a comment - Hi Li Xi, That looks true. Somehow I looked at the code wrong. The dlm lock was set destroyed flag before taking out of LRU, so ldlm_clear_skipped() in ldlm_lock_remove_from_lru_nolock() is indeed the root cause of this problem. Your patch will cause deadlock because we should take res lock first and then ns_lock. My idea to fix this problem is to move ldlm_clear_skipped() out of ldlm_lock_remove_from_lru_nolock() into ldlm_lock_add_to_lru_nolock(). I think this will fix the problem as well.

          Hi Jinshan,

          According to your test results, this problem really looks like a race problem. Would you please try following patch to check that the lock is always held when clearing/setting the flag?

          http://review.whamcloud.com/#/c/8772/

          lixi Li Xi (Inactive) added a comment - Hi Jinshan, According to your test results, this problem really looks like a race problem. Would you please try following patch to check that the lock is always held when clearing/setting the flag? http://review.whamcloud.com/#/c/8772/

          The flags should be protected by ldlm_lock::l_lock. When I saw this issue, my first reaction was to check if there are unprotected writing to l_flags and the only place is:

          /**
           * Removes LDLM lock \a lock from LRU. Assumes LRU is already locked.
           */
          int ldlm_lock_remove_from_lru_nolock(struct ldlm_lock *lock)
          {
                  int rc = 0;
                  if (!cfs_list_empty(&lock->l_lru)) {
                          struct ldlm_namespace *ns = ldlm_lock_to_ns(lock);
          
                          LASSERT(lock->l_resource->lr_type != LDLM_FLOCK);
                          cfs_list_del_init(&lock->l_lru);
                          ldlm_clear_skipped(lock);
                          LASSERT(ns->ns_nr_unused > 0);
                          ns->ns_nr_unused--;
                          rc = 1;
                  }
                  return rc;
          }
          

          However, it was protected by ns_lock and if the lock is being destroyed, it must have been taken out of LRU list.

          jay Jinshan Xiong (Inactive) added a comment - The flags should be protected by ldlm_lock::l_lock. When I saw this issue, my first reaction was to check if there are unprotected writing to l_flags and the only place is: /** * Removes LDLM lock \a lock from LRU. Assumes LRU is already locked. */ int ldlm_lock_remove_from_lru_nolock(struct ldlm_lock *lock) { int rc = 0; if (!cfs_list_empty(&lock->l_lru)) { struct ldlm_namespace *ns = ldlm_lock_to_ns(lock); LASSERT(lock->l_resource->lr_type != LDLM_FLOCK); cfs_list_del_init(&lock->l_lru); ldlm_clear_skipped(lock); LASSERT(ns->ns_nr_unused > 0); ns->ns_nr_unused--; rc = 1; } return rc; } However, it was protected by ns_lock and if the lock is being destroyed, it must have been taken out of LRU list.

          Jinshan, according to the comment, the flags should be protected by lr_flags. Are there flag changes not protected by this lock?

          Also, I have some concern that using set_bit() will not set the same bits as the Lustre network protocol on all architectures. Do you know if there is a specific bit ordering used in all cases?

          adilger Andreas Dilger added a comment - Jinshan, according to the comment, the flags should be protected by lr_flags. Are there flag changes not protected by this lock? Also, I have some concern that using set_bit() will not set the same bits as the Lustre network protocol on all architectures. Do you know if there is a specific bit ordering used in all cases?

          I confirmed that changing the dlm flag operation to bitops fixed the problem.

          Check the attachment for the concept patch. The real patch should be worked out by modifying lustre/contrib/bit-masks/lustre_dlm_flags.tpl.

          diff --git a/lustre/include/lustre_dlm.h b/lustre/include/lustre_dlm.h
          index 98e03c9..bf4fb65 100644
          --- a/lustre/include/lustre_dlm.h
          +++ b/lustre/include/lustre_dlm.h
          @@ -772,7 +772,7 @@ struct ldlm_lock {
                   * Lock state flags. Protected by lr_lock.
                   * \see lustre_dlm_flags.h where the bits are defined.
                   */
          -       __u64                   l_flags;
          +       volatile __u64          l_flags;
           
                  /**
                   * Lock r/w usage counters.
          diff --git a/lustre/include/lustre_dlm_flags.h b/lustre/include/lustre_dlm_flags.h
          index 283546d..4be532f 100644
          --- a/lustre/include/lustre_dlm_flags.h
          +++ b/lustre/include/lustre_dlm_flags.h
          @@ -57,91 +57,91 @@
           
           /** extent, mode, or resource changed */
           #define LDLM_FL_LOCK_CHANGED            0x0000000000000001ULL // bit   0
          -#define ldlm_is_lock_changed(_l)        LDLM_TEST_FLAG(( _l), 1ULL <<  0)
          -#define ldlm_set_lock_changed(_l)       LDLM_SET_FLAG((  _l), 1ULL <<  0)
          -#define ldlm_clear_lock_changed(_l)     LDLM_CLEAR_FLAG((_l), 1ULL <<  0)
          +#define ldlm_is_lock_changed(_l)        LDLM_TEST_FLAG(( _l),  0)
          +#define ldlm_set_lock_changed(_l)       LDLM_SET_FLAG((  _l),  0)
          +#define ldlm_clear_lock_changed(_l)     LDLM_CLEAR_FLAG((_l),  0)
          
          ......
          ......
          
           /** test for ldlm_lock flag bit set */
          -#define LDLM_TEST_FLAG(_l, _b)    (((_l)->l_flags & (_b)) != 0)
          +#define LDLM_TEST_FLAG(_l, _b)    test_bit(_b, (volatile unsigned long *)&(_l)->l_flags)
           
           /** multi-bit test: are any of mask bits set? */
           #define LDLM_HAVE_MASK(_l, _m)    (((_l)->l_flags & LDLM_FL_##_m##_MASK) != 0)
           
           /** set a ldlm_lock flag bit */
          -#define LDLM_SET_FLAG(_l, _b)     ((_l)->l_flags |= (_b))
          +#define LDLM_SET_FLAG(_l, _b)     set_bit(_b, (volatile unsigned long *)&(_l)->l_flags)
           
           /** clear a ldlm_lock flag bit */
          -#define LDLM_CLEAR_FLAG(_l, _b)   ((_l)->l_flags &= ~(_b))
          +#define LDLM_CLEAR_FLAG(_l, _b)   clear_bit(_b, (volatile unsigned long *)&(_l)->l_flags)
           
           /** @} subgroup */
           /** @} group */
          
          jay Jinshan Xiong (Inactive) added a comment - I confirmed that changing the dlm flag operation to bitops fixed the problem. Check the attachment for the concept patch. The real patch should be worked out by modifying lustre/contrib/bit-masks/lustre_dlm_flags.tpl. diff --git a/lustre/include/lustre_dlm.h b/lustre/include/lustre_dlm.h index 98e03c9..bf4fb65 100644 --- a/lustre/include/lustre_dlm.h +++ b/lustre/include/lustre_dlm.h @@ -772,7 +772,7 @@ struct ldlm_lock { * Lock state flags. Protected by lr_lock. * \see lustre_dlm_flags.h where the bits are defined. */ - __u64 l_flags; + volatile __u64 l_flags; /** * Lock r/w usage counters. diff --git a/lustre/include/lustre_dlm_flags.h b/lustre/include/lustre_dlm_flags.h index 283546d..4be532f 100644 --- a/lustre/include/lustre_dlm_flags.h +++ b/lustre/include/lustre_dlm_flags.h @@ -57,91 +57,91 @@ /** extent, mode, or resource changed */ #define LDLM_FL_LOCK_CHANGED 0x0000000000000001ULL // bit 0 -#define ldlm_is_lock_changed(_l) LDLM_TEST_FLAG(( _l), 1ULL << 0) -#define ldlm_set_lock_changed(_l) LDLM_SET_FLAG(( _l), 1ULL << 0) -#define ldlm_clear_lock_changed(_l) LDLM_CLEAR_FLAG((_l), 1ULL << 0) +#define ldlm_is_lock_changed(_l) LDLM_TEST_FLAG(( _l), 0) +#define ldlm_set_lock_changed(_l) LDLM_SET_FLAG(( _l), 0) +#define ldlm_clear_lock_changed(_l) LDLM_CLEAR_FLAG((_l), 0) ...... ...... /** test for ldlm_lock flag bit set */ -#define LDLM_TEST_FLAG(_l, _b) (((_l)->l_flags & (_b)) != 0) +#define LDLM_TEST_FLAG(_l, _b) test_bit(_b, ( volatile unsigned long *)&(_l)->l_flags) /** multi-bit test: are any of mask bits set? */ #define LDLM_HAVE_MASK(_l, _m) (((_l)->l_flags & LDLM_FL_##_m##_MASK) != 0) /** set a ldlm_lock flag bit */ -#define LDLM_SET_FLAG(_l, _b) ((_l)->l_flags |= (_b)) +#define LDLM_SET_FLAG(_l, _b) set_bit(_b, ( volatile unsigned long *)&(_l)->l_flags) /** clear a ldlm_lock flag bit */ -#define LDLM_CLEAR_FLAG(_l, _b) ((_l)->l_flags &= ~(_b)) +#define LDLM_CLEAR_FLAG(_l, _b) clear_bit(_b, ( volatile unsigned long *)&(_l)->l_flags) /** @} subgroup */ /** @} group */

          People

            bobijam Zhenyu Xu
            cliffw Cliff White (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: