Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4269

ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0, Lustre 2.5.1
    • Lustre 2.6.0
    • Hyperion/LLNL
    • 3
    • 11723

    Description

      Running SWL - miranda IO test.

      2013-11-16 17:27:14 LustreError: 105183:0:(ldlm_lock.c:222:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed:
      2013-11-16 17:27:14 LustreError: 105183:0:(ldlm_lock.c:222:ldlm_lock_put()) LBUG
      2013-11-16 17:27:14 Pid: 105183, comm: miranda_io
      2013-11-16 17:27:14 Nov 16 17:27:14
      2013-11-16 17:27:14 Call Trace:
      2013-11-16 17:27:14 iwc48 kernel: LustreError: 10518 [<ffffffffa056b895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      2013-11-16 17:27:14 3:0:(ldlm_lock.c:222:ldlm_lock_p [<ffffffffa056be97>] lbug_with_loc+0x47/0xb0 [libcfs]
      2013-11-16 17:27:14 ut()) ASSERTION( (((( lock))->l_flags & (1ULL << [<ffffffffa080f94d>] ldlm_lock_put+0x44d/0x560 [ptlrpc]
      2013-11-16 17:27:14  50)) != 0) ) failed:
      2013-11-16 17:27:14 Nov 16 17:27:14 iwc48 kernel: LustreError [<ffffffffa0822d22>] ldlm_cli_cancel_list+0xf2/0x3e0 [ptlrpc]
      2013-11-16 17:27:14 : 105183:0:(ldlm_lock.c:222:ldlm_lock_put()) LBUG
      2013-11-16 17:27:14  [<ffffffffa0824952>] ldlm_cli_cancel+0x132/0x360 [ptlrpc]
      2013-11-16 17:27:14  [<ffffffffa09fdbce>] osc_lock_cancel+0xfe/0x1c0 [osc]
      2013-11-16 17:27:14  [<ffffffffa0708525>] cl_lock_cancel0+0x75/0x160 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa07090db>] cl_lock_cancel+0x13b/0x140 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa070d7e4>] cl_lock_enqueue_wait+0xc4/0x2d0 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa070aa17>] ? cl_lock_mutex_put+0x77/0x90 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa0a9126b>] lov_lock_enqueue+0x3fb/0x850 [lov]
      2013-11-16 17:27:14  [<ffffffffa070c65c>] cl_enqueue_try+0xfc/0x300 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa070da5f>] cl_enqueue_locked+0x6f/0x1f0 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa070e6ae>] cl_lock_request+0x7e/0x270 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa071391c>] cl_io_lock+0x3cc/0x560 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa0713b52>] cl_io_loop+0xa2/0x1b0 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa0b0e290>] ll_file_io_generic+0x460/0x610 [lustre]
      2013-11-16 17:27:14  [<ffffffffa0701e39>] ? cl_env_get+0x29/0x350 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa0b0ecb2>] ll_file_aio_write+0x142/0x2c0 [lustre]
      2013-11-16 17:27:14  [<ffffffffa0b0ef9c>] ll_file_write+0x16c/0x2a0 [lustre]
      2013-11-16 17:27:14  [<ffffffff81181398>] vfs_write+0xb8/0x1a0
      2013-11-16 17:27:14  [<ffffffff81181c91>] sys_write+0x51/0x90
      2013-11-16 17:27:14  [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      2013-11-16 17:27:14
      

      Lctl dump attached

      Attachments

        1. iwc48.lctl.log.txt.gz
          0.2 kB
        2. vmcore-dmesg.txt
          78 kB
        3. iwc106.sysrq.txt
          375 kB
        4. iwc106.sysrq.txt
          375 kB
        5. iwc106.sysrq.txt.gz
          29 kB
        6. iwc106.sysrq.txt.gz
          29 kB
        7. iwc.console.gz
          29 kB
        8. analysis.txt
          334 kB
        9. LU-4268-140123-log.tar.gz
          1.75 MB
        10. log.bz2
          525 kB

        Activity

          [LU-4269] ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed
          bogl Bob Glossman (Inactive) added a comment - backport to b2_5: http://review.whamcloud.com/9346

          Patch landed to Master. Please reopen ticket if more work is needed.

          jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master. Please reopen ticket if more work is needed.
          bobijam Zhenyu Xu added a comment - updated patch http://review.whamcloud.com/#/c/8772/

          The root cause of this issue is pretty clear.

          Bobijam, can you please create a patch for this?

          jay Jinshan Xiong (Inactive) added a comment - The root cause of this issue is pretty clear. Bobijam, can you please create a patch for this?

          Li, Jinshan - Are either of you able to share how you're reproducing this?

          paf Patrick Farrell (Inactive) added a comment - Li, Jinshan - Are either of you able to share how you're reproducing this?

          Ah, yeah, the right lock order is lock->l_lock, res->lr_lock, ns->ns_lock. Thanks for poiting that error out.

          We are able to reproduce the problem steadily too. If you could push a patch, we can check whether it helps. I feel that there are more than one place where lock->l_flags is changed without the protection of lock->l_lock.

          lixi Li Xi (Inactive) added a comment - Ah, yeah, the right lock order is lock->l_lock, res->lr_lock, ns->ns_lock. Thanks for poiting that error out. We are able to reproduce the problem steadily too. If you could push a patch, we can check whether it helps. I feel that there are more than one place where lock->l_flags is changed without the protection of lock->l_lock.

          Hi Li Xi,

          That looks true. Somehow I looked at the code wrong. The dlm lock was set destroyed flag before taking out of LRU, so ldlm_clear_skipped() in ldlm_lock_remove_from_lru_nolock() is indeed the root cause of this problem.

          Your patch will cause deadlock because we should take res lock first and then ns_lock. My idea to fix this problem is to move ldlm_clear_skipped() out of ldlm_lock_remove_from_lru_nolock() into ldlm_lock_add_to_lru_nolock(). I think this will fix the problem as well.

          jay Jinshan Xiong (Inactive) added a comment - Hi Li Xi, That looks true. Somehow I looked at the code wrong. The dlm lock was set destroyed flag before taking out of LRU, so ldlm_clear_skipped() in ldlm_lock_remove_from_lru_nolock() is indeed the root cause of this problem. Your patch will cause deadlock because we should take res lock first and then ns_lock. My idea to fix this problem is to move ldlm_clear_skipped() out of ldlm_lock_remove_from_lru_nolock() into ldlm_lock_add_to_lru_nolock(). I think this will fix the problem as well.

          Hi Jinshan,

          According to your test results, this problem really looks like a race problem. Would you please try following patch to check that the lock is always held when clearing/setting the flag?

          http://review.whamcloud.com/#/c/8772/

          lixi Li Xi (Inactive) added a comment - Hi Jinshan, According to your test results, this problem really looks like a race problem. Would you please try following patch to check that the lock is always held when clearing/setting the flag? http://review.whamcloud.com/#/c/8772/

          The flags should be protected by ldlm_lock::l_lock. When I saw this issue, my first reaction was to check if there are unprotected writing to l_flags and the only place is:

          /**
           * Removes LDLM lock \a lock from LRU. Assumes LRU is already locked.
           */
          int ldlm_lock_remove_from_lru_nolock(struct ldlm_lock *lock)
          {
                  int rc = 0;
                  if (!cfs_list_empty(&lock->l_lru)) {
                          struct ldlm_namespace *ns = ldlm_lock_to_ns(lock);
          
                          LASSERT(lock->l_resource->lr_type != LDLM_FLOCK);
                          cfs_list_del_init(&lock->l_lru);
                          ldlm_clear_skipped(lock);
                          LASSERT(ns->ns_nr_unused > 0);
                          ns->ns_nr_unused--;
                          rc = 1;
                  }
                  return rc;
          }
          

          However, it was protected by ns_lock and if the lock is being destroyed, it must have been taken out of LRU list.

          jay Jinshan Xiong (Inactive) added a comment - The flags should be protected by ldlm_lock::l_lock. When I saw this issue, my first reaction was to check if there are unprotected writing to l_flags and the only place is: /** * Removes LDLM lock \a lock from LRU. Assumes LRU is already locked. */ int ldlm_lock_remove_from_lru_nolock(struct ldlm_lock *lock) { int rc = 0; if (!cfs_list_empty(&lock->l_lru)) { struct ldlm_namespace *ns = ldlm_lock_to_ns(lock); LASSERT(lock->l_resource->lr_type != LDLM_FLOCK); cfs_list_del_init(&lock->l_lru); ldlm_clear_skipped(lock); LASSERT(ns->ns_nr_unused > 0); ns->ns_nr_unused--; rc = 1; } return rc; } However, it was protected by ns_lock and if the lock is being destroyed, it must have been taken out of LRU list.

          Jinshan, according to the comment, the flags should be protected by lr_flags. Are there flag changes not protected by this lock?

          Also, I have some concern that using set_bit() will not set the same bits as the Lustre network protocol on all architectures. Do you know if there is a specific bit ordering used in all cases?

          adilger Andreas Dilger added a comment - Jinshan, according to the comment, the flags should be protected by lr_flags. Are there flag changes not protected by this lock? Also, I have some concern that using set_bit() will not set the same bits as the Lustre network protocol on all architectures. Do you know if there is a specific bit ordering used in all cases?

          People

            bobijam Zhenyu Xu
            cliffw Cliff White (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: