Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4269

ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0, Lustre 2.5.1
    • Lustre 2.6.0
    • Hyperion/LLNL
    • 3
    • 11723

    Description

      Running SWL - miranda IO test.

      2013-11-16 17:27:14 LustreError: 105183:0:(ldlm_lock.c:222:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed:
      2013-11-16 17:27:14 LustreError: 105183:0:(ldlm_lock.c:222:ldlm_lock_put()) LBUG
      2013-11-16 17:27:14 Pid: 105183, comm: miranda_io
      2013-11-16 17:27:14 Nov 16 17:27:14
      2013-11-16 17:27:14 Call Trace:
      2013-11-16 17:27:14 iwc48 kernel: LustreError: 10518 [<ffffffffa056b895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      2013-11-16 17:27:14 3:0:(ldlm_lock.c:222:ldlm_lock_p [<ffffffffa056be97>] lbug_with_loc+0x47/0xb0 [libcfs]
      2013-11-16 17:27:14 ut()) ASSERTION( (((( lock))->l_flags & (1ULL << [<ffffffffa080f94d>] ldlm_lock_put+0x44d/0x560 [ptlrpc]
      2013-11-16 17:27:14  50)) != 0) ) failed:
      2013-11-16 17:27:14 Nov 16 17:27:14 iwc48 kernel: LustreError [<ffffffffa0822d22>] ldlm_cli_cancel_list+0xf2/0x3e0 [ptlrpc]
      2013-11-16 17:27:14 : 105183:0:(ldlm_lock.c:222:ldlm_lock_put()) LBUG
      2013-11-16 17:27:14  [<ffffffffa0824952>] ldlm_cli_cancel+0x132/0x360 [ptlrpc]
      2013-11-16 17:27:14  [<ffffffffa09fdbce>] osc_lock_cancel+0xfe/0x1c0 [osc]
      2013-11-16 17:27:14  [<ffffffffa0708525>] cl_lock_cancel0+0x75/0x160 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa07090db>] cl_lock_cancel+0x13b/0x140 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa070d7e4>] cl_lock_enqueue_wait+0xc4/0x2d0 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa070aa17>] ? cl_lock_mutex_put+0x77/0x90 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa0a9126b>] lov_lock_enqueue+0x3fb/0x850 [lov]
      2013-11-16 17:27:14  [<ffffffffa070c65c>] cl_enqueue_try+0xfc/0x300 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa070da5f>] cl_enqueue_locked+0x6f/0x1f0 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa070e6ae>] cl_lock_request+0x7e/0x270 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa071391c>] cl_io_lock+0x3cc/0x560 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa0713b52>] cl_io_loop+0xa2/0x1b0 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa0b0e290>] ll_file_io_generic+0x460/0x610 [lustre]
      2013-11-16 17:27:14  [<ffffffffa0701e39>] ? cl_env_get+0x29/0x350 [obdclass]
      2013-11-16 17:27:14  [<ffffffffa0b0ecb2>] ll_file_aio_write+0x142/0x2c0 [lustre]
      2013-11-16 17:27:14  [<ffffffffa0b0ef9c>] ll_file_write+0x16c/0x2a0 [lustre]
      2013-11-16 17:27:14  [<ffffffff81181398>] vfs_write+0xb8/0x1a0
      2013-11-16 17:27:14  [<ffffffff81181c91>] sys_write+0x51/0x90
      2013-11-16 17:27:14  [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      2013-11-16 17:27:14
      

      Lctl dump attached

      Attachments

        1. analysis.txt
          334 kB
        2. iwc.console.gz
          29 kB
        3. iwc106.sysrq.txt
          375 kB
        4. iwc106.sysrq.txt
          375 kB
        5. iwc106.sysrq.txt.gz
          29 kB
        6. iwc106.sysrq.txt.gz
          29 kB
        7. iwc48.lctl.log.txt.gz
          0.2 kB
        8. log.bz2
          525 kB
        9. LU-4268-140123-log.tar.gz
          1.75 MB
        10. vmcore-dmesg.txt
          78 kB

        Activity

          [LU-4269] ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed
          bogl Bob Glossman (Inactive) added a comment - backport to b2_5: http://review.whamcloud.com/9346

          Patch landed to Master. Please reopen ticket if more work is needed.

          jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master. Please reopen ticket if more work is needed.
          bobijam Zhenyu Xu added a comment - updated patch http://review.whamcloud.com/#/c/8772/

          The root cause of this issue is pretty clear.

          Bobijam, can you please create a patch for this?

          jay Jinshan Xiong (Inactive) added a comment - The root cause of this issue is pretty clear. Bobijam, can you please create a patch for this?

          Li, Jinshan - Are either of you able to share how you're reproducing this?

          paf Patrick Farrell (Inactive) added a comment - Li, Jinshan - Are either of you able to share how you're reproducing this?

          Ah, yeah, the right lock order is lock->l_lock, res->lr_lock, ns->ns_lock. Thanks for poiting that error out.

          We are able to reproduce the problem steadily too. If you could push a patch, we can check whether it helps. I feel that there are more than one place where lock->l_flags is changed without the protection of lock->l_lock.

          lixi Li Xi (Inactive) added a comment - Ah, yeah, the right lock order is lock->l_lock, res->lr_lock, ns->ns_lock. Thanks for poiting that error out. We are able to reproduce the problem steadily too. If you could push a patch, we can check whether it helps. I feel that there are more than one place where lock->l_flags is changed without the protection of lock->l_lock.

          Hi Li Xi,

          That looks true. Somehow I looked at the code wrong. The dlm lock was set destroyed flag before taking out of LRU, so ldlm_clear_skipped() in ldlm_lock_remove_from_lru_nolock() is indeed the root cause of this problem.

          Your patch will cause deadlock because we should take res lock first and then ns_lock. My idea to fix this problem is to move ldlm_clear_skipped() out of ldlm_lock_remove_from_lru_nolock() into ldlm_lock_add_to_lru_nolock(). I think this will fix the problem as well.

          jay Jinshan Xiong (Inactive) added a comment - Hi Li Xi, That looks true. Somehow I looked at the code wrong. The dlm lock was set destroyed flag before taking out of LRU, so ldlm_clear_skipped() in ldlm_lock_remove_from_lru_nolock() is indeed the root cause of this problem. Your patch will cause deadlock because we should take res lock first and then ns_lock. My idea to fix this problem is to move ldlm_clear_skipped() out of ldlm_lock_remove_from_lru_nolock() into ldlm_lock_add_to_lru_nolock(). I think this will fix the problem as well.

          Hi Jinshan,

          According to your test results, this problem really looks like a race problem. Would you please try following patch to check that the lock is always held when clearing/setting the flag?

          http://review.whamcloud.com/#/c/8772/

          lixi Li Xi (Inactive) added a comment - Hi Jinshan, According to your test results, this problem really looks like a race problem. Would you please try following patch to check that the lock is always held when clearing/setting the flag? http://review.whamcloud.com/#/c/8772/

          The flags should be protected by ldlm_lock::l_lock. When I saw this issue, my first reaction was to check if there are unprotected writing to l_flags and the only place is:

          /**
           * Removes LDLM lock \a lock from LRU. Assumes LRU is already locked.
           */
          int ldlm_lock_remove_from_lru_nolock(struct ldlm_lock *lock)
          {
                  int rc = 0;
                  if (!cfs_list_empty(&lock->l_lru)) {
                          struct ldlm_namespace *ns = ldlm_lock_to_ns(lock);
          
                          LASSERT(lock->l_resource->lr_type != LDLM_FLOCK);
                          cfs_list_del_init(&lock->l_lru);
                          ldlm_clear_skipped(lock);
                          LASSERT(ns->ns_nr_unused > 0);
                          ns->ns_nr_unused--;
                          rc = 1;
                  }
                  return rc;
          }
          

          However, it was protected by ns_lock and if the lock is being destroyed, it must have been taken out of LRU list.

          jay Jinshan Xiong (Inactive) added a comment - The flags should be protected by ldlm_lock::l_lock. When I saw this issue, my first reaction was to check if there are unprotected writing to l_flags and the only place is: /** * Removes LDLM lock \a lock from LRU. Assumes LRU is already locked. */ int ldlm_lock_remove_from_lru_nolock(struct ldlm_lock *lock) { int rc = 0; if (!cfs_list_empty(&lock->l_lru)) { struct ldlm_namespace *ns = ldlm_lock_to_ns(lock); LASSERT(lock->l_resource->lr_type != LDLM_FLOCK); cfs_list_del_init(&lock->l_lru); ldlm_clear_skipped(lock); LASSERT(ns->ns_nr_unused > 0); ns->ns_nr_unused--; rc = 1; } return rc; } However, it was protected by ns_lock and if the lock is being destroyed, it must have been taken out of LRU list.

          Jinshan, according to the comment, the flags should be protected by lr_flags. Are there flag changes not protected by this lock?

          Also, I have some concern that using set_bit() will not set the same bits as the Lustre network protocol on all architectures. Do you know if there is a specific bit ordering used in all cases?

          adilger Andreas Dilger added a comment - Jinshan, according to the comment, the flags should be protected by lr_flags. Are there flag changes not protected by this lock? Also, I have some concern that using set_bit() will not set the same bits as the Lustre network protocol on all architectures. Do you know if there is a specific bit ordering used in all cases?

          People

            bobijam Zhenyu Xu
            cliffw Cliff White (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: