Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5883

DNE II testing: LustreError: 6618:0:(statahead.c:262:sa_kill()) ASSERTION( !list_empty(&entry->se_list) ) failed

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.7.0
    • None
    • 3
    • 16451

    Description

      While doing a general purpose test of master with DNE II (2 MDSes with 3 MDTs each, 6 total MDTs), we did an ls to check the status of something, and our client LBUGged. We turned up debug on a different client, did the ls again, and it crashed as well:

      2014-11-06T20:08:43.974554-06:00 c1-0c0s0n3 LustreError: 3797:0:(statahead.c:262:sa_kill()) ASSERTION( !list_empty(&entry->se_list) ) failed:
      2014-11-06T20:08:43.974591-06:00 c1-0c0s0n3 LustreError: 3797:0:(statahead.c:262:sa_kill()) LBUG
      2014-11-06T20:08:43.974597-06:00 c1-0c0s0n3 Pid: 3797, comm: ls
      2014-11-06T20:08:43.974604-06:00 c1-0c0s0n3 Call Trace:
      2014-11-06T20:08:43.974617-06:00 c1-0c0s0n3 [<ffffffff81006591>] try_stack_unwind+0x161/0x1a0
      2014-11-06T20:08:43.974622-06:00 c1-0c0s0n3 [<ffffffff81004de9>] dump_trace+0x89/0x440
      2014-11-06T20:08:43.974632-06:00 c1-0c0s0n3 [<ffffffffa0176897>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
      2014-11-06T20:08:43.974639-06:00 c1-0c0s0n3 [<ffffffffa0176de7>] lbug_with_loc+0x47/0xc0 [libcfs]
      2014-11-06T20:08:44.015207-06:00 c1-0c0s0n3 [<ffffffffa0750512>] sa_put+0x332/0x370 [lustre]
      2014-11-06T20:08:44.015232-06:00 c1-0c0s0n3 [<ffffffffa0752b6c>] do_statahead_enter+0xfdc/0x1cb0 [lustre]
      2014-11-06T20:08:44.015238-06:00 c1-0c0s0n3 [<ffffffffa073a695>] ll_lookup_it+0x7b5/0x1b70 [lustre]
      2014-11-06T20:08:44.015245-06:00 c1-0c0s0n3 [<ffffffffa073bad5>] ll_lookup_nd+0x85/0x570 [lustre]
      2014-11-06T20:08:44.015250-06:00 c1-0c0s0n3 [<ffffffff81148edc>] d_alloc_and_lookup+0x4c/0x80
      2014-11-06T20:08:44.015261-06:00 c1-0c0s0n3 [<ffffffff8114a57e>] do_lookup+0x29e/0x3a0
      2014-11-06T20:08:44.015273-06:00 c1-0c0s0n3 [<ffffffff8114cdb3>] path_lookupat+0xc3/0x5e0
      2014-11-06T20:08:44.015281-06:00 c1-0c0s0n3 [<ffffffff8114d305>] do_path_lookup+0x35/0xd0
      2014-11-06T20:08:44.015287-06:00 c1-0c0s0n3 [<ffffffff8114e043>] user_path_at_empty+0x83/0xb0
      2014-11-06T20:08:44.015303-06:00 c1-0c0s0n3 [<ffffffff8114e081>] user_path_at+0x11/0x20
      2014-11-06T20:08:44.015311-06:00 c1-0c0s0n3 [<ffffffff81142d85>] vfs_fstatat+0x55/0x90
      2014-11-06T20:08:44.054257-06:00 c1-0c0s0n3 [<ffffffff81142e2e>] vfs_lstat+0x1e/0x20
      2014-11-06T20:08:44.054275-06:00 c1-0c0s0n3 [<ffffffff81142e54>] sys_newlstat+0x24/0x50
      2014-11-06T20:08:44.054281-06:00 c1-0c0s0n3 [<ffffffff8133776b>] system_call_fastpath+0x16/0x1b
      2014-11-06T20:08:44.054293-06:00 c1-0c0s0n3 [<00007ffff78b9515>] 0x7ffff78b9515

      I'll make the client dump available in a few minutes.

      Attachments

        Activity

          [LU-5883] DNE II testing: LustreError: 6618:0:(statahead.c:262:sa_kill()) ASSERTION( !list_empty(&entry->se_list) ) failed
          laisiyao Lai Siyao added a comment - - edited

          Patch updated, if test is okay, I'll mark this duplicate.

          laisiyao Lai Siyao added a comment - - edited Patch updated, if test is okay, I'll mark this duplicate.
          laisiyao Lai Siyao added a comment - - edited

          this is a known issue: revalidate_statahead_dentry() may fail to wait for a statahead entry to become ready, in this case it should not release this entry. originally this was found in 2.4 statahead patches, and I'm waiting for their response. I'll update the last patch of LU-3270: http://review.whamcloud.com/#/c/9667/, so if you want to test statahead, it's better to apply this one and test.

          laisiyao Lai Siyao added a comment - - edited this is a known issue: revalidate_statahead_dentry() may fail to wait for a statahead entry to become ready, in this case it should not release this entry. originally this was found in 2.4 statahead patches, and I'm waiting for their response. I'll update the last patch of LU-3270 : http://review.whamcloud.com/#/c/9667/ , so if you want to test statahead, it's better to apply this one and test.
          di.wang Di Wang added a comment -

          From the debug log it seems not related with DNE, hmm, I saw

              LU-3270 statahead: statahead thread wait for RPCs to finish
              
              Statahead thread should wait for inflight stat RPCs to finish in
              case statahead RPC callback may access data allocated in statahead
              thread context.
              
              ll_sa_entry_fini() should keep old entry if stat RPC is not
              finished yet.
              
              Simplify sai refcounting:
              * newly allocated sai will hold one refcount, and it will put it
                after starting statahead thread.
              * statahead thread holds one refcount.
              * agl thread holds one refcount.
              * stat process calls do_statahead_enter() which will try to get
                sai, and if it's valid, it will revalidate from statahead cache,
                and put refcount after use.
              
              Signed-off-by: Lai Siyao <lai.siyao@intel.com>
              Change-Id: I55a4fe66a5f6c04595d3bc84f0cd3750f20e0ee4
              Reviewed-on: http://review.whamcloud.com/9663
              Tested-by: Jenkins
              Tested-by: Maloo <hpdd-maloo@intel.com>
              Reviewed-by: Fan Yong <fan.yong@intel.com>
              Reviewed-by: James Simmons <uja.ornl@gmail.com>
              Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
          

          Just changed these area in 2.6. Lai, could you please have a look. Thanks.

          di.wang Di Wang added a comment - From the debug log it seems not related with DNE, hmm, I saw LU-3270 statahead: statahead thread wait for RPCs to finish Statahead thread should wait for inflight stat RPCs to finish in case statahead RPC callback may access data allocated in statahead thread context. ll_sa_entry_fini() should keep old entry if stat RPC is not finished yet. Simplify sai refcounting: * newly allocated sai will hold one refcount, and it will put it after starting statahead thread. * statahead thread holds one refcount. * agl thread holds one refcount. * stat process calls do_statahead_enter() which will try to get sai, and if it's valid, it will revalidate from statahead cache, and put refcount after use. Signed-off-by: Lai Siyao <lai.siyao@intel.com> Change-Id: I55a4fe66a5f6c04595d3bc84f0cd3750f20e0ee4 Reviewed-on: http://review.whamcloud.com/9663 Tested-by: Jenkins Tested-by: Maloo <hpdd-maloo@intel.com> Reviewed-by: Fan Yong <fan.yong@intel.com> Reviewed-by: James Simmons <uja.ornl@gmail.com> Reviewed-by: Oleg Drokin <oleg.drokin@intel.com> Just changed these area in 2.6. Lai, could you please have a look. Thanks.

          Note that, weirdly, dirdata was off on MDT0. This system was used to troubleshoot LU-5626 and we forgot to reenable dirdata. We'll reenable it before the next run.

          paf Patrick Farrell (Inactive) added a comment - Note that, weirdly, dirdata was off on MDT0. This system was used to troubleshoot LU-5626 and we forgot to reenable dirdata. We'll reenable it before the next run.

          Full Lustre log extracted from the client is here:
          ftp.whamcloud.com/uploads/LU-5883/c1-0c0s1n2.lustre.log

          paf Patrick Farrell (Inactive) added a comment - Full Lustre log extracted from the client is here: ftp.whamcloud.com/uploads/ LU-5883 /c1-0c0s1n2.lustre.log

          Dump and associated kos (and system logs for that node) going up here:
          ftp.whamcloud.com/uploads/LU-5883/c1-0c0s1n2-20141107.tar.gz

          The node had full debugging enabled when it was crashed.

          Confirmed from the engineer testing it - The ls was of the root of the file system. There should not have been any remote directories in the root of the file system, so this may have nothing to do with DNE II.

          paf Patrick Farrell (Inactive) added a comment - - edited Dump and associated kos (and system logs for that node) going up here: ftp.whamcloud.com/uploads/ LU-5883 /c1-0c0s1n2-20141107.tar.gz The node had full debugging enabled when it was crashed. Confirmed from the engineer testing it - The ls was of the root of the file system. There should not have been any remote directories in the root of the file system, so this may have nothing to do with DNE II.

          People

            laisiyao Lai Siyao
            paf Patrick Farrell (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: