Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5883

DNE II testing: LustreError: 6618:0:(statahead.c:262:sa_kill()) ASSERTION( !list_empty(&entry->se_list) ) failed

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.7.0
    • None
    • 3
    • 16451

    Description

      While doing a general purpose test of master with DNE II (2 MDSes with 3 MDTs each, 6 total MDTs), we did an ls to check the status of something, and our client LBUGged. We turned up debug on a different client, did the ls again, and it crashed as well:

      2014-11-06T20:08:43.974554-06:00 c1-0c0s0n3 LustreError: 3797:0:(statahead.c:262:sa_kill()) ASSERTION( !list_empty(&entry->se_list) ) failed:
      2014-11-06T20:08:43.974591-06:00 c1-0c0s0n3 LustreError: 3797:0:(statahead.c:262:sa_kill()) LBUG
      2014-11-06T20:08:43.974597-06:00 c1-0c0s0n3 Pid: 3797, comm: ls
      2014-11-06T20:08:43.974604-06:00 c1-0c0s0n3 Call Trace:
      2014-11-06T20:08:43.974617-06:00 c1-0c0s0n3 [<ffffffff81006591>] try_stack_unwind+0x161/0x1a0
      2014-11-06T20:08:43.974622-06:00 c1-0c0s0n3 [<ffffffff81004de9>] dump_trace+0x89/0x440
      2014-11-06T20:08:43.974632-06:00 c1-0c0s0n3 [<ffffffffa0176897>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
      2014-11-06T20:08:43.974639-06:00 c1-0c0s0n3 [<ffffffffa0176de7>] lbug_with_loc+0x47/0xc0 [libcfs]
      2014-11-06T20:08:44.015207-06:00 c1-0c0s0n3 [<ffffffffa0750512>] sa_put+0x332/0x370 [lustre]
      2014-11-06T20:08:44.015232-06:00 c1-0c0s0n3 [<ffffffffa0752b6c>] do_statahead_enter+0xfdc/0x1cb0 [lustre]
      2014-11-06T20:08:44.015238-06:00 c1-0c0s0n3 [<ffffffffa073a695>] ll_lookup_it+0x7b5/0x1b70 [lustre]
      2014-11-06T20:08:44.015245-06:00 c1-0c0s0n3 [<ffffffffa073bad5>] ll_lookup_nd+0x85/0x570 [lustre]
      2014-11-06T20:08:44.015250-06:00 c1-0c0s0n3 [<ffffffff81148edc>] d_alloc_and_lookup+0x4c/0x80
      2014-11-06T20:08:44.015261-06:00 c1-0c0s0n3 [<ffffffff8114a57e>] do_lookup+0x29e/0x3a0
      2014-11-06T20:08:44.015273-06:00 c1-0c0s0n3 [<ffffffff8114cdb3>] path_lookupat+0xc3/0x5e0
      2014-11-06T20:08:44.015281-06:00 c1-0c0s0n3 [<ffffffff8114d305>] do_path_lookup+0x35/0xd0
      2014-11-06T20:08:44.015287-06:00 c1-0c0s0n3 [<ffffffff8114e043>] user_path_at_empty+0x83/0xb0
      2014-11-06T20:08:44.015303-06:00 c1-0c0s0n3 [<ffffffff8114e081>] user_path_at+0x11/0x20
      2014-11-06T20:08:44.015311-06:00 c1-0c0s0n3 [<ffffffff81142d85>] vfs_fstatat+0x55/0x90
      2014-11-06T20:08:44.054257-06:00 c1-0c0s0n3 [<ffffffff81142e2e>] vfs_lstat+0x1e/0x20
      2014-11-06T20:08:44.054275-06:00 c1-0c0s0n3 [<ffffffff81142e54>] sys_newlstat+0x24/0x50
      2014-11-06T20:08:44.054281-06:00 c1-0c0s0n3 [<ffffffff8133776b>] system_call_fastpath+0x16/0x1b
      2014-11-06T20:08:44.054293-06:00 c1-0c0s0n3 [<00007ffff78b9515>] 0x7ffff78b9515

      I'll make the client dump available in a few minutes.

      Attachments

        Activity

          [LU-5883] DNE II testing: LustreError: 6618:0:(statahead.c:262:sa_kill()) ASSERTION( !list_empty(&entry->se_list) ) failed
          azenk Andrew Zenk added a comment -

          I see that the patch seems to have been applied, though we seem to be having a similar problem. We're using 2.6.93 on the clients and 2.6.92 on the servers. The problem can be reproduced reliably on our end, so I'd be happy to provide additional logs/diagnostic information (Is there a standard procedure for this? I couldn't seem to find anything via google.) I've included the kernel log messages that were dumped to the console as well as the stack trace below.

          kernel:LustreError: 13007:0:(statahead.c:262:sa_kill()) ASSERTION( !list_empty(&entry->se_list) ) failed:
          kernel:LustreError: 13007:0:(statahead.c:262:sa_kill()) LBUG

          PID: 13007 TASK: ffff880b0aea2040 CPU: 25 COMMAND: "rsync"
          #0 [ffff881c20ef16f0] machine_kexec at ffffffff8103b5bb
          /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/arch/x86/kernel/machine_kexec_64.c: 336
          #1 [ffff881c20ef1750] crash_kexec at ffffffff810c9852
          /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/kernel/kexec.c: 1106
          #2 [ffff881c20ef1820] panic at ffffffff8152927e
          /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/kernel/panic.c: 111
          #3 [ffff881c20ef18a0] lbug_with_loc at ffffffffa03c5eeb [libcfs]
          /usr/src/debug/lustre-2.6.93/libcfs/libcfs/linux/linux-debug.c: 175
          #4 [ffff881c20ef18c0] revalidate_statahead_dentry at ffffffffa0a0e31d [lustre]
          /usr/src/debug/lustre-2.6.93/lustre/llite/statahead.c: 263
          #5 [ffff881c20ef1a00] ll_statahead at ffffffffa0a0e642 [lustre]
          /usr/src/debug/lustre-2.6.93/libcfs/include/libcfs/libcfs_debug.h: 219
          #6 [ffff881c20ef1ae0] ll_lookup_it at ffffffffa09f8907 [lustre]
          /usr/src/debug/lustre-2.6.93/lustre/llite/namei.c: 541
          #7 [ffff881c20ef1ba0] ll_lookup_nd at ffffffffa09f9029 [lustre]
          /usr/src/debug/lustre-2.6.93/lustre/llite/namei.c: 771
          #8 [ffff881c20ef1bf0] do_lookup at ffffffff8119dc65
          /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 1063
          #9 [ffff881c20ef1c50] __link_path_walk at ffffffff8119e8f4
          /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 1239
          #10 [ffff881c20ef1d30] path_walk at ffffffff8119f40a
          /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 558
          #11 [ffff881c20ef1d70] filename_lookup at ffffffff8119f61b
          /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 1375
          #12 [ffff881c20ef1db0] user_path_at at ffffffff811a0747
          /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 1597
          #13 [ffff881c20ef1e80] vfs_fstatat at ffffffff81193bc0
          /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/stat.c: 84
          #14 [ffff881c20ef1ee0] vfs_lstat at ffffffff81193c7e
          /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/stat.c: 107
          #15 [ffff881c20ef1ef0] sys_newlstat at ffffffff81193ca4
          /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/stat.c: 257
          #16 [ffff881c20ef1f80] system_call_fastpath at ffffffff8100b072
          /usr/src/debug////////kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/arch/x86/kernel/entry_64.S: 489
          RIP: 00000031f78daf75 RSP: 00007fff51e03da8 RFLAGS: 00010206
          RAX: 0000000000000006 RBX: ffffffff8100b072 RCX: 00007fff51e040b0
          RDX: 00007fff51dffb10 RSI: 00007fff51dffb10 RDI: 00007fff51e00ba0
          RBP: 0000000000000000 R8: 00007fff51e00be3 R9: 0000000000000000
          R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff51e00ba0
          R13: 00007fff51dffb10 R14: 00007fff51dffb10 R15: 00007fff51e00ba0
          ORIG_RAX: 0000000000000006 CS: 0033 SS: 002b

          azenk Andrew Zenk added a comment - I see that the patch seems to have been applied, though we seem to be having a similar problem. We're using 2.6.93 on the clients and 2.6.92 on the servers. The problem can be reproduced reliably on our end, so I'd be happy to provide additional logs/diagnostic information (Is there a standard procedure for this? I couldn't seem to find anything via google.) I've included the kernel log messages that were dumped to the console as well as the stack trace below. kernel:LustreError: 13007:0:(statahead.c:262:sa_kill()) ASSERTION( !list_empty(&entry->se_list) ) failed: kernel:LustreError: 13007:0:(statahead.c:262:sa_kill()) LBUG PID: 13007 TASK: ffff880b0aea2040 CPU: 25 COMMAND: "rsync" #0 [ffff881c20ef16f0] machine_kexec at ffffffff8103b5bb /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/arch/x86/kernel/machine_kexec_64.c: 336 #1 [ffff881c20ef1750] crash_kexec at ffffffff810c9852 /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/kernel/kexec.c: 1106 #2 [ffff881c20ef1820] panic at ffffffff8152927e /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/kernel/panic.c: 111 #3 [ffff881c20ef18a0] lbug_with_loc at ffffffffa03c5eeb [libcfs] /usr/src/debug/lustre-2.6.93/libcfs/libcfs/linux/linux-debug.c: 175 #4 [ffff881c20ef18c0] revalidate_statahead_dentry at ffffffffa0a0e31d [lustre] /usr/src/debug/lustre-2.6.93/lustre/llite/statahead.c: 263 #5 [ffff881c20ef1a00] ll_statahead at ffffffffa0a0e642 [lustre] /usr/src/debug/lustre-2.6.93/libcfs/include/libcfs/libcfs_debug.h: 219 #6 [ffff881c20ef1ae0] ll_lookup_it at ffffffffa09f8907 [lustre] /usr/src/debug/lustre-2.6.93/lustre/llite/namei.c: 541 #7 [ffff881c20ef1ba0] ll_lookup_nd at ffffffffa09f9029 [lustre] /usr/src/debug/lustre-2.6.93/lustre/llite/namei.c: 771 #8 [ffff881c20ef1bf0] do_lookup at ffffffff8119dc65 /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 1063 #9 [ffff881c20ef1c50] __link_path_walk at ffffffff8119e8f4 /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 1239 #10 [ffff881c20ef1d30] path_walk at ffffffff8119f40a /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 558 #11 [ffff881c20ef1d70] filename_lookup at ffffffff8119f61b /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 1375 #12 [ffff881c20ef1db0] user_path_at at ffffffff811a0747 /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 1597 #13 [ffff881c20ef1e80] vfs_fstatat at ffffffff81193bc0 /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/stat.c: 84 #14 [ffff881c20ef1ee0] vfs_lstat at ffffffff81193c7e /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/stat.c: 107 #15 [ffff881c20ef1ef0] sys_newlstat at ffffffff81193ca4 /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/stat.c: 257 #16 [ffff881c20ef1f80] system_call_fastpath at ffffffff8100b072 /usr/src/debug////////kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/arch/x86/kernel/entry_64.S: 489 RIP: 00000031f78daf75 RSP: 00007fff51e03da8 RFLAGS: 00010206 RAX: 0000000000000006 RBX: ffffffff8100b072 RCX: 00007fff51e040b0 RDX: 00007fff51dffb10 RSI: 00007fff51dffb10 RDI: 00007fff51e00ba0 RBP: 0000000000000000 R8: 00007fff51e00be3 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff51e00ba0 R13: 00007fff51dffb10 R14: 00007fff51dffb10 R15: 00007fff51e00ba0 ORIG_RAX: 0000000000000006 CS: 0033 SS: 002b

          Thanks, Lai - We'll use that patch when we next update the system. That will be later this week, I'll try to give test results shortly after that.

          paf Patrick Farrell (Inactive) added a comment - - edited Thanks, Lai - We'll use that patch when we next update the system. That will be later this week, I'll try to give test results shortly after that.
          laisiyao Lai Siyao added a comment - - edited

          Patch updated, if test is okay, I'll mark this duplicate.

          laisiyao Lai Siyao added a comment - - edited Patch updated, if test is okay, I'll mark this duplicate.
          laisiyao Lai Siyao added a comment - - edited

          this is a known issue: revalidate_statahead_dentry() may fail to wait for a statahead entry to become ready, in this case it should not release this entry. originally this was found in 2.4 statahead patches, and I'm waiting for their response. I'll update the last patch of LU-3270: http://review.whamcloud.com/#/c/9667/, so if you want to test statahead, it's better to apply this one and test.

          laisiyao Lai Siyao added a comment - - edited this is a known issue: revalidate_statahead_dentry() may fail to wait for a statahead entry to become ready, in this case it should not release this entry. originally this was found in 2.4 statahead patches, and I'm waiting for their response. I'll update the last patch of LU-3270 : http://review.whamcloud.com/#/c/9667/ , so if you want to test statahead, it's better to apply this one and test.
          di.wang Di Wang added a comment -

          From the debug log it seems not related with DNE, hmm, I saw

              LU-3270 statahead: statahead thread wait for RPCs to finish
              
              Statahead thread should wait for inflight stat RPCs to finish in
              case statahead RPC callback may access data allocated in statahead
              thread context.
              
              ll_sa_entry_fini() should keep old entry if stat RPC is not
              finished yet.
              
              Simplify sai refcounting:
              * newly allocated sai will hold one refcount, and it will put it
                after starting statahead thread.
              * statahead thread holds one refcount.
              * agl thread holds one refcount.
              * stat process calls do_statahead_enter() which will try to get
                sai, and if it's valid, it will revalidate from statahead cache,
                and put refcount after use.
              
              Signed-off-by: Lai Siyao <lai.siyao@intel.com>
              Change-Id: I55a4fe66a5f6c04595d3bc84f0cd3750f20e0ee4
              Reviewed-on: http://review.whamcloud.com/9663
              Tested-by: Jenkins
              Tested-by: Maloo <hpdd-maloo@intel.com>
              Reviewed-by: Fan Yong <fan.yong@intel.com>
              Reviewed-by: James Simmons <uja.ornl@gmail.com>
              Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
          

          Just changed these area in 2.6. Lai, could you please have a look. Thanks.

          di.wang Di Wang added a comment - From the debug log it seems not related with DNE, hmm, I saw LU-3270 statahead: statahead thread wait for RPCs to finish Statahead thread should wait for inflight stat RPCs to finish in case statahead RPC callback may access data allocated in statahead thread context. ll_sa_entry_fini() should keep old entry if stat RPC is not finished yet. Simplify sai refcounting: * newly allocated sai will hold one refcount, and it will put it after starting statahead thread. * statahead thread holds one refcount. * agl thread holds one refcount. * stat process calls do_statahead_enter() which will try to get sai, and if it's valid, it will revalidate from statahead cache, and put refcount after use. Signed-off-by: Lai Siyao <lai.siyao@intel.com> Change-Id: I55a4fe66a5f6c04595d3bc84f0cd3750f20e0ee4 Reviewed-on: http://review.whamcloud.com/9663 Tested-by: Jenkins Tested-by: Maloo <hpdd-maloo@intel.com> Reviewed-by: Fan Yong <fan.yong@intel.com> Reviewed-by: James Simmons <uja.ornl@gmail.com> Reviewed-by: Oleg Drokin <oleg.drokin@intel.com> Just changed these area in 2.6. Lai, could you please have a look. Thanks.

          Note that, weirdly, dirdata was off on MDT0. This system was used to troubleshoot LU-5626 and we forgot to reenable dirdata. We'll reenable it before the next run.

          paf Patrick Farrell (Inactive) added a comment - Note that, weirdly, dirdata was off on MDT0. This system was used to troubleshoot LU-5626 and we forgot to reenable dirdata. We'll reenable it before the next run.

          Full Lustre log extracted from the client is here:
          ftp.whamcloud.com/uploads/LU-5883/c1-0c0s1n2.lustre.log

          paf Patrick Farrell (Inactive) added a comment - Full Lustre log extracted from the client is here: ftp.whamcloud.com/uploads/ LU-5883 /c1-0c0s1n2.lustre.log

          Dump and associated kos (and system logs for that node) going up here:
          ftp.whamcloud.com/uploads/LU-5883/c1-0c0s1n2-20141107.tar.gz

          The node had full debugging enabled when it was crashed.

          Confirmed from the engineer testing it - The ls was of the root of the file system. There should not have been any remote directories in the root of the file system, so this may have nothing to do with DNE II.

          paf Patrick Farrell (Inactive) added a comment - - edited Dump and associated kos (and system logs for that node) going up here: ftp.whamcloud.com/uploads/ LU-5883 /c1-0c0s1n2-20141107.tar.gz The node had full debugging enabled when it was crashed. Confirmed from the engineer testing it - The ls was of the root of the file system. There should not have been any remote directories in the root of the file system, so this may have nothing to do with DNE II.

          People

            laisiyao Lai Siyao
            paf Patrick Farrell (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: