Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6222

LustreError (statahead.c:262:sa_kill()) ASSERTION( !list_empty(&entry->se_list) )

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.7.0
    • None
    • None
    • 2.6.93 on the clients and 2.6.92 on the servers, centos with rpms from lustre-master jenkins tree.
    • 3
    • 17399

    Description

      We seem to be having an issue similar to the one described in LU-5883. We're using 2.6.93 on the clients and 2.6.92 on the servers. The problem can be reproduced reliably on our end, so I'd be happy to provide additional logs/diagnostic information (Is there a standard procedure for this? I couldn't seem to find anything via google.) I've included the kernel log messages that were dumped to the console as well as the stack trace below.

      kernel:LustreError: 13007:0:(statahead.c:262:sa_kill()) ASSERTION( !list_empty(&entry->se_list) ) failed:
      kernel:LustreError: 13007:0:(statahead.c:262:sa_kill()) LBUG
      PID: 13007 TASK: ffff880b0aea2040 CPU: 25 COMMAND: "rsync"
      #0 [ffff881c20ef16f0] machine_kexec at ffffffff8103b5bb
      /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/arch/x86/kernel/machine_kexec_64.c: 336
      #1 [ffff881c20ef1750] crash_kexec at ffffffff810c9852
      /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/kernel/kexec.c: 1106
      #2 [ffff881c20ef1820] panic at ffffffff8152927e
      /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/kernel/panic.c: 111
      #3 [ffff881c20ef18a0] lbug_with_loc at ffffffffa03c5eeb [libcfs]
      /usr/src/debug/lustre-2.6.93/libcfs/libcfs/linux/linux-debug.c: 175
      #4 [ffff881c20ef18c0] revalidate_statahead_dentry at ffffffffa0a0e31d [lustre]
      /usr/src/debug/lustre-2.6.93/lustre/llite/statahead.c: 263
      #5 [ffff881c20ef1a00] ll_statahead at ffffffffa0a0e642 [lustre]
      /usr/src/debug/lustre-2.6.93/libcfs/include/libcfs/libcfs_debug.h: 219
      #6 [ffff881c20ef1ae0] ll_lookup_it at ffffffffa09f8907 [lustre]
      /usr/src/debug/lustre-2.6.93/lustre/llite/namei.c: 541
      #7 [ffff881c20ef1ba0] ll_lookup_nd at ffffffffa09f9029 [lustre]
      /usr/src/debug/lustre-2.6.93/lustre/llite/namei.c: 771
      #8 [ffff881c20ef1bf0] do_lookup at ffffffff8119dc65
      /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 1063
      #9 [ffff881c20ef1c50] __link_path_walk at ffffffff8119e8f4
      /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 1239
      #10 [ffff881c20ef1d30] path_walk at ffffffff8119f40a
      /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 558
      #11 [ffff881c20ef1d70] filename_lookup at ffffffff8119f61b
      /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 1375
      #12 [ffff881c20ef1db0] user_path_at at ffffffff811a0747
      /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 1597
      #13 [ffff881c20ef1e80] vfs_fstatat at ffffffff81193bc0
      /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/stat.c: 84
      #14 [ffff881c20ef1ee0] vfs_lstat at ffffffff81193c7e
      /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/stat.c: 107
      #15 [ffff881c20ef1ef0] sys_newlstat at ffffffff81193ca4
      /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/stat.c: 257
      #16 [ffff881c20ef1f80] system_call_fastpath at ffffffff8100b072
      /usr/src/debug////////kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/arch/x86/kernel/entry_64.S: 489
      RIP: 00000031f78daf75 RSP: 00007fff51e03da8 RFLAGS: 00010206
      RAX: 0000000000000006 RBX: ffffffff8100b072 RCX: 00007fff51e040b0
      RDX: 00007fff51dffb10 RSI: 00007fff51dffb10 RDI: 00007fff51e00ba0
      RBP: 0000000000000000 R8: 00007fff51e00be3 R9: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff51e00ba0
      R13: 00007fff51dffb10 R14: 00007fff51dffb10 R15: 00007fff51e00ba0
      ORIG_RAX: 0000000000000006 CS: 0033 SS: 002b

      Attachments

        Activity

          [LU-6222] LustreError (statahead.c:262:sa_kill()) ASSERTION( !list_empty(&entry->se_list) )

          Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/15178
          Subject: LU-6222 statahead: add to list before make ready
          Project: fs/lustre-release
          Branch: b2_5
          Current Patch Set: 1
          Commit: 1091da447db103549ca8e20d5c5fb97679f7080c

          gerrit Gerrit Updater added a comment - Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/15178 Subject: LU-6222 statahead: add to list before make ready Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: 1091da447db103549ca8e20d5c5fb97679f7080c
          pjones Peter Jones added a comment -

          Landed for 2.7

          pjones Peter Jones added a comment - Landed for 2.7

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13708/
          Subject: LU-6222 statahead: add to list before make ready
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: c59baf3ff10beca9841bc8aae211af120ab913dc

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13708/ Subject: LU-6222 statahead: add to list before make ready Project: fs/lustre-release Branch: master Current Patch Set: Commit: c59baf3ff10beca9841bc8aae211af120ab913dc
          azenk Andrew Zenk added a comment -

          That seems to have fixed it. The rsync script that consistently caused the issue after a minute or two has been running flawlessly for many hours. Thanks again.

          azenk Andrew Zenk added a comment - That seems to have fixed it. The rsync script that consistently caused the issue after a minute or two has been running flawlessly for many hours. Thanks again.
          azenk Andrew Zenk added a comment -

          Thanks! We're testing it now.

          azenk Andrew Zenk added a comment - Thanks! We're testing it now.
          laisiyao Lai Siyao added a comment -

          Andrew, I just uploaded a fix for this issue, will you apply it and test again?

          laisiyao Lai Siyao added a comment - Andrew, I just uploaded a fix for this issue, will you apply it and test again?

          Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/13708
          Subject: LU-6222 statahead: add to list before make ready
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: c77d1c16d45850e3cd4492d27746b30a89ba2beb

          gerrit Gerrit Updater added a comment - Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/13708 Subject: LU-6222 statahead: add to list before make ready Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c77d1c16d45850e3cd4492d27746b30a89ba2beb
          azenk Andrew Zenk added a comment - - edited

          Attached output from foreach bt -l

          azenk Andrew Zenk added a comment - - edited Attached output from foreach bt -l
          laisiyao Lai Siyao added a comment -

          could you list all process backtraces in the dump?

          laisiyao Lai Siyao added a comment - could you list all process backtraces in the dump?
          azenk Andrew Zenk added a comment -

          We do not use DNE. In this case a user is running an rsync as follows "rsync --size-only --progress -av --prune-empty-dirs --include=/ --exclude=.man --exclude=_eot.txt --exclude=.MAN --exclude=_EOT.TXT --include=052903541090_01* --exclude=* /lustre_mountpoint/staging/orig/_uploads/DG_A11281 /lustre_mountpoint/somepath/northslope/" This command is run several times in a serial fashion from a single client with slight variations of the include and srcdir for each run. At a some, seemingly random point during the sequence of rsync jobs, the kernel on the client node panics.

          Configuration:
          We're using a single MDS with nine OSSs. Each OSS contains 3 direct attached targets. Our system isn't designed for failover. The MDS and OSS nodes are booted from a common image built using centos 6.6 with this build: https://build.hpdd.intel.com/job/lustre-master/arch=x86_64,build_type=server,distro=el6.6,ib_stack=inkernel/2832/artifact/artifacts/RPMS/x86_64/.

          I'm happy to supply exact specs on raid configurations and disk counts if you feel that it's important, but we'll skip that for now. The MDS is using a single target on a ssd raid10. The OSTs are sata of various types. All servers are connected to our QDR IB fabric as well as a gigabit VLAN. The latter is used for connecting 3 clients that aren't experiencing any issues.

          There are approximately 20 clients, which are also using centos 6.6. The lustre clients are installed via the pre-built rpms from lustre-master, just like the servers. Though the clients are of slightly mixed build versions. The two clients that we've reproduced the issue on were both running build #2835.

          The entire filesystem has a stripe count of 1.

          Let me know if you need any additional information. Thanks!

          azenk Andrew Zenk added a comment - We do not use DNE. In this case a user is running an rsync as follows "rsync --size-only --progress -av --prune-empty-dirs --include=/ --exclude=.man --exclude= _eot.txt --exclude=.MAN --exclude=_EOT.TXT --include=052903541090_01 * --exclude=* /lustre_mountpoint/staging/orig/_uploads/DG_A11281 /lustre_mountpoint/somepath/northslope/" This command is run several times in a serial fashion from a single client with slight variations of the include and srcdir for each run. At a some, seemingly random point during the sequence of rsync jobs, the kernel on the client node panics. Configuration: We're using a single MDS with nine OSSs. Each OSS contains 3 direct attached targets. Our system isn't designed for failover. The MDS and OSS nodes are booted from a common image built using centos 6.6 with this build: https://build.hpdd.intel.com/job/lustre-master/arch=x86_64,build_type=server,distro=el6.6,ib_stack=inkernel/2832/artifact/artifacts/RPMS/x86_64/ . I'm happy to supply exact specs on raid configurations and disk counts if you feel that it's important, but we'll skip that for now. The MDS is using a single target on a ssd raid10. The OSTs are sata of various types. All servers are connected to our QDR IB fabric as well as a gigabit VLAN. The latter is used for connecting 3 clients that aren't experiencing any issues. There are approximately 20 clients, which are also using centos 6.6. The lustre clients are installed via the pre-built rpms from lustre-master, just like the servers. Though the clients are of slightly mixed build versions. The two clients that we've reproduced the issue on were both running build #2835. The entire filesystem has a stripe count of 1. Let me know if you need any additional information. Thanks!

          People

            laisiyao Lai Siyao
            azenk Andrew Zenk
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: