Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6222

LustreError (statahead.c:262:sa_kill()) ASSERTION( !list_empty(&entry->se_list) )

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.7.0
    • None
    • None
    • 2.6.93 on the clients and 2.6.92 on the servers, centos with rpms from lustre-master jenkins tree.
    • 3
    • 17399

    Description

      We seem to be having an issue similar to the one described in LU-5883. We're using 2.6.93 on the clients and 2.6.92 on the servers. The problem can be reproduced reliably on our end, so I'd be happy to provide additional logs/diagnostic information (Is there a standard procedure for this? I couldn't seem to find anything via google.) I've included the kernel log messages that were dumped to the console as well as the stack trace below.

      kernel:LustreError: 13007:0:(statahead.c:262:sa_kill()) ASSERTION( !list_empty(&entry->se_list) ) failed:
      kernel:LustreError: 13007:0:(statahead.c:262:sa_kill()) LBUG
      PID: 13007 TASK: ffff880b0aea2040 CPU: 25 COMMAND: "rsync"
      #0 [ffff881c20ef16f0] machine_kexec at ffffffff8103b5bb
      /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/arch/x86/kernel/machine_kexec_64.c: 336
      #1 [ffff881c20ef1750] crash_kexec at ffffffff810c9852
      /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/kernel/kexec.c: 1106
      #2 [ffff881c20ef1820] panic at ffffffff8152927e
      /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/kernel/panic.c: 111
      #3 [ffff881c20ef18a0] lbug_with_loc at ffffffffa03c5eeb [libcfs]
      /usr/src/debug/lustre-2.6.93/libcfs/libcfs/linux/linux-debug.c: 175
      #4 [ffff881c20ef18c0] revalidate_statahead_dentry at ffffffffa0a0e31d [lustre]
      /usr/src/debug/lustre-2.6.93/lustre/llite/statahead.c: 263
      #5 [ffff881c20ef1a00] ll_statahead at ffffffffa0a0e642 [lustre]
      /usr/src/debug/lustre-2.6.93/libcfs/include/libcfs/libcfs_debug.h: 219
      #6 [ffff881c20ef1ae0] ll_lookup_it at ffffffffa09f8907 [lustre]
      /usr/src/debug/lustre-2.6.93/lustre/llite/namei.c: 541
      #7 [ffff881c20ef1ba0] ll_lookup_nd at ffffffffa09f9029 [lustre]
      /usr/src/debug/lustre-2.6.93/lustre/llite/namei.c: 771
      #8 [ffff881c20ef1bf0] do_lookup at ffffffff8119dc65
      /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 1063
      #9 [ffff881c20ef1c50] __link_path_walk at ffffffff8119e8f4
      /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 1239
      #10 [ffff881c20ef1d30] path_walk at ffffffff8119f40a
      /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 558
      #11 [ffff881c20ef1d70] filename_lookup at ffffffff8119f61b
      /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 1375
      #12 [ffff881c20ef1db0] user_path_at at ffffffff811a0747
      /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 1597
      #13 [ffff881c20ef1e80] vfs_fstatat at ffffffff81193bc0
      /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/stat.c: 84
      #14 [ffff881c20ef1ee0] vfs_lstat at ffffffff81193c7e
      /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/stat.c: 107
      #15 [ffff881c20ef1ef0] sys_newlstat at ffffffff81193ca4
      /usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/stat.c: 257
      #16 [ffff881c20ef1f80] system_call_fastpath at ffffffff8100b072
      /usr/src/debug////////kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/arch/x86/kernel/entry_64.S: 489
      RIP: 00000031f78daf75 RSP: 00007fff51e03da8 RFLAGS: 00010206
      RAX: 0000000000000006 RBX: ffffffff8100b072 RCX: 00007fff51e040b0
      RDX: 00007fff51dffb10 RSI: 00007fff51dffb10 RDI: 00007fff51e00ba0
      RBP: 0000000000000000 R8: 00007fff51e00be3 R9: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff51e00ba0
      R13: 00007fff51dffb10 R14: 00007fff51dffb10 R15: 00007fff51e00ba0
      ORIG_RAX: 0000000000000006 CS: 0033 SS: 002b

      Attachments

        Activity

          [LU-6222] LustreError (statahead.c:262:sa_kill()) ASSERTION( !list_empty(&entry->se_list) )

          Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/15178
          Subject: LU-6222 statahead: add to list before make ready
          Project: fs/lustre-release
          Branch: b2_5
          Current Patch Set: 1
          Commit: 1091da447db103549ca8e20d5c5fb97679f7080c

          gerrit Gerrit Updater added a comment - Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/15178 Subject: LU-6222 statahead: add to list before make ready Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: 1091da447db103549ca8e20d5c5fb97679f7080c
          pjones Peter Jones added a comment -

          Landed for 2.7

          pjones Peter Jones added a comment - Landed for 2.7

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13708/
          Subject: LU-6222 statahead: add to list before make ready
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: c59baf3ff10beca9841bc8aae211af120ab913dc

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13708/ Subject: LU-6222 statahead: add to list before make ready Project: fs/lustre-release Branch: master Current Patch Set: Commit: c59baf3ff10beca9841bc8aae211af120ab913dc
          azenk Andrew Zenk added a comment -

          That seems to have fixed it. The rsync script that consistently caused the issue after a minute or two has been running flawlessly for many hours. Thanks again.

          azenk Andrew Zenk added a comment - That seems to have fixed it. The rsync script that consistently caused the issue after a minute or two has been running flawlessly for many hours. Thanks again.
          azenk Andrew Zenk added a comment -

          Thanks! We're testing it now.

          azenk Andrew Zenk added a comment - Thanks! We're testing it now.
          laisiyao Lai Siyao added a comment -

          Andrew, I just uploaded a fix for this issue, will you apply it and test again?

          laisiyao Lai Siyao added a comment - Andrew, I just uploaded a fix for this issue, will you apply it and test again?

          Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/13708
          Subject: LU-6222 statahead: add to list before make ready
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: c77d1c16d45850e3cd4492d27746b30a89ba2beb

          gerrit Gerrit Updater added a comment - Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/13708 Subject: LU-6222 statahead: add to list before make ready Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c77d1c16d45850e3cd4492d27746b30a89ba2beb
          azenk Andrew Zenk added a comment - - edited

          Attached output from foreach bt -l

          azenk Andrew Zenk added a comment - - edited Attached output from foreach bt -l
          laisiyao Lai Siyao added a comment -

          could you list all process backtraces in the dump?

          laisiyao Lai Siyao added a comment - could you list all process backtraces in the dump?
          azenk Andrew Zenk added a comment -

          We do not use DNE. In this case a user is running an rsync as follows "rsync --size-only --progress -av --prune-empty-dirs --include=/ --exclude=.man --exclude=_eot.txt --exclude=.MAN --exclude=_EOT.TXT --include=052903541090_01* --exclude=* /lustre_mountpoint/staging/orig/_uploads/DG_A11281 /lustre_mountpoint/somepath/northslope/" This command is run several times in a serial fashion from a single client with slight variations of the include and srcdir for each run. At a some, seemingly random point during the sequence of rsync jobs, the kernel on the client node panics.

          Configuration:
          We're using a single MDS with nine OSSs. Each OSS contains 3 direct attached targets. Our system isn't designed for failover. The MDS and OSS nodes are booted from a common image built using centos 6.6 with this build: https://build.hpdd.intel.com/job/lustre-master/arch=x86_64,build_type=server,distro=el6.6,ib_stack=inkernel/2832/artifact/artifacts/RPMS/x86_64/.

          I'm happy to supply exact specs on raid configurations and disk counts if you feel that it's important, but we'll skip that for now. The MDS is using a single target on a ssd raid10. The OSTs are sata of various types. All servers are connected to our QDR IB fabric as well as a gigabit VLAN. The latter is used for connecting 3 clients that aren't experiencing any issues.

          There are approximately 20 clients, which are also using centos 6.6. The lustre clients are installed via the pre-built rpms from lustre-master, just like the servers. Though the clients are of slightly mixed build versions. The two clients that we've reproduced the issue on were both running build #2835.

          The entire filesystem has a stripe count of 1.

          Let me know if you need any additional information. Thanks!

          azenk Andrew Zenk added a comment - We do not use DNE. In this case a user is running an rsync as follows "rsync --size-only --progress -av --prune-empty-dirs --include=/ --exclude=.man --exclude= _eot.txt --exclude=.MAN --exclude=_EOT.TXT --include=052903541090_01 * --exclude=* /lustre_mountpoint/staging/orig/_uploads/DG_A11281 /lustre_mountpoint/somepath/northslope/" This command is run several times in a serial fashion from a single client with slight variations of the include and srcdir for each run. At a some, seemingly random point during the sequence of rsync jobs, the kernel on the client node panics. Configuration: We're using a single MDS with nine OSSs. Each OSS contains 3 direct attached targets. Our system isn't designed for failover. The MDS and OSS nodes are booted from a common image built using centos 6.6 with this build: https://build.hpdd.intel.com/job/lustre-master/arch=x86_64,build_type=server,distro=el6.6,ib_stack=inkernel/2832/artifact/artifacts/RPMS/x86_64/ . I'm happy to supply exact specs on raid configurations and disk counts if you feel that it's important, but we'll skip that for now. The MDS is using a single target on a ssd raid10. The OSTs are sata of various types. All servers are connected to our QDR IB fabric as well as a gigabit VLAN. The latter is used for connecting 3 clients that aren't experiencing any issues. There are approximately 20 clients, which are also using centos 6.6. The lustre clients are installed via the pre-built rpms from lustre-master, just like the servers. Though the clients are of slightly mixed build versions. The two clients that we've reproduced the issue on were both running build #2835. The entire filesystem has a stripe count of 1. Let me know if you need any additional information. Thanks!
          green Oleg Drokin added a comment -

          Can you please detail your reproduction steps if this is something we can easily replicate?
          Do you use DNE? What's your exact configuration of Lustre?

          green Oleg Drokin added a comment - Can you please detail your reproduction steps if this is something we can easily replicate? Do you use DNE? What's your exact configuration of Lustre?

          People

            laisiyao Lai Siyao
            azenk Andrew Zenk
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: