Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • Sequoia LAC node, 2.3.54-2chaos (github.com/lustre/chaos)
    • 3
    • 5663

    Description

      We have companion nodes to Sequoia called "LAC nodes". LAC stands for "large application compile", or something like that. These nodes are NOT normal BG/Q architecture, instead the are regular large Power 7 architecture nodes. But otherwise they are similar to Sequoia I/O Nodes in that they are ppc64, 64K pages, and running RHEL for the OS.

      seqlac2 crashed on 2012-11-19 16:18:40, hitting the following Oops:

      2012-11-19 16:18:40 Faulting instruction address: 0xd0000000139b6710
      2012-11-19 16:18:40 Oops: Kernel access of bad area, sig: 11 [#1]
      2012-11-19 16:18:40 SMP NR_CPUS=1024 NUMA pSeries
      2012-11-19 16:18:40 Modules linked in: xt_owner nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ptlrpc(U) obdclass(U) lvfs(U) nfs fscache lockd nfs_acl auth_rpcgss ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) sunrpc ipt_LOG xt_multiport iptable_filter ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa uinput raid1 ses enclosure sg mlx4_ib ib_mad ib_core mlx4_en mlx4_core e1000e ehea ext4 jbd2 mbcache raid456 async_pq async_xor xor async_raid6_recov raid6_pq async_memcpy async_tx sd_mod crc_t10dif ipr dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
      2012-11-19 16:18:40 NIP: d0000000139b6710 LR: d0000000139b6700 CTR: c0000000005be960
      2012-11-19 16:18:40 REGS: c000000f54aab080 TRAP: 0300   Not tainted  (2.6.32-279.11.1.1chaos.bgq62.ppc64)
      2012-11-19 16:18:40 MSR: 8000000000009032 <EE,ME,IR,DR>  CR: 24008442  XER: 20000000
      2012-11-19 16:18:40 DAR: 0000000000000008, DSISR: 0000000042000000
      2012-11-19 16:18:40 TASK = c000000f539312d0[10079] 'ls' THREAD: c000000f54aa8000 CPU: 32
      2012-11-19 16:18:40 GPR00: c00000025f6e5d90 c000000f54aab300 d000000013a0d878 c000000ecf32d120 
      2012-11-19 16:18:40 GPR04: 5a5a5a5a5a5a5a5a 0000000000000004 c0000005ff4e506c 0000000000000002 
      2012-11-19 16:18:40 GPR08: c000000000ff7500 0000000000000000 c000000001099004 0000000000000000 
      2012-11-19 16:18:40 GPR12: d0000000139d2610 c000000000ff7500 0000000010022ff8 00000000000000e8 
      2012-11-19 16:18:40 GPR16: c000000739a8f200 0000000000000000 0000000000000001 0000000000000003 
      2012-11-19 16:18:40 GPR20: c000000ecf32cee0 0000000000000000 c00000025f6e5d80 0000000000000000 
      2012-11-19 16:18:40 GPR24: c000000f45ae0000 c00000025f6e5d80 d00000000cecfdc0 d00000000cecfdc4 
      2012-11-19 16:18:40 GPR28: 0000000000003300 c000000ecf32cd80 d000000013a0aca8 c000000f54aab300 
      2012-11-19 16:18:40 NIP [d0000000139b6710] .ll_sai_unplug+0x210/0x860 [lustre]
      2012-11-19 16:18:40 LR [d0000000139b6700] .ll_sai_unplug+0x200/0x860 [lustre]
      2012-11-19 16:18:40 Call Trace:
      2012-11-19 16:18:40 [c000000f54aab300] [d0000000139b6700] .ll_sai_unplug+0x200/0x860 [lustre] (unreliable)
      2012-11-19 16:18:40 [c000000f54aab430] [d0000000139b7bc8] .do_statahead_enter+0x218/0x22b0 [lustre]
      2012-11-19 16:18:40 [c000000f54aab5f0] [d000000013933dec] .ll_revalidate_it+0x82c/0x21f0 [lustre]
      2012-11-19 16:18:40 [c000000f54aab750] [d00000001393595c] .ll_revalidate_nd+0x1ac/0x560 [lustre]
      2012-11-19 16:18:40 [c000000f54aab810] [c0000000001d55b0] .do_lookup+0xa0/0x2d0
      2012-11-19 16:18:40 [c000000f54aab8e0] [c0000000001d8c50] .__link_path_walk+0x9b0/0x15a0
      2012-11-19 16:18:40 [c000000f54aaba10] [c0000000001d9bf8] .path_walk+0x98/0x180
      2012-11-19 16:18:40 [c000000f54aabab0] [c0000000001d9eec] .do_path_lookup+0x7c/0xf0
      2012-11-19 16:18:40 [c000000f54aabb40] [c0000000001dac50] .user_path_at+0x60/0xb0
      2012-11-19 16:18:40 [c000000f54aabc90] [c0000000001ce144] .vfs_fstatat+0x44/0xb0
      2012-11-19 16:18:40 [c000000f54aabd30] [c0000000001ce314] .SyS_newlstat+0x24/0x50
      2012-11-19 16:18:40 [c000000f54aabe30] [c000000000008564] syscall_exit+0x0/0x40
      2012-11-19 16:18:40 Instruction dump:
      2012-11-19 16:18:40 81f90058 ea1d0000 79ef06e0 3a10ff08 39ef00e8 79e31764 7c7d1a14 4801bf15 
      2012-11-19 16:18:40 e8410028 e9390018 e9790010 38190010 <f92b0008> f9690000 f8190018 f8190010 
      

      Note that this node was hitting many order:1 page allocatoin failures that day, with the most recent one less than a minute before the Oops in Lustre.

      The attached file "console.seqlac2.txt" contains a more complete console log.

      Using gdb to look up the line number of ll_sai_unplug+0x200 revealed that it was at lustre/llite/statahead.c:133, which puts it in ll_sa_entry_unhash() at the first cfs_spin_lock(). So if functions were not inlined, the end of the backtrace would probably look like this:

      ll_sa_entry_unhash
      do_sai_entry_fini
      ll_sa_entry_fini
      ll_sai_unplug
      do_statahead_enter
      ll_statahead_enter
      ll_revalidate_it
      ll_revalidate_nd
      

      Because of the inlining, I can not be entirely certain which of the three do_sai_entry_fini() calls in ll_sa_entry_fini() this was.

      Attachments

        Issue Links

          Activity

            [LU-2388] Oops in ll_sai_unplug
            pjones Peter Jones added a comment -

            Landed for 2.4

            pjones Peter Jones added a comment - Landed for 2.4
            laisiyao Lai Siyao added a comment - patch is on http://review.whamcloud.com/#change,5842
            laisiyao Lai Siyao added a comment -

            The culprit should be in ll_sa_entry_fini():

                    /* drop old entry from sent list */
                    cfs_list_for_each_entry_safe(pos, next, &sai->sai_entries_sent,
                                                 se_list) {
                            if (is_omitted_entry(sai, pos->se_index))
                                    do_sai_entry_fini(sai, pos);
                            else
                                    break;
                    }
            
                    /* drop old entry from stated list */
                    cfs_list_for_each_entry_safe(pos, next, &sai->sai_entries_stated,
                                                 se_list) {
                            if (is_omitted_entry(sai, pos->se_index))
                                    do_sai_entry_fini(sai, pos);
                            else
                                    break;
                    }
            

            Iterating these two list is not protected by ll_sai_lock, though no entry will be removed at the mean time (because this is the only place entry is removed), entry might be unlinked from list and be dangling, thus when iterating, both next and next->next may be the same entry, and one entry may be removed twice, as may cause the crash above.

            I'll commit a fix later.

            laisiyao Lai Siyao added a comment - The culprit should be in ll_sa_entry_fini(): /* drop old entry from sent list */ cfs_list_for_each_entry_safe(pos, next, &sai->sai_entries_sent, se_list) { if (is_omitted_entry(sai, pos->se_index)) do_sai_entry_fini(sai, pos); else break; } /* drop old entry from stated list */ cfs_list_for_each_entry_safe(pos, next, &sai->sai_entries_stated, se_list) { if (is_omitted_entry(sai, pos->se_index)) do_sai_entry_fini(sai, pos); else break; } Iterating these two list is not protected by ll_sai_lock, though no entry will be removed at the mean time (because this is the only place entry is removed), entry might be unlinked from list and be dangling, thus when iterating, both next and next->next may be the same entry, and one entry may be removed twice, as may cause the crash above. I'll commit a fix later.

            Yes, there is an ll_sa_13026 thread. There is also an ll_agl_13026 thread. Not too much in the bt:

            crash> bt 13027 -s -x
            PID: 13027  TASK: c000000d089271d0  CPU: 5   COMMAND: "ll_sa_13026"
             #0 [c000000053857a90] .schedule+0x3f8 at c0000000005bb0a8
             #1 [c000000053857d90] .cfs_waitq_wait+0x10 at d00000000c962020 [libcfs]
             #2 [c000000053857e00] .ll_statahead_thread+0x5ec at d00000001839dc1c [lustre]
             #3 [c000000053857f90] .kernel_thread+0x54 at c000000000032fd4
            crash> bt 13028 -s -x
            PID: 13028  TASK: c0000000536fda50  CPU: 30  COMMAND: "ll_agl_13026"
             #0 [c0000000387dbae0] .schedule+0x3f8 at c0000000005bb0a8
             #1 [c0000000387dbde0] .cfs_waitq_wait+0x10 at d00000000c962020 [libcfs]
             #2 [c0000000387dbe50] .ll_agl_thread+0x550 at d00000001839f130 [lustre]
             #3 [c0000000387dbf90] .kernel_thread+0x54 at c000000000032fd4
            

            The ll_sa_13026 thread appears to be in this l_wait_event:

            keep_it:
                                    l_wait_event(thread->t_ctl_waitq,
                                                 !sa_sent_full(sai) ||
                                                 !sa_received_empty(sai) ||
                                                 !agl_list_empty(sai) ||
                                                 !thread_is_running(thread),
                                                 &lwi);
            
            interpret_it:
            

            And ll_agl_13026 is in thie l_wait_event:

                    while (1) {
                            l_wait_event(thread->t_ctl_waitq,
                                         !agl_list_empty(sai) ||
                                         !thread_is_running(thread),
                                         &lwi);
            
                            if (!thread_is_running(thread))
            
            morrone Christopher Morrone (Inactive) added a comment - Yes, there is an ll_sa_13026 thread. There is also an ll_agl_13026 thread. Not too much in the bt: crash> bt 13027 -s -x PID: 13027 TASK: c000000d089271d0 CPU: 5 COMMAND: "ll_sa_13026" #0 [c000000053857a90] .schedule+0x3f8 at c0000000005bb0a8 #1 [c000000053857d90] .cfs_waitq_wait+0x10 at d00000000c962020 [libcfs] #2 [c000000053857e00] .ll_statahead_thread+0x5ec at d00000001839dc1c [lustre] #3 [c000000053857f90] .kernel_thread+0x54 at c000000000032fd4 crash> bt 13028 -s -x PID: 13028 TASK: c0000000536fda50 CPU: 30 COMMAND: "ll_agl_13026" #0 [c0000000387dbae0] .schedule+0x3f8 at c0000000005bb0a8 #1 [c0000000387dbde0] .cfs_waitq_wait+0x10 at d00000000c962020 [libcfs] #2 [c0000000387dbe50] .ll_agl_thread+0x550 at d00000001839f130 [lustre] #3 [c0000000387dbf90] .kernel_thread+0x54 at c000000000032fd4 The ll_sa_13026 thread appears to be in this l_wait_event: keep_it: l_wait_event(thread->t_ctl_waitq, !sa_sent_full(sai) || !sa_received_empty(sai) || !agl_list_empty(sai) || !thread_is_running(thread), &lwi); interpret_it: And ll_agl_13026 is in thie l_wait_event: while (1) { l_wait_event(thread->t_ctl_waitq, !agl_list_empty(sai) || !thread_is_running(thread), &lwi); if (!thread_is_running(thread))
            laisiyao Lai Siyao added a comment -

            Chris, could you help analyze the crash dump to find whether 'll_sa_<ls pid>' thread exists, and if yes, use `bt <statahead_thread_pid>` to see its backtrace?

            laisiyao Lai Siyao added a comment - Chris, could you help analyze the crash dump to find whether 'll_sa_<ls pid>' thread exists, and if yes, use `bt <statahead_thread_pid>` to see its backtrace?
            laisiyao Lai Siyao added a comment -

            The analysis looks reasonable, as shows an sa_entry->se_hash.next is NULL, but in the life of an sa_entry, it's NULL only when just allocated, however it won't be found and used at that moment. I will review the related code later.

            laisiyao Lai Siyao added a comment - The analysis looks reasonable, as shows an sa_entry->se_hash.next is NULL, but in the life of an sa_entry, it's NULL only when just allocated, however it won't be found and used at that moment. I will review the related code later.

            I believe the particular ll_sai_unplug() call that we are in under do_statahead_enter() is this one on line 1672:

              1665                                  if ((bits & MDS_INODELOCK_LOOKUP) &&
              1666                                      d_lustre_invalid(*dentryp))
              1667                                          d_lustre_revalidate(*dentryp);
              1668                                  ll_intent_release(&it);
              1669                          }
              1670                  }
              1671  
              1672                  ll_sai_unplug(sai, entry);
              1673                  RETURN(rc);
              1674          }
            

            Further, the console shows:

            2013-03-13 19:34:20 NIP [d000000018395a50] .ll_sai_unplug+0x210/0x860 [lustre]
            2013-03-13 19:34:20 LR [d000000018395a40] .ll_sai_unplug+0x200/0x860 [lustre]
            

            I've already shown that seems to put us in __list_del. If I back the address up a bit to find out where we are, I get:

            (gdb) list *(ll_sai_unplug+0x198)
            0x859d8 is in ll_sai_unplug (/builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c:191).
            186     /builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c: No such file or directory.
                    in /builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c
            

            That turns out to be in the inline function is_omitted_entry(), which tells us that we are in ll_sa_entry_fini(), and we are in either the second or third calls do do_sai_entry_fini() in ll_said_entry_fini(). So lets back up a bit more:

            (gdb) list *(ll_sai_unplug+0x192)
            0x859d2 is in ll_sai_unplug (/builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c:374).
            369     in /builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c
            

            That puts us here:

               373          /* drop old entry from sent list */
               374          cfs_list_for_each_entry_safe(pos, next, &sai->sai_entries_sent,
               375                                       se_list) {
               376                  if (is_omitted_entry(sai, pos->se_index))
               377                          do_sai_entry_fini(sai, pos);
               378                  else
               379                          break;
               380          }
            

            And then walking the address forward

            (gdb) list *(ll_sai_unplug+0x234)
            0x85a74 is in ll_sai_unplug (/builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c:353).
            348     /builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c: No such file or directory.
                    in /builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c
            

            puts us on line 353:

              346  static inline void
               347  do_sai_entry_fini(struct ll_statahead_info *sai, struct ll_sa_entry *entry)
               348  {
               349          struct ll_inode_info *lli = ll_i2info(sai->sai_inode);
               350  
               351          ll_sa_entry_unhash(sai, entry);
               352  
               353          spin_lock(&lli->lli_sa_lock);
               354          entry->se_stat = SA_ENTRY_DEST;
               355          if (likely(!ll_sa_entry_unlinked(entry)))
               356                  cfs_list_del_init(&entry->se_list);
               357          spin_unlock(&lli->lli_sa_lock);
               358  
               359          ll_sa_entry_put(sai, entry);
               360  }
            

            So I think that means we are in fact in the cfs_list_del_init() that is under the ll_sa_entry_unhash() on line 351, and not the later cfs_list_del_init() at line 356.

            So if the addresses can be trusted, everything seems to point to the stack I gave in the previous comment, and we're in the in the loop dropping entries from the sent list.

            morrone Christopher Morrone (Inactive) added a comment - I believe the particular ll_sai_unplug() call that we are in under do_statahead_enter() is this one on line 1672: 1665 if ((bits & MDS_INODELOCK_LOOKUP) && 1666 d_lustre_invalid(*dentryp)) 1667 d_lustre_revalidate(*dentryp); 1668 ll_intent_release(&it); 1669 } 1670 } 1671 1672 ll_sai_unplug(sai, entry); 1673 RETURN(rc); 1674 } Further, the console shows: 2013-03-13 19:34:20 NIP [d000000018395a50] .ll_sai_unplug+0x210/0x860 [lustre] 2013-03-13 19:34:20 LR [d000000018395a40] .ll_sai_unplug+0x200/0x860 [lustre] I've already shown that seems to put us in __list_del. If I back the address up a bit to find out where we are, I get: (gdb) list *(ll_sai_unplug+0x198) 0x859d8 is in ll_sai_unplug (/builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c:191). 186 /builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c: No such file or directory. in /builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c That turns out to be in the inline function is_omitted_entry(), which tells us that we are in ll_sa_entry_fini(), and we are in either the second or third calls do do_sai_entry_fini() in ll_said_entry_fini(). So lets back up a bit more: (gdb) list *(ll_sai_unplug+0x192) 0x859d2 is in ll_sai_unplug (/builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c:374). 369 in /builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c That puts us here: 373 /* drop old entry from sent list */ 374 cfs_list_for_each_entry_safe(pos, next, &sai->sai_entries_sent, 375 se_list) { 376 if (is_omitted_entry(sai, pos->se_index)) 377 do_sai_entry_fini(sai, pos); 378 else 379 break; 380 } And then walking the address forward (gdb) list *(ll_sai_unplug+0x234) 0x85a74 is in ll_sai_unplug (/builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c:353). 348 /builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c: No such file or directory. in /builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c puts us on line 353: 346 static inline void 347 do_sai_entry_fini(struct ll_statahead_info *sai, struct ll_sa_entry *entry) 348 { 349 struct ll_inode_info *lli = ll_i2info(sai->sai_inode); 350 351 ll_sa_entry_unhash(sai, entry); 352 353 spin_lock(&lli->lli_sa_lock); 354 entry->se_stat = SA_ENTRY_DEST; 355 if (likely(!ll_sa_entry_unlinked(entry))) 356 cfs_list_del_init(&entry->se_list); 357 spin_unlock(&lli->lli_sa_lock); 358 359 ll_sa_entry_put(sai, entry); 360 } So I think that means we are in fact in the cfs_list_del_init() that is under the ll_sa_entry_unhash() on line 351, and not the later cfs_list_del_init() at line 356. So if the addresses can be trusted, everything seems to point to the stack I gave in the previous comment, and we're in the in the loop dropping entries from the sent list.

            Hit this Oops again. We got a crash dump this time!

            I found that we died on line 89 here:

            crash> gdb list *(0xd000000018395a50)
            0xd000000018395a50 is in ll_sai_unplug (/usr/src/kernels/2.6.32-348.1chaos.bgq62.ppc64/include/linux/list.h:89).
            84       * This is only for internal list manipulation where we know
            85       * the prev/next entries already!
            86       */
            87      static inline void __list_del(struct list_head * prev, struct list_head * next)
            88      {
            89              next->prev = prev;
            90              prev->next = next;
            91      }
            92
            93      /**
            

            Also, note that I see this on the console as part of the Oops output:

            2013-03-13 19:34:20 Unable to handle kernel paging request for data at address 0x00000008
            

            0x00000008 it probably the offset into the structure, but NULL was passed in to __list_del in the "next" parameter.

            Looking matching the crash address to code and unwinding inlines, I believe this is the call stack where we are crashing:

            __list_del
            cfs_list_del_init
            ll_sa_entry_unhash
            do_sai_entry_fini
            ll_sai_entry_fini
            ll_sai_unplug
            do_statahead_enter
            ll_statahead_enter
            ll_revalidate_it
            ll_revalidate_nd
            do_lookup
            etc.
            
            morrone Christopher Morrone (Inactive) added a comment - - edited Hit this Oops again. We got a crash dump this time! I found that we died on line 89 here: crash> gdb list *(0xd000000018395a50) 0xd000000018395a50 is in ll_sai_unplug (/usr/src/kernels/2.6.32-348.1chaos.bgq62.ppc64/include/linux/list.h:89). 84 * This is only for internal list manipulation where we know 85 * the prev/next entries already! 86 */ 87 static inline void __list_del(struct list_head * prev, struct list_head * next) 88 { 89 next->prev = prev; 90 prev->next = next; 91 } 92 93 /** Also, note that I see this on the console as part of the Oops output: 2013-03-13 19:34:20 Unable to handle kernel paging request for data at address 0x00000008 0x00000008 it probably the offset into the structure, but NULL was passed in to __list_del in the "next" parameter. Looking matching the crash address to code and unwinding inlines, I believe this is the call stack where we are crashing: __list_del cfs_list_del_init ll_sa_entry_unhash do_sai_entry_fini ll_sai_entry_fini ll_sai_unplug do_statahead_enter ll_statahead_enter ll_revalidate_it ll_revalidate_nd do_lookup etc.

            Just got a report that a LAC node crashed on this yesterday. No crash dump.

            morrone Christopher Morrone (Inactive) added a comment - Just got a report that a LAC node crashed on this yesterday. No crash dump.

            People

              laisiyao Lai Siyao
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: