Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • Sequoia LAC node, 2.3.54-2chaos (github.com/lustre/chaos)
    • 3
    • 5663

    Description

      We have companion nodes to Sequoia called "LAC nodes". LAC stands for "large application compile", or something like that. These nodes are NOT normal BG/Q architecture, instead the are regular large Power 7 architecture nodes. But otherwise they are similar to Sequoia I/O Nodes in that they are ppc64, 64K pages, and running RHEL for the OS.

      seqlac2 crashed on 2012-11-19 16:18:40, hitting the following Oops:

      2012-11-19 16:18:40 Faulting instruction address: 0xd0000000139b6710
      2012-11-19 16:18:40 Oops: Kernel access of bad area, sig: 11 [#1]
      2012-11-19 16:18:40 SMP NR_CPUS=1024 NUMA pSeries
      2012-11-19 16:18:40 Modules linked in: xt_owner nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ptlrpc(U) obdclass(U) lvfs(U) nfs fscache lockd nfs_acl auth_rpcgss ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) sunrpc ipt_LOG xt_multiport iptable_filter ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa uinput raid1 ses enclosure sg mlx4_ib ib_mad ib_core mlx4_en mlx4_core e1000e ehea ext4 jbd2 mbcache raid456 async_pq async_xor xor async_raid6_recov raid6_pq async_memcpy async_tx sd_mod crc_t10dif ipr dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
      2012-11-19 16:18:40 NIP: d0000000139b6710 LR: d0000000139b6700 CTR: c0000000005be960
      2012-11-19 16:18:40 REGS: c000000f54aab080 TRAP: 0300   Not tainted  (2.6.32-279.11.1.1chaos.bgq62.ppc64)
      2012-11-19 16:18:40 MSR: 8000000000009032 <EE,ME,IR,DR>  CR: 24008442  XER: 20000000
      2012-11-19 16:18:40 DAR: 0000000000000008, DSISR: 0000000042000000
      2012-11-19 16:18:40 TASK = c000000f539312d0[10079] 'ls' THREAD: c000000f54aa8000 CPU: 32
      2012-11-19 16:18:40 GPR00: c00000025f6e5d90 c000000f54aab300 d000000013a0d878 c000000ecf32d120 
      2012-11-19 16:18:40 GPR04: 5a5a5a5a5a5a5a5a 0000000000000004 c0000005ff4e506c 0000000000000002 
      2012-11-19 16:18:40 GPR08: c000000000ff7500 0000000000000000 c000000001099004 0000000000000000 
      2012-11-19 16:18:40 GPR12: d0000000139d2610 c000000000ff7500 0000000010022ff8 00000000000000e8 
      2012-11-19 16:18:40 GPR16: c000000739a8f200 0000000000000000 0000000000000001 0000000000000003 
      2012-11-19 16:18:40 GPR20: c000000ecf32cee0 0000000000000000 c00000025f6e5d80 0000000000000000 
      2012-11-19 16:18:40 GPR24: c000000f45ae0000 c00000025f6e5d80 d00000000cecfdc0 d00000000cecfdc4 
      2012-11-19 16:18:40 GPR28: 0000000000003300 c000000ecf32cd80 d000000013a0aca8 c000000f54aab300 
      2012-11-19 16:18:40 NIP [d0000000139b6710] .ll_sai_unplug+0x210/0x860 [lustre]
      2012-11-19 16:18:40 LR [d0000000139b6700] .ll_sai_unplug+0x200/0x860 [lustre]
      2012-11-19 16:18:40 Call Trace:
      2012-11-19 16:18:40 [c000000f54aab300] [d0000000139b6700] .ll_sai_unplug+0x200/0x860 [lustre] (unreliable)
      2012-11-19 16:18:40 [c000000f54aab430] [d0000000139b7bc8] .do_statahead_enter+0x218/0x22b0 [lustre]
      2012-11-19 16:18:40 [c000000f54aab5f0] [d000000013933dec] .ll_revalidate_it+0x82c/0x21f0 [lustre]
      2012-11-19 16:18:40 [c000000f54aab750] [d00000001393595c] .ll_revalidate_nd+0x1ac/0x560 [lustre]
      2012-11-19 16:18:40 [c000000f54aab810] [c0000000001d55b0] .do_lookup+0xa0/0x2d0
      2012-11-19 16:18:40 [c000000f54aab8e0] [c0000000001d8c50] .__link_path_walk+0x9b0/0x15a0
      2012-11-19 16:18:40 [c000000f54aaba10] [c0000000001d9bf8] .path_walk+0x98/0x180
      2012-11-19 16:18:40 [c000000f54aabab0] [c0000000001d9eec] .do_path_lookup+0x7c/0xf0
      2012-11-19 16:18:40 [c000000f54aabb40] [c0000000001dac50] .user_path_at+0x60/0xb0
      2012-11-19 16:18:40 [c000000f54aabc90] [c0000000001ce144] .vfs_fstatat+0x44/0xb0
      2012-11-19 16:18:40 [c000000f54aabd30] [c0000000001ce314] .SyS_newlstat+0x24/0x50
      2012-11-19 16:18:40 [c000000f54aabe30] [c000000000008564] syscall_exit+0x0/0x40
      2012-11-19 16:18:40 Instruction dump:
      2012-11-19 16:18:40 81f90058 ea1d0000 79ef06e0 3a10ff08 39ef00e8 79e31764 7c7d1a14 4801bf15 
      2012-11-19 16:18:40 e8410028 e9390018 e9790010 38190010 <f92b0008> f9690000 f8190018 f8190010 
      

      Note that this node was hitting many order:1 page allocatoin failures that day, with the most recent one less than a minute before the Oops in Lustre.

      The attached file "console.seqlac2.txt" contains a more complete console log.

      Using gdb to look up the line number of ll_sai_unplug+0x200 revealed that it was at lustre/llite/statahead.c:133, which puts it in ll_sa_entry_unhash() at the first cfs_spin_lock(). So if functions were not inlined, the end of the backtrace would probably look like this:

      ll_sa_entry_unhash
      do_sai_entry_fini
      ll_sa_entry_fini
      ll_sai_unplug
      do_statahead_enter
      ll_statahead_enter
      ll_revalidate_it
      ll_revalidate_nd
      

      Because of the inlining, I can not be entirely certain which of the three do_sai_entry_fini() calls in ll_sa_entry_fini() this was.

      Attachments

        Issue Links

          Activity

            [LU-2388] Oops in ll_sai_unplug
            pjones Peter Jones added a comment -

            Landed for 2.4

            pjones Peter Jones added a comment - Landed for 2.4
            laisiyao Lai Siyao added a comment - patch is on http://review.whamcloud.com/#change,5842
            laisiyao Lai Siyao added a comment -

            The culprit should be in ll_sa_entry_fini():

                    /* drop old entry from sent list */
                    cfs_list_for_each_entry_safe(pos, next, &sai->sai_entries_sent,
                                                 se_list) {
                            if (is_omitted_entry(sai, pos->se_index))
                                    do_sai_entry_fini(sai, pos);
                            else
                                    break;
                    }
            
                    /* drop old entry from stated list */
                    cfs_list_for_each_entry_safe(pos, next, &sai->sai_entries_stated,
                                                 se_list) {
                            if (is_omitted_entry(sai, pos->se_index))
                                    do_sai_entry_fini(sai, pos);
                            else
                                    break;
                    }
            

            Iterating these two list is not protected by ll_sai_lock, though no entry will be removed at the mean time (because this is the only place entry is removed), entry might be unlinked from list and be dangling, thus when iterating, both next and next->next may be the same entry, and one entry may be removed twice, as may cause the crash above.

            I'll commit a fix later.

            laisiyao Lai Siyao added a comment - The culprit should be in ll_sa_entry_fini(): /* drop old entry from sent list */ cfs_list_for_each_entry_safe(pos, next, &sai->sai_entries_sent, se_list) { if (is_omitted_entry(sai, pos->se_index)) do_sai_entry_fini(sai, pos); else break; } /* drop old entry from stated list */ cfs_list_for_each_entry_safe(pos, next, &sai->sai_entries_stated, se_list) { if (is_omitted_entry(sai, pos->se_index)) do_sai_entry_fini(sai, pos); else break; } Iterating these two list is not protected by ll_sai_lock, though no entry will be removed at the mean time (because this is the only place entry is removed), entry might be unlinked from list and be dangling, thus when iterating, both next and next->next may be the same entry, and one entry may be removed twice, as may cause the crash above. I'll commit a fix later.

            Yes, there is an ll_sa_13026 thread. There is also an ll_agl_13026 thread. Not too much in the bt:

            crash> bt 13027 -s -x
            PID: 13027  TASK: c000000d089271d0  CPU: 5   COMMAND: "ll_sa_13026"
             #0 [c000000053857a90] .schedule+0x3f8 at c0000000005bb0a8
             #1 [c000000053857d90] .cfs_waitq_wait+0x10 at d00000000c962020 [libcfs]
             #2 [c000000053857e00] .ll_statahead_thread+0x5ec at d00000001839dc1c [lustre]
             #3 [c000000053857f90] .kernel_thread+0x54 at c000000000032fd4
            crash> bt 13028 -s -x
            PID: 13028  TASK: c0000000536fda50  CPU: 30  COMMAND: "ll_agl_13026"
             #0 [c0000000387dbae0] .schedule+0x3f8 at c0000000005bb0a8
             #1 [c0000000387dbde0] .cfs_waitq_wait+0x10 at d00000000c962020 [libcfs]
             #2 [c0000000387dbe50] .ll_agl_thread+0x550 at d00000001839f130 [lustre]
             #3 [c0000000387dbf90] .kernel_thread+0x54 at c000000000032fd4
            

            The ll_sa_13026 thread appears to be in this l_wait_event:

            keep_it:
                                    l_wait_event(thread->t_ctl_waitq,
                                                 !sa_sent_full(sai) ||
                                                 !sa_received_empty(sai) ||
                                                 !agl_list_empty(sai) ||
                                                 !thread_is_running(thread),
                                                 &lwi);
            
            interpret_it:
            

            And ll_agl_13026 is in thie l_wait_event:

                    while (1) {
                            l_wait_event(thread->t_ctl_waitq,
                                         !agl_list_empty(sai) ||
                                         !thread_is_running(thread),
                                         &lwi);
            
                            if (!thread_is_running(thread))
            
            morrone Christopher Morrone (Inactive) added a comment - Yes, there is an ll_sa_13026 thread. There is also an ll_agl_13026 thread. Not too much in the bt: crash> bt 13027 -s -x PID: 13027 TASK: c000000d089271d0 CPU: 5 COMMAND: "ll_sa_13026" #0 [c000000053857a90] .schedule+0x3f8 at c0000000005bb0a8 #1 [c000000053857d90] .cfs_waitq_wait+0x10 at d00000000c962020 [libcfs] #2 [c000000053857e00] .ll_statahead_thread+0x5ec at d00000001839dc1c [lustre] #3 [c000000053857f90] .kernel_thread+0x54 at c000000000032fd4 crash> bt 13028 -s -x PID: 13028 TASK: c0000000536fda50 CPU: 30 COMMAND: "ll_agl_13026" #0 [c0000000387dbae0] .schedule+0x3f8 at c0000000005bb0a8 #1 [c0000000387dbde0] .cfs_waitq_wait+0x10 at d00000000c962020 [libcfs] #2 [c0000000387dbe50] .ll_agl_thread+0x550 at d00000001839f130 [lustre] #3 [c0000000387dbf90] .kernel_thread+0x54 at c000000000032fd4 The ll_sa_13026 thread appears to be in this l_wait_event: keep_it: l_wait_event(thread->t_ctl_waitq, !sa_sent_full(sai) || !sa_received_empty(sai) || !agl_list_empty(sai) || !thread_is_running(thread), &lwi); interpret_it: And ll_agl_13026 is in thie l_wait_event: while (1) { l_wait_event(thread->t_ctl_waitq, !agl_list_empty(sai) || !thread_is_running(thread), &lwi); if (!thread_is_running(thread))
            laisiyao Lai Siyao added a comment -

            Chris, could you help analyze the crash dump to find whether 'll_sa_<ls pid>' thread exists, and if yes, use `bt <statahead_thread_pid>` to see its backtrace?

            laisiyao Lai Siyao added a comment - Chris, could you help analyze the crash dump to find whether 'll_sa_<ls pid>' thread exists, and if yes, use `bt <statahead_thread_pid>` to see its backtrace?

            People

              laisiyao Lai Siyao
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: