Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2427

Hit "kernel BUG" when running on debug kernel during recovery

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • None
    • None
    • kernel: 2.6.32-220.23.1.1chaos.ch5.x86_64.debug
      lustre: orion-2_3_49_54_1-55chaos + http://review.whamcloud.com/3355
    • 3
    • 3071

    Description

      I'm trying to run Lustre-Orion against a debug kernel on the MDS and hit this BUG twice yesterday. There are a couple "[...] used gratest stack depth" messages, so I'm curious if the stack was stomped on causing the crash.

      zpool used greatest stack depth: 1552 bytes left
      Lustre: Lustre: Build Version: 2.0.59-llnl3-base-DEBUG--CHANGED-2.6.32-220.23.1.1chaos.ch5.x86_64.debug
      Lustre: MGS: Mounted grove-mds2/mgs
      mount.lustre used greatest stack depth: 1280 bytes left
      LustreError: 11-0: MGC172.20.5.2@o2ib500: Communicating with 0@lo, operation llog_origin_handle_create failed with -2
      LustreError: 20904:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2
      Lustre: 20909:0:(fld_index.c:354:fld_index_init()) srv-lstest-MDT0000: File "fld" doesn't support range lookup, using stub. DNE and FIDs on OST will not work with this backend
      ib0: no IPv6 routers present
      Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.154@o2ib500
      Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.191@o2ib500
      Lustre: lstest-MDT0000: Mounted grove-mds2/mdt0
      Lustre: lstest-MDT0000: Will be in recovery for at least 5:00, or until 256 clients reconnect.
      ------------[ cut here ]------------
      kernel BUG at /usr/src/kernels/2.6.32-220.23.1.1chaos.ch5.x86_64.debug/include/linux/scatterlist.h:65!
      invalid opcode: 0000 [#1] SMP 
      last sysfs file: /sys/module/ptlrpc/initstate
      CPU 20 
      Modules linked in: osp(U) mdt(U) mdd(U) lod(U) mgs(U) mgc(U) osd_zfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) acpi_cpufreq freq_table mperf ko2iblnd(U) lnet(U) libcfs(U) ib_ipoib ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath dm_mod vhost_net macvtap macvlan tun kvm_intel kvm zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ses enclosure sg sd_mod crc_t10dif mpt2sas scsi_transport_sas raid_class serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core shpchp ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core igb dca [last unloaded: cpufreq_ondemand]
      
      Pid: 20891, comm: ll_mgs_02 Tainted: P        W  ----------------   2.6.32-220.23.1.1chaos.ch5.x86_64.debug #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
      RIP: 0010:[<ffffffffa0693ca6>]  [<ffffffffa0693ca6>] kiblnd_setup_rd_iov+0x1f6/0x2f0 [ko2iblnd]
      RSP: 0018:ffff88178f87d960  EFLAGS: 00010293
      RAX: ffffea00a792e280 RBX: ffff882fdddbe408 RCX: 0000000000000000
      RDX: 00000000000020c0 RSI: 0000000087654321 RDI: ffff882fe0d30148
      RBP: ffff88178f87d9b0 R08: ffff882fdddbe408 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff882fe0d30148
      R13: ffffc9004ed47000 R14: 00000000000020c0 R15: 0000000000000000
      FS:  00007ffff7fdc700(0000) GS:ffff881895800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 00000000006d3e70 CR3: 0000002ffa068000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process ll_mgs_02 (pid: 20891, threadinfo ffff88178f87c000, task ffff88178f878ac0)
      Stack:
       ffffc9002b567748 ffff8817b7ee9818 ffff8817b5004000 00000001dddbc058
      <0> 00000000000020c0 ffff882fdddbc058 00000000000020c0 ffffc9002b567748
      <0> 0000000000000001 000501f4ac14043d ffff88178f87da50 ffffffffa069892a
      Call Trace:
       [<ffffffffa069892a>] kiblnd_send+0x59a/0x870 [ko2iblnd]
       [<ffffffffa062e359>] ? lnet_send+0x59/0x9f0 [lnet]
       [<ffffffffa062a14b>] lnet_ni_send+0x4b/0x110 [lnet]
       [<ffffffffa062e55b>] lnet_send+0x25b/0x9f0 [lnet]
       [<ffffffffa062f5bb>] LNetPut+0x2ab/0x670 [lnet]
       [<ffffffffa086a71e>] ptl_send_buf+0x18e/0x440 [ptlrpc]
       [<ffffffffa08875f0>] ? at_measured+0x1e0/0x320 [ptlrpc]
       [<ffffffffa08a2285>] ? null_authorize+0x75/0x110 [ptlrpc]
       [<ffffffffa086ac2f>] ptlrpc_send_reply+0x25f/0x770 [ptlrpc]
       [<ffffffffa08425e4>] target_send_reply_msg+0x54/0x160 [ptlrpc]
       [<ffffffffa0842a3e>] target_send_reply+0x34e/0x680 [ptlrpc]
       [<ffffffffa08868d3>] ? llog_origin_handle_read_header+0x193/0x520 [ptlrpc]
       [<ffffffffa0c8cd16>] mgs_handle+0xd6/0x1020 [mgs]
       [<ffffffffa0706a0f>] ? keys_fill+0x6f/0x1a0 [obdclass]
       [<ffffffffa08717f4>] ? lustre_msg_get_transno+0x54/0x90 [ptlrpc]
       [<ffffffffa087bc6c>] ptlrpc_server_handle_request+0x3fc/0xce0 [ptlrpc]
       [<ffffffffa059256e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
       [<ffffffffa059ff09>] ? lc_watchdog_touch+0x79/0x110 [libcfs]
       [<ffffffffa0876e20>] ? ptlrpc_wait_event+0xb0/0x2b0 [ptlrpc]
       [<ffffffff810aeb6d>] ? trace_hardirqs_on+0xd/0x10
       [<ffffffff81055043>] ? __wake_up+0x53/0x70
       [<ffffffffa087df00>] ptlrpc_main+0x710/0x1190 [ptlrpc]
       [<ffffffff810aeb6d>] ? trace_hardirqs_on+0xd/0x10
       [<ffffffffa087d7f0>] ? ptlrpc_main+0x0/0x1190 [ptlrpc]
       [<ffffffff8100c20a>] child_rip+0xa/0x20
       [<ffffffff815231f0>] ? _spin_unlock_irq+0x30/0x40
       [<ffffffff8100bb50>] ? restore_args+0x0/0x30
       [<ffffffffa087d7f0>] ? ptlrpc_main+0x0/0x1190 [ptlrpc]
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      Code: 35 f2 01 00 00 00 04 00 e8 28 a8 f0 ff 48 c7 c7 60 2e 6b a0 c7 05 df f1 01 00 00 00 04 00 e8 02 e2 ef ff 0f 0b eb fe 0f 0b eb fe <0f> 0b 0f 1f 84 00 00 00 00 00 eb f6 48 c7 c7 20 2e 6b a0 48 c7 
      RIP  [<ffffffffa0693ca6>] kiblnd_setup_rd_iov+0x1f6/0x2f0 [ko2iblnd]
       RSP <ffff88178f87d960>
      
      crash> bt
      PID: 20891  TASK: ffff88178f878ac0  CPU: 20  COMMAND: "ll_mgs_02"
       #0 [ffff88178f87d620] machine_kexec at ffffffff81032ad0
       #1 [ffff88178f87d680] crash_kexec at ffffffff810cab52
       #2 [ffff88178f87d750] oops_end at ffffffff81524c20
       #3 [ffff88178f87d780] die at ffffffff8100f3bb
       #4 [ffff88178f87d7b0] do_trap at ffffffff81524334
       #5 [ffff88178f87d810] do_invalid_op at ffffffff8100cff5
       #6 [ffff88178f87d8b0] invalid_op at ffffffff8100bf9b
          [exception RIP: kiblnd_setup_rd_iov+502]
          RIP: ffffffffa0693ca6  RSP: ffff88178f87d960  RFLAGS: 00010293
          RAX: ffffea00a792e280  RBX: ffff882fdddbe408  RCX: 0000000000000000
          RDX: 00000000000020c0  RSI: 0000000087654321  RDI: ffff882fe0d30148
          RBP: ffff88178f87d9b0   R8: ffff882fdddbe408   R9: 0000000000000000
          R10: 0000000000000000  R11: 0000000000000000  R12: ffff882fe0d30148
          R13: ffffc9004ed47000  R14: 00000000000020c0  R15: 0000000000000000
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #7 [ffff88178f87d9b8] kiblnd_send at ffffffffa069892a [ko2iblnd]
       #8 [ffff88178f87da58] lnet_ni_send at ffffffffa062a14b [lnet]
       #9 [ffff88178f87da78] lnet_send at ffffffffa062e55b [lnet]
      #10 [ffff88178f87dae8] LNetPut at ffffffffa062f5bb [lnet]
      #11 [ffff88178f87db48] ptl_send_buf at ffffffffa086a71e [ptlrpc]
      #12 [ffff88178f87dbf8] ptlrpc_send_reply at ffffffffa086ac2f [ptlrpc]
      #13 [ffff88178f87dc78] target_send_reply_msg at ffffffffa08425e4 [ptlrpc]
      #14 [ffff88178f87dca8] target_send_reply at ffffffffa0842a3e [ptlrpc]
      #15 [ffff88178f87dd18] mgs_handle at ffffffffa0c8cd16 [mgs]
      #16 [ffff88178f87dda8] ptlrpc_server_handle_request at ffffffffa087bc6c [ptlrpc]
      #17 [ffff88178f87de98] ptlrpc_main at ffffffffa087df00 [ptlrpc]
      #18 [ffff88178f87df48] kernel_thread at ffffffff8100c20a
      

      From scatterlist.h:

      
      

      55 static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
      56

      { 57 unsigned long page_link = sg->page_link & 0x3; 58 59 /* 60 * In order for the low bit stealing approach to work, pages 61 * must be aligned at a 32-bit boundary as a minimum. 62 */ 63 BUG_ON((unsigned long) page & 0x03); 64 #ifdef CONFIG_DEBUG_SG 65 BUG_ON(sg->sg_magic != SG_MAGIC); 66 BUG_ON(sg_is_chain(sg)); 67 #endif 68 sg->page_link = page_link | (unsigned long) page; 69 }
      
      

      Attachments

        Issue Links

          Activity

            People

              liang Liang Zhen (Inactive)
              prakash Prakash Surya (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: