Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.1.2
    • None
    • 3
    • 4211

    Description

      Production client crashed when running user job with following LBUG, log dump is quite big so I have attached it in a file.

      LustreError: 3260:0:(events.c:419:ptlrpc_master_callback()) ASSERTION(callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_call
      back || callback == server_bulk_callback) failed
      LustreError: 3260:0:(events.c:419:ptlrpc_master_callback()) LBUG
      Aug 10 16:21:47 Pid: 3260, comm: kiblnd_sd_07
      sand-1-12 kernel
      : LustreError: 3Call Trace:
      260:0:(events.c: [<ffffffffa044c855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      419:ptlrpc_maste [<ffffffffa044ce95>] lbug_with_loc+0x75/0xe0 [libcfs]
      r_callback()) ASSERTION(callback [<ffffffffa0457d86>] libcfs_assertion_failed+0x66/0x70 [libcfs]
      == request_out_ [<ffffffffa06473c6>] ptlrpc_master_callback+0xb6/0xc0 [ptlrpc]
      callback || call [<ffffffffa04c0a8c>] lnet_enq_event_locked+0x6c/0xc0 [lnet]
      back == reply_in [<ffffffffa04c0b7c>] lnet_finalize+0x9c/0x280 [lnet]
      callback || callback == client [<ffffffffa07523ca>] kiblnd_recv+0x10a/0x580 [ko2iblnd]
      bulk_callback || [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
      callback == request_in_callback [<ffffffffa04c4188>] lnet_ni_recv+0xd8/0x350 [lnet]

      callback == [<ffffffffa04c44e6>] lnet_recv_put+0xe6/0x120 [lnet]
      reply_out_callba [<ffffffffa04cae1f>] lnet_parse+0x135f/0x1a80 [lnet]
      ck
      callback = [<ffffffffa0752afb>] kiblnd_handle_rx+0x2bb/0x5f0 [ko2iblnd]
      = server_bulk_callback) failed
      A [<ffffffff8104da6d>] ? check_preempt_curr+0x6d/0x90
      ug 10 16:21:47 s [<ffffffff8105e89c>] ? try_to_wake_up+0x24c/0x3e0
      and-1-12 kernel: [<ffffffffa0753723>] kiblnd_rx_complete+0x2a3/0x3e0 [ko2iblnd]
      LustreError: 32 [<ffffffff8105ea42>] ? default_wake_function+0x12/0x20
      A [<ffffffff8104da6d>] ? check_preempt_curr+0x6d/0x90
      ug 10 16:21:47 s [<ffffffff8105e89c>] ? try_to_wake_up+0x24c/0x3e0
      and-1-12 kernel: [<ffffffffa0753723>] kiblnd_rx_complete+0x2a3/0x3e0 [ko2iblnd]
      LustreError: 32 [<ffffffff8105ea42>] ? default_wake_function+0x12/0x20
      60:0:(events.c:4 [<ffffffff8104cab9>] ? __wake_up_common+0x59/0x90
      19:ptlrpc_master [<ffffffffa07538c2>] kiblnd_complete+0x62/0xe0 [ko2iblnd]
      _callback()) LBU [<ffffffffa0753c3d>] kiblnd_scheduler+0x2fd/0x770 [ko2iblnd]
      G
      Aug 10 16:21:4 [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20
      7 sand-1-12 kern [<ffffffffa0753940>] ? kiblnd_scheduler+0x0/0x770 [ko2iblnd]
      el: Pid: 3260, c [<ffffffff8100c14a>] child_rip+0xa/0x20
      omm: kiblnd_sd_0 [<ffffffffa0753940>] ? kiblnd_scheduler+0x0/0x770 [ko2iblnd]
      7
      Aug 10 16:21:4 [<ffffffffa0753940>] ? kiblnd_scheduler+0x0/0x770 [ko2iblnd]
      7 sand-1-12 kern [<ffffffff8100c140>] ? child_rip+0x0/0x20
      el:
      Aug 10 16:2
      1:47 sand-1-12 kLustreError: dumping log to /tmp/lustre-log.1344612108.3260
      ernel: Call Trace:
      Aug 10 16:21:47 sand-1-12 kernel: [<ffffffffa044c855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      Aug 10 16:21:47 sand-1-12 kernel: [<ffffffffa044ce95>] lbug_with_loc+0x75/0xe0 [libcfs]
      Aug 10 16:21:47 sand-1-12 kernel: [<ffffffffa0457d86>] libcfs_assertion_failed+0x66/0x70 [libcfs]
      Aug 10 16:21:47 sand-1-12 kernel: [<ffffffffa06473c6>] ptlrpc_master_callback+0xb6/0xc0 [ptlrpc]
      Aug 10 16:21:47 sand-1-12 kernel: [<ffffffffa04c0a8c>] lnet_enq_event_locked+0x6c/0xc0 [lnet]
      Aug 10 16:21:47 sand-1-12 kernel: [<ffffffffa04c0b7c>] lnet_finalize+0x9c/0x280 [lnet]
      Aug 10 16:21:47 sand-1-12 kernel: [<ffffffffa07523ca>] kiblnd_recv+0x10a/0x580 [ko2iblnd]
      Aug 10 16:21:47 sand-1-12 kernel: [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
      Aug 10 16:21:47 sand-1-12 kernel: [<ffffffffa04c4188>] lnet_ni_recv+0xd8/0x350 [lnet]
      Aug 10 16:21:47 sand-1-12 kernel: [<ffffffffa04c44e6>] lnet_recv_put+0xe6/0x120 [lnet]
      Aug 10 16:21:47 sand-1-12 kernel: [<ffffffffa0753723>] kiblnd_rx_complete+0x2a3/0x3e0 [ko2iblnd]
      Aug 10 16:21:48 sand-1-12 kernel: [<ffffffff8105ea42>] ? default_wake_function+0x12/0x20
      Aug 10 16:21:48 sand-1-12 kernel: [<ffffffff8104cab9>] ? __wake_up_common+0x59/0x90
      Aug 10 16:21:48 sand-1-12 kernel: [<ffffffffa07538c2>] kiblnd_complete+0x62/0xe0 [ko2iblnd]
      Aug 10 16:21:48 sand-1-12 kernel: [<ffffffffa0753c3d>] kiblnd_scheduler+0x2fd/0x770 [ko2iblnd]
      Aug 10 16:21:48 sand-1-12 kernel: [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20
      Aug 10 16:21:48 sand-1-12 kernel: [<ffffffffa0753940>] ? kiblnd_scheduler+0x0/0x770 [ko2iblnd]
      Aug 10 16:21:48 sand-1-12 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
      Aug 10 16:21:48 sand-1-12 kernel: [<ffffffffa0753940>] ? kiblnd_scheduler+0x0/0x770 [ko2iblnd]
      Aug 10 16:21:48 sand-1-12 kernel: [<ffffffffa0753940>] ? kiblnd_scheduler+0x0/0x770 [ko2iblnd]
      Aug 10 16:21:48 sand-1-12 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
      Aug 10 16:21:48 sand-1-12 kernel:
      Aug 10 16:21:48 sand-1-12 kernel: LustreError: dumping log to /tmp/lustre-log.1344612108.3260
      BUG: soft lockup - CPU#0 stuck for 67s! [kiblnd_sd_04:3257]
      Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) nfsd exportfs acpi_cpufreq
      BUG: soft lockup - CPU#1 stuck for 67s! [kiblnd_sd_08:3261]
      801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma ipv6 sd_mod crc_t10dif ahci igb dca nfs lockd fscache nfs_acl auth_rpcgss sunrpc [last unloaded: scsi_wait_scan]
      CPU 1
      801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma ipv6 sd_mod crc_t10dif ahci igb dca nfs lockd fscache nfs_acl auth_rpcgss sunrpc [last unloaded: scsi_wait_scan]

      Pid: 3261, comm: kiblnd_sd_08 Tainted: G W ---------------- 2.6.32-220.23.1.el6.x86_64 #1 Dell Inc. PowerEdge C6220/0WTH3T
      RIP: 0010:[<ffffffff814efb41>] [<ffffffff814efb41>] _spin_lock+0x21/0x30
      RSP: 0018:ffff88086a909be0 EFLAGS: 00000293
      RAX: 000000000000081f RBX: ffff88086a909be0 RCX: ffff8807bb5f14e0
      RDX: 000000000000081d RSI: 0000000000000050 RDI: ffffffffa04df340
      RBP: ffffffff8100bc0e R08: 0000000000000246 R09: 0000000000000012
      R10: 0000000000000000 R11: 0000000000000400 R12: 0000000000000000
      R13: ffff8806dcf599d0 R14: ffff88086a909cac R15: ffff88086a909ca8
      FS: 00007febefe5d700(0000) GS:ffff880044620000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 00007fba30b90008 CR3: 0000000952c82000 CR4: 00000000000406e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process kiblnd_sd_08 (pid: 3261, threadinfo ffff88086a908000, task ffff88086a85aa80)
      Stack:
      ffff88086a909ce0 ffffffffa04c9ee6 ffff88086a909d00 0000000300000002
      <0> ffff88086a909dd8 0000000000000400 0000000000000001 0000000000000000
      <0> 0000000000000002 ffff880044675fe8 ffff880044676018 ffff880044675fe8

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              wjt27 Wojciech Turek
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: