Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7574

ASSERTION( crpc != ((void *)0) && rpc == crpc->crp_rpc ) failed

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      LBUG while using lst:

      Lustre: Lustre: Build Version: 2.7.61-gb399b36-CHANGED-2.6.32-504.16.2.el6.x86_64
      LNet: Added LNI 10.10.6.1@o2ib [54/1728/0/180]
      LustreError: 4003:0:(brw_test.c:363:brw_server_rpc_done()) Bulk transfer to 12345-10.10.7.1@o2ib has failed: -108
      LustreError: 4008:0:(brw_test.c:318:brw_client_done_rpc()) BRW RPC to 12345-10.10.7.5@o2ib failed with -4
      LNet: 3969:0:(o2iblnd_cb.c:395:kiblnd_handle_rx()) PUT_NACK from 10.10.3.12@o2ib
      LustreError: 4003:0:(brw_test.c:363:brw_server_rpc_done()) Skipped 1521 previous similar messages
      LustreError: 4013:0:(conrpc.c:58:lstcon_rpc_done()) ASSERTION( crpc != ((void *)0) && rpc == crpc->crp_rpc ) failed: 
      LustreError: 4013:0:(conrpc.c:58:lstcon_rpc_done()) LBUG
      Pid: 4013, comm: lst_t_03_02
      
      Call Trace:
       [<ffffffffa0695875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
       [<ffffffffa0695e77>] lbug_with_loc+0x47/0xb0 [libcfs]
       [<ffffffffa0d18664>] lstcon_rpc_done+0x124/0x190 [lnet_selftest]
       [<ffffffffa0d28d85>] srpc_client_rpc_done+0xd5/0x300 [lnet_selftest]
       [<ffffffff81529e1e>] ? thread_return+0x4e/0x7d0
       [<ffffffffa0d29708>] srpc_send_rpc+0x408/0x850 [lnet_selftest]
       [<ffffffffa0d1dfb3>] swi_wi_action+0x13/0x20 [lnet_selftest]
       [<ffffffffa06abcf2>] cfs_wi_scheduler+0x132/0x460 [libcfs]
       [<ffffffff8109ebb0>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa06abbc0>] ? cfs_wi_scheduler+0x0/0x460 [libcfs]
       [<ffffffff8109e71e>] kthread+0x9e/0xc0
       [<ffffffff8100c20a>] child_rip+0xa/0x20
       [<ffffffff8109e680>] ? kthread+0x0/0xc0
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      
      LustreError: dumping log to /tmp/lustre-log.1450358264.4013
      ------------[ cut here ]------------
      WARNING: at lib/list_debug.c:30 __list_add+0x8f/0xa0() (Tainted: P           ---------------   )
      Hardware name: S5500WB
      list_add corruption. prev->next should be next (ffff88063dfb4d90), but was ffff88064d7fee68. (prev=ffff88064d7fee68).
      Modules linked in: lnet_selftest(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm vhost_net macvtap macvlan tun kvm_intel kvm zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ipmi_devintf ipmi_si ipmi_msghandler microcode iTCO_wdt iTCO_vendor_support sg i2c_i801 lpc_ich mfd_core ioatdma i7core_edac edac_core igb dca i2c_algo_bit i2c_core ptp pps_core ext4 jbd2 mbcache sd_mod crc_t10dif ahci mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
      Pid: 4014, comm: st_timer Tainted: P           ---------------    2.6.32-504.16.2.el6.x86_64 #1
      Call Trace:
       [<ffffffff81074e27>] ? warn_slowpath_common+0x87/0xc0
       [<ffffffff81074f16>] ? warn_slowpath_fmt+0x46/0x50
       [<ffffffffa0d1e9ef>] ? sfw_create_rpc+0xef/0x320 [lnet_selftest]
       [<ffffffff8129ed7f>] ? __list_add+0x8f/0xa0
       [<ffffffffa0d16ecc>] ? lstcon_rpc_trans_addreq+0x2c/0x40 [lnet_selftest]
       [<ffffffffa0d183b6>] ? lstcon_rpc_pinger+0x3b6/0x540 [lnet_selftest]
       [<ffffffffa0d18540>] ? lstcon_rpc_done+0x0/0x190 [lnet_selftest]
       [<ffffffffa0d243ec>] ? stt_timer_main+0xcc/0x1d0 [lnet_selftest]
       [<ffffffff8109ebb0>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa0d24320>] ? stt_timer_main+0x0/0x1d0 [lnet_selftest]
       [<ffffffff8109e71e>] ? kthread+0x9e/0xc0
       [<ffffffff8100c20a>] ? child_rip+0xa/0x20
       [<ffffffff8109e680>] ? kthread+0x0/0xc0
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      ---[ end trace 3489aa37ec955e70 ]---
      ------------[ cut here ]------------
      WARNING: at lib/list_debug.c:30 __list_add+0x8f/0xa0() (Tainted: P        W  ---------------   )
      Hardware name: S5500WB
      list_add corruption. prev->next should be next (ffff88063dfb4d90), but was ffff88064d7fee68. (prev=ffff88064d7fee68).
      Modules linked in: lnet_selftest(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm vhost_net macvtap macvlan tun kvm_intel kvm zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ipmi_devintf ipmi_si ipmi_msghandler microcode iTCO_wdt iTCO_vendor_support sg i2c_i801 lpc_ich mfd_core ioatdma i7core_edac edac_core igb dca i2c_algo_bit i2c_core ptp pps_core ext4 jbd2 mbcache sd_mod crc_t10dif ahci mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
      Pid: 4014, comm: st_timer Tainted: P        W  ---------------    2.6.32-504.16.2.el6.x86_64 #1
      Call Trace:
       [<ffffffff81074e27>] ? warn_slowpath_common+0x87/0xc0
       [<ffffffff81074f16>] ? warn_slowpath_fmt+0x46/0x50
       [<ffffffffa0d1e9ef>] ? sfw_create_rpc+0xef/0x320 [lnet_selftest]
       [<ffffffff8129ed7f>] ? __list_add+0x8f/0xa0
       [<ffffffffa0d16ecc>] ? lstcon_rpc_trans_addreq+0x2c/0x40 [lnet_selftest]
       [<ffffffffa0d183b6>] ? lstcon_rpc_pinger+0x3b6/0x540 [lnet_selftest]
       [<ffffffffa0d18540>] ? lstcon_rpc_done+0x0/0x190 [lnet_selftest]
       [<ffffffffa0d243ec>] ? stt_timer_main+0xcc/0x1d0 [lnet_selftest]
       [<ffffffff8109ebb0>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa0d24320>] ? stt_timer_main+0x0/0x1d0 [lnet_selftest]
       [<ffffffff8109e71e>] ? kthread+0x9e/0xc0
       [<ffffffff8100c20a>] ? child_rip+0xa/0x20
       [<ffffffff8109e680>] ? kthread+0x0/0xc0
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      ---[ end trace 3489aa37ec955e71 ]---
      LustreError: 4130:0:(conrpc.c:584:lstcon_rpc_trans_destroy()) ASSERTION( atomic_read(&trans->tas_remaining) == 0 ) failed: 
      LustreError: 4130:0:(conrpc.c:584:lstcon_rpc_trans_destroy()) LBUG
      Pid: 4130, comm: lst
      
      Call Trace:
       [<ffffffffa0695875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
       [<ffffffffa0695e77>] lbug_with_loc+0x47/0xb0 [libcfs]
       [<ffffffffa0d18fbd>] lstcon_rpc_trans_destroy+0x2fd/0x300 [lnet_selftest]
       [<ffffffffa0d19b9e>] lstcon_rpc_pinger_stop+0x4e/0xb0 [lnet_selftest]
       [<ffffffffa0d122e1>] lstcon_session_end+0x51/0x230 [lnet_selftest]
       [<ffffffffa0d1d6e5>] lstcon_ioctl_entry+0xbb5/0x1470 [lnet_selftest]
       [<ffffffffa069ee77>] libcfs_ioctl+0x127/0x830 [libcfs]
       [<ffffffff811a1458>] ? do_filp_open+0x798/0xd20
       [<ffffffffa069c262>] libcfs_ioctl+0xa2/0x1a0 [libcfs]
       [<ffffffff811a3a42>] vfs_ioctl+0x22/0xa0
       [<ffffffff811a3be4>] do_vfs_ioctl+0x84/0x580
       [<ffffffff811a4161>] sys_ioctl+0x81/0xa0
       [<ffffffff810e5b6e>] ? __audit_syscall_exit+0x25e/0x290
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      
      LustreError: dumping log to /tmp/lustre-log.1450358326.4130
      
      

      Attachments

        Activity

          [LU-7574] ASSERTION( crpc != ((void *)0) && rpc == crpc->crp_rpc ) failed

          This is a just a script that runs LNet selftest with different parameters:

          run_test()
          {
                  local test=$1
                  local concurrency=$2
                  local distribute=$3
                  local delay=$4
          
                  lst new_session run_test
                  lst add_group server $SERVER
                  lst add_group client $CLIENT
                  lst add_batch main_test
                  lst add_test  --batch main_test --from client --to server --distribute $distribute --concurrency $concurrency $test
                  lst run main_test
                  echo "Running test $test with concurrency:$concurrency distribute:$distribute delay:$delay"
                  lst stat --delay $delay server &
                  sleep $((delay + 2))
                  kill $!
                  lst end_session
          }
          

          Each test is stopped after 60 seconds, parameters changed and started back up. I've seen the LBUG several times now. This is unrelated to the shared key crypto.

          jfilizetti Jeremy Filizetti added a comment - This is a just a script that runs LNet selftest with different parameters: run_test() { local test=$1 local concurrency=$2 local distribute=$3 local delay=$4 lst new_session run_test lst add_group server $SERVER lst add_group client $CLIENT lst add_batch main_test lst add_test --batch main_test --from client --to server --distribute $distribute --concurrency $concurrency $test lst run main_test echo "Running test $test with concurrency:$concurrency distribute:$distribute delay:$delay" lst stat --delay $delay server & sleep $((delay + 2)) kill $! lst end_session } Each test is stopped after 60 seconds, parameters changed and started back up. I've seen the LBUG several times now. This is unrelated to the shared key crypto.

          What steps did you take before this problem was hit? Is this just the basic lnet-selftest.sh running, or is something else? It looks like there was a communication failure with the peer during the test, was there a network problem, or the peer stopped during the test?

          It shouldn't matter if Shared Key crypto was enabled, since LNet Selftest runs below the PtlRPC layer, but just in case was SK enabled during this test?

          adilger Andreas Dilger added a comment - What steps did you take before this problem was hit? Is this just the basic lnet-selftest.sh running, or is something else? It looks like there was a communication failure with the peer during the test, was there a network problem, or the peer stopped during the test? It shouldn't matter if Shared Key crypto was enabled, since LNet Selftest runs below the PtlRPC layer, but just in case was SK enabled during this test?

          People

            ashehata Amir Shehata (Inactive)
            jfilizetti Jeremy Filizetti
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: