Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13760

client hit NMI watchdog: BUG: soft lockup

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: Lustre 2.14.0
    • Fix Version/s: Lustre 2.14.0
    • Labels:
    • Environment:
      lustre-master-ib #437. version=2.13.54_118_g2e813f3
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      2 clients hit following error

      [1669499.346718] NMI watchdog: BUG: soft lockup - CPU#29 stuck for 22s! [ldlm_lock_repla:132092]
      [1669499.356235] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_k
      rb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxf
      w(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper iTCO
      _wdt cryptd iTCO_vendor_support ipmi_ssif joydev pcspkr sg i2c_i801 ipmi_si ipmi_devintf ipmi_msghandler pcc_cpufreq wmi lpc_ich mei_me mei ioatdma auth_rpcgss sunrpc ip_ta
      bles ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect isci igb sysimgblt fb_sys_fops ttm mlx4_co
      re(OE) ahci libsas libahci scsi_transport_sas ptp drm devlink pps_core crct10dif_pclmul crct10dif_common crc32c_intel dca libata mlx_compat(OE) drm_panel_orientation_quirks
       i2c_algo_bit [last unloaded: libcfs]
      [1669499.461226] CPU: 29 PID: 132092 Comm: ldlm_lock_repla Kdump: loaded Tainted: G           OEL ------------   3.10.0-1062.18.1.el7.x86_64 #1
      [1669499.475297] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
      [1669499.488011] task: ffff9f75181a3150 ti: ffff9f76836c4000 task.ti: ffff9f76836c4000
      [1669499.496556] RIP: 0010:[<ffffffff9f3176b6>]  [<ffffffff9f3176b6>] native_queued_spin_lock_slowpath+0x156/0x200
      [1669499.507827] RSP: 0018:ffff9f76836c7cc8  EFLAGS: 00000202
      [1669499.513946] RAX: 0000000000000101 RBX: 0000000000000000 RCX: 0000000000e90000
      [1669499.522103] RDX: 0000000000e90101 RSI: 0000000000000101 RDI: ffff9f6957afda5c
      [1669499.530259] RBP: ffff9f76836c7cc8 R08: ffff9f77de55b880 R09: 0000000000000000
      [1669499.538417] R10: 0000000019aefe01 R11: ffff9f6e19aee900 R12: 0000000000000000
      [1669499.546575] R13: ffffffff9f423c7d R14: ffff9f76836c7cc8 R15: ffff9f6e19aefec0
      [1669499.554734] FS:  0000000000000000(0000) GS:ffff9f77de540000(0000) knlGS:0000000000000000
      [1669499.563957] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [1669499.570563] CR2: 00007f5d7cec9b80 CR3: 0000000031010000 CR4: 00000000000607e0
      [1669499.578725] Call Trace:
      [1669499.581653]  [<ffffffff9f9754ee>] queued_spin_lock_slowpath+0xb/0xf
      [1669499.588832]  [<ffffffff9f983b20>] _raw_spin_lock+0x20/0x30
      [1669499.595172]  [<ffffffffc0fb03e2>] ldlm_resource_foreach+0x52/0x270 [ptlrpc]
      [1669499.603153]  [<ffffffffc0fb062f>] ldlm_res_iter_helper+0x2f/0x40 [ptlrpc]
      [1669499.610928]  [<ffffffffc0bdc460>] cfs_hash_for_each_relax+0x250/0x450 [libcfs]
      [1669499.619197]  [<ffffffffc0fb0600>] ? ldlm_resource_foreach+0x270/0x270 [ptlrpc]
      [1669499.627468]  [<ffffffffc0fb0600>] ? ldlm_resource_foreach+0x270/0x270 [ptlrpc]
      [1669499.635734]  [<ffffffffc0bdf6c5>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs]
      [1669499.644003]  [<ffffffffc0fb0900>] __ldlm_replay_locks+0xe0/0x9e0 [ptlrpc]
      [1669499.651788]  [<ffffffffc0fa8bc0>] ? is_granted_or_cancelled_nolock+0x60/0x60 [ptlrpc]
      [1669499.660739]  [<ffffffffc0fb1200>] ? __ldlm_replay_locks+0x9e0/0x9e0 [ptlrpc]
      [1669499.668816]  [<ffffffffc0fb1231>] ldlm_lock_replay_thread+0x31/0xd0 [ptlrpc]
      [1669499.676877]  [<ffffffff9f2c6321>] kthread+0xd1/0xe0
      [1669499.682513]  [<ffffffff9f2c6250>] ? insert_kthread_work+0x40/0x40
      [1669499.689507]  [<ffffffff9f98dd37>] ret_from_fork_nospec_begin+0x21/0x21
      [1669499.696986]  [<ffffffff9f2c6250>] ? insert_kthread_work+0x40/0x40
      [1669499.703979] Code: 8b 08 4d 85 c9 74 04 41 0f 18 09 8b 17 0f b7 c2 85 c0 74 21 83 f8 03 75 10 eb 1a 66 2e 0f 1f 84 00 00 00 00 00 85 c0 74 0c f3 90 <8b> 17 0f b7 c2 83 f8 03 75 f0 be 01 00 00 00 eb 15 66 0f 1f 84 
      [1669519.188752] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ldlm_cb00_004:84044]
      [1669519.194752] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [simul:128254]
      [1669519.194776] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper iTCO_wdt cryptd iTCO_vendor_support ipmi_ssif joydev pcspkr sg i2c_i801 ipmi_si ipmi_devintf ipmi_msghandler pcc_cpufreq wmi lpc_ich mei_me mei ioatdma auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect isci igb sysimgblt fb_sys_fops ttm mlx4_core(OE) ahci libsas libahci scsi_transport_sas ptp drm devlink pps_core crct10dif_pclmul crct10dif_common crc32c_intel dca libata mlx_compat(OE) drm_panel_orientation_quirks i2c_algo_bit [last unloaded: libcfs]
      [1669519.194785] CPU: 2 PID: 128254 Comm: simul Kdump: loaded Tainted: G           OEL ------------   3.10.0-1062.18.1.el7.x86_64 #1
      [1669519.194786] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
      [1669519.194787] task: ffff9f6a933941c0 ti: ffff9f6ed9ea4000 task.ti: ffff9f6ed9ea4000
      [1669519.194789] RIP: 0010:[<ffffffff9f31772e>]  [<ffffffff9f31772e>] native_queued_spin_lock_slowpath+0x1ce/0x200
      [1669519.194790] RSP: 0018:ffff9f6ed9ea7248  EFLAGS: 00000202
      [1669519.194791] RAX: 0000000000000001 RBX: 00000000b863ca88 RCX: 0000000000000001
      [1669519.194792] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff9f6957afda5c
      [1669519.194792] RBP: ffff9f6ed9ea7248 R08: 0000000000000101 R09: 0000000000000000
      [1669519.194793] R10: 0000000000000000 R11: ffff9f69290d6300 R12: ffff9f6ed9ea7200
      [1669519.194794] R13: ffffffffc0fa36b2 R14: ffff9f6ed9ea71b0 R15: 0000000000000000
      [1669519.194795] FS:  00007f746b6e3740(0000) GS:ffff9f6fdea80000(0000) knlGS:0000000000000000
      [1669519.194796] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [1669519.194796] CR2: 00007f7137d35c20 CR3: 00000003964da000 CR4: 00000000000607e0
      [1669519.194797] Call Trace:
      [1669519.194800]  [<ffffffff9f9754ee>] queued_spin_lock_slowpath+0xb/0xf
      [1669519.194801]  [<ffffffff9f983b20>] _raw_spin_lock+0x20/0x30
      [1669519.194821]  [<ffffffffc0f999f8>] ldlm_lock_change_resource+0xe8/0x350 [ptlrpc]
      [1669519.194836]  [<ffffffffc0fab55f>] ldlm_cli_enqueue_fini+0x3ff/0xe40 [ptlrpc]
      [1669519.194850]  [<ffffffffc0d726e1>] ? lprocfs_counter_sub+0xc1/0x130 [obdclass]
      [1669519.194866]  [<ffffffffc0faf051>] ldlm_cli_enqueue+0x441/0xa20 [ptlrpc]
      [1669519.194880]  [<ffffffffc0fac270>] ? ldlm_expired_completion_wait+0x2a0/0x2a0 [ptlrpc]
      [1669519.194890]  [<ffffffffc122e8a0>] ? ll_md_need_convert+0x180/0x180 [lustre]
      [1669519.194895]  [<ffffffffc0c878a0>] ? mdc_changelog_cdev_finish+0x210/0x210 [mdc]
      [1669519.194900]  [<ffffffffc0c81ee0>] mdc_enqueue_base+0x330/0x1d40 [mdc]
      [1669519.194904]  [<ffffffffc0c84055>] mdc_intent_lock+0x135/0x570 [mdc]
      [1669519.194917]  [<ffffffffc0d726e1>] ? lprocfs_counter_sub+0xc1/0x130 [obdclass]
      [1669519.194927]  [<ffffffffc122e8a0>] ? ll_md_need_convert+0x180/0x180 [lustre]
      [1669519.194942]  [<ffffffffc0fac270>] ? ldlm_expired_completion_wait+0x2a0/0x2a0 [ptlrpc]
      [1669519.194946]  [<ffffffffc0c878a0>] ? mdc_changelog_cdev_finish+0x210/0x210 [mdc]
      [1669519.194950]  [<ffffffffc11c2996>] lmv_revalidate_slaves+0x416/0xb30 [lmv]
      [1669519.194959]  [<ffffffffc122e8a0>] ? ll_md_need_convert+0x180/0x180 [lustre]
      [1669519.194962]  [<ffffffffc11ac366>] lmv_merge_attr+0x46/0x1b0 [lmv]
      [1669519.194975]  [<ffffffffc0d725b9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [1669519.194984]  [<ffffffffc1217d85>] ll_update_lsm_md+0xe35/0x1020 [lustre]
      [1669519.195002]  [<ffffffffc0fd2c57>] ? lustre_msg_buf+0x17/0x60 [ptlrpc]
      [1669519.195011]  [<ffffffffc121b8ab>] ll_update_inode+0x36b/0x640 [lustre]
      [1669519.195013]  [<ffffffff9f468e68>] ? inode_insert5+0x128/0x190
      [1669519.195022]  [<ffffffffc122d600>] ? ll_test_inode_by_fid+0x30/0x30 [lustre]
      [1669519.195031]  [<ffffffffc122d600>] ? ll_test_inode_by_fid+0x30/0x30 [lustre]
      [1669519.195039]  [<ffffffffc121bbe7>] ll_read_inode2+0x67/0x420 [lustre]
      [1669519.195048]  [<ffffffffc122e4ab>] ll_iget+0xdb/0x350 [lustre]
      [1669519.195057]  [<ffffffffc12206b2>] ll_prep_inode+0x212/0x9b0 [lustre]
      [1669519.195074]  [<ffffffffc0fd2c00>] ? lustre_msg_buf_v2+0x1a0/0x1e0 [ptlrpc]
      [1669519.195092]  [<ffffffffc0ffb8e7>] ? __req_capsule_get+0x427/0x6b0 [ptlrpc]
      [1669519.195102]  [<ffffffffc122fb98>] ll_lookup_it.constprop.26+0xc08/0x1ec0 [lustre]
      [1669519.195115]  [<ffffffffc0f993e3>] ? ldlm_lock_add_to_lru+0x43/0x130 [ptlrpc]
      [1669519.195129]  [<ffffffffc0f9b326>] ? ldlm_lock_decref+0x36/0x80 [ptlrpc]
      [1669519.195135]  [<ffffffffc11e82ca>] ? ll_intent_drop_lock.part.15+0x4a/0x170 [lustre]
      [1669519.195152]  [<ffffffffc0fc26c0>] ? ptlrpc_req_finished+0x10/0x20 [ptlrpc]
      [1669519.195160]  [<ffffffffc11f9dee>] ? ll_inode_revalidate+0x18e/0x690 [lustre]
      [1669519.195167]  [<ffffffffc11f6a71>] ? ll_get_acl+0x31/0xf0 [lustre]
      [1669519.195180]  [<ffffffffc0d725b9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [1669519.195189]  [<ffffffffc12299b8>] ? ll_stats_ops_tally+0x98/0x100 [lustre]
      [1669519.195198]  [<ffffffffc1230f0e>] ll_lookup_nd+0xbe/0x180 [lustre]
      [1669519.195200]  [<ffffffff9f455973>] lookup_real+0x23/0x60
      [1669519.195202]  [<ffffffff9f456392>] __lookup_hash+0x42/0x60
      [1669519.195203]  [<ffffffff9f978067>] lookup_slow+0x42/0xa7
      [1669519.195205]  [<ffffffff9f45b8a8>] path_lookupat+0x838/0x8b0
      [1669519.195206]  [<ffffffff9f4264a5>] ? kmem_cache_alloc+0x35/0x1f0
      [1669519.195208]  [<ffffffff9f45c6ff>] ? getname_flags+0x4f/0x1a0
      [1669519.195209]  [<ffffffff9f45b94b>] filename_lookup+0x2b/0xc0
      [1669519.195211]  [<ffffffff9f45d897>] user_path_at_empty+0x67/0xc0
      [1669519.195213]  [<ffffffff9f2e28c9>] ? pick_next_entity+0xa9/0x190
      [1669519.195214]  [<ffffffff9f45d901>] user_path_at+0x11/0x20
      [1669519.195216]  [<ffffffff9f4505e3>] vfs_fstatat+0x63/0xc0
      [1669519.195217]  [<ffffffff9f2de7f5>] ? sched_clock_cpu+0x85/0xc0
      [1669519.195218]  [<ffffffff9f45099e>] SYSC_newstat+0x2e/0x60
      [1669519.195220]  [<ffffffff9f98de21>] ? system_call_after_swapgs+0xae/0x146
      [1669519.195221]  [<ffffffff9f98de15>] ? system_call_after_swapgs+0xa2/0x146
      [1669519.195223]  [<ffffffff9f98de21>] ? system_call_after_swapgs+0xae/0x146
      [1669519.195224]  [<ffffffff9f98de15>] ? system_call_after_swapgs+0xa2/0x146
      [1669519.195225]  [<ffffffff9f98de21>] ? system_call_after_swapgs+0xae/0x146
      [1669519.195226]  [<ffffffff9f98de15>] ? system_call_after_swapgs+0xa2/0x146
      [1669519.195228]  [<ffffffff9f98de21>] ? system_call_after_swapgs+0xae/0x146
      [1669519.195229]  [<ffffffff9f98de15>] ? system_call_after_swapgs+0xa2/0x146
      [1669519.195230]  [<ffffffff9f98de21>] ? system_call_after_swapgs+0xae/0x146
      [1669519.195231]  [<ffffffff9f98de15>] ? system_call_after_swapgs+0xa2/0x146
      [1669519.195232]  [<ffffffff9f98de21>] ? system_call_after_swapgs+0xae/0x146
      [1669519.195234]  [<ffffffff9f450e5e>] SyS_newstat+0xe/0x10
      [1669519.195235]  [<ffffffff9f98dede>] system_call_fastpath+0x25/0x2a
      [1669519.195237]  [<ffffffff9f98de21>] ? system_call_after_swapgs+0xae/0x146
      [1669519.195251] Code: 37 81 fe 00 01 00 00 74 f4 e9 93 fe ff ff 0f 1f 80 00 00 00 00 83 fa 01 75 11 0f 1f 00 e9 68 fe ff ff 0f 1f 00 85 c0 74 0c f3 90 <8b> 07 0f b6 c0 83 f8 03 75 f0 b8 01 00 00 00 66 89 07 5d c3 66 
      [1669519.947805] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper iTCO_wdt cryptd iTCO_vendor_support ipmi_ssif joydev pcspkr sg i2c_i801 ipmi_si ipmi_devintf ipmi_msghandler pcc_cpufreq wmi lpc_ich mei_me mei ioatdma auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect isci igb sysimgblt fb_sys_fops ttm mlx4_core(OE) ahci libsas libahci scsi_transport_sas ptp drm devlink pps_core crct10dif_pclmul crct10dif_common crc32c_intel dca libata mlx_compat(OE) drm_panel_orientation_quirks i2c_algo_bit [last unloaded: libcfs]
      
      

        Attachments

          Activity

            People

            • Assignee:
              green Oleg Drokin
              Reporter:
              sarah Sarah Liu
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: