Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
CentOS 7.5 patchfull and Lustre 2.11.55 on AMD EPYC servers
-
3
-
9223372036854775807
Description
Testing master branch, tag 2.11.55, and hit soft lockups in ldlm_prepare_lru_list() (workqueue: ldlm_pools_recalc_task) on the client when running mdtest from the IO-500 benchmark using a single client.
[212288.213417] NMI watchdog: BUG: soft lockup - CPU#35 stuck for 22s! [kworker/35:1:600] [212288.221336] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase rpcsec_gss_krb5 dell_rbu auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ib_ucm rpcrdma rdma_ucm ib_uverbs ib_iser ib_umad rdma_cm iw_cm libiscsi ib_ipoib scsi_transport_iscsi ib_cm mlx5_ib ib_core sunrpc vfat fat amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel dcdbas aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg dm_multipath ccp dm_mod pcspkr shpchp i2c_piix4 ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt [212288.294305] fb_sys_fops ttm mlx5_core crct10dif_pclmul drm ahci mlxfw crct10dif_common tg3 libahci crc32c_intel devlink megaraid_sas ptp libata i2c_core pps_core [212288.307953] CPU: 35 PID: 600 Comm: kworker/35:1 Kdump: loaded Tainted: G OEL ------------ 3.10.0-862.9.1.el7_lustre.x86_64 #1 [212288.320378] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.3.6 04/20/2018 [212288.328069] Workqueue: events ldlm_pools_recalc_task [ptlrpc] [212288.333925] task: ffff908bfc470000 ti: ffff908bfc464000 task.ti: ffff908bfc464000 [212288.341491] RIP: 0010:[<ffffffff9fd08ff2>] [<ffffffff9fd08ff2>] native_queued_spin_lock_slowpath+0x122/0x200 [212288.351518] RSP: 0018:ffff908bfc467be8 EFLAGS: 00000246 [212288.356918] RAX: 0000000000000000 RBX: 0000000000002000 RCX: 0000000001190000 [212288.364139] RDX: ffff906bffb99740 RSI: 0000000001b10000 RDI: ffff90abfa32953c [212288.371358] RBP: ffff908bfc467be8 R08: ffff908bffb19740 R09: 0000000000000000 [212288.378577] R10: 0000fbd0948bcb20 R11: 7fffffffffffffff R12: ffff907ca99fd018 [212288.385796] R13: 0000000000000000 R14: 0000000000018b40 R15: 0000000000018b40 [212288.393017] FS: 00007f96c2fbb740(0000) GS:ffff908bffb00000(0000) knlGS:0000000000000000 [212288.401190] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [212288.407023] CR2: 00007f8de6b4da88 CR3: 0000000aaf40e000 CR4: 00000000003407e0 [212288.414244] Call Trace: [212288.416793] [<ffffffffa0309510>] queued_spin_lock_slowpath+0xb/0xf [212288.423151] [<ffffffffa0316840>] _raw_spin_lock+0x20/0x30 [212288.428755] [<ffffffffc0fb9280>] ldlm_pool_set_clv+0x20/0x40 [ptlrpc] [212288.435391] [<ffffffffc0f9c956>] ldlm_cancel_lrur_policy+0xd6/0x100 [ptlrpc] [212288.442639] [<ffffffffc0f9e4ca>] ldlm_prepare_lru_list+0x1fa/0x4c0 [ptlrpc] [212288.449797] [<ffffffffc0f9c880>] ? ldlm_iter_helper+0x20/0x20 [ptlrpc] [212288.456522] [<ffffffffc0fa3e31>] ldlm_cancel_lru+0x61/0x170 [ptlrpc] [212288.463076] [<ffffffffc0fb7741>] ldlm_cli_pool_recalc+0x231/0x240 [ptlrpc] [212288.470148] [<ffffffffc0fb785c>] ldlm_pool_recalc+0x10c/0x1f0 [ptlrpc] [212288.476874] [<ffffffffc0fb7abc>] ldlm_pools_recalc_delay+0x17c/0x1d0 [ptlrpc] [212288.484208] [<ffffffffc0fb7cd3>] ldlm_pools_recalc_task+0x1c3/0x260 [ptlrpc] [212288.491431] [<ffffffff9fcb35ef>] process_one_work+0x17f/0x440 [212288.497356] [<ffffffff9fcb4686>] worker_thread+0x126/0x3c0 [212288.503016] [<ffffffff9fcb4560>] ? manage_workers.isra.24+0x2a0/0x2a0 [212288.509629] [<ffffffff9fcbb621>] kthread+0xd1/0xe0 [212288.514594] [<ffffffff9fcbb550>] ? insert_kthread_work+0x40/0x40 [212288.520776] [<ffffffffa03205e4>] ret_from_fork_nospec_begin+0xe/0x21 [212288.527300] [<ffffffff9fcbb550>] ? insert_kthread_work+0x40/0x40 [212288.533479] Code: 13 48 c1 ea 0d 48 98 83 e2 30 48 81 c2 40 97 01 00 48 03 14 c5 a0 53 93 a0 4c 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 <41> 8b 40 08 85 c0 74 f6 4d 8b 08 4d 85 c9 74 04 41 0f 18 09 8b
Triggered a crash dump that can be made available if anyone interested, just let me know. Attaching vmcore-dmest.txt and the output of foreach bt.
Client was running the following part of the IO-500 benchmark:
[Starting] mdtest_easy_stat [Exec] mpirun -np 24 /home/sthiell/io-500-dev/bin/mdtest -T -F -d /firbench/nodom/datafiles/io500.2018.09.17-19.30.06/mdt_easy -n 200000 -u -L -x /firbench/nodom/datafiles/io500.2018.09.17-19.30.06/mdt_easy-stonewall
Best,
Stephane