[LU-11391] soft lockup in ldlm_prepare_lru_list() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
None
Environment:
CentOS 7.5 patchfull and Lustre 2.11.55 on AMD EPYC servers

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Testing master branch, tag 2.11.55, and hit soft lockups in ldlm_prepare_lru_list() (workqueue: ldlm_pools_recalc_task) on the client when running mdtest from the IO-500 benchmark using a single client.

[212288.213417] NMI watchdog: BUG: soft lockup - CPU#35 stuck for 22s! [kworker/35:1:600]
[212288.221336] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase rpcsec_gss_krb5 dell_rbu auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ib_ucm rpcrdma rdma_ucm ib_uverbs ib_iser ib_umad rdma_cm iw_cm libiscsi ib_ipoib scsi_transport_iscsi ib_cm mlx5_ib ib_core sunrpc vfat fat amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel dcdbas aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg dm_multipath ccp dm_mod pcspkr shpchp i2c_piix4 ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
[212288.294305]  fb_sys_fops ttm mlx5_core crct10dif_pclmul drm ahci mlxfw crct10dif_common tg3 libahci crc32c_intel devlink megaraid_sas ptp libata i2c_core pps_core
[212288.307953] CPU: 35 PID: 600 Comm: kworker/35:1 Kdump: loaded Tainted: G           OEL ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1
[212288.320378] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.3.6 04/20/2018
[212288.328069] Workqueue: events ldlm_pools_recalc_task [ptlrpc]
[212288.333925] task: ffff908bfc470000 ti: ffff908bfc464000 task.ti: ffff908bfc464000
[212288.341491] RIP: 0010:[<ffffffff9fd08ff2>]  [<ffffffff9fd08ff2>] native_queued_spin_lock_slowpath+0x122/0x200
[212288.351518] RSP: 0018:ffff908bfc467be8  EFLAGS: 00000246
[212288.356918] RAX: 0000000000000000 RBX: 0000000000002000 RCX: 0000000001190000
[212288.364139] RDX: ffff906bffb99740 RSI: 0000000001b10000 RDI: ffff90abfa32953c
[212288.371358] RBP: ffff908bfc467be8 R08: ffff908bffb19740 R09: 0000000000000000
[212288.378577] R10: 0000fbd0948bcb20 R11: 7fffffffffffffff R12: ffff907ca99fd018
[212288.385796] R13: 0000000000000000 R14: 0000000000018b40 R15: 0000000000018b40
[212288.393017] FS:  00007f96c2fbb740(0000) GS:ffff908bffb00000(0000) knlGS:0000000000000000
[212288.401190] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[212288.407023] CR2: 00007f8de6b4da88 CR3: 0000000aaf40e000 CR4: 00000000003407e0
[212288.414244] Call Trace:
[212288.416793]  [<ffffffffa0309510>] queued_spin_lock_slowpath+0xb/0xf
[212288.423151]  [<ffffffffa0316840>] _raw_spin_lock+0x20/0x30
[212288.428755]  [<ffffffffc0fb9280>] ldlm_pool_set_clv+0x20/0x40 [ptlrpc]
[212288.435391]  [<ffffffffc0f9c956>] ldlm_cancel_lrur_policy+0xd6/0x100 [ptlrpc]
[212288.442639]  [<ffffffffc0f9e4ca>] ldlm_prepare_lru_list+0x1fa/0x4c0 [ptlrpc]
[212288.449797]  [<ffffffffc0f9c880>] ? ldlm_iter_helper+0x20/0x20 [ptlrpc]
[212288.456522]  [<ffffffffc0fa3e31>] ldlm_cancel_lru+0x61/0x170 [ptlrpc]
[212288.463076]  [<ffffffffc0fb7741>] ldlm_cli_pool_recalc+0x231/0x240 [ptlrpc]
[212288.470148]  [<ffffffffc0fb785c>] ldlm_pool_recalc+0x10c/0x1f0 [ptlrpc]
[212288.476874]  [<ffffffffc0fb7abc>] ldlm_pools_recalc_delay+0x17c/0x1d0 [ptlrpc]
[212288.484208]  [<ffffffffc0fb7cd3>] ldlm_pools_recalc_task+0x1c3/0x260 [ptlrpc]
[212288.491431]  [<ffffffff9fcb35ef>] process_one_work+0x17f/0x440
[212288.497356]  [<ffffffff9fcb4686>] worker_thread+0x126/0x3c0
[212288.503016]  [<ffffffff9fcb4560>] ? manage_workers.isra.24+0x2a0/0x2a0
[212288.509629]  [<ffffffff9fcbb621>] kthread+0xd1/0xe0
[212288.514594]  [<ffffffff9fcbb550>] ? insert_kthread_work+0x40/0x40
[212288.520776]  [<ffffffffa03205e4>] ret_from_fork_nospec_begin+0xe/0x21
[212288.527300]  [<ffffffff9fcbb550>] ? insert_kthread_work+0x40/0x40
[212288.533479] Code: 13 48 c1 ea 0d 48 98 83 e2 30 48 81 c2 40 97 01 00 48 03 14 c5 a0 53 93 a0 4c 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 <41> 8b 40 08 85 c0 74 f6 4d 8b 08 4d 85 c9 74 04 41 0f 18 09 8b

Triggered a crash dump that can be made available if anyone interested, just let me know. Attaching vmcore-dmest.txt and the output of foreach bt.

Client was running the following part of the IO-500 benchmark:

[Starting] mdtest_easy_stat
[Exec] mpirun -np 24 /home/sthiell/io-500-dev/bin/mdtest -T -F -d /firbench/nodom/datafiles/io500.2018.09.17-19.30.06/mdt_easy -n 200000 -u -L -x /firbench/nodom/datafiles/io500.2018.09.17-19.30.06/mdt_easy-stonewall

Best,
Stephane

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

foreach_bt.txt
745 kB
18/Sep/18 5:27 AM
vmcore-dmesg.txt
1020 kB
18/Sep/18 5:27 AM

Issue Links

is duplicated by

LU-11693 Soft lockups on Lustre clients

Open

is related to

LU-9230 soft lockup on v2.9 Lustre clients (ldlm?)

Resolved

Activity

[LU-11391] soft lockup in ldlm_prepare_lru_list()

Johann Peyrard (Inactive) added a comment - 20/Nov/18 4:57 PM - edited

We had the same issue last week.

The only way to reduce this NMI message to near silent was to play with these two parameters :

$ lctl set_param ldlm.namespaces.*.lru_size=10000

$ lctl set_param ldlm...lru_max_age=1000

Regards,

Johann

Johann Peyrard (Inactive) added a comment - 20/Nov/18 4:57 PM - edited We had the same issue last week. The only way to reduce this NMI message to near silent was to play with these two parameters : $ lctl set_param ldlm.namespaces.*.lru_size=10000 $ lctl set_param ldlm. . .lru_max_age=1000 Regards, Johann

Yang Sheng added a comment - 12/Oct/18 3:08 AM

Hi, Stephane,

I have investigated the vmcore. Looks like we lost the timing of lockup. From stack trace you attached, the thread was spinning on pl_lock. Looks like not one can hold this lock for a long time except on server side. But this instance is client. Anyway, i'll try to reproduce it on my side.

Thanks,
YangSheng

Yang Sheng added a comment - 12/Oct/18 3:08 AM Hi, Stephane, I have investigated the vmcore. Looks like we lost the timing of lockup. From stack trace you attached, the thread was spinning on pl_lock. Looks like not one can hold this lock for a long time except on server side. But this instance is client. Anyway, i'll try to reproduce it on my side. Thanks, YangSheng

Andreas Dilger added a comment - 11/Oct/18 6:28 PM

Stephane, could you please try setting the LDLM LRU size to avoid the LRU getting too large:

client$ lctl set_param ldlm.namespaces.*.lru_size=50000

This might avoid the lockup that you are seeing. We are looking at making this the default for an upcoming release, since it seems to be a common problem.

Andreas Dilger added a comment - 11/Oct/18 6:28 PM Stephane, could you please try setting the LDLM LRU size to avoid the LRU getting too large: client$ lctl set_param ldlm.namespaces.*.lru_size=50000 This might avoid the lockup that you are seeing. We are looking at making this the default for an upcoming release, since it seems to be a common problem.

Stephane Thiell added a comment - 18/Sep/18 6:21 AM

Hi YangSheng,

Done, uploaded as LU11391-vmcore-pack.tar with debuginfo rpms included. Hope that helps!

Best,
Stephane

Stephane Thiell added a comment - 18/Sep/18 6:21 AM Hi YangSheng, Done, uploaded as LU11391-vmcore-pack.tar with debuginfo rpms included. Hope that helps! Best, Stephane

Yang Sheng added a comment - 18/Sep/18 5:47 AM

Hi, Stephane,

Could you please upload the vmcore to our ftp site(ftp.whamcloud.com)? Better pack with debuginfo rpm.

Thanks,
YangSheng

Yang Sheng added a comment - 18/Sep/18 5:47 AM Hi, Stephane, Could you please upload the vmcore to our ftp site(ftp.whamcloud.com)? Better pack with debuginfo rpm. Thanks, YangSheng

soft lockup in ldlm_prepare_lru_list()

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates