[LU-11092] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [ptlrpcd_00_18:4222] Created: 20/Jun/18  Updated: 21/Jan/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Julien Wallior Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

SLES12SP3 – 4.4.132-94.33-default client – lustre 2.10.4
OSS & MDS on centos 7.3 – lustre 2.10.0 (with a couple of hand picked patches)


Attachments: Text File client.txt     Text File mds831.txt     Text File mds832.txt    
Issue Links:
Duplicate
Related
is related to LU-11189 OSC flow control Open
is related to LU-11079 Control concurrent statahead instances Resolved
is related to LU-13970 add an option to disable inode cache ... Resolved
is related to LU-11075 suspicious code in ldlm_prepare_lru_l... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I'm running a robinhood scan on a lustre filesystem with 80M files. As I'm getting memory pressure on the system (80M * ~1kB per lustre_inode_cache is big) and the system tries to shrink the slabs, I get those errors in /var/log/messages:

2018-06-20T10:49:34.165875-04:00 btsotbal3000 kernel: [ 4889.674359] Lustre: lustre-MDT0000-mdc-ffff883fd53bb000: Connection to lustre-MDT0000 (at 10.11.201.11@o2ib) was lost; in progress operations using this service will wait for recov
ery to complete
2018-06-20T10:51:38.493202-04:00 btsotbal3000 kernel: [ 5014.015342] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1529505581/real 0] req@ffff881f7b3513c0 x16037989090541
12/t0(0) o103->lustre-OST008c-osc-ffff883fd53bb000@10.11.201.18@o2ib:17/18 lens 328/224 e 0 to 1 dl 1529505643 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
2018-06-20T10:51:38.493217-04:00 btsotbal3000 kernel: [ 5014.017476] Lustre: lustre-OST008c-osc-ffff883fd53bb000: Connection to lustre-OST008c (at 10.11.201.18@o2ib) was lost; in progress operations using this service will wait for recovery to complete
2018-06-20T10:51:39.012492-04:00 btsotbal3000 kernel: [ 5014.527203] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1529505581/real 0] req@ffff881e37d8f080 x1603798909214432/t0(0) o103->lustre-OST0084-osc-ffff883fd53bb000@10.11.201.17@o2ib:17/18 lens 328/224 e 0 to 1 dl 1529505643 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
2018-06-20T10:51:39.012505-04:00 btsotbal3000 kernel: [ 5014.527207] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 262 previous similar messages
2018-06-20T10:51:40.008987-04:00 btsotbal3000 kernel: [ 5015.533078] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1529505582/real 0] req@ffff881f2e08acc0 x1603798909677472/t0(0) o103->lustre-OST0089-osc-ffff883fd53bb000@10.11.201.18@o2ib:17/18 lens 328/224 e 0 to 1 dl 1529505644 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
2018-06-20T10:51:40.008998-04:00 btsotbal3000 kernel: [ 5015.533083] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 553 previous similar messages
2018-06-20T10:51:42.022152-04:00 btsotbal3000 kernel: [ 5017.542234] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1529505582/real 0] req@ffff8818937e8340 x1603798910485936/t0(0) o103->lustre-OST00a9-osc-ffff883fd53bb000@10.11.200.15@o2ib:17/18 lens 328/224 e 0 to 1 dl 1529505644 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
2018-06-20T10:51:42.022165-04:00 btsotbal3000 kernel: [ 5017.542239] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1102 previous similar messages
2018-06-20T10:51:46.030748-04:00 btsotbal3000 kernel: [ 5021.548246] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1529505584/real 0] req@ffff881e0945b640 x1603798911754992/t0(0) o103->lustre-OST00ab-osc-ffff883fd53bb000@10.11.200.15@o2ib:17/18 lens 328/224 e 0 to 1 dl 1529505648 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
2018-06-20T10:51:46.030762-04:00 btsotbal3000 kernel: [ 5021.548249] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 2104 previous similar messages
2018-06-20T10:51:54.040347-04:00 btsotbal3000 kernel: [ 5029.557301] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1529505589/real 0] req@ffff881e054d39c0 x1603798914360720/t0(0) o103->lustre-OST009f-osc-ffff883fd53bb000@10.11.200.14@o2ib:17/18 lens 328/224 e 0 to 1 dl 1529505658 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
2018-06-20T10:51:54.040360-04:00 btsotbal3000 kernel: [ 5029.557307] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4325 previous similar messages
2018-06-20T10:52:01.579715-04:00 btsotbal3000 kernel: [ 5037.097043] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [ptlrpcd_00_18:4222]
2018-06-20T10:52:01.579728-04:00 btsotbal3000 kernel: [ 5037.106479] Modules linked in: nfsv3 nfs fscache mpt3sas raid_class scsi_transport_sas mptctl mptbase osc(OEN) mgc(OEN) lustre(OEN) lmv(OEN) fld(OEN) mdc(OEN) fid(OEN) lov(OEN) ko2iblnd(OEN) ptlrpc(OEN) obdclass(OEN) lnet(OEN) libcfs(OEN) af_packet iscsi_ibft iscsi_boot_sysfs rdma_ucm(OEX) ib_ucm(OEX) rdma_cm(OEX) iw_cm(OEX) configfs ib_ipoib(OEX) ib_cm(OEX) ib_uverbs(OEX) ib_umad(OEX) mlx4_en(OEX) mlx4_ib(OEX) mlx4_core(OEX) msr xfs libcrc32c intel_rapl sb_edac edac_core ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp ixgbe kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel mdio drbg ansi_cprng aesni_intel igb mgag200 ttm drm_kms_helper ipmi_si drm ipmi_devintf mei_me ipmi_msghandler pcspkr i2c_algo_bit fjes aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd syscopyarea sysfillrect sysimgblt fb_sys_fops iTCO_wdt iTCO_vendor_support dcdbas(X) dca shpchp lpc_ich mfd_core processor mei wmi button nfsd nfs_acl lockd auth_rpcgss grace sunrpc mlx5_ib(OEX) ext4 crc16 jbd2 mbcache ib_core(OEX) sd_mod mlx5_core(OEX) vxlan ip6_udp_tunnel udp_tunnel ptp pps_core devlink inet_lro ehci_pci mlx_compat(OEX) ehci_hcd ahci libahci usbcore libata megaraid_sas usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4 [last unloaded: dell_rbu]
2018-06-20T10:52:01.579731-04:00 btsotbal3000 kernel: [ 5037.106543] Supported: No, Unsupported modules are loaded
2018-06-20T10:52:01.579732-04:00 btsotbal3000 kernel: [ 5037.106546] CPU: 12 PID: 4222 Comm: ptlrpcd_00_18 Tainted: G OE NX 4.4.132-94.33-default #1
2018-06-20T10:52:01.579733-04:00 btsotbal3000 kernel: [ 5037.106548] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.0.2 03/15/2016
2018-06-20T10:52:01.579734-04:00 btsotbal3000 kernel: [ 5037.106549] task: ffff883f99d95640 ti: ffff883f99d98000 task.ti: ffff883f99d98000
2018-06-20T10:52:01.579735-04:00 btsotbal3000 kernel: [ 5037.106550] RIP: 0010:[<ffffffff810cb01f>] [<ffffffff810cb01f>] native_queued_spin_lock_slowpath+0x10f/0x1a0
2018-06-20T10:52:01.579736-04:00 btsotbal3000 kernel: [ 5037.106557] RSP: 0018:ffff883f99d9bc00 EFLAGS: 00000246
2018-06-20T10:52:01.579736-04:00 btsotbal3000 kernel: [ 5037.106557] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000340000
2018-06-20T10:52:01.579737-04:00 btsotbal3000 kernel: [ 5037.106558] RDX: ffff883fff05af80 RSI: 0000000000100001 RDI: ffff881ffae04500
2018-06-20T10:52:01.579737-04:00 btsotbal3000 kernel: [ 5037.106559] RBP: ffff883f99d9bce0 R08: ffff881fff99af80 R09: 0000000000000000
2018-06-20T10:52:01.579738-04:00 btsotbal3000 kernel: [ 5037.106560] R10: 000000000000002d R11: ffff881efa09381b R12: 00000000147f4371
2018-06-20T10:52:01.579739-04:00 btsotbal3000 kernel: [ 5037.106561] R13: 00000000051fd0dc R14: 0000000000000000 R15: ffff883f99d95640
2018-06-20T10:52:01.579740-04:00 btsotbal3000 kernel: [ 5037.106562] FS: 0000000000000000(0000) GS:ffff881fff980000(0000) knlGS:0000000000000000
2018-06-20T10:52:01.579740-04:00 btsotbal3000 kernel: [ 5037.106563] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2018-06-20T10:52:01.579740-04:00 btsotbal3000 kernel: [ 5037.106564] CR2: 00007f0e6c018898 CR3: 0000000001e0c000 CR4: 00000000003606f0
2018-06-20T10:52:01.579741-04:00 btsotbal3000 kernel: [ 5037.106565] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2018-06-20T10:52:01.579742-04:00 btsotbal3000 kernel: [ 5037.106566] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
2018-06-20T10:52:01.579743-04:00 btsotbal3000 kernel: [ 5037.106567] Stack:
2018-06-20T10:52:01.579744-04:00 btsotbal3000 kernel: [ 5037.106568] ffffffff811915d2 ffffffffa0b98f15 ffffffffa0eee8db ffffffffa0eee8db
2018-06-20T10:52:01.579744-04:00 btsotbal3000 kernel: [ 5037.106569] ffffffffa0eee8db ffffffffa0eee8db 0000000000000000 00080000ffffffff
2018-06-20T10:52:01.579744-04:00 btsotbal3000 kernel: [ 5037.106571] 5b2a6a1100000000 00000000ffffffff ffff883f92821048 ffffffffa0eee8db
2018-06-20T10:52:01.579745-04:00 btsotbal3000 kernel: [ 5037.106573] Call Trace:
2018-06-20T10:52:01.579745-04:00 btsotbal3000 kernel: [ 5037.106581] [<ffffffff811915d2>] queued_spin_lock_slowpath+0x7/0xa
2018-06-20T10:52:01.579746-04:00 btsotbal3000 kernel: [ 5037.106605] [<ffffffffa0b98f15>] LNetMDUnlink+0x65/0x150 [lnet]
2018-06-20T10:52:01.579747-04:00 btsotbal3000 kernel: [ 5037.106677] [<ffffffffa0e854be>] ptlrpc_unregister_reply+0xee/0x6b0 [ptlrpc]
2018-06-20T10:52:01.579748-04:00 btsotbal3000 kernel: [ 5037.106703] [<ffffffffa0e89de3>] ptlrpc_expire_one_request+0xb3/0x420 [ptlrpc]
2018-06-20T10:52:01.579748-04:00 btsotbal3000 kernel: [ 5037.106726] [<ffffffffa0e8a1f9>] ptlrpc_expired_set+0xa9/0x180 [ptlrpc]
2018-06-20T10:52:01.579749-04:00 btsotbal3000 kernel: [ 5037.106752] [<ffffffffa0eb7679>] ptlrpcd+0x229/0x490 [ptlrpc]
2018-06-20T10:52:01.579749-04:00 btsotbal3000 kernel: [ 5037.106760] [<ffffffff8109ebc9>] kthread+0xc9/0xe0
2018-06-20T10:52:01.579750-04:00 btsotbal3000 kernel: [ 5037.106765] [<ffffffff816177ef>] ret_from_fork+0x3f/0x80
2018-06-20T10:52:01.579751-04:00 btsotbal3000 kernel: [ 5037.107943] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x80
2018-06-20T10:52:01.579752-04:00 btsotbal3000 kernel: [ 5037.107944]
2018-06-20T10:52:01.579752-04:00 btsotbal3000 kernel: [ 5037.107945] Leftover inexact backtrace:
2018-06-20T10:52:01.579753-04:00 btsotbal3000 kernel: [ 5037.107945]
2018-06-20T10:52:01.579753-04:00 btsotbal3000 kernel: [ 5037.107947] [<ffffffff8109eb00>] ? kthread_park+0x50/0x50
2018-06-20T10:52:01.579754-04:00 btsotbal3000 kernel: [ 5037.107948] Code: c1 e8 12 48 c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 80 af 01 00 48 03 14 c5 20 0d f5 81 4c 89 02 41 8b 40 08 85 c0 75 0a f3 90 <41> 8b 40 08 85 c0 74 f6 4d 8b 08 4d 85 c9 74 08 41 0f 0d 09 eb

At that point, all the CPUs go to 100% and it seems like we are not making progress anymore.

I can repro by running robinhood for a few minutes and dropping the cache (echo 2 > /proc/sys/vm/drop_caches).

I managed to fix it by limiting lru_max_age to 10s (ldlm.namespace.*.lru_max_age).

 

Overall, it feels like when the slab shrinker run, if there are too many locks, reclaiming those locks takes a while, prevents the rpc stack to make progress, and then we hit a race condition in LNetMDUnlink. 

 

 



 Comments   
Comment by Andreas Dilger [ 08/Aug/18 ]

There is a patch under development that might be able to help this - it reduces the number of client RPCs in flight if it is not getting a timely response from the servers. This helps reduce the load on the servers, and avoids timeouts on the clients.

Comment by Andreas Dilger [ 13/Aug/18 ]

There are patches for master and b2_10 in LU-11079 that may address this issue.

Comment by Julien Wallior [ 14/Aug/18 ]

We tried with the patch from LU-11079. Same issue (after putting lru_max_age back to the default).

Overall, since we lowered lru_max_age, we haven't had any issue related to this bug.

Comment by Andreas Dilger [ 15/Aug/18 ]

When you run without the reduced lru_max_age, how many locks accumulate on the clients/MDS? This can be seen on the clients and servers with lctl get_param ldlm.namespaces.*MDT*.lock_count.

Comment by Andreas Dilger [ 15/Aug/18 ]

Is it the MDS or the client that is running short of memory? It looks like the client is getting timeouts on the MDS RPCs, so finding out where the memory pressure is will determine what to investigate/fix. In addition to the lock_count, please provide the output of /proc/meminfo and /proc/slabinfo on the RBH client and MDS.

Comment by Julien Wallior [ 15/Aug/18 ]

When I revert lru_max_age to default (3900000), I only do it on the client and I can repro. (this is live prod filesystem, so I can only break it so much).

Regarding who's running out of memory, I think it's the client (I can either wait for it to run out of memory, or create the issue earlier by dropping the cache on the client). Now the client has 256GB, so even for a 80M filesystem, I would hope that should be enough to run rbh.

Finally, regarding the timeouts, they say "@@@ Request sent has timed out for sent delay". I was thinking that meant the client was stuck on something (clearing the memory) and didn't get to send the request on time. But maybe it means the client can't put the request on the wire because the server told it to wait (I'm not sure if that's possible).

The client was freshly rebooted before starting the run. After running rhb --scan for ~20min, I got the following results.
mds831.txt: data on the mds hosting mdt0
mds832.txt: data on the mds hosting mdt1
client.txt: data on the client

I can get the data on the OSS too if you need them. mds832.txt mds831.txt client.txt

Comment by Andreas Dilger [ 16/Aug/18 ]

It might be with so much RAM on the client that it isn't trying very hard to cancel locks on the client. Setting the lock max age to 10s is probably too short to be useful for other locks on the system (e.g. directories that are being accessed repeatedly). Another option would be to set a limit on the number of locks on the client and increase the max age to something better like 10 minutes, like lctl set_param ldlm.namespaces.*.lru_size=100000 ldlm.namespaces.*.lru_max_age=600000 or similar. If we assumed all of the RAM in use on the client was used by Lustre, the 1M inodes cached would be using about 13KB/inode (including 5M locks), which means that it would consume at most 13MB per OST for the inode/lock cache. It may be that you are seeing more locks than inodes if e.g. you have a default stripe count = 4 (one MDT lock, and 4 OST locks per file).

It probably makes sense to put some reasonable upper limit on the LRU size, since I've seen similar problems of too many cached locks on other systems with a lot of RAM. Even if there is enough free RAM it just slows things down when there are many millions of files cached for no real reason. We might also need to take a look at the kernel slab cache shrinker to see if it is working effectively to shrink the DLM cache size under memory pressure.

Comment by Julien Wallior [ 17/Aug/18 ]

Yes, we use stripe count > 1 in different directory.

I tried lru_size = 100k and lru_max_age = 600k and that works fine too.

Generated at Sat Feb 10 02:40:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.