[LU-11895] CPU lockup in LNetMDUnlink Created: 28/Jan/19  Updated: 11/Jul/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Stephane Thiell Assignee: Amir Shehata (Inactive)
Resolution: Unresolved Votes: 0
Labels: llnl
Environment:

CentOS 7.6, kernel 3.10.0-957.1.3.el7_lustre.x86_64, MOFED 4.5


Attachments: Text File fir-rbh01-vmcore-dmesg.txt    
Issue Links:
Related
is related to LU-12194 clients getting soft lockups on 2.10.7 Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This just occurred on a Lustre 2.12.0 client, running Robinhood, many cores with 2 x AMD EPYC 7401 so 96 cpu threads total.

[1488057.711176] CPU: 63 PID: 54246 Comm: ptlrpcd_07_06 Kdump: loaded Tainted: G           OEL ------------   3.10.0-957.1.3.el7_lustre.x86_64 #1
[1488057.711177] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.3.6 04/20/2018
[1488057.711178] task: ffff8abafc522080 ti: ffff8abaf4f54000 task.ti: ffff8abaf4f54000
[1488057.711182] RIP: 0010:[<ffffffff897121e6>]  [<ffffffff897121e6>] native_queued_spin_lock_slowpath+0x126/0x200
[1488057.711183] RSP: 0018:ffff8abaf4f57b78  EFLAGS: 00000246
[1488057.711183] RAX: 0000000000000000 RBX: ffffffffc0ff5a97 RCX: 0000000001f90000
[1488057.711184] RDX: ffff8a9affa9b780 RSI: 0000000000910000 RDI: ffff8adafa7b1b00
[1488057.711185] RBP: ffff8abaf4f57b78 R08: ffff8afb3f9db780 R09: 0000000000000000
[1488057.711185] R10: 0000000000000000 R11: 000000000000000f R12: ffff8aed569e2a00
[1488057.711186] R13: 0005c4aa17c51860 R14: ffff8aed24f64b00 R15: 0000000000000007
[1488057.711187] FS:  00007ea1d389b700(0000) GS:ffff8afb3f9c0000(0000) knlGS:0000000000000000
[1488057.711188] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1488057.711189] CR2: 00007ecf3e8c7ff6 CR3: 000000276ae10000 CR4: 00000000003407e0
[1488057.711190] Call Trace:
[1488057.711192]  [<ffffffff89d5bfcb>] queued_spin_lock_slowpath+0xb/0xf
[1488057.711193]  [<ffffffff89d6a480>] _raw_spin_lock+0x20/0x30
[1488057.711201]  [<ffffffffc0c23418>] cfs_percpt_lock+0x58/0x110 [libcfs]
[1488057.711211]  [<ffffffffc0c889d8>] LNetMDUnlink+0x78/0x180 [lnet]
[1488057.711250]  [<ffffffffc0f276bf>] ptlrpc_unregister_reply+0xbf/0x790 [ptlrpc]
[1488057.711287]  [<ffffffffc0f2c35e>] ptlrpc_expire_one_request+0xee/0x520 [ptlrpc]
[1488057.711324]  [<ffffffffc0f2c83f>] ptlrpc_expired_set+0xaf/0x1a0 [ptlrpc]
[1488057.711362]  [<ffffffffc0f5cc5c>] ptlrpcd+0x28c/0x550 [ptlrpc]
[1488057.711364]  [<ffffffff896d67b0>] ? wake_up_state+0x20/0x20
[1488057.711402]  [<ffffffffc0f5c9d0>] ? ptlrpcd_check+0x590/0x590 [ptlrpc]
[1488057.711404]  [<ffffffff896c1c31>] kthread+0xd1/0xe0
[1488057.711406]  [<ffffffff896c1b60>] ? insert_kthread_work+0x40/0x40
[1488057.711408]  [<ffffffff89d74c24>] ret_from_fork_nospec_begin+0xe/0x21
[1488057.711410]  [<ffffffff896c1b60>] ? insert_kthread_work+0x40/0x40

A lot of CPU were stucks in LNetMDUnlink. Server crashed with hard lockup at the end. vmcore available on demand. Attached vmcore-dmesg.txt

I kept the default lru_size/lru_max_age 0/3900000 values on this server, so I'll try to reduce them as follow:

  • lru_size=100
  • lru_max_age=1200
    like on our Lustre 2.10 robinhood server for Oak.

Any other recommendation welcomed.

Thanks!
Stephane



 Comments   
Comment by Peter Jones [ 29/Jan/19 ]

Amir

Could you please investigate?

Thanks

Peter

Comment by Andreas Dilger [ 29/Jan/19 ]

How many CPTs on this system? lctl get_param cpu*

Comment by Stephane Thiell [ 29/Jan/19 ]
[root@fir-rbh01 ~]# lctl get_param cpu*
cpu_partition_distance=
0	: 0:10 1:16 2:16 3:16 4:28 5:28 6:22 7:28
1	: 0:16 1:10 2:16 3:16 4:28 5:28 6:28 7:22
2	: 0:16 1:16 2:10 3:16 4:22 5:28 6:28 7:28
3	: 0:16 1:16 2:16 3:10 4:28 5:22 6:28 7:28
4	: 0:28 1:28 2:22 3:28 4:10 5:16 6:16 7:16
5	: 0:28 1:28 2:28 3:22 4:16 5:10 6:16 7:16
6	: 0:22 1:28 2:28 3:28 4:16 5:16 6:10 7:16
7	: 0:28 1:22 2:28 3:28 4:16 5:16 6:16 7:10
cpu_partition_table=
0	: 0 8 16 24 32 40 48 56 64 72 80 88
1	: 2 10 18 26 34 42 50 58 66 74 82 90
2	: 4 12 20 28 36 44 52 60 68 76 84 92
3	: 6 14 22 30 38 46 54 62 70 78 86 94
4	: 1 9 17 25 33 41 49 57 65 73 81 89
5	: 3 11 19 27 35 43 51 59 67 75 83 91
6	: 5 13 21 29 37 45 53 61 69 77 85 93
7	: 7 15 23 31 39 47 55 63 71 79 87 95

I think that makes sense as each EPYC CPU socket has 4 domains.

[root@fir-rbh01 ~]# numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 8 16 24 32 40 48 56 64 72 80 88
node 0 size: 65213 MB
node 0 free: 41206 MB
node 1 cpus: 2 10 18 26 34 42 50 58 66 74 82 90
node 1 size: 65535 MB
node 1 free: 34082 MB
node 2 cpus: 4 12 20 28 36 44 52 60 68 76 84 92
node 2 size: 65535 MB
node 2 free: 63939 MB
node 3 cpus: 6 14 22 30 38 46 54 62 70 78 86 94
node 3 size: 65535 MB
node 3 free: 63611 MB
node 4 cpus: 1 9 17 25 33 41 49 57 65 73 81 89
node 4 size: 65535 MB
node 4 free: 63923 MB
node 5 cpus: 3 11 19 27 35 43 51 59 67 75 83 91
node 5 size: 65535 MB
node 5 free: 63569 MB
node 6 cpus: 5 13 21 29 37 45 53 61 69 77 85 93
node 6 size: 65535 MB
node 6 free: 63449 MB
node 7 cpus: 7 15 23 31 39 47 55 63 71 79 87 95
node 7 size: 65535 MB
node 7 free: 63535 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  16  16  16  28  28  22  28 
  1:  16  10  16  16  28  28  28  22 
  2:  16  16  10  16  22  28  28  28 
  3:  16  16  16  10  28  22  28  28 
  4:  28  28  22  28  10  16  16  16 
  5:  28  28  28  22  16  10  16  16 
  6:  22  28  28  28  16  16  10  16 
  7:  28  22  28  28  16  16  16  10
Comment by Amir Shehata (Inactive) [ 14/Feb/19 ]

This appears similar to: https://jira.whamcloud.com/browse/LU-11100

Would you be able to try the patches suggested there?

In the meantime if you're able to make the vmcore and vmlinux available so I can take a further look, that'll be helpful.

Comment by Stephane Thiell [ 08/Apr/19 ]

Hi Amir,

Oops sorry for being so slow on this one, we had seen only the problem once so far. But today a login node started to do the same, however it did recover by itself!

Anyway, we had a lot of soft lockups, like that:

[1230992.435041] NMI watchdog: BUG: soft lockup - CPU#8 stuck for 23s! [ptlrpcd_00_10:42801]
[1230992.444173] Modules linked in: binfmt_misc squashfs overlay(T) rpcsec_gss_krb5 nfsv4 dns_resolver fuse mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ptlrpc(OE) ko2iblnd(OE) obdclass(OE)
 lnet(OE) libcfs(OE) xt_multiport ip_set_hash_ip nfsv3 nfs_acl nfs lockd grace ip6t_MASQUERADE nf_nat_masquerade_ipv6 ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_set ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJ
ECT nf_reject_ipv6 xt_conntrack ip_set_hash_net ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw
 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
[1230992.524464]  rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dell_rbu cachefiles fscache sb
_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm mgag200 i2c_algo_bit irqbypass ttm drm_kms_helper crc32_pclmul syscopyarea ghash_clmulni_intel sysfillrect sysimgblt aesni_intel lrw fb_sys_fops 
gf128mul drm iTCO_wdt iTCO_vendor_support glue_helper ablk_helper dcdbas cryptd cdc_ether usbnet mii drm_panel_orientation_quirks pcspkr sg ipmi_si joydev wmi lpc_ich ipmi_devintf mei_me mei ipmi_msghandler acpi
_pad acpi_power_meter ext4 mbcache jbd2 loop auth_rpcgss sunrpc ip_tables xfs mlx4_ib(OE) ib_core(OE) sr_mod cdrom sd_mod crc_t10dif crct10dif_generic bnx2x ahci crct10dif_pclmul
[1230992.602267]  mlx4_core(OE) crct10dif_common libahci mlx_compat(OE) crc32c_intel mdio devlink ptp libata megaraid_sas pps_core libcrc32c
[1230992.614517] CPU: 8 PID: 42801 Comm: ptlrpcd_00_10 Kdump: loaded Tainted: G           OEL ------------ T 3.10.0-957.10.1.el7.x86_64 #1
[1230992.628103] Hardware name: Dell Inc. PowerEdge R620/0GFKVD, BIOS 2.7.0 05/23/2018
[1230992.636647] task: ffff8eb9f3334100 ti: ffff8eb9f47c4000 task.ti: ffff8eb9f47c4000
[1230992.645192] RIP: 0010:[<ffffffff9cd12226>]  [<ffffffff9cd12226>] native_queued_spin_lock_slowpath+0x126/0x200
[1230992.656465] RSP: 0018:ffff8eb9f47c7b78  EFLAGS: 00000246
[1230992.662583] RAX: 0000000000000000 RBX: ffffffffc1014956 RCX: 0000000000410000
[1230992.670738] RDX: ffff8eba1f09b780 RSI: 0000000000290001 RDI: ffff8eb21a6e0a00
[1230992.678893] RBP: ffff8eb9f47c7b78 R08: ffff8eb21f91b780 R09: 0000000000000000
[1230992.687047] R10: 0000000000000000 R11: 000000000000000f R12: ffff8eb01a572a00
[1230992.695201] R13: 0005c991f29a0f30 R14: ffff8eb69a22b000 R15: 0000000000000007
[1230992.703356] FS:  0000000000000000(0000) GS:ffff8eb21f900000(0000) knlGS:0000000000000000
[1230992.712577] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1230992.719182] CR2: 00007fd45e897090 CR3: 0000000fb8c10000 CR4: 00000000001607e0
[1230992.727339] Call Trace:
[1230992.730262]  [<ffffffff9d35cfcb>] queued_spin_lock_slowpath+0xb/0xf
[1230992.737453]  [<ffffffff9d36b480>] _raw_spin_lock+0x20/0x30
[1230992.743785]  [<ffffffffc0bbe418>] cfs_percpt_lock+0x58/0x110 [libcfs]
[1230992.751181]  [<ffffffffc0c3c9d8>] LNetMDUnlink+0x78/0x180 [lnet]
[1230992.758106]  [<ffffffffc0fa685f>] ptlrpc_unregister_reply+0xbf/0x790 [ptlrpc]
[1230992.766280]  [<ffffffffc0fab4fe>] ptlrpc_expire_one_request+0xee/0x520 [ptlrpc]
[1230992.774648]  [<ffffffffc0fab9df>] ptlrpc_expired_set+0xaf/0x1a0 [ptlrpc]
[1230992.782343]  [<ffffffffc0fdae8c>] ptlrpcd+0x28c/0x550 [ptlrpc]
[1230992.789047]  [<ffffffff9ccd67f0>] ? wake_up_state+0x20/0x20
[1230992.795477]  [<ffffffffc0fdac00>] ? ptlrpcd_check+0x590/0x590 [ptlrpc]
[1230992.802956]  [<ffffffff9ccc1c71>] kthread+0xd1/0xe0
[1230992.808590]  [<ffffffff9ccc1ba0>] ? insert_kthread_work+0x40/0x40
[1230992.815585]  [<ffffffff9d375c37>] ret_from_fork_nospec_begin+0x21/0x21
[1230992.823063]  [<ffffffff9ccc1ba0>] ? insert_kthread_work+0x40/0x40

When I checked at this time, the login node (lustre client) was completely loaded with its 32 CPUs taking 100% cpu each (in ptlrpcd).

Then, the client recovered with the following logs:

[1231021.312842] Lustre: 42815:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1554765376/real 1554764903]  req@ffff8eb33a496000 x1629003522849584/t0(0) o103->fir-OST0007-osc-ffff8eba05ae6000@10.0.10.102@o2ib7:17/18 lens 328/224 e 0 to 1 dl 1554765478 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
[1231021.345646] Lustre: 42815:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1744666 previous similar messages
[1231161.240794] LNetError: 42764:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
[1231161.254982] LNetError: 42764:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1 previous similar message
[1231171.900676] LNetError: 42765:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
[1231171.914863] LNetError: 42765:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1 previous similar message
[1231247.531671] Lustre: fir-OST000d-osc-ffff8eba05ae6000: Connection to fir-OST000d (at 10.0.10.104@o2ib7) was lost; in progress operations using this service will wait for recovery to complete
[1231247.550706] Lustre: Skipped 104 previous similar messages                  
[1231374.205074] LNetError: 42762:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
[1231374.219259] LNetError: 42762:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1 previous similar message
[1231416.399571] LNetError: 42768:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
[1231416.413765] LNetError: 42768:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1 previous similar message
[1231486.750358] LNetError: 42768:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
[1231486.755230] LustreError: 167-0: fir-MDT0003-mdc-ffff8eba05ae6000: This client was evicted by fir-MDT0003; in progress operations using this service will fail.
[1231486.755365] LustreError: 34127:0:(file.c:4393:ll_inode_revalidate_fini()) fir: revalidate FID [0x28000f583:0x2cf0:0x0] error: rc = -5
[1231486.755368] LustreError: 34127:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 1 previous similar message
[1231486.805305] LNetError: 42768:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1 previous similar message
[1231486.844678] Lustre: fir-MDT0003-mdc-ffff8eba05ae6000: Connection restored to 10.0.10.52@o2ib7 (at 10.0.10.52@o2ib7)
[1231486.856531] Lustre: Skipped 152 previous similar messages                  
[1231555.033198] Lustre: Evicted from MGS (at 10.0.10.51@o2ib7) after server handle changed from 0x5c08f3702ce50dae to 0x5c08f37030f7e96c
[1231557.546491] Lustre: Evicted from fir-MDT0000_UUID (at 10.0.10.51@o2ib7) after server handle changed from 0x5c08f3702ce68e45 to 0x5c08f37030f7e957
[1231557.561265] LustreError: 167-0: fir-MDT0000-mdc-ffff8eba05ae6000: This client was evicted by fir-MDT0000; in progress operations using this service will fail.
[1231557.578742] LustreError: 31049:0:(file.c:4393:ll_inode_revalidate_fini()) fir: revalidate FID [0x240000406:0x138:0x0] error: rc = -5
[1231557.592241] LustreError: 31049:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 9 previous similar messages
[1231558.858554] LustreError: 32190:0:(file.c:216:ll_close_inode_openhandle()) fir-clilmv-ffff8eba05ae6000: inode [0x20000fd69:0x154bf:0x0] mdc close failed: rc = -5
[1231566.830744] LustreError: 32327:0:(file.c:4393:ll_inode_revalidate_fini()) fir: revalidate FID [0x240000406:0x138:0x0] error: rc = -5
[1231566.844246] LustreError: 32327:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 2 previous similar messages
[1231596.575109] LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 10.0.10.51@o2ib7) was lost; in progress operations using this service will fail
[1231621.286801] Lustre: 42791:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1554766045/real 0]  req@ffff8eace510a900 x1629003644418896/t0(0) o103->fir-OST002b-osc-ffff8eba05ae6000@10.0.10.108@o2ib7:17/18 lens 328/224 e 0 to 1 dl 1554766077 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[1231621.318732] Lustre: 42791:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1009312 previous similar messages
[1231723.524429] LustreError: 43001:0:(mgc_request.c:599:do_requeue()) failed processing log: -5
[1231873.637411] Lustre: DEBUG MARKER: Mon Apr  8 16:32:27 2019   

We are not running with the patches from LU-11100 for now. I'll see if I can find the original crash dump.

Comment by Stephane Thiell [ 09/Apr/19 ]

Uploaded original crash dump of fir-rbh01 to your FTP as vmcore-fir-rbh01-2019-01-28-15-10-41.gz

kernel used is 3.10.0-957.1.3.el7_lustre.x86_64 (even tough it's not a lustre server).
kernel-debuginfo-3.10.0-957.1.3.el7_lustre.x86_64.rpm and -common should be available already on the FTP

The issue that occurred today is on a server with less CPUs (dual Xeon CPU E5-2650 v2) and running kernel 3.10.0-957.10.1.el7.x86_64 (unpatched).

Generated at Sat Feb 10 02:47:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.