[LU-11100] Clients hangs in LNetMDUnlink Created: 25/Jun/18 Updated: 03/Jun/21 Resolved: 06/Jan/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mahmoud Hanafi | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | lnet | ||
| Environment: |
client sles12sp2 lustre 2.10.3 |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 2 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Clients hang in LNetMDUnlink. May be a dup of LU-11092 and LU-10669.
[166855.238376] CPU: 33 PID: 2938 Comm: ptlrpcd_01_02 Tainted: P W OEL NX 4.4.90-92.45.1.20171031-nasa #1
[166855.238378] Hardware name: SGI.COM SUMMIT/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[166855.238381] task: ffff8807db820bc0 ti: ffff8807db824000 task.ti: ffff8807db824000
[166855.238383] RIP: 0010:[<ffffffff810cc0a1>] [<ffffffff810cc0a1>] native_queued_spin_lock_slowpath+0x111/0x1a0
[166855.238392] RSP: 0018:ffff8807db827b98 EFLAGS: 00000246
[166855.238393] RAX: 0000000000000000 RBX: ffff880fe93574e0 RCX: 0000000000880000
[166855.238395] RDX: ffff88081e2567c0 RSI: 0000000000280001 RDI: ffff88101cdb6e00
[166855.238396] RBP: ffff8807db827b98 R08: ffff88101db567c0 R09: 0000000000000000
[166855.238398] R10: 0000000000000000 R11: ffff880ee98f8817 R12: 0000000000000008
[166855.238400] R13: 000000000a222d0f R14: 0000000000000001 R15: 0000000000000000
[166855.238402] FS: 0000000000000000(0000) GS:ffff88101db40000(0000) knlGS:0000000000000000
[166855.238403] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[166855.238405] CR2: 0000000000641038 CR3: 0000000001afe000 CR4: 00000000001406e0
[166855.238407] Stack:
[166855.238408] ffff8807db827ba8 ffffffff8119162a ffff8807db827bb8 ffffffff8161e640
[166855.238411] ffff8807db827be0 ffffffffa0a96683 ffffffffa1dc78e7 0000000000000001
[166855.238414] 000000002888b43d ffff8807db827cb8 ffffffffa0b254f5 ffffffffa1dc78d8
[166855.238417] Call Trace:
[166855.238431] [<ffffffff8119162a>] queued_spin_lock_slowpath+0xb/0xf
[166855.238439] [<ffffffff8161e640>] _raw_spin_lock+0x20/0x30
[166855.238467] [<ffffffffa0a96683>] cfs_percpt_lock+0x53/0x100 [libcfs]
[166855.238510] [<ffffffffa0b254f5>] LNetMDUnlink+0x65/0x150 [lnet]
[166855.238573] [<ffffffffa1d5cc88>] ptlrpc_unregister_reply+0xf8/0x6f0 [ptlrpc]
[166855.238636] [<ffffffffa1d616d8>] ptlrpc_expire_one_request+0xb8/0x430 [ptlrpc]
[166855.238674] [<ffffffffa1d61aff>] ptlrpc_expired_set+0xaf/0x190 [ptlrpc]
[166855.238719] [<ffffffffa1d8f998>] ptlrpcd+0x258/0x4e0 [ptlrpc]
[166855.238729] [<ffffffff8109f276>] kthread+0xd6/0xf0
[166855.238735] [<ffffffff8161ed3f>] ret_from_fork+0x3f/0x70
[166855.241341] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70
[166855.241342]
[166855.241343] Leftover inexact backtrace:
[166855.241348] [<ffffffff8109f1a0>] ? kthread_park+0x60/0x60
We will try to get a reproducer. |
| Comments |
| Comment by Joseph Gmitter (Inactive) [ 26/Jun/18 ] |
|
Hi Amir, Can you please have a look and see if you think it is a duplicate with a known fix or a new issue? Thanks. Joe |
| Comment by Mahmoud Hanafi [ 27/Jun/18 ] |
|
I got this to reproduce using the following steps.
You will see connection lost to some OSTs and all ptlrpcd* threads will go into %100 run state. Then you will get softlockup stack dumps.
|
| Comment by Amir Shehata (Inactive) [ 28/Jun/18 ] |
|
Did this problem start happening with 2.10.4 only? Would we be able to try the reproducer on 2.10.3? Can we also get a crash dump with the system in that state? |
| Comment by Mahmoud Hanafi [ 28/Jun/18 ] |
|
I am using 2.10.3 plus patches. This did started with 2.10.3 and sles12SP2 I will get a crash dump and upload it. |
| Comment by Amir Shehata (Inactive) [ 28/Jun/18 ] |
|
I was looking through the tickets and I found a very similar problem on an internal ticket with the same stack trace that has been hit on 2.5ish (2.5+other patches). So this might not be new to 2.10.3. The crash dump might show us the state of the lock and if it's being held by another process. Would you be able to also include the kernel and lustre debug rpms so I can get a hold of the vmlinux and debug symbols. Will make it easier to look through the crash. thanks. |
| Comment by Mahmoud Hanafi [ 05/Jul/18 ] |
|
Amir, Have you had a chance to look over the crash dumps? |
| Comment by Amir Shehata (Inactive) [ 05/Jul/18 ] |
|
I'm having trouble opening the crash file: crash: ./usr/lib/debug/boot/vmlinux-4.4.126-94.22.2.20180514-nasa.debug: no text and data contentscrash: The namelist argument supplied in this case is a debuginfo file, which must be accompanied by the kernel file from which it was derived. I'm wondering if I need to run crash on sles. I'll try setting up a VM with sles installed. In the meantime, are you able to startup crash? crash ../vmcore ./usr/lib/debug/boot/vmlinux-4.4.126-94.22.2.20180514-nasa.debug |
| Comment by Mahmoud Hanafi [ 05/Jul/18 ] |
|
Crash on sles is a little different. This is how I start it crash vmcore vmlinux-4.4.126-94.22.2.20180514-nasa.gz You may need a sles system. |
| Comment by Amir Shehata (Inactive) [ 05/Jul/18 ] |
|
I'm setting up one right now. Will let you know how it goes. |
| Comment by Amir Shehata (Inactive) [ 06/Jul/18 ] |
|
Looking at the crash dump it's not immediately obvious to me where/if there is a deadlock. I see that the res_lock for both CPTs in the system are taken, but the rest of the stack traces don't show who has it locked. My examination of the code so far doesn't reveal a place where we're not unlocking. I'll continue examining the dump and the code. In the meantime, would we be able to determine when this hang started happening? Can we try 2.10.2 and if it happens there then 2.10.1? I'm just trying to narrow down the set of changes that could've caused this issue. |
| Comment by Amir Shehata (Inactive) [ 06/Jul/18 ] |
|
Also is it possible to get a link to the Lustre tree you're using, so I can make sure I'm matching the stack trace with the code properly. |
| Comment by Jay Lan (Inactive) [ 06/Jul/18 ] |
|
My 2.10.3 git repo is at git://github.com/jlan/lustre-nas If you need my 2.7.3-fe git repo, please give me your github ID so that I can grant you access. You can email me your github ID. Thanks! |
| Comment by Amir Shehata (Inactive) [ 13/Jul/18 ] |
|
The crash dump shows that the md/me lists are growing extensively, in excess of 5Million entries. This could explain why we're seeing soft lockups. In LNetMDUnlink() we call lnet_res_lh_lookup(), although this looks through the hash, but because we have so many entries, the hash can still collide growing the lists. The next step is to understand why the lists are growing. Would it be ok to provide a debug patch to confirm that this is the issue and to extract further information, if need be. |
| Comment by Mahmoud Hanafi [ 13/Jul/18 ] |
|
Yes a debug patch will work. |
| Comment by Amir Shehata (Inactive) [ 16/Jul/18 ] |
|
Attached is a debug patch based on the github repo you pointed me to. This times the lnet_res_lh_lookup() and prints an error message to console if it took more than 5 seconds to finish When you reproduce the problem, if we can monitor dmesg and see if any of these messages pop up. They are tagged with "NASA_DEBUG". I'm still scrubbing through the core, printing out the length of each MD hash list to see if there is anyone that is of excessive length. It's taking a while. |
| Comment by Amir Shehata (Inactive) [ 18/Jul/18 ] |
|
Oleg has also pointed me to this ticket which seems to be related to the issue you're running into. Would you be able to apply the patch and see if it resolves the hang? https://jira.whamcloud.com/browse/LU-11079
|
| Comment by Jay Lan (Inactive) [ 18/Jul/18 ] |
|
Hi Amir, |
| Comment by Amir Shehata (Inactive) [ 19/Jul/18 ] |
|
Hi Jay, I attach the patch ported to the NASA repo. Let me know how it goes. |
| Comment by Jay Lan (Inactive) [ 19/Jul/18 ] |
|
Thank you Amir. |
| Comment by Mahmoud Hanafi [ 31/Jul/18 ] |
|
I tired the patch and was unable to reproduce the hang. We will try the patch on some production servers for additional testing. |
| Comment by Jay Lan (Inactive) [ 02/Aug/18 ] |
|
Hi, Could you also back port the Thanks,
|
| Comment by Jay Lan (Inactive) [ 03/Aug/18 ] |
|
Ah, I figured out the problem that caused the compilation errors. The nasa_LU_11079.patch in the attachments is not the same as the patch in the b2_10 reviews. Applying the patch from the attachments to nas-2.11.0 addressed the merge and compilation problem. A back port to b2_11 is not needed. Thanks. |
| Comment by Jay Lan (Inactive) [ 05/Sep/18 ] |
|
Admir, We still hit this problem with Jay |
| Comment by Julien Wallior [ 05/Sep/18 ] |
|
FYI: This sounds similar to an issue we hit: LU-11092. We fixed it by changing the ldlm lru_* settings. |
| Comment by Jay Lan (Inactive) [ 06/Sep/18 ] |
|
Hi Julien, Thank you for your info. I did not realize that Mahmoud has done the ldlm settings at our site. I was trying to find out if I miss any patch that I need to cherry-pick. Thanks. |
| Comment by Mahmoud Hanafi [ 07/Sep/18 ] |
|
We tried the settings recommend in LU-11092. But it is making things worse because we are hitting |
| Comment by Jay Lan (Inactive) [ 10/Sep/18 ] |
|
For documentation purpose, backporting |
| Comment by Mahmoud Hanafi [ 22/Jan/20 ] |
|
We are hitting this with lustre2.12.3. So the above patches didn't fix the issue. Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] Modules linked in: iptable_nat(E) nf_nat_ipv4(E) nf_nat(E) binfmt_misc(E) fuse(E) mgc(OEN) lustre(OEN) lmv(OEN) mdc(OEN) fid(OEN) osc(OEN) rpcsec_gss_krb5(E) auth_rpcgss(E) lov(OEN) nfsv4(E) fld(OEN) dns_resolver(E) ko2iblnd(OEN) ptlrpc(OEN) obdclass(OEN) lnet(OEN) nfsv3(E) nfs_acl(E) nfs(E) lockd(E) grace(E) fscache(E) libcfs(OEN) rdma_ucm(OEX) ib_ucm(OEX) rdma_cm(OEX) iw_cm(OEX) configfs(E) ib_ipoib(OEX) ib_cm(OEX) ib_umad(OEX) bonding(E) iscsi_ibft(E) iscsi_boot_sysfs(E) nf_log_ipv6(E) nf_log_common(E) xt_LOG(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ip6table_filter(E) ip6_tables(E) xt_tcpudp(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) xt_conntrack(E) iptable_filter(E) xt_CT(E) nf_conntrack(E) libcrc32c(E) iptable_raw(E) ip_tables(E) x_tables(E) mlx4_ib(OEX) ib_uverbs(OEX) ib_core(OEX) Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] tcp_bic(EN) intel_rapl(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) ipmi_ssif(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) mlx4_core(OEX) aesni_intel(E) iTCO_wdt(E) iTCO_vendor_support(E) aes_x86_64(E) crypto_simd(E) glue_helper(E) mlx_compat(OEX) devlink(E) cryptd(E) ipmi_si(E) ioatdma(E) igb(E) mei_me(E) mei(E) lpc_ich(E) wmi(E) ipmi_devintf(E) ipmi_msghandler(E) i2c_i801(E) shpchp(E) mfd_core(E) pcspkr(E) dca(E) button(E) acpi_cpufreq(E) sunrpc(E) ext4(E) crc16(E) jbd2(E) mbcache(E) sd_mod(E) csiostor(E) sr_mod(E) cdrom(E) mgag200(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ttm(E) isci(EX) ahci(E) cxgb4(E) drm(E) libahci(E) libsas(E) ptp(E) crc32c_intel(E) serio_raw(E) drm_panel_orientation_quirks(E) Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] libata(E) scsi_transport_fc(E) pps_core(E) scsi_transport_sas(E) hwperf(OEX) numatools(OEX) xpmem(OEX) gru(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) autofs4(E) Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] Supported: No, Unreleased kernel Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] CPU: 10 PID: 4337 Comm: ptlrpcd_01_05 Tainted: G OEL 4.12.14-95.40.1.20191112-nasa #1 SLE12-SP4 (unreleased) Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] Hardware name: SGI.COM C1104-RP7/X9DRW-3LN4F+/X9DRW-3TF+, BIOS 3.00 09/12/2013 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] task: ffff91f29dfeca00 task.stack: ffffa1c3cb810000 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] RIP: 0010:native_queued_spin_lock_slowpath+0xda/0x1d0 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] RSP: 0018:ffffa1c3cb813c30 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 00000000002c0000 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] RDX: ffff91fadf2a3a00 RSI: ffff91fadf3e3a00 RDI: ffff91facfa7d040 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] RBP: ffffa1c3cb813d10 R08: 0000000000000000 R09: 0000000000000150 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] R10: 000000000000002d R11: ffff91f982e8d817 R12: 000000007068765d Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] R13: 000000001c1a1d97 R14: 0000000000000000 R15: ffff91f29dfeca00 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] FS: 0000000000000000(0000) GS:ffff91fadf280000(0000) knlGS:0000000000000000 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] CR2: 00002aaaaacc4000 CR3: 0000000d3700a001 CR4: 00000000000606e0 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] Call Trace: Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] queued_spin_lock_slowpath+0x7/0xa Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] LNetMDUnlink+0x65/0x150 [lnet] Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] ptlrpc_unregister_reply+0xf2/0x6f0 [ptlrpc] Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] ? ptlrpc_set_import_discon+0xf5/0x6e0 [ptlrpc] Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] ptlrpc_expire_one_request+0xe4/0x4d0 [ptlrpc] Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] ptlrpc_expired_set+0xa9/0x180 [ptlrpc] Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] ptlrpcd+0x22e/0x4a0 [ptlrpc] Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] ? wake_up_q+0x70/0x70 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] kthread+0xff/0x140 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] ? ptlrpcd_check+0x560/0x560 [ptlrpc] Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] ? __kthread_parkme+0x70/0x70 Jan 21 10:19:20 pfe24 kernel: [1579630760.775887] ret_from_fork+0x35/0x40 |
| Comment by Mahmoud Hanafi [ 22/Apr/20 ] |
|
We hit this issue again today. |
| Comment by Amir Shehata (Inactive) [ 23/Apr/20 ] |
|
Is it possible to grab the stack traces of all the tasks when we hit this issue: echo t > /proc/sysrq-trigger It would be useful to see who's holding the lock. Last I looked at the crash dump for this case, it looked like the MD/ME lists were growing, so I was suspecting that it takes a long time to go through them. |
| Comment by Mahmoud Hanafi [ 06/Jan/21 ] |
|
This can be closed |