[LU-10635] MGS kernel panic when configuring nodemaps and filesets Created: 08/Feb/18 Updated: 23/Oct/20 Resolved: 23/Oct/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Kim Sebo | Assignee: | Emoly Liu |
| Resolution: | Won't Fix | Votes: | 1 |
| Labels: | None | ||
| Environment: |
Lustre 2.10.2, DNE, ZFS 0.7.3, CentOS 7.4 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We (ANU-NCI) are attempting to configure the lustre nodemap and fileset features and have run into two problems. 1: Toggling the nodemap_activate flag from 0 to 1 appears to clear/reset the existing nodemap "fileset" property on all the MDS/OSS nodes. This appears to be similar to This command sequence should reproduce the problems: MGS> lctl nodemap_activate=0 This command hangs and is unkillable. At this point the MGS stops responding to clients, though the MDS running on the same host seems OK. Any new client attempt to mount the filesystem fails with: mount.lustre: mount 10.112.1.41@o2ib8:10.112.1.42@o2ib8:/fs/sub1/sub2 at /mnt/sub2 failed: Input/output error Attempting to unmount the MGT (either manually, or as part of HA failover) results in [ 2860.466515] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c [ 2860.501712] IP: [<ffffffffc11048ce>] ldlm_process_plain_lock+0x6e/0xb30 [ptlrpc] [ 2860.534877] PGD 0 [ 2860.544669] Oops: 0000 [#1] SMP [ 2860.561965] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ptlrpc (OE) obdclass(OE) ko2iblnd(OE) lnet(OE) libcfs(OE) bonding rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) dm_mirror dm_region_hash dm_log zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac edac_core intel_powerclamp coretemp intel_rapl dm_round_robin iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ipmi_si ghash_clmulni_intel iTCO_wdt dm_multipath sg joydev iTCO_vendor_support aesni_intel ipmi_devintf hpwdt hpilo lrw gf128mul glue_helper ablk_helper cryptd ioatdma pcspkr i2c_i801 wmi dm_mod ipmi_msghandler lpc_ich shpchp dca acpi_cpufreq [ 2860.882396] acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace knem(OE) sunrpc ip_tables xfs libcrc32c mlx5_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx5_core(OE) mlxfw(OE) mlx_compat(OE) tg3 devlink crct10dif_pclmul serio_raw crct10dif_common ptp drm crc32c_intel i2c_core pps_core hpsa(OE) scsi_transport_sas [ 2861.040539] CPU: 27 PID: 4459 Comm: ldlm_bl_24 Tainted: P OE ------------ 3.10.0-693.5.2.el7.x86_64 #1 [ 2861.092592] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 04/25/2017 [ 2861.129810] task: ffff88becf800fd0 ti: ffff88b24b3c0000 task.ti: ffff88b24b3c0000 [ 2861.163313] RIP: 0010:[<ffffffffc11048ce>] [<ffffffffc11048ce>] ldlm_process_plain_lock+0x6e/0xb30 [ptlrpc] [ 2861.207431] RSP: 0018:ffff88b24b3c3be0 EFLAGS: 00010287 [ 2861.231433] RAX: 0000000000000000 RBX: ffff88b774f53800 RCX: ffff88b24b3c3c7c [ 2861.263036] RDX: 0000000000000002 RSI: ffff88b24b3c3c80 RDI: ffff88b774f53800 [ 2861.295068] RBP: ffff88b24b3c3c58 R08: ffff88b24b3c3cd0 R09: ffff88beff257880 [ 2861.327085] R10: ffff88b774f53800 R11: 0000000000000005 R12: ffff885efdd5fcc0 [ 2861.359171] R13: 0000000000000002 R14: ffff88b24b3c3cd0 R15: ffff88b774f53860 [ 2861.391235] FS: 0000000000000000(0000) GS:ffff88befee40000(0000) knlGS:0000000000000000 [ 2861.428020] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2861.454884] CR2: 000000000000001c CR3: 0000005e60ab7000 CR4: 00000000003407e0 [ 2861.487917] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2861.520372] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 2861.552721] Stack: [ 2861.562219] ffff88b24b3c3c7c ffff88b24b3c3cd0 ffff88b24b3c3c80 0000000000000000 [ 2861.599583] ffff885efdd5fca0 0000001000000001 ffff885e00000010 ffff88b24b3c3c18 [ 2861.632738] ffff88b24b3c3c18 0000000013b7eb66 0000000000000002 ffff885efdd5fcc0 [ 2861.666158] Call Trace: [ 2861.677268] [<ffffffffc1104860>] ? ldlm_errno2error+0x60/0x60 [ptlrpc] [ 2861.706930] [<ffffffffc10ef9db>] ldlm_reprocess_queue+0x13b/0x2a0 [ptlrpc] [ 2861.738259] [<ffffffffc10f057d>] __ldlm_reprocess_all+0x14d/0x3a0 [ptlrpc] [ 2861.769602] [<ffffffffc10f0b30>] ldlm_reprocess_res+0x20/0x30 [ptlrpc] [ 2861.799258] [<ffffffffc0a36bef>] cfs_hash_for_each_relax+0x21f/0x400 [libcfs] [ 2861.831735] [<ffffffffc10f0b10>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc] [ 2861.863592] [<ffffffffc10f0b10>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc] [ 2861.895390] [<ffffffffc0a39d95>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs] [ 2861.928211] [<ffffffffc10f0b7c>] ldlm_reprocess_recovery_done+0x3c/0x110 [ptlrpc] [ 2861.962124] [<ffffffffc10f17bc>] ldlm_export_cancel_locks+0x11c/0x130 [ptlrpc] [ 2861.994962] [<ffffffffc111ada8>] ldlm_bl_thread_main+0x4c8/0x700 [ptlrpc] [ 2862.025858] [<ffffffff810c4820>] ? wake_up_state+0x20/0x20 [ 2862.050898] [<ffffffffc111a8e0>] ? ldlm_handle_bl_callback+0x410/0x410 [ptlrpc] [ 2862.086852] [<ffffffff810b099f>] kthread+0xcf/0xe0 [ 2862.112252] [<ffffffff810b08d0>] ? insert_kthread_work+0x40/0x40 [ 2862.139704] [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90 [ 2862.164047] [<ffffffff810b08d0>] ? insert_kthread_work+0x40/0x40 [ 2862.191049] Code: 89 45 a0 74 0d f6 05 b3 ac 94 ff 01 0f 85 34 06 00 00 8b 83 98 00 00 00 39 83 9c 00 00 00 89 45 b8 0f 84 57 09 00 00 48 8b 45 a0 <8b> 40 1c 85 c0 0f 84 7a 09 00 00 48 8b 4d a0 48 89 c8 48 83 c0 [ 2862.276308] RIP [<ffffffffc11048ce>] ldlm_process_plain_lock+0x6e/0xb30 [ptlrpc] [ 2862.309889] RSP <ffff88b24b3c3be0> [ 2862.325503] CR2: 000000000000001c |
| Comments |
| Comment by Peter Jones [ 08/Feb/18 ] |
|
Emoly Could you please advise? Thanks Peter |
| Comment by Emoly Liu [ 09/Feb/18 ] |
|
I ran your steps on 4 nodes(MDS1/MGS, MDS2, 1*OSS, 1*client) with Lustre 2.10.2 + ZFS, but didn't reproduce the kernel panic issue. [root@onyx-23vm2 ~]# lctl nodemap_activate 0 [root@onyx-23vm2 ~]# lctl nodemap_add testmgmt04 [root@onyx-23vm2 ~]# lctl set_param -P nodemap.testmgmt04.fileset='/fs/sub1/sub2' [root@onyx-23vm2 ~]# pdsh -S -Rrsh -w onyx-23vm[2-4] /sbin/lctl get_param nodemap.testmgmt04.fileset onyx-23vm2: nodemap.testmgmt04.fileset=/fs/sub1/sub2 onyx-23vm4: nodemap.testmgmt04.fileset=/fs/sub1/sub2 onyx-23vm3: nodemap.testmgmt04.fileset= onyx-23vm3: [root@onyx-23vm2 ~]# lctl nodemap_activate 1 [root@onyx-23vm2 ~]# pdsh -S -Rrsh -w onyx-23vm[2-4] /sbin/lctl get_param nodemap.testmgmt04.fileset onyx-23vm2: nodemap.testmgmt04.fileset=/fs/sub1/sub2 onyx-23vm3: nodemap.testmgmt04.fileset= onyx-23vm3: onyx-23vm4: nodemap.testmgmt04.fileset= onyx-23vm4: [root@onyx-23vm2 ~]# lctl set_param -P nodemap.testmgmt04.fileset='/fs/sub1/sub2' [root@onyx-23vm2 ~]# pdsh -S -Rrsh -w onyx-23vm[2-4] /sbin/lctl get_param nodemap.testmgmt04.fileset onyx-23vm4: nodemap.testmgmt04.fileset=/fs/sub1/sub2 onyx-23vm2: nodemap.testmgmt04.fileset=/fs/sub1/sub2 onyx-23vm3: nodemap.testmgmt04.fileset=/fs/sub1/sub2 Could you please collect some lustre logs on MGS and upload here by the following steps?
In case of the last command hanging, you can run "lctl dk" commands in a new terminal tab/window to collect the logs. |
| Comment by Kim Sebo [ 16/Feb/18 ] |
|
This is a production filesystem so breaking it was not immediately possible but I now have debug logs. We've now suffered the problem several times again, even though nodemaps/filesets are apparently disabled (nodemap_activate=0) and removed via lctl nodemap_del. All the servers only show the "default" nodemap in /proc/fs/lustre/nodemap. The problem can be provoked by running a dcp workload from 26 Lustre 2.5.x clients. 30-60 minutes into the transfer the MGS CPU load gets very high, with all the ll_mgs_NNN processes in Run state and consuming CPU. Many migration/watchdog processes also become very busy. The filesystem stays working for existing clients, but new clients cannot mount it because the MGS is unresponsive. Again, the MGS panics as soon as the MGT is unmounted. I've attached "top" output, and "lctl dk" debug output from when the MGS was in the busy/unresponsive state, an ftrace of a busy ll_mgs_NNN process. We also have the MGS crash dumps - Please advise how to get these to you. |
| Comment by Kim Sebo [ 16/Feb/18 ] |
|
This problem appears to be related to LU-10390. |
| Comment by Kim Sebo [ 27/Feb/18 ] |
|
**Any progress on this ticket? This is blocking full deployment of the filesystem. |
| Comment by Emoly Liu [ 27/Feb/18 ] |
|
Kim, I am just back to work from my vacation. I will look into this issue. |
| Comment by Malcolm Haak - NCI (Inactive) [ 01/Mar/18 ] |
|
So when looking at the running system with dbg we could see lots of generic LDLM locks in the process of being granted on the MGS. None were actually being granted from what we could see. So I grabbed this: # pwd This looks very wrong. Unless there has been some major change in the way these values are updated/reported JUST for the MGS in 2.10 Checking the mdt namespace for the mdt mounted on this server, it looks normal. I can provide it as well, not that I believe it adds to the conversation. We have crash dumps, I believe it is vital that they are looked at. This is no longer a minor issue for this site. @pjones This should be getting looked at under their support agreement... Please escalate |
| Comment by Emoly Liu [ 12/Mar/18 ] |
|
Hi kim.sebo, I created a patch at https://review.whamcloud.com/#/c/31450/ to fix the fileset issue. I'm not sure if this patch can help with your issue, so could you please have a test if possible? BTW, you can upload the crash dump to ftp.hpdd.intel.com, please mkdir Thanks,
|
| Comment by Kim Sebo [ 14/Mar/18 ] |
|
mgs/mds crash dump has been uploaded to the ftp site. Will test the patch as soon as I can. |
| Comment by Malcolm Haak - NCI (Inactive) [ 21/Mar/18 ] |
|
Hi Emoly, We think we might have located a possible LU patch that could be related.
Considering the stack traces and that the issues appeared to coincide with the addition of nodemaps it looks like a good candidate for rollback or adjustment. As it has been mentioned in both this ticket and the related ticket, this is blocking deployment of a filesystem into production. Also ANU have paid Intel support, please escalate this ticket. |
| Comment by Emoly Liu [ 21/Mar/18 ] |
|
Thanks Malcolm. I am looking at the patch of |
| Comment by Emoly Liu [ 21/Mar/18 ] |
|
I think the patch of |
| Comment by Malcolm Haak - NCI (Inactive) [ 21/Mar/18 ] |
|
I'm sure the The issue is there are literally millions of pending lock grants on the MGS, this is slowing down ALL the DNE MDT's and causing issues with stability. We have the MDS crashing under load during DCP runs, so we can't populate the filesystem, we can't cleanly fail over as all the unnecessary lock load on the MGS causes MDT's issues and it causes lockups on unmount of the MGS. At this point having pacemaker is pointless as it can't cleanly do anything. If you look in the crash with GDB you can see they are all genric/plain locks that the MGS is dealing with, or not dealing with apparently. I've done testing on a small cluster and it doesn't encounter issues. Unmount is quite slow compared with an IEEL3.0 system on the same VM's. I assume this means that its seeing some of the issue but due to the relatively (1 client vs 4000) small client count its not enough to get the hung tasks or the the null pointers. This is seriously a huge issue. Its now been over a month and we no closer to an answer. What can we do to speed this up? Can I get more crash dumps? Do you need a bigger test environment? I can see what I can do in that department, if it will help... |
| Comment by Gerrit Updater [ 22/Mar/18 ] |
|
Emoly Liu (emoly.liu@intel.com) uploaded a new patch: https://review.whamcloud.com/31722 |
| Comment by Emoly Liu [ 22/Mar/18 ] |
|
This is a "fortestonly" patch to fix "kernel NULL pointer dereference at 000000000000001c in ldlm_process_plain_lock()". Obviously, the ldlm lock pointer to some mgs resource became null when it is accessed in ldlm_process_plain_lock(), but for now we don't know where it is released exactly. This patch is just to avoid to access the null pointer, probably will cause other issue. If it's not convenient for you to apply the patches, including the patch of What's more, the following call trace uploaded by Kim is different from this null pointer issue, but same to the one reported in LU-10390 recently. [146854.840731] CPU: 23 PID: 52092 Comm: ldlm_bl_04 Tainted: P OE ------------ 3.10.0-693.5.2.el7.x86_64 #1 [146854.840732] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 04/25/2017 [146854.840733] task: ffff88a5d8f5af70 ti: ffff887d17a9c000 task.ti: ffff887d17a9c000 [146854.840734] RIP: 0010:[<ffffffff810fa336>] [146854.840743] [<ffffffff810fa336>] native_queued_spin_lock_slowpath+0x116/0x1e0 [146854.840744] RSP: 0018:ffff887d17a9fb70 EFLAGS: 00000246 [146854.840745] RAX: 0000000000000000 RBX: ffffffffc0e98bc7 RCX: 0000000000b90000 [146854.840745] RDX: ffff88beff417880 RSI: 0000000002210001 RDI: ffff88beb8be60dc [146854.840746] RBP: ffff887d17a9fb70 R08: ffff88befed57880 R09: 0000000000000000 [146854.840747] R10: 0000000000000000 R11: 000000000000000f R12: ffff88beb8be60c0 [146854.840748] R13: ffff887d17a9fc18 R14: ffff887d17a9fc80 R15: ffff887d17a9fc18 [146854.840749] FS: 0000000000000000(0000) GS:ffff88befed40000(0000) knlGS:0000000000000000 [146854.840750] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [146854.840751] CR2: 00007feb348867b0 CR3: 000000bec0fc9000 CR4: 00000000003407e0 [146854.840752] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [146854.840753] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [146854.840753] Stack: [146854.840754] ffff887d17a9fb80 [146854.840754] ffffffff8169e6bf [146854.840755] ffff887d17a9fb90 [146854.840755] ffffffff816abc10 [146854.840756] [146854.840756] ffff887d17a9fbd0 [146854.840757] ffffffffc11591d8 [146854.840757] 0000000000000000 [146854.840757] ffff887d2cbdc200 [146854.840758] [146854.840758] ffff887d17a9fc18 [146854.840758] ffff887d2cbdc260 [146854.840759] ffff887d2cbdc200 [146854.840759] ffff88a7f0741400 [146854.840760] [146854.840760] Call Trace: [146854.840768] [<ffffffff8169e6bf>] queued_spin_lock_slowpath+0xb/0xf [146854.840772] [<ffffffff816abc10>] _raw_spin_lock+0x20/0x30 [146854.840840] [<ffffffffc11591d8>] ldlm_handle_conflict_lock+0xd8/0x330 [ptlrpc] [146854.840870] [<ffffffffc116dc95>] ldlm_process_plain_lock+0x435/0xb30 [ptlrpc] [146854.840895] [<ffffffffc116d860>] ? ldlm_errno2error+0x60/0x60 [ptlrpc] [146854.840918] [<ffffffffc11589db>] ldlm_reprocess_queue+0x13b/0x2a0 [ptlrpc] [146854.840940] [<ffffffffc115957d>] __ldlm_reprocess_all+0x14d/0x3a0 [ptlrpc] [146854.840963] [<ffffffffc1159b30>] ldlm_reprocess_res+0x20/0x30 [ptlrpc] [146854.840976] [<ffffffffc0e9ebef>] cfs_hash_for_each_relax+0x21f/0x400 [libcfs] [146854.840999] [<ffffffffc1159b10>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc] [146854.841020] [<ffffffffc1159b10>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc] [146854.841029] [<ffffffffc0ea1d95>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs] [146854.841052] [<ffffffffc1159b7c>] ldlm_reprocess_recovery_done+0x3c/0x110 [ptlrpc] [146854.841075] [<ffffffffc115a7bc>] ldlm_export_cancel_locks+0x11c/0x130 [ptlrpc] [146854.841105] [<ffffffffc1183da8>] ldlm_bl_thread_main+0x4c8/0x700 [ptlrpc] [146854.841108] [<ffffffff810c4820>] ? wake_up_state+0x20/0x20 [146854.841134] [<ffffffffc11838e0>] ? ldlm_handle_bl_callback+0x410/0x410 [ptlrpc] [146854.841137] [<ffffffff810b099f>] kthread+0xcf/0xe0 [146854.841140] [<ffffffff8108ddfb>] ? do_exit+0x6bb/0xa40 [146854.841141] [<ffffffff810b08d0>] ? insert_kthread_work+0x40/0x40 [146854.841144] [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90 [146854.841146] [<ffffffff810b08d0>] ? insert_kthread_work+0x40/0x40 BTW, did you have a try to disable nodemap feature to see if this issue still exists? |
| Comment by Malcolm Haak - NCI (Inactive) [ 22/Mar/18 ] |
|
The real issue is more of rebooting MDS's. We are working on getting a properly sized test environment up, that can be mounted by both our 37 node test cluster as well as the whole compute cluster if we need to stress it. Once we have that we would be in a better spot to test this. As for patches, diff's are fine we have a build system that spits out RPMs. Yes, I thought we are actually seeing more than one different bug in our crashes. We also appear to be seeing As for disabling nodemaps, what do you mean exactly? We tried setting the nodemap_activate=0 and removing all custom nodemaps, but it has no effect. It is a bit like once you have activated it, you get into a state that you cannot revert from. All the clients have been rebooted since nodemaps were disabled and all custom nodemaps were deleted. As has the MDS/MGS. The other interesting thing of note is when the 'live lock' talked about above happens, the mgs sever load goes to over 100 and its not until unmount that we hit the Null Pointer. I believe there is probably some value in us getting it into this state and forcing a crash dump while it is mounted, if you agree I can attempt to arrange that. |
| Comment by Malcolm Haak - NCI (Inactive) [ 22/Mar/18 ] |
|
Also, since this issue has started we have clients that fail to mount the filesystem sometimes, usually on the first attempt. Here is the logs from a client: 2018-03-21 12:39:08 [ 164.054046] Lustre: 8787:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1521596248/real 1521596248] req@ffff88104836bfc0 x1595505493821472/t0(0) o502->MGC10.112.1.41@o2ib8@10.112.1.41@o2ib8:26/25 lens 272/8472 e 0 to 1 dl 1521596348 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 2018-03-21 12:39:08 [ 164.084367] LustreError: 166-1: MGC10.112.1.41@o2ib8: Connection to MGS (at 10.112.1.41@o2ib8) was lost; in progress operations using this service will fail 2018-03-21 12:39:08 [ 164.099869] LustreError: 15c-8: MGC10.112.1.41@o2ib8: The configuration from log 'gdata1b-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. 2018-03-21 12:39:08 [ 164.125470] LustreError: 8786:0:(llite_lib.c:1145:ll_fill_super()) Unable to process log: -5 2018-03-21 12:39:08 [ 164.137142] Lustre: Unmounted gdata1b-client 2018-03-21 12:39:08 [ 164.198698] Lustre: MGC10.112.1.41@o2ib8: Connection restored to MGS (at 10.112.1.41@o2ib8) 2018-03-21 12:39:08 [ 164.208862] LustreError: 8786:0:(obd_mount.c:1340:lustre_fill_super()) Unable to mount (-5) 2018-03-21 12:39:08 mount.lustre: mount 10.112.1.41@o2ib8:10.112.1.42@o2ib8:/gdata1b at /g/data1b failed: Input/output error 2018-03-21 12:39:09 Is the MGS running? So it's also related somehow. Usually subsequent mount attempts work successfully |
| Comment by javed shaikh (Inactive) [ 23/Oct/20 ] |
|
since there has been no fix to this issue, NCI doesn't use nodemaps. |