Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10635

MGS kernel panic when configuring nodemaps and filesets

Details

    • Bug
    • Resolution: Won't Fix
    • Minor
    • None
    • Lustre 2.10.2
    • None
    • Lustre 2.10.2, DNE, ZFS 0.7.3, CentOS 7.4
    • 3
    • 9223372036854775807

    Description

      We (ANU-NCI) are attempting to configure the lustre nodemap and fileset features and have run into two problems.

      1: Toggling the nodemap_activate flag from 0 to 1 appears to clear/reset the existing nodemap "fileset" property on all the MDS/OSS nodes. This appears to be similar to LU-9154.
      2: reapplying the nodemap fileset property in response to the above reliably causes the MGS to stop responding to clients, and recovery attempts result in a MGS kernel panic and reboot.

      This command sequence should reproduce the problems:

      MGS> lctl nodemap_activate=0
      MGS> lctl nodemap_add testmgmt04
      MGS> lctl set_param -P nodemap.testmgmt04.fileset='/fs/sub1/sub2'
      MGS> ### this above command takes a suprisingly long time to return (>60 sec), but seems to succeed.
      MGS> ### at this point, the fileset property has propagated and visible on all MDS/OSS nodes.
      MGS> lctl nodemap_activate=1
      MGS> ### the fileset property has now been reset to empty on all nodes except the combined MGS/MDS1 node
      MGS> ### try to re-apply the fileset property again:
      MGS> lctl set_param -P nodemap.testmgmt04.fileset='/fs/sub1/sub2'

      This command hangs and is unkillable. At this point the MGS stops responding to clients, though the MDS running on the same host seems OK. Any new client attempt to mount the filesystem fails with:

      mount.lustre: mount 10.112.1.41@o2ib8:10.112.1.42@o2ib8:/fs/sub1/sub2 at /mnt/sub2 failed: Input/output error
      Is the MGS running?

      Attempting to unmount the MGT (either manually, or as part of HA failover) results in
      an immediate MGS kernel panic and reboot:

      [ 2860.466515] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
      [ 2860.501712] IP: [<ffffffffc11048ce>] ldlm_process_plain_lock+0x6e/0xb30 [ptlrpc]
      [ 2860.534877] PGD 0 
      [ 2860.544669] Oops: 0000 [#1] SMP 
      [ 2860.561965] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ptlrpc
      (OE) obdclass(OE) ko2iblnd(OE) lnet(OE) libcfs(OE) bonding rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) dm_mirror dm_region_hash dm_log zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac edac_core intel_powerclamp coretemp intel_rapl dm_round_robin iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ipmi_si ghash_clmulni_intel iTCO_wdt dm_multipath sg joydev iTCO_vendor_support aesni_intel ipmi_devintf hpwdt hpilo lrw gf128mul glue_helper ablk_helper cryptd ioatdma pcspkr i2c_i801 wmi dm_mod ipmi_msghandler lpc_ich shpchp dca acpi_cpufreq
      [ 2860.882396]  acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace knem(OE) sunrpc ip_tables xfs libcrc32c mlx5_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx5_core(OE) mlxfw(OE) mlx_compat(OE) tg3 devlink crct10dif_pclmul serio_raw crct10dif_common ptp drm crc32c_intel i2c_core pps_core hpsa(OE) scsi_transport_sas
      [ 2861.040539] CPU: 27 PID: 4459 Comm: ldlm_bl_24 Tainted: P           OE  ------------   3.10.0-693.5.2.el7.x86_64 #1
      [ 2861.092592] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 04/25/2017
      [ 2861.129810] task: ffff88becf800fd0 ti: ffff88b24b3c0000 task.ti: ffff88b24b3c0000
      [ 2861.163313] RIP: 0010:[<ffffffffc11048ce>]  [<ffffffffc11048ce>] ldlm_process_plain_lock+0x6e/0xb30 [ptlrpc]
      [ 2861.207431] RSP: 0018:ffff88b24b3c3be0  EFLAGS: 00010287
      [ 2861.231433] RAX: 0000000000000000 RBX: ffff88b774f53800 RCX: ffff88b24b3c3c7c
      [ 2861.263036] RDX: 0000000000000002 RSI: ffff88b24b3c3c80 RDI: ffff88b774f53800
      [ 2861.295068] RBP: ffff88b24b3c3c58 R08: ffff88b24b3c3cd0 R09: ffff88beff257880
      [ 2861.327085] R10: ffff88b774f53800 R11: 0000000000000005 R12: ffff885efdd5fcc0
      [ 2861.359171] R13: 0000000000000002 R14: ffff88b24b3c3cd0 R15: ffff88b774f53860
      [ 2861.391235] FS:  0000000000000000(0000) GS:ffff88befee40000(0000) knlGS:0000000000000000
      [ 2861.428020] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 2861.454884] CR2: 000000000000001c CR3: 0000005e60ab7000 CR4: 00000000003407e0
      [ 2861.487917] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 2861.520372] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 2861.552721] Stack:
      [ 2861.562219]  ffff88b24b3c3c7c ffff88b24b3c3cd0 ffff88b24b3c3c80 0000000000000000
      [ 2861.599583]  ffff885efdd5fca0 0000001000000001 ffff885e00000010 ffff88b24b3c3c18
      [ 2861.632738]  ffff88b24b3c3c18 0000000013b7eb66 0000000000000002 ffff885efdd5fcc0
      [ 2861.666158] Call Trace:
      [ 2861.677268]  [<ffffffffc1104860>] ? ldlm_errno2error+0x60/0x60 [ptlrpc]
      [ 2861.706930]  [<ffffffffc10ef9db>] ldlm_reprocess_queue+0x13b/0x2a0 [ptlrpc]
      [ 2861.738259]  [<ffffffffc10f057d>] __ldlm_reprocess_all+0x14d/0x3a0 [ptlrpc]
      [ 2861.769602]  [<ffffffffc10f0b30>] ldlm_reprocess_res+0x20/0x30 [ptlrpc]
      [ 2861.799258]  [<ffffffffc0a36bef>] cfs_hash_for_each_relax+0x21f/0x400 [libcfs]
      [ 2861.831735]  [<ffffffffc10f0b10>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc]
      [ 2861.863592]  [<ffffffffc10f0b10>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc]
      [ 2861.895390]  [<ffffffffc0a39d95>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs]
      [ 2861.928211]  [<ffffffffc10f0b7c>] ldlm_reprocess_recovery_done+0x3c/0x110 [ptlrpc]
      [ 2861.962124]  [<ffffffffc10f17bc>] ldlm_export_cancel_locks+0x11c/0x130 [ptlrpc]
      [ 2861.994962]  [<ffffffffc111ada8>] ldlm_bl_thread_main+0x4c8/0x700 [ptlrpc]
      [ 2862.025858]  [<ffffffff810c4820>] ? wake_up_state+0x20/0x20
      [ 2862.050898]  [<ffffffffc111a8e0>] ? ldlm_handle_bl_callback+0x410/0x410 [ptlrpc]
      [ 2862.086852]  [<ffffffff810b099f>] kthread+0xcf/0xe0
      [ 2862.112252]  [<ffffffff810b08d0>] ? insert_kthread_work+0x40/0x40
      [ 2862.139704]  [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90
      [ 2862.164047]  [<ffffffff810b08d0>] ? insert_kthread_work+0x40/0x40
      [ 2862.191049] Code: 89 45 a0 74 0d f6 05 b3 ac 94 ff 01 0f 85 34 06 00 00 8b 83 98 00 00 00 39 83 9c 00 00 00 89 45 b8 0f 84 57 09 00 00 48 8b 45 a0 <8b> 40 1c 85 c0 0f 84 7a 09 00 00 48 8b 4d a0 48 89 c8 48 83 c0 
      [ 2862.276308] RIP  [<ffffffffc11048ce>] ldlm_process_plain_lock+0x6e/0xb30 [ptlrpc]
      [ 2862.309889]  RSP <ffff88b24b3c3be0>
      [ 2862.325503] CR2: 000000000000001c
      

      Attachments

        1. ll_mgs_ftrace.log.gz
          197 kB
        2. mgs_busy_top.txt
          8 kB
        3. mgs_lctl_dk.log.gz
          2.43 MB
        4. mgs_panic.txt
          5 kB

        Issue Links

          Activity

            [LU-10635] MGS kernel panic when configuring nodemaps and filesets

            since there has been no fix to this issue, NCI doesn't use nodemaps.
            ok to close this ticket.

            javed javed shaikh (Inactive) added a comment - since there has been no fix to this issue, NCI doesn't use nodemaps. ok to close this ticket.

            Also, since this issue has started we have clients that fail to mount the filesystem sometimes, usually on the first attempt.

            Here is the logs from a client:

            2018-03-21 12:39:08 [  164.054046] Lustre: 8787:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1521596248/real 1521596248]  req@ffff88104836bfc0 x1595505493821472/t0(0) o502->MGC10.112.1.41@o2ib8@10.112.1.41@o2ib8:26/25 lens 272/8472 e 0 to 1 dl 1521596348 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            2018-03-21 12:39:08 [  164.084367] LustreError: 166-1: MGC10.112.1.41@o2ib8: Connection to MGS (at 10.112.1.41@o2ib8) was lost; in progress operations using this service will fail
            2018-03-21 12:39:08 [  164.099869] LustreError: 15c-8: MGC10.112.1.41@o2ib8: The configuration from log 'gdata1b-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
            2018-03-21 12:39:08 [  164.125470] LustreError: 8786:0:(llite_lib.c:1145:ll_fill_super()) Unable to process log: -5
            2018-03-21 12:39:08 [  164.137142] Lustre: Unmounted gdata1b-client
            2018-03-21 12:39:08 [  164.198698] Lustre: MGC10.112.1.41@o2ib8: Connection restored to MGS (at 10.112.1.41@o2ib8)
            2018-03-21 12:39:08 [  164.208862] LustreError: 8786:0:(obd_mount.c:1340:lustre_fill_super()) Unable to mount  (-5)
            2018-03-21 12:39:08 mount.lustre: mount 10.112.1.41@o2ib8:10.112.1.42@o2ib8:/gdata1b at /g/data1b failed: Input/output error
            2018-03-21 12:39:09 Is the MGS running?
            

            So it's also related somehow. Usually subsequent mount attempts work successfully  

            mhaakddn Malcolm Haak - NCI (Inactive) added a comment - - edited Also, since this issue has started we have clients that fail to mount the filesystem sometimes, usually on the first attempt. Here is the logs from a client: 2018-03-21 12:39:08 [ 164.054046] Lustre: 8787:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1521596248/real 1521596248] req@ffff88104836bfc0 x1595505493821472/t0(0) o502->MGC10.112.1.41@o2ib8@10.112.1.41@o2ib8:26/25 lens 272/8472 e 0 to 1 dl 1521596348 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 2018-03-21 12:39:08 [ 164.084367] LustreError: 166-1: MGC10.112.1.41@o2ib8: Connection to MGS (at 10.112.1.41@o2ib8) was lost; in progress operations using this service will fail 2018-03-21 12:39:08 [ 164.099869] LustreError: 15c-8: MGC10.112.1.41@o2ib8: The configuration from log 'gdata1b-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. 2018-03-21 12:39:08 [ 164.125470] LustreError: 8786:0:(llite_lib.c:1145:ll_fill_super()) Unable to process log: -5 2018-03-21 12:39:08 [ 164.137142] Lustre: Unmounted gdata1b-client 2018-03-21 12:39:08 [ 164.198698] Lustre: MGC10.112.1.41@o2ib8: Connection restored to MGS (at 10.112.1.41@o2ib8) 2018-03-21 12:39:08 [ 164.208862] LustreError: 8786:0:(obd_mount.c:1340:lustre_fill_super()) Unable to mount (-5) 2018-03-21 12:39:08 mount.lustre: mount 10.112.1.41@o2ib8:10.112.1.42@o2ib8:/gdata1b at /g/data1b failed: Input/output error 2018-03-21 12:39:09 Is the MGS running? So it's also related somehow. Usually subsequent mount attempts work successfully  

            The real issue is more of rebooting MDS's. We are working on getting a properly sized test environment up, that can be mounted by both our 37 node test cluster as well as the whole compute cluster if we need to stress it. Once we have that we would be in a better spot to test this. As for patches, diff's are fine we have a build system that  spits out RPMs.

            Yes, I thought we are actually seeing more than one different bug in our crashes. We also appear to be seeing LU-9193, or based on what Bruno is suggesting, LU-10235/LU-10262 under specific work loads.

            As for disabling nodemaps, what do you mean exactly? We tried setting the nodemap_activate=0 and removing all custom nodemaps,  but it has no effect.

            It is a bit like once you have activated it, you get into a state that you cannot revert from. All the clients have been rebooted since nodemaps were disabled and all custom nodemaps were deleted. As has the MDS/MGS.

            The other interesting thing of note is when the 'live lock' talked about above happens, the mgs sever load goes to over 100 and its not until unmount that we hit the Null Pointer. 

            I believe there is probably some value in us getting it into this state and forcing a crash dump while it is mounted, if you agree I can attempt to arrange that.

            mhaakddn Malcolm Haak - NCI (Inactive) added a comment - The real issue is more of rebooting MDS's. We are working on getting a properly sized test environment up, that can be mounted by both our 37 node test cluster as well as the whole compute cluster if we need to stress it. Once we have that we would be in a better spot to test this. As for patches, diff's are fine we have a build system that  spits out RPMs. Yes, I thought we are actually seeing more than one different bug in our crashes. We also appear to be seeing LU-9193 , or based on what Bruno is suggesting, LU-10235 / LU-10262 under specific work loads. As for disabling nodemaps, what do you mean exactly? We tried setting the nodemap_activate=0 and removing all custom nodemaps,  but it has no effect. It is a bit like once you have activated it, you get into a state that you cannot revert from. All the clients have been rebooted since nodemaps were disabled and all custom nodemaps were deleted. As has the MDS/MGS. The other interesting thing of note is when the 'live lock' talked about above happens, the mgs sever load goes to over 100 and its not until unmount that we hit the Null Pointer.  I believe there is probably some value in us getting it into this state and forcing a crash dump while it is mounted, if you agree I can attempt to arrange that.
            emoly.liu Emoly Liu added a comment -

            This is a "fortestonly" patch to fix "kernel NULL pointer dereference at 000000000000001c in ldlm_process_plain_lock()". Obviously, the ldlm lock pointer to some mgs resource became null when it is accessed in ldlm_process_plain_lock(), but for now we don't know where it is released exactly. This patch is just to avoid to access the null pointer, probably will cause other issue.

            If it's not convenient for you to apply the patches, including the patch of LU-10703, we can provide ptlrpc.ko module for your test. If you'd like, please let me know your detailed kernel and lustre version information.

            What's more, the following call trace uploaded by Kim is different from this null pointer issue, but same to the one reported in LU-10390 recently.

            [146854.840731] CPU: 23 PID: 52092 Comm: ldlm_bl_04 Tainted: P           OE  ------------   3.10.0-693.5.2.el7.x86_64 #1
            [146854.840732] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 04/25/2017
            [146854.840733] task: ffff88a5d8f5af70 ti: ffff887d17a9c000 task.ti: ffff887d17a9c000
            [146854.840734] RIP: 0010:[<ffffffff810fa336>]
            [146854.840743]  [<ffffffff810fa336>] native_queued_spin_lock_slowpath+0x116/0x1e0
            [146854.840744] RSP: 0018:ffff887d17a9fb70  EFLAGS: 00000246
            [146854.840745] RAX: 0000000000000000 RBX: ffffffffc0e98bc7 RCX: 0000000000b90000
            [146854.840745] RDX: ffff88beff417880 RSI: 0000000002210001 RDI: ffff88beb8be60dc
            [146854.840746] RBP: ffff887d17a9fb70 R08: ffff88befed57880 R09: 0000000000000000
            [146854.840747] R10: 0000000000000000 R11: 000000000000000f R12: ffff88beb8be60c0
            [146854.840748] R13: ffff887d17a9fc18 R14: ffff887d17a9fc80 R15: ffff887d17a9fc18
            [146854.840749] FS:  0000000000000000(0000) GS:ffff88befed40000(0000) knlGS:0000000000000000
            [146854.840750] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [146854.840751] CR2: 00007feb348867b0 CR3: 000000bec0fc9000 CR4: 00000000003407e0
            [146854.840752] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            [146854.840753] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
            [146854.840753] Stack:
            [146854.840754]  ffff887d17a9fb80
            [146854.840754]  ffffffff8169e6bf
            [146854.840755]  ffff887d17a9fb90
            [146854.840755]  ffffffff816abc10
            [146854.840756]
            [146854.840756]  ffff887d17a9fbd0
            [146854.840757]  ffffffffc11591d8
            [146854.840757]  0000000000000000
            [146854.840757]  ffff887d2cbdc200
            [146854.840758]
            [146854.840758]  ffff887d17a9fc18
            [146854.840758]  ffff887d2cbdc260
            [146854.840759]  ffff887d2cbdc200
            [146854.840759]  ffff88a7f0741400
            [146854.840760]
            [146854.840760] Call Trace:
            [146854.840768]  [<ffffffff8169e6bf>] queued_spin_lock_slowpath+0xb/0xf
            [146854.840772]  [<ffffffff816abc10>] _raw_spin_lock+0x20/0x30
            [146854.840840]  [<ffffffffc11591d8>] ldlm_handle_conflict_lock+0xd8/0x330 [ptlrpc]
            [146854.840870]  [<ffffffffc116dc95>] ldlm_process_plain_lock+0x435/0xb30 [ptlrpc]
            [146854.840895]  [<ffffffffc116d860>] ? ldlm_errno2error+0x60/0x60 [ptlrpc]
            [146854.840918]  [<ffffffffc11589db>] ldlm_reprocess_queue+0x13b/0x2a0 [ptlrpc]
            [146854.840940]  [<ffffffffc115957d>] __ldlm_reprocess_all+0x14d/0x3a0 [ptlrpc]
            [146854.840963]  [<ffffffffc1159b30>] ldlm_reprocess_res+0x20/0x30 [ptlrpc]
            [146854.840976]  [<ffffffffc0e9ebef>] cfs_hash_for_each_relax+0x21f/0x400 [libcfs]
            [146854.840999]  [<ffffffffc1159b10>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc]
            [146854.841020]  [<ffffffffc1159b10>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc]
            [146854.841029]  [<ffffffffc0ea1d95>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs]
            [146854.841052]  [<ffffffffc1159b7c>] ldlm_reprocess_recovery_done+0x3c/0x110 [ptlrpc]
            [146854.841075]  [<ffffffffc115a7bc>] ldlm_export_cancel_locks+0x11c/0x130 [ptlrpc]
            [146854.841105]  [<ffffffffc1183da8>] ldlm_bl_thread_main+0x4c8/0x700 [ptlrpc]
            [146854.841108]  [<ffffffff810c4820>] ? wake_up_state+0x20/0x20
            [146854.841134]  [<ffffffffc11838e0>] ? ldlm_handle_bl_callback+0x410/0x410 [ptlrpc]
            [146854.841137]  [<ffffffff810b099f>] kthread+0xcf/0xe0
            [146854.841140]  [<ffffffff8108ddfb>] ? do_exit+0x6bb/0xa40
            [146854.841141]  [<ffffffff810b08d0>] ? insert_kthread_work+0x40/0x40
            [146854.841144]  [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90
            [146854.841146]  [<ffffffff810b08d0>] ? insert_kthread_work+0x40/0x40
            

            BTW, did you have a try to disable nodemap feature to see if this issue still exists?

            emoly.liu Emoly Liu added a comment - This is a "fortestonly" patch to fix "kernel NULL pointer dereference at 000000000000001c in ldlm_process_plain_lock()". Obviously, the ldlm lock pointer to some mgs resource became null when it is accessed in ldlm_process_plain_lock(), but for now we don't know where it is released exactly. This patch is just to avoid to access the null pointer, probably will cause other issue. If it's not convenient for you to apply the patches, including the patch of LU-10703 , we can provide ptlrpc.ko module for your test. If you'd like, please let me know your detailed kernel and lustre version information. What's more, the following call trace uploaded by Kim is different from this null pointer issue, but same to the one reported in LU-10390 recently. [146854.840731] CPU: 23 PID: 52092 Comm: ldlm_bl_04 Tainted: P OE ------------ 3.10.0-693.5.2.el7.x86_64 #1 [146854.840732] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 04/25/2017 [146854.840733] task: ffff88a5d8f5af70 ti: ffff887d17a9c000 task.ti: ffff887d17a9c000 [146854.840734] RIP: 0010:[<ffffffff810fa336>] [146854.840743] [<ffffffff810fa336>] native_queued_spin_lock_slowpath+0x116/0x1e0 [146854.840744] RSP: 0018:ffff887d17a9fb70 EFLAGS: 00000246 [146854.840745] RAX: 0000000000000000 RBX: ffffffffc0e98bc7 RCX: 0000000000b90000 [146854.840745] RDX: ffff88beff417880 RSI: 0000000002210001 RDI: ffff88beb8be60dc [146854.840746] RBP: ffff887d17a9fb70 R08: ffff88befed57880 R09: 0000000000000000 [146854.840747] R10: 0000000000000000 R11: 000000000000000f R12: ffff88beb8be60c0 [146854.840748] R13: ffff887d17a9fc18 R14: ffff887d17a9fc80 R15: ffff887d17a9fc18 [146854.840749] FS: 0000000000000000(0000) GS:ffff88befed40000(0000) knlGS:0000000000000000 [146854.840750] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [146854.840751] CR2: 00007feb348867b0 CR3: 000000bec0fc9000 CR4: 00000000003407e0 [146854.840752] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [146854.840753] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [146854.840753] Stack: [146854.840754] ffff887d17a9fb80 [146854.840754] ffffffff8169e6bf [146854.840755] ffff887d17a9fb90 [146854.840755] ffffffff816abc10 [146854.840756] [146854.840756] ffff887d17a9fbd0 [146854.840757] ffffffffc11591d8 [146854.840757] 0000000000000000 [146854.840757] ffff887d2cbdc200 [146854.840758] [146854.840758] ffff887d17a9fc18 [146854.840758] ffff887d2cbdc260 [146854.840759] ffff887d2cbdc200 [146854.840759] ffff88a7f0741400 [146854.840760] [146854.840760] Call Trace: [146854.840768] [<ffffffff8169e6bf>] queued_spin_lock_slowpath+0xb/0xf [146854.840772] [<ffffffff816abc10>] _raw_spin_lock+0x20/0x30 [146854.840840] [<ffffffffc11591d8>] ldlm_handle_conflict_lock+0xd8/0x330 [ptlrpc] [146854.840870] [<ffffffffc116dc95>] ldlm_process_plain_lock+0x435/0xb30 [ptlrpc] [146854.840895] [<ffffffffc116d860>] ? ldlm_errno2error+0x60/0x60 [ptlrpc] [146854.840918] [<ffffffffc11589db>] ldlm_reprocess_queue+0x13b/0x2a0 [ptlrpc] [146854.840940] [<ffffffffc115957d>] __ldlm_reprocess_all+0x14d/0x3a0 [ptlrpc] [146854.840963] [<ffffffffc1159b30>] ldlm_reprocess_res+0x20/0x30 [ptlrpc] [146854.840976] [<ffffffffc0e9ebef>] cfs_hash_for_each_relax+0x21f/0x400 [libcfs] [146854.840999] [<ffffffffc1159b10>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc] [146854.841020] [<ffffffffc1159b10>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc] [146854.841029] [<ffffffffc0ea1d95>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs] [146854.841052] [<ffffffffc1159b7c>] ldlm_reprocess_recovery_done+0x3c/0x110 [ptlrpc] [146854.841075] [<ffffffffc115a7bc>] ldlm_export_cancel_locks+0x11c/0x130 [ptlrpc] [146854.841105] [<ffffffffc1183da8>] ldlm_bl_thread_main+0x4c8/0x700 [ptlrpc] [146854.841108] [<ffffffff810c4820>] ? wake_up_state+0x20/0x20 [146854.841134] [<ffffffffc11838e0>] ? ldlm_handle_bl_callback+0x410/0x410 [ptlrpc] [146854.841137] [<ffffffff810b099f>] kthread+0xcf/0xe0 [146854.841140] [<ffffffff8108ddfb>] ? do_exit+0x6bb/0xa40 [146854.841141] [<ffffffff810b08d0>] ? insert_kthread_work+0x40/0x40 [146854.841144] [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90 [146854.841146] [<ffffffff810b08d0>] ? insert_kthread_work+0x40/0x40 BTW, did you have a try to disable nodemap feature to see if this issue still exists?

            Emoly Liu (emoly.liu@intel.com) uploaded a new patch: https://review.whamcloud.com/31722
            Subject: LU-10635 ldlm: check if the resource is available before using it
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: f10cb6c9a60ca5df342aa0b6595e2592cd59f17c

            gerrit Gerrit Updater added a comment - Emoly Liu (emoly.liu@intel.com) uploaded a new patch: https://review.whamcloud.com/31722 Subject: LU-10635 ldlm: check if the resource is available before using it Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f10cb6c9a60ca5df342aa0b6595e2592cd59f17c

            I'm sure the LU-10703 patch is fine. It looks sensible and I'm not taking the filesystem down, as problematic as it is, as it's currently mounted in production but not in full production usage. So taking it down will cause jitter on the compute nodes and occasionally, due to 2.5.x clients, the odd crash.

            The issue is there are literally millions of pending lock grants on the MGS, this is slowing down ALL the DNE MDT's and causing issues with stability. We have the MDS crashing under load during DCP runs, so we can't populate the filesystem, we can't cleanly fail over as all the unnecessary lock load on the MGS causes MDT's issues and it causes lockups on unmount of the MGS. At this point having pacemaker is pointless as it can't cleanly do anything.

            If you look in the crash with GDB you can see they are all genric/plain locks that the MGS is dealing with, or not dealing with apparently. I've done testing on a small cluster and it doesn't encounter issues. Unmount is quite slow compared with an IEEL3.0 system on the same VM's. I assume this means that its seeing some of the issue but due to the relatively (1 client vs 4000) small client count its not enough to get the hung tasks or the the null pointers.

            This is seriously a huge issue. Its now been over a month and we no closer to an answer. What can we do to speed this up? Can I get more crash dumps? Do you need a bigger test environment? I can see what I can do in that department, if it will help...

            mhaakddn Malcolm Haak - NCI (Inactive) added a comment - I'm sure the LU-10703 patch is fine. It looks sensible and I'm not taking the filesystem down, as problematic as it is, as it's currently mounted in production but not in full production usage. So taking it down will cause jitter on the compute nodes and occasionally, due to 2.5.x clients, the odd crash. The issue is there are literally millions of pending lock grants on the MGS, this is slowing down ALL the DNE MDT's and causing issues with stability. We have the MDS crashing under load during DCP runs, so we can't populate the filesystem, we can't cleanly fail over as all the unnecessary lock load on the MGS causes MDT's issues and it causes lockups on unmount of the MGS. At this point having pacemaker is pointless as it can't cleanly do anything. If you look in the crash with GDB you can see they are all genric/plain locks that the MGS is dealing with, or not dealing with apparently. I've done testing on a small cluster and it doesn't encounter issues. Unmount is quite slow compared with an IEEL3.0 system on the same VM's. I assume this means that its seeing some of the issue but due to the relatively (1 client vs 4000) small client count its not enough to get the hung tasks or the the null pointers. This is seriously a huge issue. Its now been over a month and we no closer to an answer. What can we do to speed this up? Can I get more crash dumps? Do you need a bigger test environment? I can see what I can do in that department, if it will help...
            emoly.liu Emoly Liu added a comment - - edited

            I think the patch of LU-8287 is good and it has been landed since lustre 2.8.56.
            I will keep checking the code and please let me know if the patch of LU-10703 works for you. Thanks.

            emoly.liu Emoly Liu added a comment - - edited I think the patch of LU-8287 is good and it has been landed since lustre 2.8.56. I will keep checking the code and please let me know if the patch of LU-10703 works for you. Thanks.
            emoly.liu Emoly Liu added a comment - - edited

            Thanks Malcolm. I am looking at the patch of LU-8287.

            emoly.liu Emoly Liu added a comment - - edited Thanks Malcolm. I am looking at the patch of LU-8287 .

            Hi Emoly,

            We think we might have located a possible LU patch that could be related.

            LU-8287. It changes some behaviour, specifically it changes lock revocation behaviour in relation to nodemaps.

            Considering the stack traces and that the issues appeared to coincide with the addition of nodemaps it looks like a good candidate for rollback or adjustment.

            As it has been mentioned in both this ticket and the related ticket, this is blocking deployment of a filesystem into production. Also ANU have paid Intel support, please escalate this ticket.

            mhaakddn Malcolm Haak - NCI (Inactive) added a comment - - edited Hi Emoly, We think we might have located a possible LU patch that could be related. LU-8287 . It changes some behaviour, specifically it changes lock revocation behaviour in relation to nodemaps. Considering the stack traces and that the issues appeared to coincide with the addition of nodemaps it looks like a good candidate for rollback or adjustment. As it has been mentioned in both this ticket and the related ticket, this is blocking deployment of a filesystem into production. Also ANU have paid Intel support, please escalate this ticket.
            kim.sebo Kim Sebo added a comment -

            mgs/mds crash dump has been uploaded to the ftp site. Will test the patch as soon as I can.

            kim.sebo Kim Sebo added a comment - mgs/mds crash dump has been uploaded to the ftp site. Will test the patch as soon as I can.

            People

              emoly.liu Emoly Liu
              kim.sebo Kim Sebo
              Votes:
              1 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: