Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10635

MGS kernel panic when configuring nodemaps and filesets

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Minor
    • None
    • Lustre 2.10.2
    • None
    • Lustre 2.10.2, DNE, ZFS 0.7.3, CentOS 7.4
    • 3
    • 9223372036854775807

    Description

      We (ANU-NCI) are attempting to configure the lustre nodemap and fileset features and have run into two problems.

      1: Toggling the nodemap_activate flag from 0 to 1 appears to clear/reset the existing nodemap "fileset" property on all the MDS/OSS nodes. This appears to be similar to LU-9154.
      2: reapplying the nodemap fileset property in response to the above reliably causes the MGS to stop responding to clients, and recovery attempts result in a MGS kernel panic and reboot.

      This command sequence should reproduce the problems:

      MGS> lctl nodemap_activate=0
      MGS> lctl nodemap_add testmgmt04
      MGS> lctl set_param -P nodemap.testmgmt04.fileset='/fs/sub1/sub2'
      MGS> ### this above command takes a suprisingly long time to return (>60 sec), but seems to succeed.
      MGS> ### at this point, the fileset property has propagated and visible on all MDS/OSS nodes.
      MGS> lctl nodemap_activate=1
      MGS> ### the fileset property has now been reset to empty on all nodes except the combined MGS/MDS1 node
      MGS> ### try to re-apply the fileset property again:
      MGS> lctl set_param -P nodemap.testmgmt04.fileset='/fs/sub1/sub2'

      This command hangs and is unkillable. At this point the MGS stops responding to clients, though the MDS running on the same host seems OK. Any new client attempt to mount the filesystem fails with:

      mount.lustre: mount 10.112.1.41@o2ib8:10.112.1.42@o2ib8:/fs/sub1/sub2 at /mnt/sub2 failed: Input/output error
      Is the MGS running?

      Attempting to unmount the MGT (either manually, or as part of HA failover) results in
      an immediate MGS kernel panic and reboot:

      [ 2860.466515] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
      [ 2860.501712] IP: [<ffffffffc11048ce>] ldlm_process_plain_lock+0x6e/0xb30 [ptlrpc]
      [ 2860.534877] PGD 0 
      [ 2860.544669] Oops: 0000 [#1] SMP 
      [ 2860.561965] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ptlrpc
      (OE) obdclass(OE) ko2iblnd(OE) lnet(OE) libcfs(OE) bonding rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) dm_mirror dm_region_hash dm_log zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac edac_core intel_powerclamp coretemp intel_rapl dm_round_robin iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ipmi_si ghash_clmulni_intel iTCO_wdt dm_multipath sg joydev iTCO_vendor_support aesni_intel ipmi_devintf hpwdt hpilo lrw gf128mul glue_helper ablk_helper cryptd ioatdma pcspkr i2c_i801 wmi dm_mod ipmi_msghandler lpc_ich shpchp dca acpi_cpufreq
      [ 2860.882396]  acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace knem(OE) sunrpc ip_tables xfs libcrc32c mlx5_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx5_core(OE) mlxfw(OE) mlx_compat(OE) tg3 devlink crct10dif_pclmul serio_raw crct10dif_common ptp drm crc32c_intel i2c_core pps_core hpsa(OE) scsi_transport_sas
      [ 2861.040539] CPU: 27 PID: 4459 Comm: ldlm_bl_24 Tainted: P           OE  ------------   3.10.0-693.5.2.el7.x86_64 #1
      [ 2861.092592] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 04/25/2017
      [ 2861.129810] task: ffff88becf800fd0 ti: ffff88b24b3c0000 task.ti: ffff88b24b3c0000
      [ 2861.163313] RIP: 0010:[<ffffffffc11048ce>]  [<ffffffffc11048ce>] ldlm_process_plain_lock+0x6e/0xb30 [ptlrpc]
      [ 2861.207431] RSP: 0018:ffff88b24b3c3be0  EFLAGS: 00010287
      [ 2861.231433] RAX: 0000000000000000 RBX: ffff88b774f53800 RCX: ffff88b24b3c3c7c
      [ 2861.263036] RDX: 0000000000000002 RSI: ffff88b24b3c3c80 RDI: ffff88b774f53800
      [ 2861.295068] RBP: ffff88b24b3c3c58 R08: ffff88b24b3c3cd0 R09: ffff88beff257880
      [ 2861.327085] R10: ffff88b774f53800 R11: 0000000000000005 R12: ffff885efdd5fcc0
      [ 2861.359171] R13: 0000000000000002 R14: ffff88b24b3c3cd0 R15: ffff88b774f53860
      [ 2861.391235] FS:  0000000000000000(0000) GS:ffff88befee40000(0000) knlGS:0000000000000000
      [ 2861.428020] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 2861.454884] CR2: 000000000000001c CR3: 0000005e60ab7000 CR4: 00000000003407e0
      [ 2861.487917] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 2861.520372] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 2861.552721] Stack:
      [ 2861.562219]  ffff88b24b3c3c7c ffff88b24b3c3cd0 ffff88b24b3c3c80 0000000000000000
      [ 2861.599583]  ffff885efdd5fca0 0000001000000001 ffff885e00000010 ffff88b24b3c3c18
      [ 2861.632738]  ffff88b24b3c3c18 0000000013b7eb66 0000000000000002 ffff885efdd5fcc0
      [ 2861.666158] Call Trace:
      [ 2861.677268]  [<ffffffffc1104860>] ? ldlm_errno2error+0x60/0x60 [ptlrpc]
      [ 2861.706930]  [<ffffffffc10ef9db>] ldlm_reprocess_queue+0x13b/0x2a0 [ptlrpc]
      [ 2861.738259]  [<ffffffffc10f057d>] __ldlm_reprocess_all+0x14d/0x3a0 [ptlrpc]
      [ 2861.769602]  [<ffffffffc10f0b30>] ldlm_reprocess_res+0x20/0x30 [ptlrpc]
      [ 2861.799258]  [<ffffffffc0a36bef>] cfs_hash_for_each_relax+0x21f/0x400 [libcfs]
      [ 2861.831735]  [<ffffffffc10f0b10>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc]
      [ 2861.863592]  [<ffffffffc10f0b10>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc]
      [ 2861.895390]  [<ffffffffc0a39d95>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs]
      [ 2861.928211]  [<ffffffffc10f0b7c>] ldlm_reprocess_recovery_done+0x3c/0x110 [ptlrpc]
      [ 2861.962124]  [<ffffffffc10f17bc>] ldlm_export_cancel_locks+0x11c/0x130 [ptlrpc]
      [ 2861.994962]  [<ffffffffc111ada8>] ldlm_bl_thread_main+0x4c8/0x700 [ptlrpc]
      [ 2862.025858]  [<ffffffff810c4820>] ? wake_up_state+0x20/0x20
      [ 2862.050898]  [<ffffffffc111a8e0>] ? ldlm_handle_bl_callback+0x410/0x410 [ptlrpc]
      [ 2862.086852]  [<ffffffff810b099f>] kthread+0xcf/0xe0
      [ 2862.112252]  [<ffffffff810b08d0>] ? insert_kthread_work+0x40/0x40
      [ 2862.139704]  [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90
      [ 2862.164047]  [<ffffffff810b08d0>] ? insert_kthread_work+0x40/0x40
      [ 2862.191049] Code: 89 45 a0 74 0d f6 05 b3 ac 94 ff 01 0f 85 34 06 00 00 8b 83 98 00 00 00 39 83 9c 00 00 00 89 45 b8 0f 84 57 09 00 00 48 8b 45 a0 <8b> 40 1c 85 c0 0f 84 7a 09 00 00 48 8b 4d a0 48 89 c8 48 83 c0 
      [ 2862.276308] RIP  [<ffffffffc11048ce>] ldlm_process_plain_lock+0x6e/0xb30 [ptlrpc]
      [ 2862.309889]  RSP <ffff88b24b3c3be0>
      [ 2862.325503] CR2: 000000000000001c
      

      Attachments

        1. ll_mgs_ftrace.log.gz
          197 kB
        2. mgs_busy_top.txt
          8 kB
        3. mgs_lctl_dk.log.gz
          2.43 MB
        4. mgs_panic.txt
          5 kB

        Issue Links

          Activity

            People

              emoly.liu Emoly Liu
              kim.sebo Kim Sebo
              Votes:
              1 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: