Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11252

MDS kernel panic when try to umount

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.5, Lustre 2.10.7
    • 2.10.5-RC1 DNE
    • 3
    • 9223372036854775807

    Description

      Doing mds failover, MDS1(soak-9) crash when umounting the MDT0
      the problem seems related with LU-10390 and LU-10635

      soak.log

      2018-08-14 12:43:14,259:fsmgmt.fsmgmt:INFO     soaked-MDT0000 in status 'RECOVERING'.
      2018-08-14 12:43:14,259:fsmgmt.fsmgmt:INFO     Next recovery check in 15s...
      2018-08-14 12:43:29,697:fsmgmt.fsmgmt:DEBUG    Recovery Result Record: {'soak-9': {'soaked-MDT0001': 'COMPLETE', 'soaked-MDT0000': 'COMPLETE'}}
      2018-08-14 12:43:29,697:fsmgmt.fsmgmt:INFO     Node soak-9: 'soaked-MDT0000' recovery completed
      2018-08-14 12:43:29,697:fsmgmt.fsmgmt:INFO     Failing back soaked-MDT0000 ...
      2018-08-14 12:43:29,697:fsmgmt.fsmgmt:INFO     Unmounting soaked-MDT0000 on soak-9 ...
      

      soak-9 console log

      [ 8931.176903] LustreError: Skipped 91 previous similar messages^M
      [ 8952.016266] LDISKFS-fs (dm-3): recovery complete^M
      [ 8952.024971] LDISKFS-fs (dm-3): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc^M
      [ 8953.758389] Lustre: MGS: Connection restored to 192.168.1.109@o2ib (at 0@lo)^M
      [ 8954.257224] Lustre: soaked-MDT0000: Imperative Recovery not enabled, recovery window 300-900^M
      [ 8954.313828] Lustre: soaked-MDT0000: in recovery but waiting for the first client to connect^M
      [ 8955.666497] Lustre: soaked-MDT0000: Will be in recovery for at least 5:00, or until 29 clients reconnect^M
      [ 8979.238342] Lustre: Evicted from MGS (at 192.168.1.109@o2ib) after server handle changed from 0x5be05c011300c6fc to 0x33b9ca6b616b9f53^M
      [ 9019.048297] Lustre: MGS: Connection restored to 192.168.1.111@o2ib (at 192.168.1.111@o2ib)^M
      [ 9019.061014] Lustre: Skipped 72 previous similar messages^M
      [ 9029.239386] LustreError: 167-0: soaked-MDT0000-lwp-MDT0001: This client was evicted by soaked-MDT0000; in progress operations using this service will fail.^M
      [ 9056.462524] Lustre: soaked-MDT0000: Recovery over after 1:40, of 29 clients 29 recovered and 0 were evicted.^M
      [ 9057.179759] LustreError: 4059:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0000: expected 416 actual 344.^M
      [ 9057.191940] LustreError: 4059:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 5 previous similar messages^M
      [ 9063.287830] Lustre: Failing over soaked-MDT0000^M
      [ 9066.966848] LustreError: 4060:0:(ldlm_lockd.c:1415:ldlm_handle_enqueue0()) ### lock on destroyed export ffff98fe2e643c00 ns: mdt-soaked-MDT0000_UUID lock: ffff9902823fa800/0x33b9ca6b6194c08e lrc: 1/0,0 mode: --/CR res: [0x20000c78a:0x11d30:0x0].0x0 bits 0x8 rrc: 6 type: IBT flags: 0x54801000000000 nid: 192.168.1.135@o2ib remote: 0x6ee64381e8b43dff expref: 7 pid: 4060 timeout: 0 lvb_type: 3^M
      [ 9066.966990] LustreError: 4067:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0000: expected 752 actual 416.^M
      [ 9066.966993] LustreError: 4067:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 9 previous similar messages^M
      [ 9066.979183] LustreError: 4081:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff98fe00f51500 x1608769364838608/t0(0) o105->soaked-MDT0000@192.168.1.111@o2ib:15/16 lens 304/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1^M
      [ 9066.990686] LustreError: 11324:0:(osp_precreate.c:642:osp_precreate_send()) soaked-OST0003-osc-MDT0000: can't precreate: rc = -5^M
      [ 9066.990693] LustreError: 11324:0:(osp_precreate.c:1289:osp_precreate_thread()) soaked-OST0003-osc-MDT0000: cannot precreate objects: rc = -5^M
      [ 9067.005697] Lustre: soaked-MDT0000: Not available for connect from 192.168.1.144@o2ib (stopping)^M
      [ 9067.005698] Lustre: soaked-MDT0000: Not available for connect from 192.168.1.121@o2ib (stopping)^M
      [ 9067.130024] LustreError: 4060:0:(ldlm_lockd.c:1415:ldlm_handle_enqueue0()) Skipped 13 previous similar messages^M
      [ 9067.143747] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c^M
      [ 9067.154313] IP: [<ffffffffc12a814d>] ldlm_handle_conflict_lock+0x3d/0x330 [ptlrpc]^M
      [ 9067.164557] PGD 0 ^M
      [ 9067.168351] Oops: 0000 [#1] SMP ^M
      [ 9067.173431] Modules linked in: mgs(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) sb_edac intel_powerclamp coretemp zcommon(POE) znvpair(POE) spl(OE) intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev ipmi_ssif ipmi_si ipmi_devintf iTCO_wdt iTCO_vendor_support sg ipmi_msghandler pcspkr mei_me lpc_ich mei ioatdma i2c_i801 wmi shpchp dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect ahci igb isci sysimgblt fb_sys_fops ptp mlx4_core(OE) mpt2sas ttm libsas libahci pps_core crct10dif_pclmul drm dca crct10dif_common raid_class libata crc32c_intel i2c_algo_bit mlx_compat(OE) i2c_core scsi_transport_sas devlink^M
      [ 9067.305855] CPU: 0 PID: 11591 Comm: ldlm_bl_08 Tainted: P           OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1^M
      [ 9067.319633] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
      [ 9067.333638] task: ffff98fe43e43f40 ti: ffff98fe2cf8c000 task.ti: ffff98fe2cf8c000^M
      [ 9067.343460] RIP: 0010:[<ffffffffc12a814d>]  [<ffffffffc12a814d>] ldlm_handle_conflict_lock+0x3d/0x330 [ptlrpc]^M
      [ 9067.356095] RSP: 0018:ffff98fe2cf8fbc0  EFLAGS: 00010246^M
      [ 9067.363466] RAX: 0000000000000001 RBX: ffff990230721600 RCX: 0000000000000000^M
      [ 9067.372905] RDX: ffff98fe2cf8fc18 RSI: ffff98fe2cf8fc80 RDI: ffff990230721600^M
      [ 9067.382288] RBP: ffff98fe2cf8fbf0 R08: ffff98fe2cf8fcd0 R09: ffff98feae059740^M
      [ 9067.391697] R10: ffff990230721600 R11: 000000020000c7a7 R12: ffff98fe2cf8fc18^M
      [ 9067.401046] R13: ffff98fe2cf8fc80 R14: ffff98fe2cf8fc18 R15: 0000000000000000^M
      [ 9067.410382] FS:  0000000000000000(0000) GS:ffff98feae000000(0000) knlGS:0000000000000000^M
      [ 9067.420839] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
      [ 9067.428646] CR2: 000000000000001c CR3: 00000006e360e000 CR4: 00000000000607f0^M
      [ 9067.437961] Call Trace:^M
      [ 9067.442072]  [<ffffffffc12d955d>] ldlm_process_inodebits_lock+0xfd/0x400 [ptlrpc]^M
      [ 9067.451841]  [<ffffffffc12d9460>] ? ldlm_inodebits_compat_queue+0x390/0x390 [ptlrpc]^M
      [ 9067.461888]  [<ffffffffc12a79ed>] ldlm_reprocess_queue+0x13d/0x2a0 [ptlrpc]^M
      [ 9067.471035]  [<ffffffffc12a858d>] __ldlm_reprocess_all+0x14d/0x3a0 [ptlrpc]^M
      [ 9067.480154]  [<ffffffffc12a8b46>] ldlm_reprocess_res+0x26/0x30 [ptlrpc]^M
      [ 9067.488864]  [<ffffffffc0aa0c50>] cfs_hash_for_each_relax+0x250/0x450 [libcfs]^M
      [ 9067.498240]  [<ffffffffc12a8b20>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc]^M
      [ 9067.507403]  [<ffffffffc12a8b20>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc]^M
      [ 9067.516516]  [<ffffffffc0aa3fe5>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs]^M
      [ 9067.525834]  [<ffffffffc12a8b8c>] ldlm_reprocess_recovery_done+0x3c/0x110 [ptlrpc]^M
      [ 9067.535543]  [<ffffffffc12a983c>] ldlm_export_cancel_locks+0x11c/0x130 [ptlrpc]^M
      [ 9067.544910]  [<ffffffffc12d2c08>] ldlm_bl_thread_main+0x4c8/0x700 [ptlrpc]^M
      [ 9067.553787]  [<ffffffffc12d2740>] ? ldlm_handle_bl_callback+0x410/0x410 [ptlrpc]^M
      [ 9067.563206]  [<ffffffffac8bb621>] kthread+0xd1/0xe0^M
      [ 9067.569815]  [<ffffffffac8bb550>] ? insert_kthread_work+0x40/0x40^M
      [ 9067.577760]  [<ffffffffacf205f7>] ret_from_fork_nospec_begin+0x21/0x21^M
      [ 9067.586159]  [<ffffffffac8bb550>] ? insert_kthread_work+0x40/0x40^M
      [ 9067.594055] Code: 49 89 f5 41 54 53 48 89 fb 48 83 ec 08 f6 05 26 14 81 ff 01 48 89 4d d0 4c 8b 7f 48 74 0d f6 05 1b 14 81 ff 01 0f 85 63 01 00 00 <41> 8b 47 1c 85 c0 0f 84 6b 02 00 00 48 8d 43 60 48 39 43 60 0f ^M
      [ 9067.618124] RIP  [<ffffffffc12a814d>] ldlm_handle_conflict_lock+0x3d/0x330 [ptlrpc]^M
      [ 9067.627815]  RSP <ffff98fe2cf8fbc0>^M
      [ 9067.632829] CR2: 000000000000001c^M
      [ 9067.640738] Lustre: soaked-MDT0000: Not available for connect from 192.168.1.107@o2ib (stopping)^M
      [ 9067.652220] Lustre: Skipped 12 previous similar messages^M
      [ 9067.725141] ---[ end trace 2c0ee3d783e754cb ]---^M
      [ 9067.800713] Kernel panic - not syncing: Fatal exception^M
      [ 9067.807687] Kernel Offset: 0x2b800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)^M
      

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: