Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.12.0
    • Lustre 2.11.0, Lustre 2.12.0
    • None
    • Soak stress cluster - lustre-master-ib build 64 version=2.10.58_139_g630cd49
    • 3
    • 9223372036854775807

    Description

      Attempting to re-mount the filesystem after the upgrade, Have a hard crash on MDT0001. 

      Crash is repeatable. I will leave the system in this state for examination, then re-format non-DNE.

      Crash dumps are available on soak

       [  451.170602] LDISKFS-fs warning (device dm-2): ldiskfs_multi_mount_protect:322: MMP interval 42 higher than expected, please wait.[  493.737484] LDISKFS-fs (dm-2): recovery complete
      [  493.793102] LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
      [  495.357987] LustreError: 2384:0:(tgt_lastrcvd.c:1533:tgt_clients_data_init()) soaked-MDT0001: duplicate export for client generation 11
      [  495.646489] LustreError: 2384:0:(obd_config.c:559:class_setup()) setup soaked-MDT0001 failed (-114)
      [  495.646493] LustreError: 2384:0:(obd_config.c:1822:class_config_llog_handler()) MGC192.168.1.108@o2ib: cfg command failed: rc = -114
      [  495.646497] Lustre:    cmd=cf003 0:soaked-MDT0001  1:soaked-MDT0001_UUID  2:1  3:soaked-MDT0001-mdtlov  4:f[  495.646570] LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log 'soaked-MDT0001' failed (-114). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      [  495.646587] LustreError: 2303:0:(obd_mount_server.c:1383:server_start_targets()) failed to start server soaked-MDT0001: -114
      [  495.646728] LustreError: 2303:0:(obd_mount_server.c:1936:server_fill_super()) Unable to start targets: -114
      [  495.646760] LustreError: 2303:0:(obd_config.c:610:class_cleanup()) Device 4 not setup
      [  495.899986] BUG: unable to handle kernel NULL pointer dereference at 0000000000000378
      [  495.899999] IP: [<ffffffff816b683c>] _raw_spin_lock+0xc/0x30
      [  495.900002] PGD 0
      [  495.900005] Oops: 0002 [#1] SMP
      [  495.900073] Modules linked in: mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlx4_en(OE) sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd dm_round_robin iTCO_wdt iTCO_vendor_support ipmi_ssif sg joydev ipmi_si ipmi_devintf mei_me ioatdma ipmi_msghandler pcspkr wmi mei lpc_ich shpchp i2c_i801 dm_multipath
      [  495.900107]  dm_mod nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops isci ahci igb mpt2sas libsas ttm libahci ptp crct10dif_pclmul pps_core crct10dif_common mlx4_core(OE) raid_class drm libata crc32c_intel dca mlx_compat(OE) scsi_transport_sas i2c_algo_bit devlink i2c_core
      
      [  495.900113] CPU: 10 PID: 2167 Comm: obd_zombid Tainted: P           OE  ------------   3.10.0-693.21.1.el7_lustre.x86_64 #1
      [  495.900114] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
      [  495.900117] task: ffff880036358fd0 ti: ffff8804176d0000 task.ti: ffff8804176d0000
      [  495.900122] RIP: 0010:[<ffffffff816b683c>]  [<ffffffff816b683c>] _raw_spin_lock+0xc/0x30
      [  495.900124] RSP: 0018:ffff8804176d3da8  EFLAGS: 00010246
      [  495.900126] RAX: 0000000000000000 RBX: ffff88081503c800 RCX: 000000018040003f
      [  495.900128] RDX: 0000000000000001 RSI: ffffea0020556b00 RDI: 0000000000000378
      [  495.900129] RBP: ffff8804176d3de0 R08: ffff8808155acf00 R09: 000000018040003f
      [  495.900131] R10: 0000000000000001 R11: ffffea0020556b00 R12: 0000000000000000
      [  495.900133] R13: 0000000000000378 R14: ffff880817131068 R15: ffff88081503c800
      [  495.900135] FS:  0000000000000000(0000) GS:ffff88082d880000(0000) knlGS:0000000000000000
      [  495.900137] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  495.900139] CR2: 0000000000000378 CR3: 0000000001a02000 CR4: 00000000000607e0
      [  495.900141] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  495.900143] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [  495.900144] Call Trace:
      [  495.900246]  [<ffffffffc0d6b635>] ? tgt_grant_discard+0x35/0x190 [ptlrpc]
      [  495.900317]  [<ffffffffc0d3f74e>] ? tgt_client_free+0x17e/0x3b0 [ptlrpc]
      [  495.900354]  [<ffffffffc177c097>] mdt_destroy_export+0x87/0x200 [mdt]
      [  495.900410]  [<ffffffffc0a7b9be>] class_export_destroy+0xee/0x490 [obdclass]
      [  495.900448]  [<ffffffffc0a8434a>] obd_zombie_impexp_cull+0x39a/0x550 [obdclass]
      [  495.900479]  [<ffffffffc0a8456d>] obd_zombie_impexp_thread+0x6d/0x1c0 [obdclass]
      [  495.900489]  [<ffffffff810c7c70>] ? wake_up_state+0x20/0x20
      [  495.900519]  [<ffffffffc0a84500>] ? obd_zombie_impexp_cull+0x550/0x550 [obdclass]
      [  495.900526]  [<ffffffff810b4031>] kthread+0xd1/0xe0
      [  495.900530]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      [  495.900537]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
      [  495.900541]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      [  495.900576] Code: 5d c3 0f 1f 44 00 00 85 d2 74 e4 0f 1f 40 00 eb ed 66 0f 1f 44 00 00 b8 01 00 00 00 5d c3 90 66 66 66 66 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 01 c3 55 89 c6 48 89 e5 e8 99 27 ff ff 5d
      [  495.900580] RIP  [<ffffffff816b683c>] _raw_spin_lock+0xc/0x30
      [  495.900581]  RSP <ffff8804176d3da8>
      [  495.900582] CR2: 0000000000000378
      
      
      

      Attachments

        Issue Links

          Activity

            [LU-10806] Hard crash when mounting DNE MDT
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33240/
            Subject: LU-10806 target: skip discard for a missing obt_lut
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 5ed65fd0594741e69999216db27d85f7f6f7f5d6

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33240/ Subject: LU-10806 target: skip discard for a missing obt_lut Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5ed65fd0594741e69999216db27d85f7f6f7f5d6

            The patch fixes the crash during mount for a last_rcvd duplicate generation. But I'm thinking that the real problem is two different records with a same generation at last_rcvd.

            aboyko Alexander Boyko added a comment - The patch fixes the crash during mount for a last_rcvd duplicate generation. But I'm thinking that the real problem is two different records with a same generation at last_rcvd.

            Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/33240
            Subject: LU-10806 target: skip discard for a missing obt_lut
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 9935a66421e5b109dc90a4caaa6d83ebac31cde0

            gerrit Gerrit Updater added a comment - Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/33240 Subject: LU-10806 target: skip discard for a missing obt_lut Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9935a66421e5b109dc90a4caaa6d83ebac31cde0
            sarah Sarah Liu added a comment -

            soak hit this problem again after about 24h running on tag-2.11.55

            sarah Sarah Liu added a comment - soak hit this problem again after about 24h running on tag-2.11.55
            sarah Sarah Liu added a comment -

            Hit the similar error when running with lustre-master version=2.11.54_103_gdeb5aba for about 2 and half days:

            MDS console

            soak-11 login: [  189.069605] LNet: HW NUMA nodes: 2, HW CPU cores: 32, npartitions: 2
            [  189.079399] alg: No test for adler32 (adler32-zlib)
            [  189.986576] Lustre: Lustre: Build Version: 2.11.54_103_gdeb5aba
            [  190.277235] LNet: Using FMR for registration
            [  190.294638] LNet: Added LNI 192.168.1.111@o2ib [8/256/0/180]
            [  190.473995] LDISKFS-fs warning (device dm-5): ldiskfs_multi_mount_protect:322: MMP interval 42 higher than expected, please wait.
            [  190.473995] 
            [  232.980161] LDISKFS-fs (dm-5): recovery complete
            [  232.985552] LDISKFS-fs (dm-5): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
            [  234.661896] LustreError: 4231:0:(tgt_lastrcvd.c:1540:tgt_clients_data_init()) soaked-MDT0003: duplicate export for client generation 5
            [  234.790360] LustreError: 4231:0:(obd_config.c:559:class_setup()) setup soaked-MDT0003 failed (-114)
            [  234.800544] LustreError: 4231:0:(obd_config.c:1835:class_config_llog_handler()) MGC192.168.1.108@o2ib: cfg command failed: rc = -114
            [  234.813900] Lustre:    cmd=cf003 0:soaked-MDT0003  1:soaked-MDT0003_UUID  2:3  3:soaked-MDT0003-mdtlov  4:f  
            [  234.813900] 
            [  234.826773] LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log 'soaked-MDT0003' failed (-114). This may be the result of communication errors betw
            een this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
            [  234.853307] LustreError: 4077:0:(obd_mount_server.c:1386:server_start_targets()) failed to start server soaked-MDT0003: -114
            [  234.866025] LustreError: 4077:0:(obd_mount_server.c:1939:server_fill_super()) Unable to start targets: -114
            [  234.877033] LustreError: 4077:0:(obd_config.c:610:class_cleanup()) Device 4 not setup
            [  234.890447] BUG: unable to handle kernel NULL pointer dereference at 0000000000000380
            [  234.899277] IP: [<ffffffffb6d1682c>] _raw_spin_lock+0xc/0x30
            [  234.905658] PGD 0 
            [  234.907933] Oops: 0002 [#1] SMP 
            [  234.911578] Modules linked in: mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) 
            lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_uma
            d(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclam
            p coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ipmi_ssif iTCO_wdt iTCO_vendor_sup
            port i2c_i801 sg ipmi_si joydev mei_me mei lpc_ich ipmi_devintf ipmi_msghandler pcspkr shpchp ioatdma wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache 
            jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb isci ptp mlx4_core(OE) mpt
            3sas ahci pps_core drm libsas libahci devlink crct10dif_pclmul crct10dif_common dca crc32c_intel raid_class i2c_algo_bit libata mlx_compat(OE) i2c_core scsi_transport_s
            as
            [  235.028639] CPU: 19 PID: 230 Comm: kworker/19:1 Tainted: P           OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1
            [  235.041116] Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
            [  235.055386] Workqueue: obd_zombid obd_zombie_exp_cull [obdclass]
            [  235.062133] task: ffff945dabe1cf10 ti: ffff945dabe3c000 task.ti: ffff945dabe3c000
            [  235.070513] RIP: 0010:[<ffffffffb6d1682c>]  [<ffffffffb6d1682c>] _raw_spin_lock+0xc/0x30
            [  235.081017] RSP: 0018:ffff945dabe3fd98  EFLAGS: 00010246
            [  235.088351] RAX: 0000000000000000 RBX: ffff945d7e112c00 RCX: 0000000000000956
            [  235.097753] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 0000000000000380
            [  235.107136] RBP: ffff945dabe3fdd0 R08: 000000000001bac0 R09: ffffffffc15bf8ae
            [  235.116518] R10: ffff945dae2dbac0 R11: ffffde280ff93b00 R12: 0000000000000000
            [  235.125883] R13: 0000000000000380 R14: ffff945d7ffa1040 R15: 00000000000004c0
            [  235.135230] FS:  0000000000000000(0000) GS:ffff945dae2c0000(0000) knlGS:0000000000000000
            [  235.145653] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [  235.153454] CR2: 0000000000000380 CR3: 0000000462a0e000 CR4: 00000000000607e0
            [  235.162816] Call Trace:
            [  235.166982]  [<ffffffffc15ec0c5>] ? tgt_grant_discard+0x35/0x190 [ptlrpc]
            [  235.175956]  [<ffffffffc15bf8ae>] ? tgt_client_free+0x17e/0x3b0 [ptlrpc]
            [  235.185758]  [<ffffffffc1870097>] mdt_destroy_export+0x87/0x200 [mdt]
            [  235.195292]  [<ffffffffc134d4fe>] class_export_destroy+0xee/0x490 [obdclass]
            [  235.205422]  [<ffffffffc134d8b5>] obd_zombie_exp_cull+0x15/0x20 [obdclass]
            [  235.215313]  [<ffffffffb66b35ef>] process_one_work+0x17f/0x440
            [  235.223979]  [<ffffffffb66b4686>] worker_thread+0x126/0x3c0
            [  235.232344]  [<ffffffffb66b4560>] ? manage_workers.isra.24+0x2a0/0x2a0
            [  235.241768]  [<ffffffffb66bb621>] kthread+0xd1/0xe0
            [  235.249282]  [<ffffffffb66bb550>] ? insert_kthread_work+0x40/0x40
            [  235.258143]  [<ffffffffb6d205f7>] ret_from_fork_nospec_begin+0x21/0x21
            [  235.267438]  [<ffffffffb66bb550>] ? insert_kthread_work+0x40/0x40
            [  235.276219] Code: 5d c3 0f 1f 44 00 00 85 d2 74 e4 0f 1f 40 00 eb ed 66 0f 1f 44 00 00 b8 01 00 00 00 5d c3 90 66 66 66 66 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 01 c3 55 89 c6 48 89 e5 e8 c5 2c ff ff 5d 
            [  235.302152] RIP  [<ffffffffb6d1682c>] _raw_spin_lock+0xc/0x30
            [  235.310516]  RSP <ffff945dabe3fd98>
            [  235.316349] CR2: 0000000000000380
            [  235.321927] ---[ end trace c992470b75e3279d ]---
            [  235.402279] Kernel panic - not syncing: Fatal exception
            [  235.410133] Kernel Offset: 0x35600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
            [  235.494959] ------------[ cut here ]------------
            [  235.501354] WARNING: CPU: 19 PID: 230 at arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x65/0x70
            [  235.513010] Modules linked in: mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) 
            lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_uma
            d(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclam
            p coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ipmi_ssif iTCO_wdt iTCO_vendor_sup
            port i2c_i801 sg ipmi_si joydev mei_me mei lpc_ich ipmi_devintf ipmi_msghandler pcspkr shpchp ioatdma wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb isci ptp mlx4_core(OE) mpt3sas ahci pps_core drm libsas libahci devlink crct10dif_pclmul crct10dif_common dca crc32c_intel raid_class i2c_algo_bit libata mlx_compat(OE) i2c_core scsi_transport_sas
            [  235.641969] CPU: 19 PID: 230 Comm: kworker/19:1 Tainted: P      D    OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1
            [  235.655571] Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
            [  235.672103] Workqueue: obd_zombid obd_zombie_exp_cull [obdclass]
            [  235.679970] Call Trace:
            [  235.683848]  <IRQ>  [<ffffffffb6d0e84e>] dump_stack+0x19/0x1b
            [  235.691458]  [<ffffffffb6691e18>] __warn+0xd8/0x100
            [  235.698043]  [<ffffffffb6691f5d>] warn_slowpath_null+0x1d/0x20
            [  235.705703]  [<ffffffffb6654e95>] native_smp_send_reschedule+0x65/0x70
            [  235.714144]  [<ffffffffb66ddf81>] trigger_load_balance+0x191/0x280
            [  235.722184]  [<ffffffffb66cdc0a>] scheduler_tick+0x10a/0x150
            [  235.729649]  [<ffffffffb6701c10>] ? tick_sched_do_timer+0x50/0x50
            [  235.737599]  [<ffffffffb66a4f65>] update_process_times+0x65/0x80
            [  235.745437]  [<ffffffffb6701a10>] tick_sched_handle+0x30/0x70
            [  235.752977]  [<ffffffffb6701c49>] tick_sched_timer+0x39/0x80
            [  235.760423]  [<ffffffffb66bf7e6>] __hrtimer_run_queues+0xd6/0x260
            [  235.768326]  [<ffffffffb66bfd7f>] hrtimer_interrupt+0xaf/0x1d0
            [  235.775951]  [<ffffffffb665847b>] local_apic_timer_interrupt+0x3b/0x60
            [  235.784346]  [<ffffffffb6d25063>] smp_apic_timer_interrupt+0x43/0x60
            [  235.792545]  [<ffffffffb6d217b2>] apic_timer_interrupt+0x162/0x170
            [  235.800552]  <EOI>  [<ffffffffb6d08c3d>] ? panic+0x1d5/0x21f
            [  235.808001]  [<ffffffffb6d08ba1>] ? panic+0x139/0x21f
            [  235.814751]  [<ffffffffb6d18745>] oops_end+0xc5/0xe0
            [  235.821391]  [<ffffffffb6d0807e>] no_context+0x285/0x2a8
            [  235.828408]  [<ffffffffb6d08115>] __bad_area_nosemaphore+0x74/0x1d1
            [  235.836493]  [<ffffffffb6d08286>] bad_area_nosemaphore+0x14/0x16
            [  235.844301]  [<ffffffffb6d1b6e0>] __do_page_fault+0x330/0x4f0
            [  235.851818]  [<ffffffffb66db5e8>] ? enqueue_task_fair+0x208/0x6c0
            [  235.859687]  [<ffffffffb6d1b8d5>] do_page_fault+0x35/0x90
            [  235.866774]  [<ffffffffb6d17758>] page_fault+0x28/0x30
            [  235.873635]  [<ffffffffc15bf8ae>] ? tgt_client_free+0x17e/0x3b0 [ptlrpc]
            [  235.882169]  [<ffffffffb6d1682c>] ? _raw_spin_lock+0xc/0x30
            [  235.889469]  [<ffffffffc15ec0c5>] ? tgt_grant_discard+0x35/0x190 [ptlrpc]
            [  235.898128]  [<ffffffffc15bf8ae>] ? tgt_client_free+0x17e/0x3b0 [ptlrpc]
            [  235.906674]  [<ffffffffc1870097>] mdt_destroy_export+0x87/0x200 [mdt]
            [  235.914904]  [<ffffffffc134d4fe>] class_export_destroy+0xee/0x490 [obdclass]
            [  235.923792]  [<ffffffffc134d8b5>] obd_zombie_exp_cull+0x15/0x20 [obdclass]
            [  235.932481]  [<ffffffffb66b35ef>] process_one_work+0x17f/0x440
            [  235.939970]  [<ffffffffb66b4686>] worker_thread+0x126/0x3c0
            [  235.947140]  [<ffffffffb66b4560>] ? manage_workers.isra.24+0x2a0/0x2a0
            [  235.955369]  [<ffffffffb66bb621>] kthread+0xd1/0xe0
            [  235.961715]  [<ffffffffb66bb550>] ? insert_kthread_work+0x40/0x40
            [  235.969404]  [<ffffffffb6d205f7>] ret_from_fork_nospec_begin+0x21/0x21
            [  235.977546]  [<ffffffffb66bb550>] ? insert_kthread_work+0x40/0x40
            [  235.985195] ---[ end trace c992470b75e3279e ]---
            
            sarah Sarah Liu added a comment - Hit the similar error when running with lustre-master version=2.11.54_103_gdeb5aba for about 2 and half days: MDS console soak-11 login: [ 189.069605] LNet: HW NUMA nodes: 2, HW CPU cores: 32, npartitions: 2 [ 189.079399] alg: No test for adler32 (adler32-zlib) [ 189.986576] Lustre: Lustre: Build Version: 2.11.54_103_gdeb5aba [ 190.277235] LNet: Using FMR for registration [ 190.294638] LNet: Added LNI 192.168.1.111@o2ib [8/256/0/180] [ 190.473995] LDISKFS-fs warning (device dm-5): ldiskfs_multi_mount_protect:322: MMP interval 42 higher than expected, please wait. [ 190.473995] [ 232.980161] LDISKFS-fs (dm-5): recovery complete [ 232.985552] LDISKFS-fs (dm-5): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc [ 234.661896] LustreError: 4231:0:(tgt_lastrcvd.c:1540:tgt_clients_data_init()) soaked-MDT0003: duplicate export for client generation 5 [ 234.790360] LustreError: 4231:0:(obd_config.c:559:class_setup()) setup soaked-MDT0003 failed (-114) [ 234.800544] LustreError: 4231:0:(obd_config.c:1835:class_config_llog_handler()) MGC192.168.1.108@o2ib: cfg command failed: rc = -114 [ 234.813900] Lustre: cmd=cf003 0:soaked-MDT0003 1:soaked-MDT0003_UUID 2:3 3:soaked-MDT0003-mdtlov 4:f [ 234.813900] [ 234.826773] LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log 'soaked-MDT0003' failed (-114). This may be the result of communication errors betw een this node and the MGS, a bad configuration, or other errors. See the syslog for more information. [ 234.853307] LustreError: 4077:0:(obd_mount_server.c:1386:server_start_targets()) failed to start server soaked-MDT0003: -114 [ 234.866025] LustreError: 4077:0:(obd_mount_server.c:1939:server_fill_super()) Unable to start targets: -114 [ 234.877033] LustreError: 4077:0:(obd_config.c:610:class_cleanup()) Device 4 not setup [ 234.890447] BUG: unable to handle kernel NULL pointer dereference at 0000000000000380 [ 234.899277] IP: [<ffffffffb6d1682c>] _raw_spin_lock+0xc/0x30 [ 234.905658] PGD 0 [ 234.907933] Oops: 0002 [#1] SMP [ 234.911578] Modules linked in: mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_uma d(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclam p coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ipmi_ssif iTCO_wdt iTCO_vendor_sup port i2c_i801 sg ipmi_si joydev mei_me mei lpc_ich ipmi_devintf ipmi_msghandler pcspkr shpchp ioatdma wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb isci ptp mlx4_core(OE) mpt 3sas ahci pps_core drm libsas libahci devlink crct10dif_pclmul crct10dif_common dca crc32c_intel raid_class i2c_algo_bit libata mlx_compat(OE) i2c_core scsi_transport_s as [ 235.028639] CPU: 19 PID: 230 Comm: kworker/19:1 Tainted: P OE ------------ 3.10.0-862.9.1.el7_lustre.x86_64 #1 [ 235.041116] Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [ 235.055386] Workqueue: obd_zombid obd_zombie_exp_cull [obdclass] [ 235.062133] task: ffff945dabe1cf10 ti: ffff945dabe3c000 task.ti: ffff945dabe3c000 [ 235.070513] RIP: 0010:[<ffffffffb6d1682c>] [<ffffffffb6d1682c>] _raw_spin_lock+0xc/0x30 [ 235.081017] RSP: 0018:ffff945dabe3fd98 EFLAGS: 00010246 [ 235.088351] RAX: 0000000000000000 RBX: ffff945d7e112c00 RCX: 0000000000000956 [ 235.097753] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 0000000000000380 [ 235.107136] RBP: ffff945dabe3fdd0 R08: 000000000001bac0 R09: ffffffffc15bf8ae [ 235.116518] R10: ffff945dae2dbac0 R11: ffffde280ff93b00 R12: 0000000000000000 [ 235.125883] R13: 0000000000000380 R14: ffff945d7ffa1040 R15: 00000000000004c0 [ 235.135230] FS: 0000000000000000(0000) GS:ffff945dae2c0000(0000) knlGS:0000000000000000 [ 235.145653] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 235.153454] CR2: 0000000000000380 CR3: 0000000462a0e000 CR4: 00000000000607e0 [ 235.162816] Call Trace: [ 235.166982] [<ffffffffc15ec0c5>] ? tgt_grant_discard+0x35/0x190 [ptlrpc] [ 235.175956] [<ffffffffc15bf8ae>] ? tgt_client_free+0x17e/0x3b0 [ptlrpc] [ 235.185758] [<ffffffffc1870097>] mdt_destroy_export+0x87/0x200 [mdt] [ 235.195292] [<ffffffffc134d4fe>] class_export_destroy+0xee/0x490 [obdclass] [ 235.205422] [<ffffffffc134d8b5>] obd_zombie_exp_cull+0x15/0x20 [obdclass] [ 235.215313] [<ffffffffb66b35ef>] process_one_work+0x17f/0x440 [ 235.223979] [<ffffffffb66b4686>] worker_thread+0x126/0x3c0 [ 235.232344] [<ffffffffb66b4560>] ? manage_workers.isra.24+0x2a0/0x2a0 [ 235.241768] [<ffffffffb66bb621>] kthread+0xd1/0xe0 [ 235.249282] [<ffffffffb66bb550>] ? insert_kthread_work+0x40/0x40 [ 235.258143] [<ffffffffb6d205f7>] ret_from_fork_nospec_begin+0x21/0x21 [ 235.267438] [<ffffffffb66bb550>] ? insert_kthread_work+0x40/0x40 [ 235.276219] Code: 5d c3 0f 1f 44 00 00 85 d2 74 e4 0f 1f 40 00 eb ed 66 0f 1f 44 00 00 b8 01 00 00 00 5d c3 90 66 66 66 66 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 01 c3 55 89 c6 48 89 e5 e8 c5 2c ff ff 5d [ 235.302152] RIP [<ffffffffb6d1682c>] _raw_spin_lock+0xc/0x30 [ 235.310516] RSP <ffff945dabe3fd98> [ 235.316349] CR2: 0000000000000380 [ 235.321927] ---[ end trace c992470b75e3279d ]--- [ 235.402279] Kernel panic - not syncing: Fatal exception [ 235.410133] Kernel Offset: 0x35600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 235.494959] ------------[ cut here ]------------ [ 235.501354] WARNING: CPU: 19 PID: 230 at arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x65/0x70 [ 235.513010] Modules linked in: mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_uma d(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclam p coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ipmi_ssif iTCO_wdt iTCO_vendor_sup port i2c_i801 sg ipmi_si joydev mei_me mei lpc_ich ipmi_devintf ipmi_msghandler pcspkr shpchp ioatdma wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb isci ptp mlx4_core(OE) mpt3sas ahci pps_core drm libsas libahci devlink crct10dif_pclmul crct10dif_common dca crc32c_intel raid_class i2c_algo_bit libata mlx_compat(OE) i2c_core scsi_transport_sas [ 235.641969] CPU: 19 PID: 230 Comm: kworker/19:1 Tainted: P D OE ------------ 3.10.0-862.9.1.el7_lustre.x86_64 #1 [ 235.655571] Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [ 235.672103] Workqueue: obd_zombid obd_zombie_exp_cull [obdclass] [ 235.679970] Call Trace: [ 235.683848] <IRQ> [<ffffffffb6d0e84e>] dump_stack+0x19/0x1b [ 235.691458] [<ffffffffb6691e18>] __warn+0xd8/0x100 [ 235.698043] [<ffffffffb6691f5d>] warn_slowpath_null+0x1d/0x20 [ 235.705703] [<ffffffffb6654e95>] native_smp_send_reschedule+0x65/0x70 [ 235.714144] [<ffffffffb66ddf81>] trigger_load_balance+0x191/0x280 [ 235.722184] [<ffffffffb66cdc0a>] scheduler_tick+0x10a/0x150 [ 235.729649] [<ffffffffb6701c10>] ? tick_sched_do_timer+0x50/0x50 [ 235.737599] [<ffffffffb66a4f65>] update_process_times+0x65/0x80 [ 235.745437] [<ffffffffb6701a10>] tick_sched_handle+0x30/0x70 [ 235.752977] [<ffffffffb6701c49>] tick_sched_timer+0x39/0x80 [ 235.760423] [<ffffffffb66bf7e6>] __hrtimer_run_queues+0xd6/0x260 [ 235.768326] [<ffffffffb66bfd7f>] hrtimer_interrupt+0xaf/0x1d0 [ 235.775951] [<ffffffffb665847b>] local_apic_timer_interrupt+0x3b/0x60 [ 235.784346] [<ffffffffb6d25063>] smp_apic_timer_interrupt+0x43/0x60 [ 235.792545] [<ffffffffb6d217b2>] apic_timer_interrupt+0x162/0x170 [ 235.800552] <EOI> [<ffffffffb6d08c3d>] ? panic+0x1d5/0x21f [ 235.808001] [<ffffffffb6d08ba1>] ? panic+0x139/0x21f [ 235.814751] [<ffffffffb6d18745>] oops_end+0xc5/0xe0 [ 235.821391] [<ffffffffb6d0807e>] no_context+0x285/0x2a8 [ 235.828408] [<ffffffffb6d08115>] __bad_area_nosemaphore+0x74/0x1d1 [ 235.836493] [<ffffffffb6d08286>] bad_area_nosemaphore+0x14/0x16 [ 235.844301] [<ffffffffb6d1b6e0>] __do_page_fault+0x330/0x4f0 [ 235.851818] [<ffffffffb66db5e8>] ? enqueue_task_fair+0x208/0x6c0 [ 235.859687] [<ffffffffb6d1b8d5>] do_page_fault+0x35/0x90 [ 235.866774] [<ffffffffb6d17758>] page_fault+0x28/0x30 [ 235.873635] [<ffffffffc15bf8ae>] ? tgt_client_free+0x17e/0x3b0 [ptlrpc] [ 235.882169] [<ffffffffb6d1682c>] ? _raw_spin_lock+0xc/0x30 [ 235.889469] [<ffffffffc15ec0c5>] ? tgt_grant_discard+0x35/0x190 [ptlrpc] [ 235.898128] [<ffffffffc15bf8ae>] ? tgt_client_free+0x17e/0x3b0 [ptlrpc] [ 235.906674] [<ffffffffc1870097>] mdt_destroy_export+0x87/0x200 [mdt] [ 235.914904] [<ffffffffc134d4fe>] class_export_destroy+0xee/0x490 [obdclass] [ 235.923792] [<ffffffffc134d8b5>] obd_zombie_exp_cull+0x15/0x20 [obdclass] [ 235.932481] [<ffffffffb66b35ef>] process_one_work+0x17f/0x440 [ 235.939970] [<ffffffffb66b4686>] worker_thread+0x126/0x3c0 [ 235.947140] [<ffffffffb66b4560>] ? manage_workers.isra.24+0x2a0/0x2a0 [ 235.955369] [<ffffffffb66bb621>] kthread+0xd1/0xe0 [ 235.961715] [<ffffffffb66bb550>] ? insert_kthread_work+0x40/0x40 [ 235.969404] [<ffffffffb6d205f7>] ret_from_fork_nospec_begin+0x21/0x21 [ 235.977546] [<ffffffffb66bb550>] ? insert_kthread_work+0x40/0x40 [ 235.985195] ---[ end trace c992470b75e3279e ]---
            laisiyao Lai Siyao added a comment -

            Do other MDT's mount successfully? And it's best to know whether this is reproducible.

            laisiyao Lai Siyao added a comment - Do other MDT's mount successfully? And it's best to know whether this is reproducible.

            The previous version was 

            cliffw Cliff White (Inactive) added a comment - The previous version was  latest master, plus:  https://review.whamcloud.com/#/c/31475/ Lustre version=2.10.58_76_gbe9f2ee
            laisiyao Lai Siyao added a comment -
            [  495.357987] LustreError: 2384:0:(tgt_lastrcvd.c:1533:tgt_clients_data_init()) soaked-MDT0001: duplicate export for client generation 11

            shows the last_rcvd file is corrupt: two client data have the same generation. This caused mount failure, and the error handling code triggers the crash. So this is an bug in error handling code. I'll try to reproduce it and see how to fix it.

            I have a question about the upgrade: what is its original version?

            laisiyao Lai Siyao added a comment - [ 495.357987] LustreError: 2384:0:(tgt_lastrcvd.c:1533:tgt_clients_data_init()) soaked-MDT0001: duplicate export for client generation 11 shows the last_rcvd file is corrupt: two client data have the same generation. This caused mount failure, and the error handling code triggers the crash. So this is an bug in error handling code. I'll try to reproduce it and see how to fix it. I have a question about the upgrade: what is its original version?
            pjones Peter Jones added a comment -

            Lai

            Could you please investigate this issue?

            Peter

            pjones Peter Jones added a comment - Lai Could you please investigate this issue? Peter

            People

              laisiyao Lai Siyao
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: