Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11457

osd_oi_insert(): the FID is used by two objects

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0
    • Lustre 2.12.0, Lustre 2.12.2, Lustre 2.12.4, Lustre 2.12.5
    • 3
    • 9223372036854775807

    Description

      tag-2.11.55

      MDS crash

      [15650.670434] device-mapper: multipath: Failing path 8:96.^M
      [15650.765276] BUG: unable to handle kernel NULL pointer dereference at           (null)^M
      [15650.775741] IP: [<          (null)>]           (null)^M
      [15650.783081] PGD 0 ^M
      [15650.786948] Oops: 0010 [#1] SMP ^M
      [15650.792218] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt joydev pcspkr ipmi_ssif iTCO_vendor_support sg ipmi_si ipmi_devintf shpchp ipmi_msghandler i2c_i801 mei_me ioatdma mei lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm igb isci ahci ptp mlx4_core(OE) mpt3sas libsas libahci pps_core dca crct10dif_pclmul devlink i2c_algo_bit crct10dif_common raid_class crc32c_intel libata i2c_core mlx_compat(OE) scsi_transport_sas^M
      [15650.934649] CPU: 14 PID: 9491 Comm: mdt_rdpg01_008 Tainted: P           OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1^M
      [15650.952002] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
      [15650.966961] task: ffff8be6a1253f40 ti: ffff8be68a044000 task.ti: ffff8be68a044000^M
      [15650.977791] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)^M
      [15650.988693] RSP: 0018:ffff8be68a047b58  EFLAGS: 00010246^M
      [15650.997167] RAX: 0000000000000000 RBX: ffff8be68b820000 RCX: 0000000000000002^M
      [15651.007733] RDX: ffffffffc164c7b0 RSI: ffff8be68a047b60 RDI: ffff8be68b820008^M
      [15651.018326] RBP: ffff8be68a047b98 R08: 0000000000000004 R09: 0000000000000000^M
      [15651.028930] R10: 0000000000000001 R11: 00000000007fffff R12: ffff8be26f9fab00^M
      [15651.039547] R13: ffff8be279a448a0 R14: ffff8be68a160000 R15: ffff8be68b820008^M
      [15651.050168] FS:  0000000000000000(0000) GS:ffff8be6ad980000(0000) knlGS:0000000000000000^M
      [15651.061880] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
      [15651.070980] CR2: 0000000000000000 CR3: 000000042c3b6000 CR4: 00000000000607e0^M
      [15651.081660] Call Trace:^M
      [15651.087091]  [<ffffffffc164ac3e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]^M
      [15651.098058]  [<ffffffffc164ae17>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]^M
      [15651.108370]  [<ffffffffc188eb47>] lod_it_load+0x27/0x90 [lod]^M
      [15651.117554]  [<ffffffffc0f48808>] dt_index_walk+0xf8/0x430 [obdclass]^M
      [15651.127457]  [<ffffffffc1915080>] ? mdd_object_lock+0xe0/0xe0 [mdd]^M
      [15651.137132]  [<ffffffffc1916d9f>] mdd_readpage+0x25f/0x5a0 [mdd]^M
      [15651.146553]  [<ffffffffc1782bda>] mdt_readpage+0x63a/0x880 [mdt]^M
      [15651.155992]  [<ffffffffc11e82ca>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M
      [15651.166379]  [<ffffffffc11c02e1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]^M
      [15651.177493]  [<ffffffffc0dfcbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]^M
      [15651.188033]  [<ffffffffc118b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M
      [15651.199251]  [<ffffffffc1188315>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]^M
      [15651.209399]  [<ffffffff83ccf682>] ? default_wake_function+0x12/0x20^M
      [15651.218931]  [<ffffffff83cc52ab>] ? __wake_up_common+0x5b/0x90^M
      [15651.228026]  [<ffffffffc118ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]^M
      [15651.237575]  [<ffffffffc118e1b0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]^M
      [15651.248365]  [<ffffffff83cbb621>] kthread+0xd1/0xe0^M
      [15651.256344]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.265688]  [<ffffffff843205f7>] ret_from_fork_nospec_begin+0x21/0x21^M
      [15651.275475]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.284736] Code:  Bad RIP value.^M
      [15651.290946] RIP  [<          (null)>]           (null)^M
      [15651.299236]  RSP <ffff8be68a047b58>^M
      [15651.305543] CR2: 0000000000000000^M
      [15651.315778] ---[ end trace 4ae4238c00f9aeec ]---^M
      [15651.336386] Kernel panic - not syncing: Fatal exception^M
      [15651.344613] Kernel Offset: 0x2c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)^M
      [15651.369289] ------------[ cut here ]------------^M
      [15651.376397] WARNING: CPU: 14 PID: 9491 at arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x65/0x70^M
      [15651.388915] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt joydev pcspkr ipmi_ssif iTCO_vendor_support sg ipmi_si ipmi_devintf shpchp ipmi_msghandler i2c_i801 mei_me ioatdma mei lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm igb isci ahci ptp mlx4_core(OE) mpt3sas libsas libahci pps_core dca crct10dif_pclmul devlink i2c_algo_bit crct10dif_common raid_class crc32c_intel libata i2c_core mlx_compat(OE) scsi_transport_sas^M
      [15651.529620] CPU: 14 PID: 9491 Comm: mdt_rdpg01_008 Tainted: P      D    OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1^M
      [15651.546472] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
      [15651.561156] Call Trace:^M
      [15651.566023]  <IRQ>  [<ffffffff8430e84e>] dump_stack+0x19/0x1b^M
      [15651.574646]  [<ffffffff83c91e18>] __warn+0xd8/0x100^M
      [15651.582224]  [<ffffffff83c91f5d>] warn_slowpath_null+0x1d/0x20^M
      [15651.590851]  [<ffffffff83c54e95>] native_smp_send_reschedule+0x65/0x70^M
      [15651.600279]  [<ffffffff83cddf81>] trigger_load_balance+0x191/0x280^M
      [15651.609280]  [<ffffffff83ccdc0a>] scheduler_tick+0x10a/0x150^M
      [15651.617702]  [<ffffffff83d01c10>] ? tick_sched_do_timer+0x50/0x50^M
      [15651.626619]  [<ffffffff83ca4f65>] update_process_times+0x65/0x80^M
      [15651.635416]  [<ffffffff83d01a10>] tick_sched_handle+0x30/0x70^M
      [15651.643916]  [<ffffffff83d01c49>] tick_sched_timer+0x39/0x80^M
      [15651.652315]  [<ffffffff83cbf7e6>] __hrtimer_run_queues+0xd6/0x260^M
      [15651.661210]  [<ffffffff83cbfd7f>] hrtimer_interrupt+0xaf/0x1d0^M
      [15651.669814]  [<ffffffff83c5847b>] local_apic_timer_interrupt+0x3b/0x60^M
      [15651.679184]  [<ffffffff84325063>] smp_apic_timer_interrupt+0x43/0x60^M
      [15651.688352]  [<ffffffff843217b2>] apic_timer_interrupt+0x162/0x170^M
      [15651.697316]  <EOI>  [<ffffffff84308c3d>] ? panic+0x1d5/0x21f^M
      [15651.705715]  [<ffffffff84308ba1>] ? panic+0x139/0x21f^M
      [15651.713430]  [<ffffffff84318745>] oops_end+0xc5/0xe0^M
      [15651.721020]  [<ffffffff8430807e>] no_context+0x285/0x2a8^M
      [15651.728984]  [<ffffffff84308115>] __bad_area_nosemaphore+0x74/0x1d1^M
      [15651.738014]  [<ffffffff84308286>] bad_area_nosemaphore+0x14/0x16^M
      [15651.746760]  [<ffffffff8431b6e0>] __do_page_fault+0x330/0x4f0^M
      [15651.755199]  [<ffffffff8431b8d5>] do_page_fault+0x35/0x90^M
      [15651.763264]  [<ffffffff84317758>] page_fault+0x28/0x30^M
      [15651.771013]  [<ffffffffc164c7b0>] ? osd_object_alloc+0x360/0x360 [osd_ldiskfs]^M
      [15651.781105]  [<ffffffffc164ac3e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]^M
      [15651.791402]  [<ffffffffc164ae17>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]^M
      [15651.801028]  [<ffffffffc188eb47>] lod_it_load+0x27/0x90 [lod]^M
      [15651.809517]  [<ffffffffc0f48808>] dt_index_walk+0xf8/0x430 [obdclass]^M
      [15651.818761]  [<ffffffffc1915080>] ? mdd_object_lock+0xe0/0xe0 [mdd]^M
      [15651.827808]  [<ffffffffc1916d9f>] mdd_readpage+0x25f/0x5a0 [mdd]^M
      [15651.836533]  [<ffffffffc1782bda>] mdt_readpage+0x63a/0x880 [mdt]^M
      [15651.845269]  [<ffffffffc11e82ca>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M
      [15651.854937]  [<ffffffffc11c02e1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]^M
      [15651.865302]  [<ffffffffc0dfcbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]^M
      [15651.875084]  [<ffffffffc118b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M
      [15651.885522]  [<ffffffffc1188315>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]^M
      [15651.894881]  [<ffffffff83ccf682>] ? default_wake_function+0x12/0x20^M
      [15651.903621]  [<ffffffff83cc52ab>] ? __wake_up_common+0x5b/0x90^M
      [15651.911880]  [<ffffffffc118ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]^M
      [15651.920585]  [<ffffffffc118e1b0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]^M
      [15651.930460]  [<ffffffff83cbb621>] kthread+0xd1/0xe0^M
      [15651.937468]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.945794]  [<ffffffff843205f7>] ret_from_fork_nospec_begin+0x21/0x21^M
      [15651.954564]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.962806] ---[ end trace 4ae4238c00f9aeed ]---^M
      
      

      Attachments

        Issue Links

          Activity

            [LU-11457] osd_oi_insert(): the FID is used by two objects
            sarah Sarah Liu added a comment -

            I think we can close it as cannot reproduce for now.

            sarah Sarah Liu added a comment - I think we can close it as cannot reproduce for now.
            pjones Peter Jones added a comment -

            So can we close this ticket as cannot repo? Previously this had been happening every couple of hours, right?

            pjones Peter Jones added a comment - So can we close this ticket as cannot repo? Previously this had been happening every couple of hours, right?
            sarah Sarah Liu added a comment -

            hi John, it is lustre-master-ib build 137. No, I don't see the failure recently, soak is running on lustre-lustre-ib build 142 right now

            sarah Sarah Liu added a comment - hi John, it is lustre-master-ib build 137. No, I don't see the failure recently, soak is running on lustre-lustre-ib build 142 right now
            jhammond John Hammond added a comment -

            Which build was this? Was it lustre-master-patchless #137? Is there a more recent build (recent enough that we still have the RPMs) that has this issue?

            jhammond John Hammond added a comment - Which build was this? Was it lustre-master-patchless #137? Is there a more recent build (recent enough that we still have the RPMs) that has this issue?
            ys Yang Sheng added a comment -
            [15638.646875] blk_cloned_rq_check_limits: over max size limit.
            [15638.646879] blk_cloned_rq_check_limits: over max size limit.
            [15638.646888] blk_cloned_rq_check_limits: over max size limit.
            [15638.646890] blk_cloned_rq_check_limits: over max size limit.
            [15638.646892] blk_cloned_rq_check_limits: over max size limit.
            [15638.646895] blk_cloned_rq_check_limits: over max size limit.
            [15638.646916] device-mapper: multipath: Failing path 8:96.
            

            This issue should be fixed by LU-9551. Looks like still there? But so great we can reproduce it on our cluster.

            ys Yang Sheng added a comment - [15638.646875] blk_cloned_rq_check_limits: over max size limit. [15638.646879] blk_cloned_rq_check_limits: over max size limit. [15638.646888] blk_cloned_rq_check_limits: over max size limit. [15638.646890] blk_cloned_rq_check_limits: over max size limit. [15638.646892] blk_cloned_rq_check_limits: over max size limit. [15638.646895] blk_cloned_rq_check_limits: over max size limit. [15638.646916] device-mapper: multipath: Failing path 8:96. This issue should be fixed by LU-9551 . Looks like still there? But so great we can reproduce it on our cluster.
            ys Yang Sheng added a comment -

            Maybe some issue like LDEV-642, Let me do further investigation.

            ys Yang Sheng added a comment - Maybe some issue like LDEV-642, Let me do further investigation.
            pjones Peter Jones added a comment -

            Yang Sheng

            This seems to be an ldiskfs issue. It has only appeared between the 2.11.54 and 2.11.55 tags. Any suggestions?

            Peter

            pjones Peter Jones added a comment - Yang Sheng This seems to be an ldiskfs issue. It has only appeared between the 2.11.54 and 2.11.55 tags. Any suggestions? Peter
            sarah Sarah Liu added a comment -

            Hi Lai,

            No, I didn't run dir migration. When I filed the ticket, it was the first time seen the problem, but I restarted soak on tip of master on Monday with build #137, hit the same issue again in about 2 hours run. Here is the trace from the latest failure:

            on soak-10

            [ 9654.637324] blk_cloned_rq_check_limits: over max size limit.
            [ 9658.590523] BUG: unable to handle kernel NULL pointer dereference at           (null)
            [ 9658.600366] IP: [<          (null)>]           (null)
            [ 9658.606866] PGD 0 
            [ 9658.609984] Oops: 0010 [#1] SMP 
            [ 9658.614447] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) sb_edac intel_powerclamp coretemp spl(OE) intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ipmi_ssif pcspkr ipmi_si ipmi_devintf ipmi_msghandler iTCO_wdt iTCO_vendor_support joydev sg mei_me mei i2c_i801 lpc_ich ioatdma wmi shpchp auth_rpcgss dm_multipath dm_mod sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb isci ahci ptp drm mlx4_core(OE) libsas libahci pps_core mpt3sas dca crct10dif_pclmul devlink i2c_algo_bit crct10dif_common crc32c_intel raid_class libata i2c_core mlx_compat(OE) scsi_transport_sas
            [ 9658.745591] CPU: 11 PID: 22180 Comm: mdt_out01_010 Tainted: P           OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1
            [ 9658.760612] Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
            [ 9658.777200] task: ffff9e132d6f2f70 ti: ffff9e12e2b24000 task.ti: ffff9e12e2b24000
            [ 9658.786678] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
            [ 9658.796193] RSP: 0018:ffff9e12e2b27b60  EFLAGS: 00010246
            [ 9658.803358] RAX: 0000000000000000 RBX: ffff9e1715878000 RCX: 0000000000000002
            [ 9658.812569] RDX: ffffffffc16877b0 RSI: ffff9e12e2b27b68 RDI: ffff9e1715878008
            [ 9658.821879] RBP: ffff9e12e2b27ba0 R08: 0000000000000004 R09: 0000000000000000
            [ 9658.831212] R10: 0000000000000001 R11: 00000000007fffff R12: ffff9e1317b6ad00
            [ 9658.840559] R13: ffff9e16edd7c030 R14: ffff9e1714d4e800 R15: ffff9e1715878008
            [ 9658.849980] FS:  0000000000000000(0000) GS:ffff9e172d8c0000(0000) knlGS:0000000000000000
            [ 9658.860384] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [ 9658.868127] CR2: 0000000000000000 CR3: 000000073a60e000 CR4: 00000000000607e0
            [ 9658.877444] Call Trace:
            [ 9658.881511]  [<ffffffffc1685c3e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]
            [ 9658.891221]  [<ffffffffc1685e17>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]
            [ 9658.900163]  [<ffffffffc0f47608>] dt_index_walk+0xf8/0x430 [obdclass]
            [ 9658.908835]  [<ffffffffc0f47940>] ? dt_index_walk+0x430/0x430 [obdclass]
            [ 9658.917808]  [<ffffffffc0f48a14>] dt_index_read+0x394/0x6a0 [obdclass]
            [ 9658.926634]  [<ffffffffc11dfd22>] tgt_obd_idx_read+0x612/0x860 [ptlrpc]
            [ 9658.935280]  [<ffffffffc11e2f3a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            [ 9658.944303]  [<ffffffffc11bea61>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
            [ 9658.954245]  [<ffffffffc0dc2bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
            [ 9658.963572]  [<ffffffffc1189acb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [ 9658.973654]  [<ffffffffc1186955>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
            [ 9658.982660]  [<ffffffffaaccf682>] ? default_wake_function+0x12/0x20
            [ 9658.991178]  [<ffffffffaacc52ab>] ? __wake_up_common+0x5b/0x90
            [ 9658.999219]  [<ffffffffc118d2ec>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc]
            [ 9659.007754]  [<ffffffffc118c7f0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]
            [ 9659.017507]  [<ffffffffaacbb621>] kthread+0xd1/0xe0
            [ 9659.024461]  [<ffffffffaacbb550>] ? insert_kthread_work+0x40/0x40
            [ 9659.032809]  [<ffffffffab3205f7>] ret_from_fork_nospec_begin+0x21/0x21
            [ 9659.041631]  [<ffffffffaacbb550>] ? insert_kthread_work+0x40/0x40
            [ 9659.049943] Code:  Bad RIP value.
            [ 9659.055214] RIP  [<          (null)>]           (null)
            [ 9659.062486]  RSP <ffff9e12e2b27b60>
            [ 9659.067869] CR2: 0000000000000000
            [ 9659.075547] ---[ end trace 8f1f37f93401bf0f ]---
            l Corporation.All rights reserved. 
            Version 2.00.1201.Copyright(c) 2010 - 2012 American Megatrends,Inc. 
            Installed BIOS: SE5C600.86B.01.08.0003
            
            ------system reboot, and after reboot--------
            
            CentOS Linux 7 (Core)
            Kernel 3.10.0-862.9.1.el7_lustre.x86_64 on an x86_64
            
            soak-10 login: [  175.721668] LNet: HW NUMA nodes: 2, HW CPU cores: 32, npartitions: 2
            [  175.731983] alg: No test for adler32 (adler32-zlib)
            [  176.631197] Lustre: Lustre: Build Version: 2.11.55_65_gec2e999
            [  176.925651] LNet: Using FMR for registration
            [  176.943513] LNet: Added LNI 192.168.1.110@o2ib [8/256/0/180]
            [  177.121204] LDISKFS-fs warning (device dm-6): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.
            [  177.121204] 
            [  223.131715] LDISKFS-fs (dm-6): recovery complete
            [  223.137107] LDISKFS-fs (dm-6): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
            [  227.640081] Lustre: soaked-MDT0002: Not available for connect from 192.168.1.142@o2ib (not set up)
            [  230.666682] Lustre: soaked-MDT0002: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
            [  230.703152] LustreError: 4067:0:(llog_osd.c:1005:llog_osd_next_block()) soaked-MDT0000-osp-MDT0002: missed desired record? 3 > 1
            [  230.716141] LustreError: 4067:0:(lod_dev.c:428:lod_sub_recovery_thread()) soaked-MDT0000-osp-MDT0002 get update log failed: rc = -2
            [  234.808297] Lustre: soaked-MDT0002: Connection restored to 192.168.1.106@o2ib (at 192.168.1.106@o2ib)
            [  237.283751] Lustre: soaked-MDT0002: Connection restored to e3d1d543-7071-4f1d-f9c5-f174c66a3f7e (at 192.168.1.143@o2ib)
            [  237.295833] Lustre: Skipped 1 previous similar message
            [  237.308240] Lustre: 4070:0:(ldlm_lib.c:2048:target_recovery_overseer()) recovery is aborted, evict exports in recovery
            [  237.320897] Lustre: soaked-MDT0002: disconnecting 28 stale clients
            [  237.327887] LustreError: 4070:0:(tgt_grant.c:248:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 2097152 != fo_tot_granted 4194304
            [  238.302385] Lustre: soaked-MDT0002: Connection restored to 1d05ec18-7f6a-896f-de54-063cf1f8b51c (at 192.168.1.122@o2ib)
            [  238.314575] Lustre: Skipped 1 previous similar message
            [  240.471517] Lustre: soaked-MDT0002: Connection restored to a225c1a8-609c-d769-404b-3c25112913a0 (at 192.168.1.141@o2ib)
            [  240.483695] Lustre: Skipped 7 previous similar messages
            [  244.492077] Lustre: soaked-MDT0002: Connection restored to a6836049-d02e-27ab-c254-5e76eb9cef2b (at 192.168.1.136@o2ib)
            [  244.504273] Lustre: Skipped 4 previous similar messages
            [  254.539127] Lustre: soaked-MDT0002: Connection restored to e6fd2edd-0c64-011a-f170-b8c415906b8c (at 192.168.1.124@o2ib)
            [  254.551333] Lustre: Skipped 9 previous similar messages
            [  273.870109] Lustre: soaked-MDT0002: Connection restored to 192.168.1.104@o2ib (at 192.168.1.104@o2ib)
            [  273.880570] Lustre: Skipped 7 previous similar messages
            [  292.450898] LustreError: 168-f: soaked-MDT0002: BAD WRITE CHECKSUM: from 12345-192.168.1.135@o2ib inode [0x28001e862:0x214:0x0] object 0x28001e862:532 extent [36-39]: client csum 2ff01ff, server csum 2ed01f6
            [  293.574529] LustreError: 168-f: soaked-MDT0002: BAD WRITE CHECKSUM: from 12345-192.168.1.135@o2ib inode [0x28001e862:0x214:0x0] object 0x28001e862:532 extent [36-39]: client csum 2ff01ff, server csum 2ed01f6
            [  295.414655] LustreError: 168-f: soaked-MDT0002: BAD WRITE CHECKSUM: from 12345-192.168.1.135@o2ib inode [0x28001e862:0x214:0x0] object 0x28001e862:532 extent [36-39]: client csum 2ff01ff, server csum 2ed01f6
            [  296.317843] BUG: unable to handle kernel NULL pointer dereference at 00000000000005ab
            [  296.326718] IP: [<ffffffff8c3528c3>] rb_next+0x23/0x50
            [  296.332536] PGD 0 
            [  296.334823] Oops: 0000 [#1] SMP 
            [  296.341568] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support joydev ipmi_ssif pcspkr mei_me sg mei i2c_i801 lpc_ich ipmi_si ipmi_devintf ipmi_msghandler wmi ioatdma shpchp dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb ttm ptp mlx4_core(OE) drm isci ahci mpt3sas pps_core libsas crct10dif_pclmul libahci devlink dca crct10dif_common i2c_algo_bit raid_class crc32c_intel libata i2c_core mlx_compat(OE) scsi_transport_sas
            [  296.490014] CPU: 10 PID: 3968 Comm: mdt_out01_000 Tainted: P           OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1
            [  296.505712] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
            [  296.522014] task: ffff9193e349af70 ti: ffff9190a81f4000 task.ti: ffff9190a81f4000
            [  296.534051] RIP: 0010:[<ffffffff8c3528c3>]  [<ffffffff8c3528c3>] rb_next+0x23/0x50
            [  296.545986] RSP: 0018:ffff9190a81f7a40  EFLAGS: 00010202
            [  296.555563] RAX: 000000000000059b RBX: ffff91981aea8d90 RCX: 0000000000000000
            [  296.567341] RDX: 000000000000059b RSI: ffff919815fb2d5e RDI: ffff919815fb2d28
            [  296.578836] RBP: ffff9190a81f7a40 R08: 000000000e800157 R09: 0000000000000004
            [  296.590438] R10: ffff91982ae9c500 R11: ffff91982ae9c500 R12: ffff91980d991900
            [  296.601802] R13: ffff91981aea8d90 R14: ffff91981f763000 R15: ffff91980df68008
            [  296.613303] FS:  0000000000000000(0000) GS:ffff91982d880000(0000) knlGS:0000000000000000
            [  296.625676] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [  296.634865] CR2: 00000000000005ab CR3: 00000004ebe0e000 CR4: 00000000000607e0
            [  296.645657] Call Trace:
            [  296.651676]  [<ffffffffc162cd44>] ldiskfs_readdir+0x5b4/0x850 [ldiskfs]
            [  296.662343]  [<ffffffffc0944ef2>] ? fld_local_lookup+0x62/0x270 [fld]
            [  296.672780]  [<ffffffffc16ae7b0>] ? osd_object_alloc+0x360/0x360 [osd_ldiskfs]
            [  296.684053]  [<ffffffffc16acc3e>] osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]
            [  296.695182]  [<ffffffffc16acfa6>] osd_it_ea_next+0xc6/0x150 [osd_ldiskfs]
            [  296.705729]  [<ffffffffc1184ae8>] dt_index_page_build+0x1a8/0x470 [obdclass]
            [  296.716658]  [<ffffffffc11846b0>] dt_index_walk+0x1a0/0x430 [obdclass]
            [  296.726761]  [<ffffffffc1184940>] ? dt_index_walk+0x430/0x430 [obdclass]
            [  296.737149]  [<ffffffffc1185a14>] dt_index_read+0x394/0x6a0 [obdclass]
            [  296.747492]  [<ffffffffc141cd22>] tgt_obd_idx_read+0x612/0x860 [ptlrpc]
            [  296.757860]  [<ffffffffc141ff3a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            [  296.768476]  [<ffffffffc13fba61>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
            [  296.779775]  [<ffffffffc1038bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
            [  296.790527]  [<ffffffffc13c6acb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [  296.801935]  [<ffffffffc13c3955>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
            [  296.812275]  [<ffffffff8c0cf682>] ? default_wake_function+0x12/0x20
            [  296.822010]  [<ffffffff8c0c52ab>] ? __wake_up_common+0x5b/0x90
            [  296.831226]  [<ffffffffc13ca2ec>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc]
            [  296.840865]  [<ffffffffc13c97f0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]
            [  296.851675]  [<ffffffff8c0bb621>] kthread+0xd1/0xe0
            [  296.859622]  [<ffffffff8c0bb550>] ? insert_kthread_work+0x40/0x40
            [  296.868941]  [<ffffffff8c7205f7>] ret_from_fork_nospec_begin+0x21/0x21
            [  296.878704]  [<ffffffff8c0bb550>] ? insert_kthread_work+0x40/0x40
            [  296.888023] Code: c0 5d c3 0f 1f 44 00 00 55 48 8b 17 48 89 e5 48 39 d7 74 3b 48 8b 47 08 48 85 c0 75 0e eb 25 66 0f 1f 84 00 00 00 00 00 48 89 d0 <48> 8b 50 10 48 85 d2 75 f4 5d c3 66 90 48 3b 78 08 75 f6 48 8b 
            [  296.915195] RIP  [<ffffffff8c3528c3>] rb_next+0x23/0x50
            [  296.923524]  RSP <ffff9190a81f7a40>
            [  296.929891] CR2: 00000000000005ab
            [  296.935844] ---[ end trace 7f7dc5e5140c0c8b ]---
            
            
            
            sarah Sarah Liu added a comment - Hi Lai, No, I didn't run dir migration. When I filed the ticket, it was the first time seen the problem, but I restarted soak on tip of master on Monday with build #137, hit the same issue again in about 2 hours run. Here is the trace from the latest failure: on soak-10 [ 9654.637324] blk_cloned_rq_check_limits: over max size limit. [ 9658.590523] BUG: unable to handle kernel NULL pointer dereference at (null) [ 9658.600366] IP: [< (null)>] (null) [ 9658.606866] PGD 0 [ 9658.609984] Oops: 0010 [#1] SMP [ 9658.614447] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) sb_edac intel_powerclamp coretemp spl(OE) intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ipmi_ssif pcspkr ipmi_si ipmi_devintf ipmi_msghandler iTCO_wdt iTCO_vendor_support joydev sg mei_me mei i2c_i801 lpc_ich ioatdma wmi shpchp auth_rpcgss dm_multipath dm_mod sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb isci ahci ptp drm mlx4_core(OE) libsas libahci pps_core mpt3sas dca crct10dif_pclmul devlink i2c_algo_bit crct10dif_common crc32c_intel raid_class libata i2c_core mlx_compat(OE) scsi_transport_sas [ 9658.745591] CPU: 11 PID: 22180 Comm: mdt_out01_010 Tainted: P OE ------------ 3.10.0-862.9.1.el7_lustre.x86_64 #1 [ 9658.760612] Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [ 9658.777200] task: ffff9e132d6f2f70 ti: ffff9e12e2b24000 task.ti: ffff9e12e2b24000 [ 9658.786678] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [ 9658.796193] RSP: 0018:ffff9e12e2b27b60 EFLAGS: 00010246 [ 9658.803358] RAX: 0000000000000000 RBX: ffff9e1715878000 RCX: 0000000000000002 [ 9658.812569] RDX: ffffffffc16877b0 RSI: ffff9e12e2b27b68 RDI: ffff9e1715878008 [ 9658.821879] RBP: ffff9e12e2b27ba0 R08: 0000000000000004 R09: 0000000000000000 [ 9658.831212] R10: 0000000000000001 R11: 00000000007fffff R12: ffff9e1317b6ad00 [ 9658.840559] R13: ffff9e16edd7c030 R14: ffff9e1714d4e800 R15: ffff9e1715878008 [ 9658.849980] FS: 0000000000000000(0000) GS:ffff9e172d8c0000(0000) knlGS:0000000000000000 [ 9658.860384] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9658.868127] CR2: 0000000000000000 CR3: 000000073a60e000 CR4: 00000000000607e0 [ 9658.877444] Call Trace: [ 9658.881511] [<ffffffffc1685c3e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs] [ 9658.891221] [<ffffffffc1685e17>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs] [ 9658.900163] [<ffffffffc0f47608>] dt_index_walk+0xf8/0x430 [obdclass] [ 9658.908835] [<ffffffffc0f47940>] ? dt_index_walk+0x430/0x430 [obdclass] [ 9658.917808] [<ffffffffc0f48a14>] dt_index_read+0x394/0x6a0 [obdclass] [ 9658.926634] [<ffffffffc11dfd22>] tgt_obd_idx_read+0x612/0x860 [ptlrpc] [ 9658.935280] [<ffffffffc11e2f3a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 9658.944303] [<ffffffffc11bea61>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [ 9658.954245] [<ffffffffc0dc2bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [ 9658.963572] [<ffffffffc1189acb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 9658.973654] [<ffffffffc1186955>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [ 9658.982660] [<ffffffffaaccf682>] ? default_wake_function+0x12/0x20 [ 9658.991178] [<ffffffffaacc52ab>] ? __wake_up_common+0x5b/0x90 [ 9658.999219] [<ffffffffc118d2ec>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc] [ 9659.007754] [<ffffffffc118c7f0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc] [ 9659.017507] [<ffffffffaacbb621>] kthread+0xd1/0xe0 [ 9659.024461] [<ffffffffaacbb550>] ? insert_kthread_work+0x40/0x40 [ 9659.032809] [<ffffffffab3205f7>] ret_from_fork_nospec_begin+0x21/0x21 [ 9659.041631] [<ffffffffaacbb550>] ? insert_kthread_work+0x40/0x40 [ 9659.049943] Code: Bad RIP value. [ 9659.055214] RIP [< (null)>] (null) [ 9659.062486] RSP <ffff9e12e2b27b60> [ 9659.067869] CR2: 0000000000000000 [ 9659.075547] ---[ end trace 8f1f37f93401bf0f ]--- l Corporation.All rights reserved. Version 2.00.1201.Copyright(c) 2010 - 2012 American Megatrends,Inc. Installed BIOS: SE5C600.86B.01.08.0003 ------system reboot, and after reboot-------- CentOS Linux 7 (Core) Kernel 3.10.0-862.9.1.el7_lustre.x86_64 on an x86_64 soak-10 login: [ 175.721668] LNet: HW NUMA nodes: 2, HW CPU cores: 32, npartitions: 2 [ 175.731983] alg: No test for adler32 (adler32-zlib) [ 176.631197] Lustre: Lustre: Build Version: 2.11.55_65_gec2e999 [ 176.925651] LNet: Using FMR for registration [ 176.943513] LNet: Added LNI 192.168.1.110@o2ib [8/256/0/180] [ 177.121204] LDISKFS-fs warning (device dm-6): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait. [ 177.121204] [ 223.131715] LDISKFS-fs (dm-6): recovery complete [ 223.137107] LDISKFS-fs (dm-6): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc [ 227.640081] Lustre: soaked-MDT0002: Not available for connect from 192.168.1.142@o2ib (not set up) [ 230.666682] Lustre: soaked-MDT0002: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 [ 230.703152] LustreError: 4067:0:(llog_osd.c:1005:llog_osd_next_block()) soaked-MDT0000-osp-MDT0002: missed desired record? 3 > 1 [ 230.716141] LustreError: 4067:0:(lod_dev.c:428:lod_sub_recovery_thread()) soaked-MDT0000-osp-MDT0002 get update log failed: rc = -2 [ 234.808297] Lustre: soaked-MDT0002: Connection restored to 192.168.1.106@o2ib (at 192.168.1.106@o2ib) [ 237.283751] Lustre: soaked-MDT0002: Connection restored to e3d1d543-7071-4f1d-f9c5-f174c66a3f7e (at 192.168.1.143@o2ib) [ 237.295833] Lustre: Skipped 1 previous similar message [ 237.308240] Lustre: 4070:0:(ldlm_lib.c:2048:target_recovery_overseer()) recovery is aborted, evict exports in recovery [ 237.320897] Lustre: soaked-MDT0002: disconnecting 28 stale clients [ 237.327887] LustreError: 4070:0:(tgt_grant.c:248:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 2097152 != fo_tot_granted 4194304 [ 238.302385] Lustre: soaked-MDT0002: Connection restored to 1d05ec18-7f6a-896f-de54-063cf1f8b51c (at 192.168.1.122@o2ib) [ 238.314575] Lustre: Skipped 1 previous similar message [ 240.471517] Lustre: soaked-MDT0002: Connection restored to a225c1a8-609c-d769-404b-3c25112913a0 (at 192.168.1.141@o2ib) [ 240.483695] Lustre: Skipped 7 previous similar messages [ 244.492077] Lustre: soaked-MDT0002: Connection restored to a6836049-d02e-27ab-c254-5e76eb9cef2b (at 192.168.1.136@o2ib) [ 244.504273] Lustre: Skipped 4 previous similar messages [ 254.539127] Lustre: soaked-MDT0002: Connection restored to e6fd2edd-0c64-011a-f170-b8c415906b8c (at 192.168.1.124@o2ib) [ 254.551333] Lustre: Skipped 9 previous similar messages [ 273.870109] Lustre: soaked-MDT0002: Connection restored to 192.168.1.104@o2ib (at 192.168.1.104@o2ib) [ 273.880570] Lustre: Skipped 7 previous similar messages [ 292.450898] LustreError: 168-f: soaked-MDT0002: BAD WRITE CHECKSUM: from 12345-192.168.1.135@o2ib inode [0x28001e862:0x214:0x0] object 0x28001e862:532 extent [36-39]: client csum 2ff01ff, server csum 2ed01f6 [ 293.574529] LustreError: 168-f: soaked-MDT0002: BAD WRITE CHECKSUM: from 12345-192.168.1.135@o2ib inode [0x28001e862:0x214:0x0] object 0x28001e862:532 extent [36-39]: client csum 2ff01ff, server csum 2ed01f6 [ 295.414655] LustreError: 168-f: soaked-MDT0002: BAD WRITE CHECKSUM: from 12345-192.168.1.135@o2ib inode [0x28001e862:0x214:0x0] object 0x28001e862:532 extent [36-39]: client csum 2ff01ff, server csum 2ed01f6 [ 296.317843] BUG: unable to handle kernel NULL pointer dereference at 00000000000005ab [ 296.326718] IP: [<ffffffff8c3528c3>] rb_next+0x23/0x50 [ 296.332536] PGD 0 [ 296.334823] Oops: 0000 [#1] SMP [ 296.341568] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support joydev ipmi_ssif pcspkr mei_me sg mei i2c_i801 lpc_ich ipmi_si ipmi_devintf ipmi_msghandler wmi ioatdma shpchp dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb ttm ptp mlx4_core(OE) drm isci ahci mpt3sas pps_core libsas crct10dif_pclmul libahci devlink dca crct10dif_common i2c_algo_bit raid_class crc32c_intel libata i2c_core mlx_compat(OE) scsi_transport_sas [ 296.490014] CPU: 10 PID: 3968 Comm: mdt_out01_000 Tainted: P OE ------------ 3.10.0-862.9.1.el7_lustre.x86_64 #1 [ 296.505712] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [ 296.522014] task: ffff9193e349af70 ti: ffff9190a81f4000 task.ti: ffff9190a81f4000 [ 296.534051] RIP: 0010:[<ffffffff8c3528c3>] [<ffffffff8c3528c3>] rb_next+0x23/0x50 [ 296.545986] RSP: 0018:ffff9190a81f7a40 EFLAGS: 00010202 [ 296.555563] RAX: 000000000000059b RBX: ffff91981aea8d90 RCX: 0000000000000000 [ 296.567341] RDX: 000000000000059b RSI: ffff919815fb2d5e RDI: ffff919815fb2d28 [ 296.578836] RBP: ffff9190a81f7a40 R08: 000000000e800157 R09: 0000000000000004 [ 296.590438] R10: ffff91982ae9c500 R11: ffff91982ae9c500 R12: ffff91980d991900 [ 296.601802] R13: ffff91981aea8d90 R14: ffff91981f763000 R15: ffff91980df68008 [ 296.613303] FS: 0000000000000000(0000) GS:ffff91982d880000(0000) knlGS:0000000000000000 [ 296.625676] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 296.634865] CR2: 00000000000005ab CR3: 00000004ebe0e000 CR4: 00000000000607e0 [ 296.645657] Call Trace: [ 296.651676] [<ffffffffc162cd44>] ldiskfs_readdir+0x5b4/0x850 [ldiskfs] [ 296.662343] [<ffffffffc0944ef2>] ? fld_local_lookup+0x62/0x270 [fld] [ 296.672780] [<ffffffffc16ae7b0>] ? osd_object_alloc+0x360/0x360 [osd_ldiskfs] [ 296.684053] [<ffffffffc16acc3e>] osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs] [ 296.695182] [<ffffffffc16acfa6>] osd_it_ea_next+0xc6/0x150 [osd_ldiskfs] [ 296.705729] [<ffffffffc1184ae8>] dt_index_page_build+0x1a8/0x470 [obdclass] [ 296.716658] [<ffffffffc11846b0>] dt_index_walk+0x1a0/0x430 [obdclass] [ 296.726761] [<ffffffffc1184940>] ? dt_index_walk+0x430/0x430 [obdclass] [ 296.737149] [<ffffffffc1185a14>] dt_index_read+0x394/0x6a0 [obdclass] [ 296.747492] [<ffffffffc141cd22>] tgt_obd_idx_read+0x612/0x860 [ptlrpc] [ 296.757860] [<ffffffffc141ff3a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 296.768476] [<ffffffffc13fba61>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [ 296.779775] [<ffffffffc1038bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [ 296.790527] [<ffffffffc13c6acb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 296.801935] [<ffffffffc13c3955>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [ 296.812275] [<ffffffff8c0cf682>] ? default_wake_function+0x12/0x20 [ 296.822010] [<ffffffff8c0c52ab>] ? __wake_up_common+0x5b/0x90 [ 296.831226] [<ffffffffc13ca2ec>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc] [ 296.840865] [<ffffffffc13c97f0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc] [ 296.851675] [<ffffffff8c0bb621>] kthread+0xd1/0xe0 [ 296.859622] [<ffffffff8c0bb550>] ? insert_kthread_work+0x40/0x40 [ 296.868941] [<ffffffff8c7205f7>] ret_from_fork_nospec_begin+0x21/0x21 [ 296.878704] [<ffffffff8c0bb550>] ? insert_kthread_work+0x40/0x40 [ 296.888023] Code: c0 5d c3 0f 1f 44 00 00 55 48 8b 17 48 89 e5 48 39 d7 74 3b 48 8b 47 08 48 85 c0 75 0e eb 25 66 0f 1f 84 00 00 00 00 00 48 89 d0 <48> 8b 50 10 48 85 d2 75 f4 5d c3 66 90 48 3b 78 08 75 f6 48 8b [ 296.915195] RIP [<ffffffff8c3528c3>] rb_next+0x23/0x50 [ 296.923524] RSP <ffff9190a81f7a40> [ 296.929891] CR2: 00000000000005ab [ 296.935844] ---[ end trace 7f7dc5e5140c0c8b ]---
            laisiyao Lai Siyao added a comment -

            Sarah, did you run dir migration test in soak test? And how many times have you seen this crash?

            laisiyao Lai Siyao added a comment - Sarah, did you run dir migration test in soak test? And how many times have you seen this crash?
            pjones Peter Jones added a comment -

            Lai

            Does this appear to be related to recent changes under LU-4684?

            Peter

            pjones Peter Jones added a comment - Lai Does this appear to be related to recent changes under LU-4684 ? Peter

            People

              ys Yang Sheng
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: