Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11457

osd_oi_insert(): the FID is used by two objects

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0
    • Lustre 2.12.0, Lustre 2.12.2, Lustre 2.12.4, Lustre 2.12.5
    • 3
    • 9223372036854775807

    Description

      tag-2.11.55

      MDS crash

      [15650.670434] device-mapper: multipath: Failing path 8:96.^M
      [15650.765276] BUG: unable to handle kernel NULL pointer dereference at           (null)^M
      [15650.775741] IP: [<          (null)>]           (null)^M
      [15650.783081] PGD 0 ^M
      [15650.786948] Oops: 0010 [#1] SMP ^M
      [15650.792218] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt joydev pcspkr ipmi_ssif iTCO_vendor_support sg ipmi_si ipmi_devintf shpchp ipmi_msghandler i2c_i801 mei_me ioatdma mei lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm igb isci ahci ptp mlx4_core(OE) mpt3sas libsas libahci pps_core dca crct10dif_pclmul devlink i2c_algo_bit crct10dif_common raid_class crc32c_intel libata i2c_core mlx_compat(OE) scsi_transport_sas^M
      [15650.934649] CPU: 14 PID: 9491 Comm: mdt_rdpg01_008 Tainted: P           OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1^M
      [15650.952002] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
      [15650.966961] task: ffff8be6a1253f40 ti: ffff8be68a044000 task.ti: ffff8be68a044000^M
      [15650.977791] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)^M
      [15650.988693] RSP: 0018:ffff8be68a047b58  EFLAGS: 00010246^M
      [15650.997167] RAX: 0000000000000000 RBX: ffff8be68b820000 RCX: 0000000000000002^M
      [15651.007733] RDX: ffffffffc164c7b0 RSI: ffff8be68a047b60 RDI: ffff8be68b820008^M
      [15651.018326] RBP: ffff8be68a047b98 R08: 0000000000000004 R09: 0000000000000000^M
      [15651.028930] R10: 0000000000000001 R11: 00000000007fffff R12: ffff8be26f9fab00^M
      [15651.039547] R13: ffff8be279a448a0 R14: ffff8be68a160000 R15: ffff8be68b820008^M
      [15651.050168] FS:  0000000000000000(0000) GS:ffff8be6ad980000(0000) knlGS:0000000000000000^M
      [15651.061880] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
      [15651.070980] CR2: 0000000000000000 CR3: 000000042c3b6000 CR4: 00000000000607e0^M
      [15651.081660] Call Trace:^M
      [15651.087091]  [<ffffffffc164ac3e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]^M
      [15651.098058]  [<ffffffffc164ae17>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]^M
      [15651.108370]  [<ffffffffc188eb47>] lod_it_load+0x27/0x90 [lod]^M
      [15651.117554]  [<ffffffffc0f48808>] dt_index_walk+0xf8/0x430 [obdclass]^M
      [15651.127457]  [<ffffffffc1915080>] ? mdd_object_lock+0xe0/0xe0 [mdd]^M
      [15651.137132]  [<ffffffffc1916d9f>] mdd_readpage+0x25f/0x5a0 [mdd]^M
      [15651.146553]  [<ffffffffc1782bda>] mdt_readpage+0x63a/0x880 [mdt]^M
      [15651.155992]  [<ffffffffc11e82ca>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M
      [15651.166379]  [<ffffffffc11c02e1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]^M
      [15651.177493]  [<ffffffffc0dfcbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]^M
      [15651.188033]  [<ffffffffc118b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M
      [15651.199251]  [<ffffffffc1188315>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]^M
      [15651.209399]  [<ffffffff83ccf682>] ? default_wake_function+0x12/0x20^M
      [15651.218931]  [<ffffffff83cc52ab>] ? __wake_up_common+0x5b/0x90^M
      [15651.228026]  [<ffffffffc118ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]^M
      [15651.237575]  [<ffffffffc118e1b0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]^M
      [15651.248365]  [<ffffffff83cbb621>] kthread+0xd1/0xe0^M
      [15651.256344]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.265688]  [<ffffffff843205f7>] ret_from_fork_nospec_begin+0x21/0x21^M
      [15651.275475]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.284736] Code:  Bad RIP value.^M
      [15651.290946] RIP  [<          (null)>]           (null)^M
      [15651.299236]  RSP <ffff8be68a047b58>^M
      [15651.305543] CR2: 0000000000000000^M
      [15651.315778] ---[ end trace 4ae4238c00f9aeec ]---^M
      [15651.336386] Kernel panic - not syncing: Fatal exception^M
      [15651.344613] Kernel Offset: 0x2c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)^M
      [15651.369289] ------------[ cut here ]------------^M
      [15651.376397] WARNING: CPU: 14 PID: 9491 at arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x65/0x70^M
      [15651.388915] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt joydev pcspkr ipmi_ssif iTCO_vendor_support sg ipmi_si ipmi_devintf shpchp ipmi_msghandler i2c_i801 mei_me ioatdma mei lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm igb isci ahci ptp mlx4_core(OE) mpt3sas libsas libahci pps_core dca crct10dif_pclmul devlink i2c_algo_bit crct10dif_common raid_class crc32c_intel libata i2c_core mlx_compat(OE) scsi_transport_sas^M
      [15651.529620] CPU: 14 PID: 9491 Comm: mdt_rdpg01_008 Tainted: P      D    OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1^M
      [15651.546472] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
      [15651.561156] Call Trace:^M
      [15651.566023]  <IRQ>  [<ffffffff8430e84e>] dump_stack+0x19/0x1b^M
      [15651.574646]  [<ffffffff83c91e18>] __warn+0xd8/0x100^M
      [15651.582224]  [<ffffffff83c91f5d>] warn_slowpath_null+0x1d/0x20^M
      [15651.590851]  [<ffffffff83c54e95>] native_smp_send_reschedule+0x65/0x70^M
      [15651.600279]  [<ffffffff83cddf81>] trigger_load_balance+0x191/0x280^M
      [15651.609280]  [<ffffffff83ccdc0a>] scheduler_tick+0x10a/0x150^M
      [15651.617702]  [<ffffffff83d01c10>] ? tick_sched_do_timer+0x50/0x50^M
      [15651.626619]  [<ffffffff83ca4f65>] update_process_times+0x65/0x80^M
      [15651.635416]  [<ffffffff83d01a10>] tick_sched_handle+0x30/0x70^M
      [15651.643916]  [<ffffffff83d01c49>] tick_sched_timer+0x39/0x80^M
      [15651.652315]  [<ffffffff83cbf7e6>] __hrtimer_run_queues+0xd6/0x260^M
      [15651.661210]  [<ffffffff83cbfd7f>] hrtimer_interrupt+0xaf/0x1d0^M
      [15651.669814]  [<ffffffff83c5847b>] local_apic_timer_interrupt+0x3b/0x60^M
      [15651.679184]  [<ffffffff84325063>] smp_apic_timer_interrupt+0x43/0x60^M
      [15651.688352]  [<ffffffff843217b2>] apic_timer_interrupt+0x162/0x170^M
      [15651.697316]  <EOI>  [<ffffffff84308c3d>] ? panic+0x1d5/0x21f^M
      [15651.705715]  [<ffffffff84308ba1>] ? panic+0x139/0x21f^M
      [15651.713430]  [<ffffffff84318745>] oops_end+0xc5/0xe0^M
      [15651.721020]  [<ffffffff8430807e>] no_context+0x285/0x2a8^M
      [15651.728984]  [<ffffffff84308115>] __bad_area_nosemaphore+0x74/0x1d1^M
      [15651.738014]  [<ffffffff84308286>] bad_area_nosemaphore+0x14/0x16^M
      [15651.746760]  [<ffffffff8431b6e0>] __do_page_fault+0x330/0x4f0^M
      [15651.755199]  [<ffffffff8431b8d5>] do_page_fault+0x35/0x90^M
      [15651.763264]  [<ffffffff84317758>] page_fault+0x28/0x30^M
      [15651.771013]  [<ffffffffc164c7b0>] ? osd_object_alloc+0x360/0x360 [osd_ldiskfs]^M
      [15651.781105]  [<ffffffffc164ac3e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]^M
      [15651.791402]  [<ffffffffc164ae17>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]^M
      [15651.801028]  [<ffffffffc188eb47>] lod_it_load+0x27/0x90 [lod]^M
      [15651.809517]  [<ffffffffc0f48808>] dt_index_walk+0xf8/0x430 [obdclass]^M
      [15651.818761]  [<ffffffffc1915080>] ? mdd_object_lock+0xe0/0xe0 [mdd]^M
      [15651.827808]  [<ffffffffc1916d9f>] mdd_readpage+0x25f/0x5a0 [mdd]^M
      [15651.836533]  [<ffffffffc1782bda>] mdt_readpage+0x63a/0x880 [mdt]^M
      [15651.845269]  [<ffffffffc11e82ca>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M
      [15651.854937]  [<ffffffffc11c02e1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]^M
      [15651.865302]  [<ffffffffc0dfcbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]^M
      [15651.875084]  [<ffffffffc118b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M
      [15651.885522]  [<ffffffffc1188315>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]^M
      [15651.894881]  [<ffffffff83ccf682>] ? default_wake_function+0x12/0x20^M
      [15651.903621]  [<ffffffff83cc52ab>] ? __wake_up_common+0x5b/0x90^M
      [15651.911880]  [<ffffffffc118ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]^M
      [15651.920585]  [<ffffffffc118e1b0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]^M
      [15651.930460]  [<ffffffff83cbb621>] kthread+0xd1/0xe0^M
      [15651.937468]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.945794]  [<ffffffff843205f7>] ret_from_fork_nospec_begin+0x21/0x21^M
      [15651.954564]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.962806] ---[ end trace 4ae4238c00f9aeed ]---^M
      
      

      Attachments

        Issue Links

          Activity

            [LU-11457] osd_oi_insert(): the FID is used by two objects
            lixi_wc Li Xi added a comment -

            We checked the inodes via debugfs and they showed that the inodes did not have the same FID in the LMA, and in fact two of the inodes were completely unused, so it isn't clear why this error was being printed at all:

            I am not sure whether this related, but the osd_add_dot_dotdot_internal() seems save the wrong child fid to "..".

            lixi_wc Li Xi added a comment - We checked the inodes via debugfs and they showed that the inodes did not have the same FID in the LMA, and in fact two of the inodes were completely unused, so it isn't clear why this error was being printed at all: I am not sure whether this related, but the osd_add_dot_dotdot_internal() seems save the wrong child fid to "..".
            sarah Sarah Liu added a comment -

            soak hit the problem on 2.12.5 testing after running for 4 days.

            [18137.800259] Pid: 5007, comm: mdt01_017 3.10.0-1062.18.1.el7_lustre.x86_64 #1 SMP Wed May 27 23:31:53 UTC 2020
            [18137.811351] Call Trace:
            [18137.814107]  [<ffffffffc107f8fc>] top_trans_wait_result+0xa9/0x158 [ptlrpc]
            [18137.822008]  [<ffffffffc105fb79>] top_trans_stop+0x4e9/0xa70 [ptlrpc]
            [18137.829352]  [<ffffffffc1522dec>] lod_trans_stop+0x25c/0x340 [lod]
            [18137.836307]  [<ffffffffc15d56ee>] mdd_trans_stop+0x2e/0x174 [mdd]
            [18137.843150]  [<ffffffffc15bdc73>] mdd_unlink+0x3f3/0xaa0 [mdd]
            [18137.849700]  [<ffffffffc148139d>] mdo_unlink+0x46/0x48 [mdt]
            [18137.856087]  [<ffffffffc14452bd>] mdt_reint_unlink+0xbed/0x14b0 [mdt]
            [18137.863361]  [<ffffffffc1449963>] mdt_reint_rec+0x83/0x210 [mdt]
            [18137.870109]  [<ffffffffc1426273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
            [18137.877464]  [<ffffffffc14316e7>] mdt_reint+0x67/0x140 [mdt]
            [18137.883828]  [<ffffffffc104d9da>] tgt_request_handle+0xada/0x1570 [ptlrpc]
            [18137.891641]  [<ffffffffc0ff248b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [18137.900298]  [<ffffffffc0ff5df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
            [18137.907411]  [<ffffffff9f2c6321>] kthread+0xd1/0xe0
            [18137.912887]  [<ffffffff9f98ed37>] ret_from_fork_nospec_end+0x0/0x39
            [18137.919936]  [<ffffffffffffffff>] 0xffffffffffffffff
            [18137.925528] LNet: Service thread pid 4804 was inactive for 202.30s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
            [18152.716405] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 0 seconds
            [18152.727780] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 111 previous similar messages
            [18210.718946] Lustre: 4675:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1591370247/real 1591370247]  req@ffff8b87f443ec00 x1668654025156672/t0(0) o105->soaked-MDT0001@192.168.1.111@o2ib:15/16 lens 304/224 e 0 to 1 dl 1591370254 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
            [18210.749830] Lustre: 4675:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 110 previous similar messages
            [18232.719942] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 0 seconds
            [18232.731320] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 91 previous similar messages
            [18235.829192] Lustre: soaked-MDT0001: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID
            [18235.841908] Lustre: soaked-MDT0001: Connection restored to soaked-MDT0000-mdtlov_UUID (at 192.168.1.108@o2ib)
            [18235.853001] Lustre: Skipped 1 previous similar message
            [18236.004566] LustreError: 4796:0:(osd_oi.c:762:osd_oi_insert()) dm-1: the FID [0x2400134c5:0x8001:0x0] is used by two objects: 477626398/123258462 477626399/123258463
            [18236.053964] LNet: 4615:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.108@o2ib
            [18236.794778] BUG: unable to handle kernel NULL pointer dereference at           (null)
            [18236.803570] IP: [<          (null)>]           (null)
            [18236.809248] PGD 0 
            [18236.811519] Oops: 0010 [#1] SMP 
            [18236.815153] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel iTCO_wdt iTCO_vendor_support aesni_intel lrw gf128mul glue_helper ipmi_ssif ablk_helper dm_round_robin mei_me cryptd mei sg ioatdma ipmi_si i2c_i801 lpc_ich ipmi_devintf joydev ipmi_msghandler pcspkr pcc_cpufreq wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops isci ttm igb mlx4_core(OE) ahci libsas libahci ptp mpt2sas drm devlink pps_core crct10dif_pclmul raid_class crct10dif_common dca libata crc32c_intel mlx_compat(OE) scsi_transport_sas drm_panel_orientation_quirks i2c_algo_bit
            [18236.927648] CPU: 31 PID: 4694 Comm: mdt_rdpg01_001 Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.18.1.el7_lustre.x86_64 #1
            [18236.941923] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
            [18236.954462] task: ffff8b8807dae2a0 ti: ffff8b88072f0000 task.ti: ffff8b88072f0000
            [18236.962824] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
            [18236.971207] RSP: 0018:ffff8b88072f3b68  EFLAGS: 00010246
            [18236.977142] RAX: 0000000000000000 RBX: ffff8b87ff710000 RCX: 0000000000000002
            [18236.985109] RDX: ffffffffc12e2ab0 RSI: ffff8b88072f3b70 RDI: ffff8b87ff710008
            [18236.993080] RBP: ffff8b88072f3ba8 R08: 0000000000000000 R09: 0000000000000001
            [18237.001053] R10: ffff8b842dbb5800 R11: 00000000007fffff R12: ffff8b87fef0f000
            [18237.009011] R13: ffff8b87eb34f628 R14: ffff8b88283e6800 R15: ffff8b87ff710008
            [18237.016993] FS:  0000000000000000(0000) GS:ffff8b882e5c0000(0000) knlGS:0000000000000000
            [18237.026029] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [18237.032449] CR2: 0000000000000000 CR3: 00000006c0a10000 CR4: 00000000000607e0
            [18237.040420] Call Trace:
            [18237.043201]  [<ffffffffc12e0eae>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]
            [18237.051478]  [<ffffffffc12e1087>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]
            [18237.059104]  [<ffffffffc1538e07>] lod_it_load+0x27/0x90 [lod]
            [18237.065587]  [<ffffffffc0d0e258>] dt_index_walk+0xf8/0x430 [obdclass]
            [18237.072807]  [<ffffffffc15c0c60>] ? mdd_object_lock+0xe0/0xe0 [mdd]
            [18237.079818]  [<ffffffffc15c25bf>] mdd_readpage+0x25f/0x5a0 [mdd]
            [18237.086558]  [<ffffffffc141f88a>] mdt_readpage+0x69a/0x8e0 [mdt]
            [18237.093384]  [<ffffffffc104d9da>] tgt_request_handle+0xada/0x1570 [ptlrpc]
            [18237.101137]  [<ffffffffc1027121>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
            [18237.109611]  [<ffffffffc0bc0bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
            [18237.117558]  [<ffffffffc0ff248b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [18237.126192]  [<ffffffffc0fef2a5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
            [18237.133784]  [<ffffffff9f2d3a33>] ? __wake_up+0x13/0x20
            [18237.139689]  [<ffffffffc0ff5df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
            [18237.146704]  [<ffffffff9f980e02>] ? __schedule+0x402/0x840
            [18237.152899]  [<ffffffffc0ff52c0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
            [18237.161175]  [<ffffffff9f2c6321>] kthread+0xd1/0xe0
            [18237.166629]  [<ffffffff9f2c6250>] ? insert_kthread_work+0x40/0x40
            [18237.173457]  [<ffffffff9f98ed37>] ret_from_fork_nospec_begin+0x21/0x21
            [18237.180748]  [<ffffffff9f2c6250>] ? insert_kthread_work+0x40/0x40
            [18237.187548] Code:  Bad RIP value.
            [18237.191284] RIP  [<          (null)>]           (null)
            [18237.197031]  RSP <ffff8b88072f3b68>
            [18237.200923] CR2: 0000000000000000
            [    0.000000] Initializing cgroup subsys cpuset
            [    0.000000] Initializing cgroup subsys cpu
            [    0.000000] Initializing cgroup subsys cpuacct
            [    0.000000] Linux version 3.10.0-1062.18.1.el7_lustre.x86_64 (jenkins@trevis-307-el7-x8664-1.trevis.whamcloud.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Wed May 27 23:31:53 UTC 2020
            [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-1062.18.1.el7_lustre.x86_64 ro console=ttyS0,115200 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd disable_cpu_apicid=0 elfcorehdr=869816K
            [    0.000000] e820: BIOS-provided physical RAM map:
            [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved
            [    0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000008efff] usable
            [    0.000000] BIOS-e820: [mem 0x000000000008f000-0x000000000009ffff] reserved
            [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
            [    0.000000] BIOS-e820: [mem 0x000000002b000000-0x000000003516dfff] usable
            [    0.000000] BIOS-e820: [mem 0x00000000bb3c7000-0x00000000bdd2efff] reserved
            
            
            sarah Sarah Liu added a comment - soak hit the problem on 2.12.5 testing after running for 4 days. [18137.800259] Pid: 5007, comm: mdt01_017 3.10.0-1062.18.1.el7_lustre.x86_64 #1 SMP Wed May 27 23:31:53 UTC 2020 [18137.811351] Call Trace: [18137.814107] [<ffffffffc107f8fc>] top_trans_wait_result+0xa9/0x158 [ptlrpc] [18137.822008] [<ffffffffc105fb79>] top_trans_stop+0x4e9/0xa70 [ptlrpc] [18137.829352] [<ffffffffc1522dec>] lod_trans_stop+0x25c/0x340 [lod] [18137.836307] [<ffffffffc15d56ee>] mdd_trans_stop+0x2e/0x174 [mdd] [18137.843150] [<ffffffffc15bdc73>] mdd_unlink+0x3f3/0xaa0 [mdd] [18137.849700] [<ffffffffc148139d>] mdo_unlink+0x46/0x48 [mdt] [18137.856087] [<ffffffffc14452bd>] mdt_reint_unlink+0xbed/0x14b0 [mdt] [18137.863361] [<ffffffffc1449963>] mdt_reint_rec+0x83/0x210 [mdt] [18137.870109] [<ffffffffc1426273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] [18137.877464] [<ffffffffc14316e7>] mdt_reint+0x67/0x140 [mdt] [18137.883828] [<ffffffffc104d9da>] tgt_request_handle+0xada/0x1570 [ptlrpc] [18137.891641] [<ffffffffc0ff248b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [18137.900298] [<ffffffffc0ff5df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [18137.907411] [<ffffffff9f2c6321>] kthread+0xd1/0xe0 [18137.912887] [<ffffffff9f98ed37>] ret_from_fork_nospec_end+0x0/0x39 [18137.919936] [<ffffffffffffffff>] 0xffffffffffffffff [18137.925528] LNet: Service thread pid 4804 was inactive for 202.30s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [18152.716405] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 0 seconds [18152.727780] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 111 previous similar messages [18210.718946] Lustre: 4675:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1591370247/real 1591370247] req@ffff8b87f443ec00 x1668654025156672/t0(0) o105->soaked-MDT0001@192.168.1.111@o2ib:15/16 lens 304/224 e 0 to 1 dl 1591370254 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 [18210.749830] Lustre: 4675:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 110 previous similar messages [18232.719942] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 0 seconds [18232.731320] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 91 previous similar messages [18235.829192] Lustre: soaked-MDT0001: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID [18235.841908] Lustre: soaked-MDT0001: Connection restored to soaked-MDT0000-mdtlov_UUID (at 192.168.1.108@o2ib) [18235.853001] Lustre: Skipped 1 previous similar message [18236.004566] LustreError: 4796:0:(osd_oi.c:762:osd_oi_insert()) dm-1: the FID [0x2400134c5:0x8001:0x0] is used by two objects: 477626398/123258462 477626399/123258463 [18236.053964] LNet: 4615:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.108@o2ib [18236.794778] BUG: unable to handle kernel NULL pointer dereference at (null) [18236.803570] IP: [< (null)>] (null) [18236.809248] PGD 0 [18236.811519] Oops: 0010 [#1] SMP [18236.815153] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel iTCO_wdt iTCO_vendor_support aesni_intel lrw gf128mul glue_helper ipmi_ssif ablk_helper dm_round_robin mei_me cryptd mei sg ioatdma ipmi_si i2c_i801 lpc_ich ipmi_devintf joydev ipmi_msghandler pcspkr pcc_cpufreq wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops isci ttm igb mlx4_core(OE) ahci libsas libahci ptp mpt2sas drm devlink pps_core crct10dif_pclmul raid_class crct10dif_common dca libata crc32c_intel mlx_compat(OE) scsi_transport_sas drm_panel_orientation_quirks i2c_algo_bit [18236.927648] CPU: 31 PID: 4694 Comm: mdt_rdpg01_001 Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.18.1.el7_lustre.x86_64 #1 [18236.941923] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [18236.954462] task: ffff8b8807dae2a0 ti: ffff8b88072f0000 task.ti: ffff8b88072f0000 [18236.962824] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [18236.971207] RSP: 0018:ffff8b88072f3b68 EFLAGS: 00010246 [18236.977142] RAX: 0000000000000000 RBX: ffff8b87ff710000 RCX: 0000000000000002 [18236.985109] RDX: ffffffffc12e2ab0 RSI: ffff8b88072f3b70 RDI: ffff8b87ff710008 [18236.993080] RBP: ffff8b88072f3ba8 R08: 0000000000000000 R09: 0000000000000001 [18237.001053] R10: ffff8b842dbb5800 R11: 00000000007fffff R12: ffff8b87fef0f000 [18237.009011] R13: ffff8b87eb34f628 R14: ffff8b88283e6800 R15: ffff8b87ff710008 [18237.016993] FS: 0000000000000000(0000) GS:ffff8b882e5c0000(0000) knlGS:0000000000000000 [18237.026029] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [18237.032449] CR2: 0000000000000000 CR3: 00000006c0a10000 CR4: 00000000000607e0 [18237.040420] Call Trace: [18237.043201] [<ffffffffc12e0eae>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs] [18237.051478] [<ffffffffc12e1087>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs] [18237.059104] [<ffffffffc1538e07>] lod_it_load+0x27/0x90 [lod] [18237.065587] [<ffffffffc0d0e258>] dt_index_walk+0xf8/0x430 [obdclass] [18237.072807] [<ffffffffc15c0c60>] ? mdd_object_lock+0xe0/0xe0 [mdd] [18237.079818] [<ffffffffc15c25bf>] mdd_readpage+0x25f/0x5a0 [mdd] [18237.086558] [<ffffffffc141f88a>] mdt_readpage+0x69a/0x8e0 [mdt] [18237.093384] [<ffffffffc104d9da>] tgt_request_handle+0xada/0x1570 [ptlrpc] [18237.101137] [<ffffffffc1027121>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [18237.109611] [<ffffffffc0bc0bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [18237.117558] [<ffffffffc0ff248b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [18237.126192] [<ffffffffc0fef2a5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [18237.133784] [<ffffffff9f2d3a33>] ? __wake_up+0x13/0x20 [18237.139689] [<ffffffffc0ff5df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [18237.146704] [<ffffffff9f980e02>] ? __schedule+0x402/0x840 [18237.152899] [<ffffffffc0ff52c0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [18237.161175] [<ffffffff9f2c6321>] kthread+0xd1/0xe0 [18237.166629] [<ffffffff9f2c6250>] ? insert_kthread_work+0x40/0x40 [18237.173457] [<ffffffff9f98ed37>] ret_from_fork_nospec_begin+0x21/0x21 [18237.180748] [<ffffffff9f2c6250>] ? insert_kthread_work+0x40/0x40 [18237.187548] Code: Bad RIP value. [18237.191284] RIP [< (null)>] (null) [18237.197031] RSP <ffff8b88072f3b68> [18237.200923] CR2: 0000000000000000 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct [ 0.000000] Linux version 3.10.0-1062.18.1.el7_lustre.x86_64 (jenkins@trevis-307-el7-x8664-1.trevis.whamcloud.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Wed May 27 23:31:53 UTC 2020 [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-1062.18.1.el7_lustre.x86_64 ro console=ttyS0,115200 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd disable_cpu_apicid=0 elfcorehdr=869816K [ 0.000000] e820: BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000008efff] usable [ 0.000000] BIOS-e820: [mem 0x000000000008f000-0x000000000009ffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved [ 0.000000] BIOS-e820: [mem 0x000000002b000000-0x000000003516dfff] usable [ 0.000000] BIOS-e820: [mem 0x00000000bb3c7000-0x00000000bdd2efff] reserved
            sarah Sarah Liu added a comment -

            Hit this issue again on lustre-b2_12-next #57
            soak-11 console:

            [16422.066032] Lustre: 79110:0:(service.c:1322:ptlrpc_at_send_early_reply()) Skipped 425 previous similar messages
            [16469.920777] Lustre: 79107:0:(service.c:2011:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 225s  req@ffff89052d9cf980 x1656840860166528/t0(0) o103->s
            oaked-MDT0002-mdtlov_UUID@192.168.1.110@o2ib:0/0 lens 328/0 e 0 to 0 dl 0 ref 1 fl New:/2/ffffffff rc 0/-1
            [16469.946879] Lustre: 79107:0:(service.c:2011:ptlrpc_server_handle_req_in()) Skipped 370 previous similar messages
            [16511.443090] Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
            [16511.453496] Lustre: 79109:0:(service.c:1541:ptlrpc_at_check_timed()) earlyQ=3796 reqQ=17207 recA=0, svcEst=600, delay=85403
            [16511.466032] Lustre: 79109:0:(service.c:1322:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-2s), not sending early reply. Consider increasing at_ear
            ly_margin (5)?  req@ffff890983671200 x1656840826067648/t0(0) o103->soaked-MDT0000-mdtlov_UUID@192.168.1.108@o2ib:154/0 lens 328/0 e 0 to 0 dl 1580100394 ref 2 fl
             New:/0/ffffffff rc 0/-1
            [16511.499803] Lustre: 79109:0:(service.c:1322:ptlrpc_at_send_early_reply()) Skipped 109 previous similar messages
            [16589.022832] Lustre: soaked-MDT0003: Client 58fa13bb-1d12-d81e-9977-655c7b0a359d (at 192.168.1.117@o2ib) reconnecting
            [16665.657814] Lustre: soaked-MDT0003: Client 2d96e5f6-af44-1f55-5648-3c5026213dae (at 192.168.1.130@o2ib) reconnecting
            [16751.557870] Lustre: soaked-MDT0003: Client d23f5b34-b590-b62f-81cf-55bba7beeb62 (at 192.168.1.142@o2ib) reconnecting
            [16756.583622] Lustre: soaked-MDT0003: Client bd3af862-653c-bd66-2072-7b69d8812dd7 (at 192.168.1.140@o2ib) reconnecting
            [16756.595384] Lustre: Skipped 1 previous similar message
            [16781.901025] Lustre: soaked-MDT0003: Client 55eecb6a-6cc1-b7f2-e17b-d2ce51f4ff57 (at 192.168.1.141@o2ib) reconnecting
            [16785.356557] perf: interrupt took too long (3961 > 3942), lowering kernel.perf_event_max_sample_rate to 50000
            [16840.006883] Lustre: 78276:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1580100124/real 1580100124]  req@
            ffff890592e0d580 x1656840776727360/t0(0) o103->soaked-MDT0000-osp-MDT0003@192.168.1.108@o2ib:17/18 lens 328/224 e 7 to 1 dl 1580100725 ref 1 fl Rpc:X/2/ffffffff 
            rc -11/-1
            [16840.012891] Lustre: soaked-MDT0000-osp-MDT0003: Connection to soaked-MDT0000 (at 192.168.1.108@o2ib) was lost; in progress operations using this service will 
            wait for recovery to complete
            [16840.057838] Lustre: 78276:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1124 previous similar messages
            [16840.104230] Lustre: soaked-MDT0000-osp-MDT0003: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib)
            [16840.115694] Lustre: Skipped 8 previous similar messages
            [16844.361410] Lustre: soaked-MDT0003: Received new LWP connection from 192.168.1.110@o2ib, removing former export from same NID
            [16844.374044] Lustre: Skipped 1 previous similar message
            [16879.242891] Lustre: 79107:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
            [16879.242891]   req@ffff890969d8e300 x1656840800674880/t0(0) o103->soaked-MDT0000-mdtlov_UUID@192.168.1.108@o2ib:529/0 lens 328/0 e 0 to 0 dl 1580100769 ref 2 f
            l New:/2/ffffffff rc 0/-1
            [16879.274271] Lustre: 79107:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 34712 previous similar messages
            [16884.947622] LustreError: 79136:0:(osd_oi.c:762:osd_oi_insert()) dm-5: the FID [0x2c00543c1:0xe1d:0x0] is used by two objects: 212861283/3086025257 233832815/3
            086025258
            [16884.964548] LNet: 78257:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.108@o2ib
            [16887.446546] BUG: unable to handle kernel NULL pointer dereference at           (null)
            [16887.455308] IP: [<          (null)>]           (null)
            [16887.460965] PGD 0
            [16887.463234] Oops: 0010 [#1] SMP
            [16887.466862] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov
            (OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE)
            rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) sb_edac intel_powe
            rclamp coretemp intel_rapl iosf_mbi kvm iTCO_wdt irqbypass iTCO_vendor_support crc32_pclmul ghash_clmulni_intel aesni_intel lrw ipmi_ssif gf128mul glue_helper me
            i_me dm_round_robin ablk_helper i2c_i801 mei ipmi_si lpc_ich cryptd ioatdma joydev pcspkr sg ipmi_devintf ipmi_msghandler wmi pcc_cpufreq dm_multipath dm_mod aut
            h_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect mlx4_core(OE) igb sysimgblt fb_sys
            _fops isci mpt3sas ttm ahci libsas libahci ptp crct10dif_pclmul drm devlink pps_core raid_class crct10dif_common dca libata crc32c_intel mlx_compat(OE) scsi_tran
            sport_sas drm_panel_orientation_quirks i2c_algo_bit
            [16887.582758] CPU: 8 PID: 83890 Comm: mdt_out01_011 Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.9.1.el7_lustre.x86_64 #1
            [16887.596833] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
            [16887.609356] task: ffff8905a85d41c0 ti: ffff8905765b0000 task.ti: ffff8905765b0000
            [16887.617710] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
            [16887.626075] RSP: 0018:ffff8905765b3b68  EFLAGS: 00010246
            [16887.631996] RAX: 0000000000000000 RBX: ffff89097aff8000 RCX: 0000000000000002
            [16887.639962] RDX: ffffffffc1564ab0 RSI: ffff8905765b3b70 RDI: ffff89097aff8008
            [16887.647926] RBP: ffff8905765b3ba8 R08: 0000000000000000 R09: 0000000000000001
            [16887.655891] R10: 0000000200000008 R11: 00000000007fffff R12: ffff8909a93ded00
            [16887.663863] R13: ffff89096ddb15d8 R14: ffff890981779a00 R15: ffff89097aff8008
            [16887.671829] FS:  0000000000000000(0000) GS:ffff8909ae200000(0000) knlGS:0000000000000000
            [16887.680870] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [16887.687285] CR2: 0000000000000000 CR3: 0000000604210000 CR4: 00000000000607e0
            [16887.695251] Call Trace:
            [16887.698006]  [<ffffffffc1562eae>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]
            [16887.706274]  [<ffffffffc1563087>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]
            [16887.713907]  [<ffffffffc0dba208>] dt_index_walk+0xf8/0x430 [obdclass]
            [16887.721135]  [<ffffffffc0dba540>] ? dt_index_walk+0x430/0x430 [obdclass]
            [16887.728649]  [<ffffffffc0dbae34>] dt_index_read+0x394/0x6a0 [obdclass]
            [16887.736047]  [<ffffffffc10f4e42>] tgt_obd_idx_read+0x612/0x860 [ptlrpc]
            [16887.743503]  [<ffffffffc10f99ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
            [16887.751242]  [<ffffffffc10d3101>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
            [16887.759709]  [<ffffffffc0c6ebde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
            [16887.767639]  [<ffffffffc109e47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [16887.776256]  [<ffffffffc109b295>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
            [16887.783833]  [<ffffffff892d3903>] ? __wake_up+0x13/0x20
            [16887.789723]  [<ffffffffc10a1de4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
            [16887.796776]  [<ffffffffc10a12b0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
            [16887.805038]  [<ffffffff892c61f1>] kthread+0xd1/0xe0
            [16887.810487]  [<ffffffff892c6120>] ? insert_kthread_work+0x40/0x40
            [16887.817293]  [<ffffffff8998dd37>] ret_from_fork_nospec_begin+0x21/0x21
            [16887.824581]  [<ffffffff892c6120>] ? insert_kthread_work+0x40/0x40
            [16887.831383] Code:  Bad RIP value.
            [16887.835102] RIP  [<          (null)>]           (null)
            [16887.835102] RIP  [<          (null)>]           (null)
            [16887.840852]  RSP <ffff8905765b3b68>
            [16887.844734] CR2: 0000000000000000
            [    0.000000] Initializing cgroup subsys cpuset
            [    0.000000] Initializing cgroup subsys cpu
            [    0.000000] Initializing cgroup subsys cpuacct
            [    0.000000] Linux version 3.10.0-1062.9.1.el7_lustre.x86_64 (jenkins@trevis-310-el7-x8664-6.trevis.whamcloud.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Thu Jan 23 14:10:41 UTC 2020
            [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-1062.9.1.el7_lustre.x86_64 ro console=ttyS0,115200 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd disable_cpu_apicid=0 elfcorehdr=869812K
            [    0.000000] e820: BIOS-provided physical RAM map:
            [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved
            [    0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000008ffff] usable
            [    0.000000] BIOS-e820: [mem 0x0000000000090000-0x000000000009ffff] reserved
            

            core dump can be found under /scratch/dumps/soak-11.spirit.whamcloud.com/10.10.1.111-2020-01-27-04\:53\:21/

            sarah Sarah Liu added a comment - Hit this issue again on lustre-b2_12-next #57 soak-11 console: [16422.066032] Lustre: 79110:0:(service.c:1322:ptlrpc_at_send_early_reply()) Skipped 425 previous similar messages [16469.920777] Lustre: 79107:0:(service.c:2011:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 225s req@ffff89052d9cf980 x1656840860166528/t0(0) o103->s oaked-MDT0002-mdtlov_UUID@192.168.1.110@o2ib:0/0 lens 328/0 e 0 to 0 dl 0 ref 1 fl New:/2/ffffffff rc 0/-1 [16469.946879] Lustre: 79107:0:(service.c:2011:ptlrpc_server_handle_req_in()) Skipped 370 previous similar messages [16511.443090] Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound). [16511.453496] Lustre: 79109:0:(service.c:1541:ptlrpc_at_check_timed()) earlyQ=3796 reqQ=17207 recA=0, svcEst=600, delay=85403 [16511.466032] Lustre: 79109:0:(service.c:1322:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-2s), not sending early reply. Consider increasing at_ear ly_margin (5)? req@ffff890983671200 x1656840826067648/t0(0) o103->soaked-MDT0000-mdtlov_UUID@192.168.1.108@o2ib:154/0 lens 328/0 e 0 to 0 dl 1580100394 ref 2 fl New:/0/ffffffff rc 0/-1 [16511.499803] Lustre: 79109:0:(service.c:1322:ptlrpc_at_send_early_reply()) Skipped 109 previous similar messages [16589.022832] Lustre: soaked-MDT0003: Client 58fa13bb-1d12-d81e-9977-655c7b0a359d (at 192.168.1.117@o2ib) reconnecting [16665.657814] Lustre: soaked-MDT0003: Client 2d96e5f6-af44-1f55-5648-3c5026213dae (at 192.168.1.130@o2ib) reconnecting [16751.557870] Lustre: soaked-MDT0003: Client d23f5b34-b590-b62f-81cf-55bba7beeb62 (at 192.168.1.142@o2ib) reconnecting [16756.583622] Lustre: soaked-MDT0003: Client bd3af862-653c-bd66-2072-7b69d8812dd7 (at 192.168.1.140@o2ib) reconnecting [16756.595384] Lustre: Skipped 1 previous similar message [16781.901025] Lustre: soaked-MDT0003: Client 55eecb6a-6cc1-b7f2-e17b-d2ce51f4ff57 (at 192.168.1.141@o2ib) reconnecting [16785.356557] perf: interrupt took too long (3961 > 3942), lowering kernel.perf_event_max_sample_rate to 50000 [16840.006883] Lustre: 78276:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1580100124/real 1580100124] req@ ffff890592e0d580 x1656840776727360/t0(0) o103->soaked-MDT0000-osp-MDT0003@192.168.1.108@o2ib:17/18 lens 328/224 e 7 to 1 dl 1580100725 ref 1 fl Rpc:X/2/ffffffff rc -11/-1 [16840.012891] Lustre: soaked-MDT0000-osp-MDT0003: Connection to soaked-MDT0000 (at 192.168.1.108@o2ib) was lost; in progress operations using this service will wait for recovery to complete [16840.057838] Lustre: 78276:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1124 previous similar messages [16840.104230] Lustre: soaked-MDT0000-osp-MDT0003: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib) [16840.115694] Lustre: Skipped 8 previous similar messages [16844.361410] Lustre: soaked-MDT0003: Received new LWP connection from 192.168.1.110@o2ib, removing former export from same NID [16844.374044] Lustre: Skipped 1 previous similar message [16879.242891] Lustre: 79107:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply [16879.242891] req@ffff890969d8e300 x1656840800674880/t0(0) o103->soaked-MDT0000-mdtlov_UUID@192.168.1.108@o2ib:529/0 lens 328/0 e 0 to 0 dl 1580100769 ref 2 f l New:/2/ffffffff rc 0/-1 [16879.274271] Lustre: 79107:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 34712 previous similar messages [16884.947622] LustreError: 79136:0:(osd_oi.c:762:osd_oi_insert()) dm-5: the FID [0x2c00543c1:0xe1d:0x0] is used by two objects: 212861283/3086025257 233832815/3 086025258 [16884.964548] LNet: 78257:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.108@o2ib [16887.446546] BUG: unable to handle kernel NULL pointer dereference at (null) [16887.455308] IP: [< (null)>] (null) [16887.460965] PGD 0 [16887.463234] Oops: 0010 [#1] SMP [16887.466862] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov (OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) sb_edac intel_powe rclamp coretemp intel_rapl iosf_mbi kvm iTCO_wdt irqbypass iTCO_vendor_support crc32_pclmul ghash_clmulni_intel aesni_intel lrw ipmi_ssif gf128mul glue_helper me i_me dm_round_robin ablk_helper i2c_i801 mei ipmi_si lpc_ich cryptd ioatdma joydev pcspkr sg ipmi_devintf ipmi_msghandler wmi pcc_cpufreq dm_multipath dm_mod aut h_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect mlx4_core(OE) igb sysimgblt fb_sys _fops isci mpt3sas ttm ahci libsas libahci ptp crct10dif_pclmul drm devlink pps_core raid_class crct10dif_common dca libata crc32c_intel mlx_compat(OE) scsi_tran sport_sas drm_panel_orientation_quirks i2c_algo_bit [16887.582758] CPU: 8 PID: 83890 Comm: mdt_out01_011 Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.9.1.el7_lustre.x86_64 #1 [16887.596833] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [16887.609356] task: ffff8905a85d41c0 ti: ffff8905765b0000 task.ti: ffff8905765b0000 [16887.617710] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [16887.626075] RSP: 0018:ffff8905765b3b68 EFLAGS: 00010246 [16887.631996] RAX: 0000000000000000 RBX: ffff89097aff8000 RCX: 0000000000000002 [16887.639962] RDX: ffffffffc1564ab0 RSI: ffff8905765b3b70 RDI: ffff89097aff8008 [16887.647926] RBP: ffff8905765b3ba8 R08: 0000000000000000 R09: 0000000000000001 [16887.655891] R10: 0000000200000008 R11: 00000000007fffff R12: ffff8909a93ded00 [16887.663863] R13: ffff89096ddb15d8 R14: ffff890981779a00 R15: ffff89097aff8008 [16887.671829] FS: 0000000000000000(0000) GS:ffff8909ae200000(0000) knlGS:0000000000000000 [16887.680870] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [16887.687285] CR2: 0000000000000000 CR3: 0000000604210000 CR4: 00000000000607e0 [16887.695251] Call Trace: [16887.698006] [<ffffffffc1562eae>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs] [16887.706274] [<ffffffffc1563087>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs] [16887.713907] [<ffffffffc0dba208>] dt_index_walk+0xf8/0x430 [obdclass] [16887.721135] [<ffffffffc0dba540>] ? dt_index_walk+0x430/0x430 [obdclass] [16887.728649] [<ffffffffc0dbae34>] dt_index_read+0x394/0x6a0 [obdclass] [16887.736047] [<ffffffffc10f4e42>] tgt_obd_idx_read+0x612/0x860 [ptlrpc] [16887.743503] [<ffffffffc10f99ca>] tgt_request_handle+0xada/0x1570 [ptlrpc] [16887.751242] [<ffffffffc10d3101>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [16887.759709] [<ffffffffc0c6ebde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [16887.767639] [<ffffffffc109e47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [16887.776256] [<ffffffffc109b295>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [16887.783833] [<ffffffff892d3903>] ? __wake_up+0x13/0x20 [16887.789723] [<ffffffffc10a1de4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [16887.796776] [<ffffffffc10a12b0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [16887.805038] [<ffffffff892c61f1>] kthread+0xd1/0xe0 [16887.810487] [<ffffffff892c6120>] ? insert_kthread_work+0x40/0x40 [16887.817293] [<ffffffff8998dd37>] ret_from_fork_nospec_begin+0x21/0x21 [16887.824581] [<ffffffff892c6120>] ? insert_kthread_work+0x40/0x40 [16887.831383] Code: Bad RIP value. [16887.835102] RIP [< (null)>] (null) [16887.835102] RIP [< (null)>] (null) [16887.840852] RSP <ffff8905765b3b68> [16887.844734] CR2: 0000000000000000 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct [ 0.000000] Linux version 3.10.0-1062.9.1.el7_lustre.x86_64 (jenkins@trevis-310-el7-x8664-6.trevis.whamcloud.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Thu Jan 23 14:10:41 UTC 2020 [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-1062.9.1.el7_lustre.x86_64 ro console=ttyS0,115200 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd disable_cpu_apicid=0 elfcorehdr=869812K [ 0.000000] e820: BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000008ffff] usable [ 0.000000] BIOS-e820: [mem 0x0000000000090000-0x000000000009ffff] reserved core dump can be found under /scratch/dumps/soak-11.spirit.whamcloud.com/10.10.1.111-2020-01-27-04\:53\:21/
            ys Yang Sheng added a comment -

            From vmcore, looks like some thing really wrong:

            .....
            crash> inode ffff8da86e8e5d08 -x
            struct inode {
              i_mode = 0x41ed,
              i_opflags = 0x0,
              i_uid = {
                val = 0x21b
              },
              i_gid = {
                val = 0x21b
              },
              i_flags = 0x80,
              i_acl = 0x0,
              i_default_acl = 0xffffffffffffffff,
              i_op = 0xffffffffc1178d00 <ldiskfs_file_inode_operations>,
              i_sb = 0xffff8da5797df800,
              i_mapping = 0xffff8da86e8e5e58,
              i_security = 0x0,
              i_ino = 0x1c78042b,
              {
                i_nlink = 0x1,
                __i_nlink = 0x1
              },
              i_rdev = 0x0,
              i_size = 0x0,
              i_atime = {
                tv_sec = 0x5cd8196d,
                tv_nsec = 0x0
              },
              i_mtime = {
                tv_sec = 0x5cd8196d,
                tv_nsec = 0x0
              },
              i_ctime = {
                tv_sec = 0x5cd8196d,
                tv_nsec = 0x0
              },
              i_lock = {
                {
                  rlock = {
                    raw_lock = {
                      val = {
                        counter = 0x0
                      }
                    }
                  }
                }
              },
              i_bytes = 0x0,
              i_blkbits = 0xc,
              i_blocks = 0x0,
              i_state = 0x0,
              i_mutex = {
                count = {
                  counter = 0x1
                },
                wait_lock = {
                  {
                    rlock = {
                      raw_lock = {
                        val = {
                          counter = 0x0
                        }
                      }
                    }
                  }
                },
                wait_list = {
                  next = 0xffff8da86e8e5db8,
                  prev = 0xffff8da86e8e5db8
                },
                owner = 0x0,
                {
                  osq = {
                    tail = {
                      counter = 0x0
                    }
                  },
                  __UNIQUE_ID_rh_kabi_hide3 = {
                    spin_mlock = 0x0
                  },
                  {<No data fields>}
                }
              },
              dirtied_when = 0x0,
              i_hash = {
                next = 0x0,
                pprev = 0xffffab0282849760
              },
              i_io_list = {
                next = 0xffff8da86e8e5df0,
                prev = 0xffff8da86e8e5df0
              },
              i_lru = {
                next = 0xffff8da86e8e5e00,
                prev = 0xffff8da86e8e5e00
              },
              i_sb_list = {
                next = 0xffff8da8555f27a0,
                prev = 0xffff8da86ea655b0
              },
              {
                i_dentry = {
                  first = 0x0
                },
                i_rcu = {
                  next = 0x0,
                  func = 0x0
                }
              },
              i_version = 0x1,
              i_count = {
                counter = 0x1
              },
              i_dio_count = {
                counter = 0x0
              },
              i_writecount = {
                counter = 0x0
              },
              i_fop = 0xffffffffc1178dc0 <ldiskfs_file_operations>,
              i_flock = 0x0,
              i_data = {
                host = 0xffff8da86e8e5d08,
                page_tree = {
                  rh_reserved_height = 0x0,
                  gfp_mask = 0x20,
                  rnode = 0x0
                },
                tree_lock = {
                  {
                    rlock = {
                      raw_lock = {
                        val = {
                          counter = 0x0
                        }
                      }
                    }
                  }
                },
                {
                  i_mmap_writable = {
                    counter = 0x0
                  },
                  __UNIQUE_ID_rh_kabi_hide17 = {
                    i_mmap_writable = 0x0
                  },
                  {<No data fields>}
                },
                i_mmap = {
                  rb_node = 0x0
                },
                i_mmap_nonlinear = {
                  next = 0xffff8da86e8e5e80,
                  prev = 0xffff8da86e8e5e80
                },
                i_mmap_mutex = {
                  count = {
                    counter = 0x1
                  },
                  wait_lock = {
                    {
                      rlock = {
                        raw_lock = {
                          val = {
                            counter = 0x0
                          }
                        }
                      }
                    }
                  },
                  wait_list = {
                    next = 0xffff8da86e8e5e98,
                    prev = 0xffff8da86e8e5e98
                  },
                  owner = 0x0,
                  {
                    osq = {
                      tail = {
                        counter = 0x0
                      }
                    },
                    __UNIQUE_ID_rh_kabi_hide3 = {
                      spin_mlock = 0x0
                    },
                    {<No data fields>}
                  }
                },
                nrpages = 0x0,
                nrexceptional = 0x0,
                writeback_index = 0x0,
                a_ops = 0xffffffffc1178b20 <ldiskfs_journalled_aops>,
                flags = 0x200da,
                backing_dev_info = 0xffff8da8e3673b20,
                private_lock = {
                  {
                    rlock = {
                      raw_lock = {
                        val = {
                          counter = 0x0
                        }
                      }
                    }
                  }
                },
                private_list = {
                  next = 0xffff8da86e8e5ef0,
                  prev = 0xffff8da86e8e5ef0
                },
                private_data = 0x0
              },
              i_dquot = {0xffff8da8e5124100, 0xffff8da8e5124200, 0x0},
              i_devices = {
                next = 0xffff8da86e8e5f20,
                prev = 0xffff8da86e8e5f20
              },
              {
                i_pipe = 0x0,
                i_bdev = 0x0,
                i_cdev = 0x0
              },
              i_generation = 0x3f58114f,
              i_fsnotify_mask = 0x0,
              {
                i_fsnotify_marks = 0x0,
                __UNIQUE_ID_rh_kabi_hide18 = {
                  i_fsnotify_marks = {
                    first = 0x0
                  }
                },
                {<No data fields>}
              },
              i_readcount = {
                counter = 0x0
              },
              i_private = 0x0
            }
            ......
            

            The i_mode indicated this inode is a directory(0x4000). But the i_op & i_fop set to REG operations. Looks like some corner case causes such situation?

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - From vmcore, looks like some thing really wrong: ..... crash> inode ffff8da86e8e5d08 -x struct inode { i_mode = 0x41ed, i_opflags = 0x0, i_uid = { val = 0x21b }, i_gid = { val = 0x21b }, i_flags = 0x80, i_acl = 0x0, i_default_acl = 0xffffffffffffffff, i_op = 0xffffffffc1178d00 <ldiskfs_file_inode_operations>, i_sb = 0xffff8da5797df800, i_mapping = 0xffff8da86e8e5e58, i_security = 0x0, i_ino = 0x1c78042b, { i_nlink = 0x1, __i_nlink = 0x1 }, i_rdev = 0x0, i_size = 0x0, i_atime = { tv_sec = 0x5cd8196d, tv_nsec = 0x0 }, i_mtime = { tv_sec = 0x5cd8196d, tv_nsec = 0x0 }, i_ctime = { tv_sec = 0x5cd8196d, tv_nsec = 0x0 }, i_lock = { { rlock = { raw_lock = { val = { counter = 0x0 } } } } }, i_bytes = 0x0, i_blkbits = 0xc, i_blocks = 0x0, i_state = 0x0, i_mutex = { count = { counter = 0x1 }, wait_lock = { { rlock = { raw_lock = { val = { counter = 0x0 } } } } }, wait_list = { next = 0xffff8da86e8e5db8, prev = 0xffff8da86e8e5db8 }, owner = 0x0, { osq = { tail = { counter = 0x0 } }, __UNIQUE_ID_rh_kabi_hide3 = { spin_mlock = 0x0 }, {<No data fields>} } }, dirtied_when = 0x0, i_hash = { next = 0x0, pprev = 0xffffab0282849760 }, i_io_list = { next = 0xffff8da86e8e5df0, prev = 0xffff8da86e8e5df0 }, i_lru = { next = 0xffff8da86e8e5e00, prev = 0xffff8da86e8e5e00 }, i_sb_list = { next = 0xffff8da8555f27a0, prev = 0xffff8da86ea655b0 }, { i_dentry = { first = 0x0 }, i_rcu = { next = 0x0, func = 0x0 } }, i_version = 0x1, i_count = { counter = 0x1 }, i_dio_count = { counter = 0x0 }, i_writecount = { counter = 0x0 }, i_fop = 0xffffffffc1178dc0 <ldiskfs_file_operations>, i_flock = 0x0, i_data = { host = 0xffff8da86e8e5d08, page_tree = { rh_reserved_height = 0x0, gfp_mask = 0x20, rnode = 0x0 }, tree_lock = { { rlock = { raw_lock = { val = { counter = 0x0 } } } } }, { i_mmap_writable = { counter = 0x0 }, __UNIQUE_ID_rh_kabi_hide17 = { i_mmap_writable = 0x0 }, {<No data fields>} }, i_mmap = { rb_node = 0x0 }, i_mmap_nonlinear = { next = 0xffff8da86e8e5e80, prev = 0xffff8da86e8e5e80 }, i_mmap_mutex = { count = { counter = 0x1 }, wait_lock = { { rlock = { raw_lock = { val = { counter = 0x0 } } } } }, wait_list = { next = 0xffff8da86e8e5e98, prev = 0xffff8da86e8e5e98 }, owner = 0x0, { osq = { tail = { counter = 0x0 } }, __UNIQUE_ID_rh_kabi_hide3 = { spin_mlock = 0x0 }, {<No data fields>} } }, nrpages = 0x0, nrexceptional = 0x0, writeback_index = 0x0, a_ops = 0xffffffffc1178b20 <ldiskfs_journalled_aops>, flags = 0x200da, backing_dev_info = 0xffff8da8e3673b20, private_lock = { { rlock = { raw_lock = { val = { counter = 0x0 } } } } }, private_list = { next = 0xffff8da86e8e5ef0, prev = 0xffff8da86e8e5ef0 }, private_data = 0x0 }, i_dquot = {0xffff8da8e5124100, 0xffff8da8e5124200, 0x0}, i_devices = { next = 0xffff8da86e8e5f20, prev = 0xffff8da86e8e5f20 }, { i_pipe = 0x0, i_bdev = 0x0, i_cdev = 0x0 }, i_generation = 0x3f58114f, i_fsnotify_mask = 0x0, { i_fsnotify_marks = 0x0, __UNIQUE_ID_rh_kabi_hide18 = { i_fsnotify_marks = { first = 0x0 } }, {<No data fields>} }, i_readcount = { counter = 0x0 }, i_private = 0x0 } ...... The i_mode indicated this inode is a directory(0x4000). But the i_op & i_fop set to REG operations. Looks like some corner case causes such situation? Thanks, YangSheng
            sarah Sarah Liu added a comment -

            I uploaded the vmcore again, please try if it works. Or if you can access spirit, the dumps are located under
            /scratch/dumps/soak-10.spirit.whamcloud.com/10.10.1.110-2019-05-12-13:06:18

            I was using the mlnx bulid
            https://build.whamcloud.com/job/lustre-b2_12-ib/16/arch=x86_64,build_type=server,distro=el7.6,ib_stack=mlx/

            sarah Sarah Liu added a comment - I uploaded the vmcore again, please try if it works. Or if you can access spirit, the dumps are located under /scratch/dumps/soak-10.spirit.whamcloud.com/10.10.1.110-2019-05-12-13:06:18 I was using the mlnx bulid https://build.whamcloud.com/job/lustre-b2_12-ib/16/arch=x86_64,build_type=server,distro=el7.6,ib_stack=mlx/
            ys Yang Sheng added a comment -

            Hi, Sarah,

            Which build do you test? I use

            https://build.whamcloud.com/job/lustre-b2_12/arch=x86_64,build_type=server,distro=el7.6,ib_stack=inkernel/15/artifact/artifacts/RPMS/x86_64/

            cannot decode the vmcore.

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Sarah, Which build do you test? I use https://build.whamcloud.com/job/lustre-b2_12/arch=x86_64,build_type=server,distro=el7.6,ib_stack=inkernel/15/artifact/artifacts/RPMS/x86_64/ cannot decode the vmcore. Thanks, YangSheng
            sarah Sarah Liu added a comment -

            Hi Yang Sheng,

            I uploaded the vmcore to my home dir on trevis, can you please see if you can access it. Thank you

            /home/w3liu/test_logs/lu-11457

            sarah Sarah Liu added a comment - Hi Yang Sheng, I uploaded the vmcore to my home dir on trevis, can you please see if you can access it. Thank you /home/w3liu/test_logs/lu-11457
            ys Yang Sheng added a comment -

            Hi, Sarah,

            Do you have crash dump for further analysis?

            Thanks,
            Yangsheng

            ys Yang Sheng added a comment - Hi, Sarah, Do you have crash dump for further analysis? Thanks, Yangsheng
            sarah Sarah Liu added a comment -

            Hit this again on 2.12.2-rc1 testing

            sarah Sarah Liu added a comment - Hit this again on 2.12.2-rc1 testing
            sarah Sarah Liu added a comment -

            Hit the same issue again when testing 2.12.2-rc1, 1 MDS crash after running about 34 hours

            soak-10 console log

            [10077.630267] device-mapper: multipath: Failing path 8:96.
            [10080.630817] device-mapper: multipath: Reinstating path 8:128.
            [10080.637462] device-mapper: multipath: Failing path 8:128.
            [10081.939589] INFO: task kmmpd-dm-6:12825 blocked for more than 120 seconds.
            [10081.947277] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [10081.956040] kmmpd-dm-6      D ffff8da0f6120000     0 12825      2 0x00000080
            [10081.964033] Call Trace:
            [10081.966797]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10081.972543]  [<ffffffff8e368c49>] schedule+0x29/0x70
            [10081.978120]  [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0
            [10081.984704]  [<ffffffffc05040bd>] ? dm_old_request_fn+0x18d/0x210 [dm_mod]
            [10081.992416]  [<ffffffff8df680de>] ? deadline_add_request+0x6e/0x90
            [10081.999361]  [<ffffffff8dd01092>] ? ktime_get_ts64+0x52/0xf0
            [10082.005679]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10082.011450]  [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130
            [10082.018164]  [<ffffffff8e368388>] io_schedule+0x18/0x20
            [10082.024022]  [<ffffffff8e366d71>] bit_wait_io+0x11/0x50
            [10082.029881]  [<ffffffff8e366897>] __wait_on_bit+0x67/0x90
            [10082.035921]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10082.041680]  [<ffffffff8e366a01>] out_of_line_wait_on_bit+0x81/0xb0
            [10082.048707]  [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40
            [10082.055333]  [<ffffffff8de7931a>] __wait_on_buffer+0x2a/0x30
            [10082.061693]  [<ffffffffc11429ed>] write_mmp_block+0x11d/0x160 [ldiskfs]
            [10082.069108]  [<ffffffffc1142c48>] kmmpd+0x1a8/0x430 [ldiskfs]
            [10082.075554]  [<ffffffffc1142aa0>] ? __dump_mmp_msg+0x70/0x70 [ldiskfs]
            [10082.082842]  [<ffffffff8dcc1c71>] kthread+0xd1/0xe0
            [10082.088321]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10082.095153]  [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21
            [10082.102444]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10082.109292] INFO: task jbd2/dm-6-8:12826 blocked for more than 120 seconds.
            [10082.117092] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [10082.125846] jbd2/dm-6-8     D ffff8da8ea112080     0 12826      2 0x00000080
            [10082.133794] Call Trace:
            [10082.136532]  [<ffffffff8e368c49>] schedule+0x29/0x70
            [10082.142108]  [<ffffffffc047b2bc>] jbd2_journal_commit_transaction+0x23c/0x19b0 [jbd2]
            [10082.150883]  [<ffffffff8dcd9f15>] ? sched_clock_cpu+0x85/0xc0
            [10082.157323]  [<ffffffff8dc2a59e>] ? __switch_to+0xce/0x580
            [10082.163459]  [<ffffffff8dcc2d40>] ? wake_up_atomic_t+0x30/0x30
            [10082.169993]  [<ffffffff8e36878f>] ? __schedule+0x3ff/0x890
            [10082.176140]  [<ffffffff8dcaa11e>] ? try_to_del_timer_sync+0x5e/0x90
            [10082.183156]  [<ffffffffc0481e89>] kjournald2+0xc9/0x260 [jbd2]
            [10082.189686]  [<ffffffff8dcc2d40>] ? wake_up_atomic_t+0x30/0x30
            [10082.196208]  [<ffffffffc0481dc0>] ? commit_timeout+0x10/0x10 [jbd2]
            [10082.203227]  [<ffffffff8dcc1c71>] kthread+0xd1/0xe0
            [10082.208710]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10082.215521]  [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21
            [10082.222812]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10082.229643] INFO: task mdt_out01_012:13070 blocked for more than 120 seconds.
            [10082.237624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [10082.246372] mdt_out01_012   D ffff8da8d67f9040     0 13070      2 0x00000080
            [10082.254354] Call Trace:
            [10082.257115]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10082.262863]  [<ffffffff8e368c49>] schedule+0x29/0x70
            [10082.268446]  [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0
            [10082.274989]  [<ffffffff8de78c3d>] ? __brelse+0x3d/0x50
            [10082.280755]  [<ffffffff8de78f2a>] ? bh_lru_install+0x18a/0x1e0
            [10082.287296]  [<ffffffff8dd01092>] ? ktime_get_ts64+0x52/0xf0
            [10082.293630]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10082.299394]  [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130
            [10082.306106]  [<ffffffff8e368388>] io_schedule+0x18/0x20
            [10082.311963]  [<ffffffff8e366d71>] bit_wait_io+0x11/0x50
            [10082.317820]  [<ffffffff8e366921>] __wait_on_bit_lock+0x61/0xc0
            [10082.324361]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10082.330114]  [<ffffffff8e366b71>] out_of_line_wait_on_bit_lock+0x81/0xb0
            [10082.337614]  [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40
            [10082.344252]  [<ffffffff8de79352>] __lock_buffer+0x32/0x40
            [10082.350326]  [<ffffffffc0479ea2>] do_get_write_access+0x432/0x4d0 [jbd2]
            [10082.357831]  [<ffffffff8de7959d>] ? __getblk+0x2d/0x300
            [10082.363681]  [<ffffffffc0479f67>] jbd2_journal_get_write_access+0x27/0x40 [jbd2]
            [10082.371965]  [<ffffffffc1159dab>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs]
            [10082.381027]  [<ffffffffc112f180>] add_dirent_to_buf+0xd0/0x2c0 [ldiskfs]
            [10082.388530]  [<ffffffffc112f55e>] ldiskfs_dx_add_entry+0x1ee/0xc60 [ldiskfs]
            [10082.396428]  [<ffffffffc047a2d1>] ? jbd2_journal_dirty_metadata+0xd1/0x260 [jbd2]
            [10082.404806]  [<ffffffff8dcc2d05>] ? wake_up_bit+0x25/0x30
            [10082.410868]  [<ffffffffc1130a54>] __ldiskfs_add_entry+0x254/0x6e0 [ldiskfs]
            [10082.418671]  [<ffffffffc1132ae3>] ? ldiskfs_add_dot_dotdot+0x83/0x90 [ldiskfs]
            [10082.426755]  [<ffffffffc11e9404>] osd_ldiskfs_add_entry+0x34/0x260 [osd_ldiskfs]
            [10082.435042]  [<ffffffffc11e9c77>] __osd_ea_add_rec+0x197/0x360 [osd_ldiskfs]
            [10082.442941]  [<ffffffffc11f4ad1>] osd_index_ea_insert+0x491/0x12f0 [osd_ldiskfs]
            [10082.451206]  [<ffffffff8e367f32>] ? down_write+0x12/0x3d
            [10082.457246]  [<ffffffffc0f43350>] out_tx_index_insert_exec+0x1e0/0x490 [ptlrpc]
            [10082.465464]  [<ffffffffc0f41521>] ? out_obj_ref_add+0xc1/0x220 [ptlrpc]
            [10082.472895]  [<ffffffffc0f3c391>] out_tx_end+0xe1/0x5c0 [ptlrpc]
            [10082.479666]  [<ffffffffc0f405b2>] out_handle+0x1452/0x1bc0 [ptlrpc]
            [10082.486717]  [<ffffffffc0e87890>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
            [10082.494854]  [<ffffffffc0f351da>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            [10082.502588]  [<ffffffffc0f10c11>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
            [10082.511061]  [<ffffffffc0b2abde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
            [10082.519002]  [<ffffffffc0eda80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [10082.527624]  [<ffffffffc0ed7695>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
            [10082.535212]  [<ffffffff8dcd6802>] ? default_wake_function+0x12/0x20
            [10082.542246]  [<ffffffff8dccbadb>] ? __wake_up_common+0x5b/0x90
            [10082.548808]  [<ffffffffc0ede13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
            [10082.555853]  [<ffffffffc0edd640>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
            [10082.564140]  [<ffffffff8dcc1c71>] kthread+0xd1/0xe0
            [10082.569609]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10082.576435]  [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21
            [10082.583733]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10082.590580] INFO: task debugfs:27640 blocked for more than 120 seconds.
            [10082.597980] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [10082.606728] debugfs         D ffff8da16f7c4100     0 27640  27639 0x00000080
            [10082.614668] Call Trace:
            [10082.617431]  [<ffffffff8df7d1b2>] ? radix_tree_lookup_slot+0x22/0x50
            [10082.624532]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10082.630298]  [<ffffffff8e368c49>] schedule+0x29/0x70
            [10082.635855]  [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0
            [10082.637668] device-mapper: multipath: Reinstating path 8:96.
            [10082.637830] device-mapper: multipath: Failing path 8:96.
            [10082.654638]  [<ffffffff8ddb6d2d>] ? find_get_pages_tag+0xfd/0x240
            [10082.661459]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10082.667204]  [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130
            [10082.673927]  [<ffffffff8e368388>] io_schedule+0x18/0x20
            [10082.679768]  [<ffffffff8e366d71>] bit_wait_io+0x11/0x50
            [10082.685610]  [<ffffffff8e366897>] __wait_on_bit+0x67/0x90
            [10082.691649]  [<ffffffff8ddb5881>] wait_on_page_bit+0x81/0xa0
            [10082.697995]  [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40
            [10082.704613]  [<ffffffff8ddb59b1>] __filemap_fdatawait_range+0x111/0x190
            [10082.712004]  [<ffffffff8ddb5a44>] filemap_fdatawait_range+0x14/0x30
            [10082.719033]  [<ffffffff8ddb7ea6>] filemap_write_and_wait_range+0x56/0x90
            [10082.726530]  [<ffffffff8de803eb>] blkdev_fsync+0x1b/0x50
            [10082.732468]  [<ffffffff8de762f7>] do_fsync+0x67/0xb0
            [10082.738041]  [<ffffffff8e375d15>] ? system_call_after_swapgs+0xa2/0x146
            [10082.745435]  [<ffffffff8de765e0>] SyS_fsync+0x10/0x20
            [10082.751082]  [<ffffffff8e375ddb>] system_call_fastpath+0x22/0x27
            [10082.757818]  [<ffffffff8e375d21>] ? system_call_after_swapgs+0xae/0x146
            [10085.638108] device-mapper: multipath: Reinstating path 8:128.
            [10085.644767] device-mapper: multipath: Failing path 8:128.
            [10087.645110] device-mapper: multipath: Reinstating path 8:96.
            [10087.651810] sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, queueing MODE_SELECT command
            [10088.301259] sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, MODE_SELECT completed
            [10088.676785] LustreError: 13053:0:(osd_oi.c:762:osd_oi_insert()) dm-6: the FID [0x2800cae3d:0xe3ec:0x0] is used by two objects: 30409754/1062736205 477627435/1062736207
            [10088.693877] LNet: 12726:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.109@o2ib
            [10090.652279] device-mapper: multipath: Reinstating path 8:128.
            [10090.658823] device-mapper: multipath: Failing path 8:128.
            [10095.663302] device-mapper: multipath: Reinstating path 8:128.
            [10095.669968] device-mapper: multipath: Failing path 8:128.
            [10100.670909] device-mapper: multipath: Reinstating path 8:128.
            [10100.677556] device-mapper: multipath: Failing path 8:128.
            [10105.677908] device-mapper: multipath: Reinstating path 8:128.
            [10105.684607] device-mapper: multipath: Failing path 8:128.
            [10110.686120] device-mapper: multipath: Reinstating path 8:128.
            [10110.692756] device-mapper: multipath: Failing path 8:128.
            [10115.694525] device-mapper: multipath: Reinstating path 8:128.
            [10115.701149] device-mapper: multipath: Failing path 8:128.
            [10119.333857] BUG: unable to handle kernel NULL pointer dereference at           (null)
            [10119.342631] IP: [<          (null)>]           (null)
            [10119.348281] PGD 0 
            [10119.350546] Oops: 0010 [#1] SMP 
            [10119.354188] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ipmi_ssif ablk_helper dm_round_robin cryptd iTCO_wdt iTCO_vendor_support ipmi_si joydev pcspkr ipmi_devintf mei_me ioatdma ipmi_msghandler i2c_i801 mei sg lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx4_core(OE) igb ttm isci ahci devlink crct10dif_pclmul crct10dif_common libsas mpt3sas libahci crc32c_intel ptp drm pps_core raid_class libata mlx_compat(OE) dca scsi_transport_sas drm_panel_orientation_quirks i2c_algo_bit
            [10119.465397] CPU: 13 PID: 13069 Comm: mdt_out01_011 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.10.1.el7_lustre.x86_64 #1
            [10119.479569] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
            [10119.492080] task: ffff8da8e85fa080 ti: ffff8da8ccfb0000 task.ti: ffff8da8ccfb0000
            [10119.500439] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
            [10119.508809] RSP: 0018:ffff8da8ccfb3b60  EFLAGS: 00010246
            [10119.514738] RAX: 0000000000000000 RBX: ffff8da8d64fa580 RCX: 0000000000000002
            [10119.522705] RDX: ffffffffc11e1a50 RSI: ffff8da8ccfb3b68 RDI: ffff8da8d64fa588
            [10119.530669] RBP: ffff8da8ccfb3ba0 R08: 0000000000000004 R09: 0000000000000000
            [10119.538639] R10: 0000000000000001 R11: 00000000007fffff R12: ffff8da85672a300
            [10119.546601] R13: ffff8da86e8e5d08 R14: ffff8da4b1756200 R15: ffff8da8d64fa588
            [10119.554566] FS:  0000000000000000(0000) GS:ffff8da8ed940000(0000) knlGS:0000000000000000
            [10119.563610] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [10119.570023] CR2: 0000000000000000 CR3: 000000004c210000 CR4: 00000000000607e0
            [10119.578001] Call Trace:
            [10119.580767]  [<ffffffffc11dfe4e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]
            [10119.589030]  [<ffffffffc11e0027>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]
            [10119.596673]  [<ffffffffc0c987d8>] dt_index_walk+0xf8/0x430 [obdclass]
            [10119.603889]  [<ffffffffc0c98b10>] ? dt_index_walk+0x430/0x430 [obdclass]
            [10119.611386]  [<ffffffffc0c99404>] dt_index_read+0x394/0x6a0 [obdclass]
            [10119.618775]  [<ffffffffc0f32162>] tgt_obd_idx_read+0x612/0x860 [ptlrpc]
            [10119.626211]  [<ffffffffc0f351da>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            [10119.633936]  [<ffffffffc0f10c11>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
            [10119.642404]  [<ffffffffc0b2abde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
            [10119.650319]  [<ffffffffc0eda80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [10119.658910]  [<ffffffffc0ed7695>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
            [10119.666489]  [<ffffffff8dcd6802>] ? default_wake_function+0x12/0x20
            [10119.673492]  [<ffffffff8dccbadb>] ? __wake_up_common+0x5b/0x90
            [10119.680047]  [<ffffffffc0ede13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
            [10119.687087]  [<ffffffffc0edd640>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
            [10119.695351]  [<ffffffff8dcc1c71>] kthread+0xd1/0xe0
            [10119.700795]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10119.707625]  [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21
            [10119.714910]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10119.721711] Code:  Bad RIP value.
            [10119.725449] RIP  [<          (null)>]           (null)
            [10119.731192]  RSP <ffff8da8ccfb3b60>
            [10119.735084] CR2: 0000000000000000
            [    0.000000] Initializing cgroup subsys cpuset
            [    0.000000] Initializing cgroup subsys cpu
            [    0.000000] Initializing cgroup subsys cpuacct
            
            sarah Sarah Liu added a comment - Hit the same issue again when testing 2.12.2-rc1, 1 MDS crash after running about 34 hours soak-10 console log [10077.630267] device-mapper: multipath: Failing path 8:96. [10080.630817] device-mapper: multipath: Reinstating path 8:128. [10080.637462] device-mapper: multipath: Failing path 8:128. [10081.939589] INFO: task kmmpd-dm-6:12825 blocked for more than 120 seconds. [10081.947277] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10081.956040] kmmpd-dm-6 D ffff8da0f6120000 0 12825 2 0x00000080 [10081.964033] Call Trace: [10081.966797] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10081.972543] [<ffffffff8e368c49>] schedule+0x29/0x70 [10081.978120] [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0 [10081.984704] [<ffffffffc05040bd>] ? dm_old_request_fn+0x18d/0x210 [dm_mod] [10081.992416] [<ffffffff8df680de>] ? deadline_add_request+0x6e/0x90 [10081.999361] [<ffffffff8dd01092>] ? ktime_get_ts64+0x52/0xf0 [10082.005679] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.011450] [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130 [10082.018164] [<ffffffff8e368388>] io_schedule+0x18/0x20 [10082.024022] [<ffffffff8e366d71>] bit_wait_io+0x11/0x50 [10082.029881] [<ffffffff8e366897>] __wait_on_bit+0x67/0x90 [10082.035921] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.041680] [<ffffffff8e366a01>] out_of_line_wait_on_bit+0x81/0xb0 [10082.048707] [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40 [10082.055333] [<ffffffff8de7931a>] __wait_on_buffer+0x2a/0x30 [10082.061693] [<ffffffffc11429ed>] write_mmp_block+0x11d/0x160 [ldiskfs] [10082.069108] [<ffffffffc1142c48>] kmmpd+0x1a8/0x430 [ldiskfs] [10082.075554] [<ffffffffc1142aa0>] ? __dump_mmp_msg+0x70/0x70 [ldiskfs] [10082.082842] [<ffffffff8dcc1c71>] kthread+0xd1/0xe0 [10082.088321] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.095153] [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21 [10082.102444] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.109292] INFO: task jbd2/dm-6-8:12826 blocked for more than 120 seconds. [10082.117092] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10082.125846] jbd2/dm-6-8 D ffff8da8ea112080 0 12826 2 0x00000080 [10082.133794] Call Trace: [10082.136532] [<ffffffff8e368c49>] schedule+0x29/0x70 [10082.142108] [<ffffffffc047b2bc>] jbd2_journal_commit_transaction+0x23c/0x19b0 [jbd2] [10082.150883] [<ffffffff8dcd9f15>] ? sched_clock_cpu+0x85/0xc0 [10082.157323] [<ffffffff8dc2a59e>] ? __switch_to+0xce/0x580 [10082.163459] [<ffffffff8dcc2d40>] ? wake_up_atomic_t+0x30/0x30 [10082.169993] [<ffffffff8e36878f>] ? __schedule+0x3ff/0x890 [10082.176140] [<ffffffff8dcaa11e>] ? try_to_del_timer_sync+0x5e/0x90 [10082.183156] [<ffffffffc0481e89>] kjournald2+0xc9/0x260 [jbd2] [10082.189686] [<ffffffff8dcc2d40>] ? wake_up_atomic_t+0x30/0x30 [10082.196208] [<ffffffffc0481dc0>] ? commit_timeout+0x10/0x10 [jbd2] [10082.203227] [<ffffffff8dcc1c71>] kthread+0xd1/0xe0 [10082.208710] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.215521] [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21 [10082.222812] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.229643] INFO: task mdt_out01_012:13070 blocked for more than 120 seconds. [10082.237624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10082.246372] mdt_out01_012 D ffff8da8d67f9040 0 13070 2 0x00000080 [10082.254354] Call Trace: [10082.257115] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.262863] [<ffffffff8e368c49>] schedule+0x29/0x70 [10082.268446] [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0 [10082.274989] [<ffffffff8de78c3d>] ? __brelse+0x3d/0x50 [10082.280755] [<ffffffff8de78f2a>] ? bh_lru_install+0x18a/0x1e0 [10082.287296] [<ffffffff8dd01092>] ? ktime_get_ts64+0x52/0xf0 [10082.293630] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.299394] [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130 [10082.306106] [<ffffffff8e368388>] io_schedule+0x18/0x20 [10082.311963] [<ffffffff8e366d71>] bit_wait_io+0x11/0x50 [10082.317820] [<ffffffff8e366921>] __wait_on_bit_lock+0x61/0xc0 [10082.324361] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.330114] [<ffffffff8e366b71>] out_of_line_wait_on_bit_lock+0x81/0xb0 [10082.337614] [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40 [10082.344252] [<ffffffff8de79352>] __lock_buffer+0x32/0x40 [10082.350326] [<ffffffffc0479ea2>] do_get_write_access+0x432/0x4d0 [jbd2] [10082.357831] [<ffffffff8de7959d>] ? __getblk+0x2d/0x300 [10082.363681] [<ffffffffc0479f67>] jbd2_journal_get_write_access+0x27/0x40 [jbd2] [10082.371965] [<ffffffffc1159dab>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs] [10082.381027] [<ffffffffc112f180>] add_dirent_to_buf+0xd0/0x2c0 [ldiskfs] [10082.388530] [<ffffffffc112f55e>] ldiskfs_dx_add_entry+0x1ee/0xc60 [ldiskfs] [10082.396428] [<ffffffffc047a2d1>] ? jbd2_journal_dirty_metadata+0xd1/0x260 [jbd2] [10082.404806] [<ffffffff8dcc2d05>] ? wake_up_bit+0x25/0x30 [10082.410868] [<ffffffffc1130a54>] __ldiskfs_add_entry+0x254/0x6e0 [ldiskfs] [10082.418671] [<ffffffffc1132ae3>] ? ldiskfs_add_dot_dotdot+0x83/0x90 [ldiskfs] [10082.426755] [<ffffffffc11e9404>] osd_ldiskfs_add_entry+0x34/0x260 [osd_ldiskfs] [10082.435042] [<ffffffffc11e9c77>] __osd_ea_add_rec+0x197/0x360 [osd_ldiskfs] [10082.442941] [<ffffffffc11f4ad1>] osd_index_ea_insert+0x491/0x12f0 [osd_ldiskfs] [10082.451206] [<ffffffff8e367f32>] ? down_write+0x12/0x3d [10082.457246] [<ffffffffc0f43350>] out_tx_index_insert_exec+0x1e0/0x490 [ptlrpc] [10082.465464] [<ffffffffc0f41521>] ? out_obj_ref_add+0xc1/0x220 [ptlrpc] [10082.472895] [<ffffffffc0f3c391>] out_tx_end+0xe1/0x5c0 [ptlrpc] [10082.479666] [<ffffffffc0f405b2>] out_handle+0x1452/0x1bc0 [ptlrpc] [10082.486717] [<ffffffffc0e87890>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [10082.494854] [<ffffffffc0f351da>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [10082.502588] [<ffffffffc0f10c11>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [10082.511061] [<ffffffffc0b2abde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [10082.519002] [<ffffffffc0eda80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [10082.527624] [<ffffffffc0ed7695>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [10082.535212] [<ffffffff8dcd6802>] ? default_wake_function+0x12/0x20 [10082.542246] [<ffffffff8dccbadb>] ? __wake_up_common+0x5b/0x90 [10082.548808] [<ffffffffc0ede13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] [10082.555853] [<ffffffffc0edd640>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [10082.564140] [<ffffffff8dcc1c71>] kthread+0xd1/0xe0 [10082.569609] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.576435] [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21 [10082.583733] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.590580] INFO: task debugfs:27640 blocked for more than 120 seconds. [10082.597980] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10082.606728] debugfs D ffff8da16f7c4100 0 27640 27639 0x00000080 [10082.614668] Call Trace: [10082.617431] [<ffffffff8df7d1b2>] ? radix_tree_lookup_slot+0x22/0x50 [10082.624532] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.630298] [<ffffffff8e368c49>] schedule+0x29/0x70 [10082.635855] [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0 [10082.637668] device-mapper: multipath: Reinstating path 8:96. [10082.637830] device-mapper: multipath: Failing path 8:96. [10082.654638] [<ffffffff8ddb6d2d>] ? find_get_pages_tag+0xfd/0x240 [10082.661459] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.667204] [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130 [10082.673927] [<ffffffff8e368388>] io_schedule+0x18/0x20 [10082.679768] [<ffffffff8e366d71>] bit_wait_io+0x11/0x50 [10082.685610] [<ffffffff8e366897>] __wait_on_bit+0x67/0x90 [10082.691649] [<ffffffff8ddb5881>] wait_on_page_bit+0x81/0xa0 [10082.697995] [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40 [10082.704613] [<ffffffff8ddb59b1>] __filemap_fdatawait_range+0x111/0x190 [10082.712004] [<ffffffff8ddb5a44>] filemap_fdatawait_range+0x14/0x30 [10082.719033] [<ffffffff8ddb7ea6>] filemap_write_and_wait_range+0x56/0x90 [10082.726530] [<ffffffff8de803eb>] blkdev_fsync+0x1b/0x50 [10082.732468] [<ffffffff8de762f7>] do_fsync+0x67/0xb0 [10082.738041] [<ffffffff8e375d15>] ? system_call_after_swapgs+0xa2/0x146 [10082.745435] [<ffffffff8de765e0>] SyS_fsync+0x10/0x20 [10082.751082] [<ffffffff8e375ddb>] system_call_fastpath+0x22/0x27 [10082.757818] [<ffffffff8e375d21>] ? system_call_after_swapgs+0xae/0x146 [10085.638108] device-mapper: multipath: Reinstating path 8:128. [10085.644767] device-mapper: multipath: Failing path 8:128. [10087.645110] device-mapper: multipath: Reinstating path 8:96. [10087.651810] sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, queueing MODE_SELECT command [10088.301259] sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, MODE_SELECT completed [10088.676785] LustreError: 13053:0:(osd_oi.c:762:osd_oi_insert()) dm-6: the FID [0x2800cae3d:0xe3ec:0x0] is used by two objects: 30409754/1062736205 477627435/1062736207 [10088.693877] LNet: 12726:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.109@o2ib [10090.652279] device-mapper: multipath: Reinstating path 8:128. [10090.658823] device-mapper: multipath: Failing path 8:128. [10095.663302] device-mapper: multipath: Reinstating path 8:128. [10095.669968] device-mapper: multipath: Failing path 8:128. [10100.670909] device-mapper: multipath: Reinstating path 8:128. [10100.677556] device-mapper: multipath: Failing path 8:128. [10105.677908] device-mapper: multipath: Reinstating path 8:128. [10105.684607] device-mapper: multipath: Failing path 8:128. [10110.686120] device-mapper: multipath: Reinstating path 8:128. [10110.692756] device-mapper: multipath: Failing path 8:128. [10115.694525] device-mapper: multipath: Reinstating path 8:128. [10115.701149] device-mapper: multipath: Failing path 8:128. [10119.333857] BUG: unable to handle kernel NULL pointer dereference at (null) [10119.342631] IP: [< (null)>] (null) [10119.348281] PGD 0 [10119.350546] Oops: 0010 [#1] SMP [10119.354188] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ipmi_ssif ablk_helper dm_round_robin cryptd iTCO_wdt iTCO_vendor_support ipmi_si joydev pcspkr ipmi_devintf mei_me ioatdma ipmi_msghandler i2c_i801 mei sg lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx4_core(OE) igb ttm isci ahci devlink crct10dif_pclmul crct10dif_common libsas mpt3sas libahci crc32c_intel ptp drm pps_core raid_class libata mlx_compat(OE) dca scsi_transport_sas drm_panel_orientation_quirks i2c_algo_bit [10119.465397] CPU: 13 PID: 13069 Comm: mdt_out01_011 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.10.1.el7_lustre.x86_64 #1 [10119.479569] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [10119.492080] task: ffff8da8e85fa080 ti: ffff8da8ccfb0000 task.ti: ffff8da8ccfb0000 [10119.500439] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [10119.508809] RSP: 0018:ffff8da8ccfb3b60 EFLAGS: 00010246 [10119.514738] RAX: 0000000000000000 RBX: ffff8da8d64fa580 RCX: 0000000000000002 [10119.522705] RDX: ffffffffc11e1a50 RSI: ffff8da8ccfb3b68 RDI: ffff8da8d64fa588 [10119.530669] RBP: ffff8da8ccfb3ba0 R08: 0000000000000004 R09: 0000000000000000 [10119.538639] R10: 0000000000000001 R11: 00000000007fffff R12: ffff8da85672a300 [10119.546601] R13: ffff8da86e8e5d08 R14: ffff8da4b1756200 R15: ffff8da8d64fa588 [10119.554566] FS: 0000000000000000(0000) GS:ffff8da8ed940000(0000) knlGS:0000000000000000 [10119.563610] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [10119.570023] CR2: 0000000000000000 CR3: 000000004c210000 CR4: 00000000000607e0 [10119.578001] Call Trace: [10119.580767] [<ffffffffc11dfe4e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs] [10119.589030] [<ffffffffc11e0027>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs] [10119.596673] [<ffffffffc0c987d8>] dt_index_walk+0xf8/0x430 [obdclass] [10119.603889] [<ffffffffc0c98b10>] ? dt_index_walk+0x430/0x430 [obdclass] [10119.611386] [<ffffffffc0c99404>] dt_index_read+0x394/0x6a0 [obdclass] [10119.618775] [<ffffffffc0f32162>] tgt_obd_idx_read+0x612/0x860 [ptlrpc] [10119.626211] [<ffffffffc0f351da>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [10119.633936] [<ffffffffc0f10c11>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [10119.642404] [<ffffffffc0b2abde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [10119.650319] [<ffffffffc0eda80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [10119.658910] [<ffffffffc0ed7695>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [10119.666489] [<ffffffff8dcd6802>] ? default_wake_function+0x12/0x20 [10119.673492] [<ffffffff8dccbadb>] ? __wake_up_common+0x5b/0x90 [10119.680047] [<ffffffffc0ede13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] [10119.687087] [<ffffffffc0edd640>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [10119.695351] [<ffffffff8dcc1c71>] kthread+0xd1/0xe0 [10119.700795] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10119.707625] [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21 [10119.714910] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10119.721711] Code: Bad RIP value. [10119.725449] RIP [< (null)>] (null) [10119.731192] RSP <ffff8da8ccfb3b60> [10119.735084] CR2: 0000000000000000 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct
            pjones Peter Jones added a comment -

            ok - thanks Sarah!

            pjones Peter Jones added a comment - ok - thanks Sarah!

            People

              ys Yang Sheng
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: