Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11457

osd_oi_insert(): the FID is used by two objects

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0
    • Lustre 2.12.0, Lustre 2.12.2, Lustre 2.12.4, Lustre 2.12.5
    • 3
    • 9223372036854775807

    Description

      tag-2.11.55

      MDS crash

      [15650.670434] device-mapper: multipath: Failing path 8:96.^M
      [15650.765276] BUG: unable to handle kernel NULL pointer dereference at           (null)^M
      [15650.775741] IP: [<          (null)>]           (null)^M
      [15650.783081] PGD 0 ^M
      [15650.786948] Oops: 0010 [#1] SMP ^M
      [15650.792218] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt joydev pcspkr ipmi_ssif iTCO_vendor_support sg ipmi_si ipmi_devintf shpchp ipmi_msghandler i2c_i801 mei_me ioatdma mei lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm igb isci ahci ptp mlx4_core(OE) mpt3sas libsas libahci pps_core dca crct10dif_pclmul devlink i2c_algo_bit crct10dif_common raid_class crc32c_intel libata i2c_core mlx_compat(OE) scsi_transport_sas^M
      [15650.934649] CPU: 14 PID: 9491 Comm: mdt_rdpg01_008 Tainted: P           OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1^M
      [15650.952002] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
      [15650.966961] task: ffff8be6a1253f40 ti: ffff8be68a044000 task.ti: ffff8be68a044000^M
      [15650.977791] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)^M
      [15650.988693] RSP: 0018:ffff8be68a047b58  EFLAGS: 00010246^M
      [15650.997167] RAX: 0000000000000000 RBX: ffff8be68b820000 RCX: 0000000000000002^M
      [15651.007733] RDX: ffffffffc164c7b0 RSI: ffff8be68a047b60 RDI: ffff8be68b820008^M
      [15651.018326] RBP: ffff8be68a047b98 R08: 0000000000000004 R09: 0000000000000000^M
      [15651.028930] R10: 0000000000000001 R11: 00000000007fffff R12: ffff8be26f9fab00^M
      [15651.039547] R13: ffff8be279a448a0 R14: ffff8be68a160000 R15: ffff8be68b820008^M
      [15651.050168] FS:  0000000000000000(0000) GS:ffff8be6ad980000(0000) knlGS:0000000000000000^M
      [15651.061880] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
      [15651.070980] CR2: 0000000000000000 CR3: 000000042c3b6000 CR4: 00000000000607e0^M
      [15651.081660] Call Trace:^M
      [15651.087091]  [<ffffffffc164ac3e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]^M
      [15651.098058]  [<ffffffffc164ae17>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]^M
      [15651.108370]  [<ffffffffc188eb47>] lod_it_load+0x27/0x90 [lod]^M
      [15651.117554]  [<ffffffffc0f48808>] dt_index_walk+0xf8/0x430 [obdclass]^M
      [15651.127457]  [<ffffffffc1915080>] ? mdd_object_lock+0xe0/0xe0 [mdd]^M
      [15651.137132]  [<ffffffffc1916d9f>] mdd_readpage+0x25f/0x5a0 [mdd]^M
      [15651.146553]  [<ffffffffc1782bda>] mdt_readpage+0x63a/0x880 [mdt]^M
      [15651.155992]  [<ffffffffc11e82ca>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M
      [15651.166379]  [<ffffffffc11c02e1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]^M
      [15651.177493]  [<ffffffffc0dfcbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]^M
      [15651.188033]  [<ffffffffc118b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M
      [15651.199251]  [<ffffffffc1188315>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]^M
      [15651.209399]  [<ffffffff83ccf682>] ? default_wake_function+0x12/0x20^M
      [15651.218931]  [<ffffffff83cc52ab>] ? __wake_up_common+0x5b/0x90^M
      [15651.228026]  [<ffffffffc118ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]^M
      [15651.237575]  [<ffffffffc118e1b0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]^M
      [15651.248365]  [<ffffffff83cbb621>] kthread+0xd1/0xe0^M
      [15651.256344]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.265688]  [<ffffffff843205f7>] ret_from_fork_nospec_begin+0x21/0x21^M
      [15651.275475]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.284736] Code:  Bad RIP value.^M
      [15651.290946] RIP  [<          (null)>]           (null)^M
      [15651.299236]  RSP <ffff8be68a047b58>^M
      [15651.305543] CR2: 0000000000000000^M
      [15651.315778] ---[ end trace 4ae4238c00f9aeec ]---^M
      [15651.336386] Kernel panic - not syncing: Fatal exception^M
      [15651.344613] Kernel Offset: 0x2c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)^M
      [15651.369289] ------------[ cut here ]------------^M
      [15651.376397] WARNING: CPU: 14 PID: 9491 at arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x65/0x70^M
      [15651.388915] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt joydev pcspkr ipmi_ssif iTCO_vendor_support sg ipmi_si ipmi_devintf shpchp ipmi_msghandler i2c_i801 mei_me ioatdma mei lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm igb isci ahci ptp mlx4_core(OE) mpt3sas libsas libahci pps_core dca crct10dif_pclmul devlink i2c_algo_bit crct10dif_common raid_class crc32c_intel libata i2c_core mlx_compat(OE) scsi_transport_sas^M
      [15651.529620] CPU: 14 PID: 9491 Comm: mdt_rdpg01_008 Tainted: P      D    OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1^M
      [15651.546472] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
      [15651.561156] Call Trace:^M
      [15651.566023]  <IRQ>  [<ffffffff8430e84e>] dump_stack+0x19/0x1b^M
      [15651.574646]  [<ffffffff83c91e18>] __warn+0xd8/0x100^M
      [15651.582224]  [<ffffffff83c91f5d>] warn_slowpath_null+0x1d/0x20^M
      [15651.590851]  [<ffffffff83c54e95>] native_smp_send_reschedule+0x65/0x70^M
      [15651.600279]  [<ffffffff83cddf81>] trigger_load_balance+0x191/0x280^M
      [15651.609280]  [<ffffffff83ccdc0a>] scheduler_tick+0x10a/0x150^M
      [15651.617702]  [<ffffffff83d01c10>] ? tick_sched_do_timer+0x50/0x50^M
      [15651.626619]  [<ffffffff83ca4f65>] update_process_times+0x65/0x80^M
      [15651.635416]  [<ffffffff83d01a10>] tick_sched_handle+0x30/0x70^M
      [15651.643916]  [<ffffffff83d01c49>] tick_sched_timer+0x39/0x80^M
      [15651.652315]  [<ffffffff83cbf7e6>] __hrtimer_run_queues+0xd6/0x260^M
      [15651.661210]  [<ffffffff83cbfd7f>] hrtimer_interrupt+0xaf/0x1d0^M
      [15651.669814]  [<ffffffff83c5847b>] local_apic_timer_interrupt+0x3b/0x60^M
      [15651.679184]  [<ffffffff84325063>] smp_apic_timer_interrupt+0x43/0x60^M
      [15651.688352]  [<ffffffff843217b2>] apic_timer_interrupt+0x162/0x170^M
      [15651.697316]  <EOI>  [<ffffffff84308c3d>] ? panic+0x1d5/0x21f^M
      [15651.705715]  [<ffffffff84308ba1>] ? panic+0x139/0x21f^M
      [15651.713430]  [<ffffffff84318745>] oops_end+0xc5/0xe0^M
      [15651.721020]  [<ffffffff8430807e>] no_context+0x285/0x2a8^M
      [15651.728984]  [<ffffffff84308115>] __bad_area_nosemaphore+0x74/0x1d1^M
      [15651.738014]  [<ffffffff84308286>] bad_area_nosemaphore+0x14/0x16^M
      [15651.746760]  [<ffffffff8431b6e0>] __do_page_fault+0x330/0x4f0^M
      [15651.755199]  [<ffffffff8431b8d5>] do_page_fault+0x35/0x90^M
      [15651.763264]  [<ffffffff84317758>] page_fault+0x28/0x30^M
      [15651.771013]  [<ffffffffc164c7b0>] ? osd_object_alloc+0x360/0x360 [osd_ldiskfs]^M
      [15651.781105]  [<ffffffffc164ac3e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]^M
      [15651.791402]  [<ffffffffc164ae17>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]^M
      [15651.801028]  [<ffffffffc188eb47>] lod_it_load+0x27/0x90 [lod]^M
      [15651.809517]  [<ffffffffc0f48808>] dt_index_walk+0xf8/0x430 [obdclass]^M
      [15651.818761]  [<ffffffffc1915080>] ? mdd_object_lock+0xe0/0xe0 [mdd]^M
      [15651.827808]  [<ffffffffc1916d9f>] mdd_readpage+0x25f/0x5a0 [mdd]^M
      [15651.836533]  [<ffffffffc1782bda>] mdt_readpage+0x63a/0x880 [mdt]^M
      [15651.845269]  [<ffffffffc11e82ca>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M
      [15651.854937]  [<ffffffffc11c02e1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]^M
      [15651.865302]  [<ffffffffc0dfcbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]^M
      [15651.875084]  [<ffffffffc118b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M
      [15651.885522]  [<ffffffffc1188315>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]^M
      [15651.894881]  [<ffffffff83ccf682>] ? default_wake_function+0x12/0x20^M
      [15651.903621]  [<ffffffff83cc52ab>] ? __wake_up_common+0x5b/0x90^M
      [15651.911880]  [<ffffffffc118ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]^M
      [15651.920585]  [<ffffffffc118e1b0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]^M
      [15651.930460]  [<ffffffff83cbb621>] kthread+0xd1/0xe0^M
      [15651.937468]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.945794]  [<ffffffff843205f7>] ret_from_fork_nospec_begin+0x21/0x21^M
      [15651.954564]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.962806] ---[ end trace 4ae4238c00f9aeed ]---^M
      
      

      Attachments

        Issue Links

          Activity

            [LU-11457] osd_oi_insert(): the FID is used by two objects
            ys Yang Sheng added a comment -

            Hi, Sarah,

            Which build do you test? I use

            https://build.whamcloud.com/job/lustre-b2_12/arch=x86_64,build_type=server,distro=el7.6,ib_stack=inkernel/15/artifact/artifacts/RPMS/x86_64/

            cannot decode the vmcore.

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Sarah, Which build do you test? I use https://build.whamcloud.com/job/lustre-b2_12/arch=x86_64,build_type=server,distro=el7.6,ib_stack=inkernel/15/artifact/artifacts/RPMS/x86_64/ cannot decode the vmcore. Thanks, YangSheng
            sarah Sarah Liu added a comment -

            Hi Yang Sheng,

            I uploaded the vmcore to my home dir on trevis, can you please see if you can access it. Thank you

            /home/w3liu/test_logs/lu-11457

            sarah Sarah Liu added a comment - Hi Yang Sheng, I uploaded the vmcore to my home dir on trevis, can you please see if you can access it. Thank you /home/w3liu/test_logs/lu-11457
            ys Yang Sheng added a comment -

            Hi, Sarah,

            Do you have crash dump for further analysis?

            Thanks,
            Yangsheng

            ys Yang Sheng added a comment - Hi, Sarah, Do you have crash dump for further analysis? Thanks, Yangsheng
            sarah Sarah Liu added a comment -

            Hit this again on 2.12.2-rc1 testing

            sarah Sarah Liu added a comment - Hit this again on 2.12.2-rc1 testing
            sarah Sarah Liu added a comment -

            Hit the same issue again when testing 2.12.2-rc1, 1 MDS crash after running about 34 hours

            soak-10 console log

            [10077.630267] device-mapper: multipath: Failing path 8:96.
            [10080.630817] device-mapper: multipath: Reinstating path 8:128.
            [10080.637462] device-mapper: multipath: Failing path 8:128.
            [10081.939589] INFO: task kmmpd-dm-6:12825 blocked for more than 120 seconds.
            [10081.947277] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [10081.956040] kmmpd-dm-6      D ffff8da0f6120000     0 12825      2 0x00000080
            [10081.964033] Call Trace:
            [10081.966797]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10081.972543]  [<ffffffff8e368c49>] schedule+0x29/0x70
            [10081.978120]  [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0
            [10081.984704]  [<ffffffffc05040bd>] ? dm_old_request_fn+0x18d/0x210 [dm_mod]
            [10081.992416]  [<ffffffff8df680de>] ? deadline_add_request+0x6e/0x90
            [10081.999361]  [<ffffffff8dd01092>] ? ktime_get_ts64+0x52/0xf0
            [10082.005679]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10082.011450]  [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130
            [10082.018164]  [<ffffffff8e368388>] io_schedule+0x18/0x20
            [10082.024022]  [<ffffffff8e366d71>] bit_wait_io+0x11/0x50
            [10082.029881]  [<ffffffff8e366897>] __wait_on_bit+0x67/0x90
            [10082.035921]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10082.041680]  [<ffffffff8e366a01>] out_of_line_wait_on_bit+0x81/0xb0
            [10082.048707]  [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40
            [10082.055333]  [<ffffffff8de7931a>] __wait_on_buffer+0x2a/0x30
            [10082.061693]  [<ffffffffc11429ed>] write_mmp_block+0x11d/0x160 [ldiskfs]
            [10082.069108]  [<ffffffffc1142c48>] kmmpd+0x1a8/0x430 [ldiskfs]
            [10082.075554]  [<ffffffffc1142aa0>] ? __dump_mmp_msg+0x70/0x70 [ldiskfs]
            [10082.082842]  [<ffffffff8dcc1c71>] kthread+0xd1/0xe0
            [10082.088321]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10082.095153]  [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21
            [10082.102444]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10082.109292] INFO: task jbd2/dm-6-8:12826 blocked for more than 120 seconds.
            [10082.117092] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [10082.125846] jbd2/dm-6-8     D ffff8da8ea112080     0 12826      2 0x00000080
            [10082.133794] Call Trace:
            [10082.136532]  [<ffffffff8e368c49>] schedule+0x29/0x70
            [10082.142108]  [<ffffffffc047b2bc>] jbd2_journal_commit_transaction+0x23c/0x19b0 [jbd2]
            [10082.150883]  [<ffffffff8dcd9f15>] ? sched_clock_cpu+0x85/0xc0
            [10082.157323]  [<ffffffff8dc2a59e>] ? __switch_to+0xce/0x580
            [10082.163459]  [<ffffffff8dcc2d40>] ? wake_up_atomic_t+0x30/0x30
            [10082.169993]  [<ffffffff8e36878f>] ? __schedule+0x3ff/0x890
            [10082.176140]  [<ffffffff8dcaa11e>] ? try_to_del_timer_sync+0x5e/0x90
            [10082.183156]  [<ffffffffc0481e89>] kjournald2+0xc9/0x260 [jbd2]
            [10082.189686]  [<ffffffff8dcc2d40>] ? wake_up_atomic_t+0x30/0x30
            [10082.196208]  [<ffffffffc0481dc0>] ? commit_timeout+0x10/0x10 [jbd2]
            [10082.203227]  [<ffffffff8dcc1c71>] kthread+0xd1/0xe0
            [10082.208710]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10082.215521]  [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21
            [10082.222812]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10082.229643] INFO: task mdt_out01_012:13070 blocked for more than 120 seconds.
            [10082.237624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [10082.246372] mdt_out01_012   D ffff8da8d67f9040     0 13070      2 0x00000080
            [10082.254354] Call Trace:
            [10082.257115]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10082.262863]  [<ffffffff8e368c49>] schedule+0x29/0x70
            [10082.268446]  [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0
            [10082.274989]  [<ffffffff8de78c3d>] ? __brelse+0x3d/0x50
            [10082.280755]  [<ffffffff8de78f2a>] ? bh_lru_install+0x18a/0x1e0
            [10082.287296]  [<ffffffff8dd01092>] ? ktime_get_ts64+0x52/0xf0
            [10082.293630]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10082.299394]  [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130
            [10082.306106]  [<ffffffff8e368388>] io_schedule+0x18/0x20
            [10082.311963]  [<ffffffff8e366d71>] bit_wait_io+0x11/0x50
            [10082.317820]  [<ffffffff8e366921>] __wait_on_bit_lock+0x61/0xc0
            [10082.324361]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10082.330114]  [<ffffffff8e366b71>] out_of_line_wait_on_bit_lock+0x81/0xb0
            [10082.337614]  [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40
            [10082.344252]  [<ffffffff8de79352>] __lock_buffer+0x32/0x40
            [10082.350326]  [<ffffffffc0479ea2>] do_get_write_access+0x432/0x4d0 [jbd2]
            [10082.357831]  [<ffffffff8de7959d>] ? __getblk+0x2d/0x300
            [10082.363681]  [<ffffffffc0479f67>] jbd2_journal_get_write_access+0x27/0x40 [jbd2]
            [10082.371965]  [<ffffffffc1159dab>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs]
            [10082.381027]  [<ffffffffc112f180>] add_dirent_to_buf+0xd0/0x2c0 [ldiskfs]
            [10082.388530]  [<ffffffffc112f55e>] ldiskfs_dx_add_entry+0x1ee/0xc60 [ldiskfs]
            [10082.396428]  [<ffffffffc047a2d1>] ? jbd2_journal_dirty_metadata+0xd1/0x260 [jbd2]
            [10082.404806]  [<ffffffff8dcc2d05>] ? wake_up_bit+0x25/0x30
            [10082.410868]  [<ffffffffc1130a54>] __ldiskfs_add_entry+0x254/0x6e0 [ldiskfs]
            [10082.418671]  [<ffffffffc1132ae3>] ? ldiskfs_add_dot_dotdot+0x83/0x90 [ldiskfs]
            [10082.426755]  [<ffffffffc11e9404>] osd_ldiskfs_add_entry+0x34/0x260 [osd_ldiskfs]
            [10082.435042]  [<ffffffffc11e9c77>] __osd_ea_add_rec+0x197/0x360 [osd_ldiskfs]
            [10082.442941]  [<ffffffffc11f4ad1>] osd_index_ea_insert+0x491/0x12f0 [osd_ldiskfs]
            [10082.451206]  [<ffffffff8e367f32>] ? down_write+0x12/0x3d
            [10082.457246]  [<ffffffffc0f43350>] out_tx_index_insert_exec+0x1e0/0x490 [ptlrpc]
            [10082.465464]  [<ffffffffc0f41521>] ? out_obj_ref_add+0xc1/0x220 [ptlrpc]
            [10082.472895]  [<ffffffffc0f3c391>] out_tx_end+0xe1/0x5c0 [ptlrpc]
            [10082.479666]  [<ffffffffc0f405b2>] out_handle+0x1452/0x1bc0 [ptlrpc]
            [10082.486717]  [<ffffffffc0e87890>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
            [10082.494854]  [<ffffffffc0f351da>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            [10082.502588]  [<ffffffffc0f10c11>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
            [10082.511061]  [<ffffffffc0b2abde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
            [10082.519002]  [<ffffffffc0eda80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [10082.527624]  [<ffffffffc0ed7695>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
            [10082.535212]  [<ffffffff8dcd6802>] ? default_wake_function+0x12/0x20
            [10082.542246]  [<ffffffff8dccbadb>] ? __wake_up_common+0x5b/0x90
            [10082.548808]  [<ffffffffc0ede13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
            [10082.555853]  [<ffffffffc0edd640>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
            [10082.564140]  [<ffffffff8dcc1c71>] kthread+0xd1/0xe0
            [10082.569609]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10082.576435]  [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21
            [10082.583733]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10082.590580] INFO: task debugfs:27640 blocked for more than 120 seconds.
            [10082.597980] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [10082.606728] debugfs         D ffff8da16f7c4100     0 27640  27639 0x00000080
            [10082.614668] Call Trace:
            [10082.617431]  [<ffffffff8df7d1b2>] ? radix_tree_lookup_slot+0x22/0x50
            [10082.624532]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10082.630298]  [<ffffffff8e368c49>] schedule+0x29/0x70
            [10082.635855]  [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0
            [10082.637668] device-mapper: multipath: Reinstating path 8:96.
            [10082.637830] device-mapper: multipath: Failing path 8:96.
            [10082.654638]  [<ffffffff8ddb6d2d>] ? find_get_pages_tag+0xfd/0x240
            [10082.661459]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10082.667204]  [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130
            [10082.673927]  [<ffffffff8e368388>] io_schedule+0x18/0x20
            [10082.679768]  [<ffffffff8e366d71>] bit_wait_io+0x11/0x50
            [10082.685610]  [<ffffffff8e366897>] __wait_on_bit+0x67/0x90
            [10082.691649]  [<ffffffff8ddb5881>] wait_on_page_bit+0x81/0xa0
            [10082.697995]  [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40
            [10082.704613]  [<ffffffff8ddb59b1>] __filemap_fdatawait_range+0x111/0x190
            [10082.712004]  [<ffffffff8ddb5a44>] filemap_fdatawait_range+0x14/0x30
            [10082.719033]  [<ffffffff8ddb7ea6>] filemap_write_and_wait_range+0x56/0x90
            [10082.726530]  [<ffffffff8de803eb>] blkdev_fsync+0x1b/0x50
            [10082.732468]  [<ffffffff8de762f7>] do_fsync+0x67/0xb0
            [10082.738041]  [<ffffffff8e375d15>] ? system_call_after_swapgs+0xa2/0x146
            [10082.745435]  [<ffffffff8de765e0>] SyS_fsync+0x10/0x20
            [10082.751082]  [<ffffffff8e375ddb>] system_call_fastpath+0x22/0x27
            [10082.757818]  [<ffffffff8e375d21>] ? system_call_after_swapgs+0xae/0x146
            [10085.638108] device-mapper: multipath: Reinstating path 8:128.
            [10085.644767] device-mapper: multipath: Failing path 8:128.
            [10087.645110] device-mapper: multipath: Reinstating path 8:96.
            [10087.651810] sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, queueing MODE_SELECT command
            [10088.301259] sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, MODE_SELECT completed
            [10088.676785] LustreError: 13053:0:(osd_oi.c:762:osd_oi_insert()) dm-6: the FID [0x2800cae3d:0xe3ec:0x0] is used by two objects: 30409754/1062736205 477627435/1062736207
            [10088.693877] LNet: 12726:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.109@o2ib
            [10090.652279] device-mapper: multipath: Reinstating path 8:128.
            [10090.658823] device-mapper: multipath: Failing path 8:128.
            [10095.663302] device-mapper: multipath: Reinstating path 8:128.
            [10095.669968] device-mapper: multipath: Failing path 8:128.
            [10100.670909] device-mapper: multipath: Reinstating path 8:128.
            [10100.677556] device-mapper: multipath: Failing path 8:128.
            [10105.677908] device-mapper: multipath: Reinstating path 8:128.
            [10105.684607] device-mapper: multipath: Failing path 8:128.
            [10110.686120] device-mapper: multipath: Reinstating path 8:128.
            [10110.692756] device-mapper: multipath: Failing path 8:128.
            [10115.694525] device-mapper: multipath: Reinstating path 8:128.
            [10115.701149] device-mapper: multipath: Failing path 8:128.
            [10119.333857] BUG: unable to handle kernel NULL pointer dereference at           (null)
            [10119.342631] IP: [<          (null)>]           (null)
            [10119.348281] PGD 0 
            [10119.350546] Oops: 0010 [#1] SMP 
            [10119.354188] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ipmi_ssif ablk_helper dm_round_robin cryptd iTCO_wdt iTCO_vendor_support ipmi_si joydev pcspkr ipmi_devintf mei_me ioatdma ipmi_msghandler i2c_i801 mei sg lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx4_core(OE) igb ttm isci ahci devlink crct10dif_pclmul crct10dif_common libsas mpt3sas libahci crc32c_intel ptp drm pps_core raid_class libata mlx_compat(OE) dca scsi_transport_sas drm_panel_orientation_quirks i2c_algo_bit
            [10119.465397] CPU: 13 PID: 13069 Comm: mdt_out01_011 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.10.1.el7_lustre.x86_64 #1
            [10119.479569] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
            [10119.492080] task: ffff8da8e85fa080 ti: ffff8da8ccfb0000 task.ti: ffff8da8ccfb0000
            [10119.500439] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
            [10119.508809] RSP: 0018:ffff8da8ccfb3b60  EFLAGS: 00010246
            [10119.514738] RAX: 0000000000000000 RBX: ffff8da8d64fa580 RCX: 0000000000000002
            [10119.522705] RDX: ffffffffc11e1a50 RSI: ffff8da8ccfb3b68 RDI: ffff8da8d64fa588
            [10119.530669] RBP: ffff8da8ccfb3ba0 R08: 0000000000000004 R09: 0000000000000000
            [10119.538639] R10: 0000000000000001 R11: 00000000007fffff R12: ffff8da85672a300
            [10119.546601] R13: ffff8da86e8e5d08 R14: ffff8da4b1756200 R15: ffff8da8d64fa588
            [10119.554566] FS:  0000000000000000(0000) GS:ffff8da8ed940000(0000) knlGS:0000000000000000
            [10119.563610] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [10119.570023] CR2: 0000000000000000 CR3: 000000004c210000 CR4: 00000000000607e0
            [10119.578001] Call Trace:
            [10119.580767]  [<ffffffffc11dfe4e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]
            [10119.589030]  [<ffffffffc11e0027>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]
            [10119.596673]  [<ffffffffc0c987d8>] dt_index_walk+0xf8/0x430 [obdclass]
            [10119.603889]  [<ffffffffc0c98b10>] ? dt_index_walk+0x430/0x430 [obdclass]
            [10119.611386]  [<ffffffffc0c99404>] dt_index_read+0x394/0x6a0 [obdclass]
            [10119.618775]  [<ffffffffc0f32162>] tgt_obd_idx_read+0x612/0x860 [ptlrpc]
            [10119.626211]  [<ffffffffc0f351da>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            [10119.633936]  [<ffffffffc0f10c11>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
            [10119.642404]  [<ffffffffc0b2abde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
            [10119.650319]  [<ffffffffc0eda80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [10119.658910]  [<ffffffffc0ed7695>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
            [10119.666489]  [<ffffffff8dcd6802>] ? default_wake_function+0x12/0x20
            [10119.673492]  [<ffffffff8dccbadb>] ? __wake_up_common+0x5b/0x90
            [10119.680047]  [<ffffffffc0ede13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
            [10119.687087]  [<ffffffffc0edd640>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
            [10119.695351]  [<ffffffff8dcc1c71>] kthread+0xd1/0xe0
            [10119.700795]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10119.707625]  [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21
            [10119.714910]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10119.721711] Code:  Bad RIP value.
            [10119.725449] RIP  [<          (null)>]           (null)
            [10119.731192]  RSP <ffff8da8ccfb3b60>
            [10119.735084] CR2: 0000000000000000
            [    0.000000] Initializing cgroup subsys cpuset
            [    0.000000] Initializing cgroup subsys cpu
            [    0.000000] Initializing cgroup subsys cpuacct
            
            sarah Sarah Liu added a comment - Hit the same issue again when testing 2.12.2-rc1, 1 MDS crash after running about 34 hours soak-10 console log [10077.630267] device-mapper: multipath: Failing path 8:96. [10080.630817] device-mapper: multipath: Reinstating path 8:128. [10080.637462] device-mapper: multipath: Failing path 8:128. [10081.939589] INFO: task kmmpd-dm-6:12825 blocked for more than 120 seconds. [10081.947277] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10081.956040] kmmpd-dm-6 D ffff8da0f6120000 0 12825 2 0x00000080 [10081.964033] Call Trace: [10081.966797] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10081.972543] [<ffffffff8e368c49>] schedule+0x29/0x70 [10081.978120] [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0 [10081.984704] [<ffffffffc05040bd>] ? dm_old_request_fn+0x18d/0x210 [dm_mod] [10081.992416] [<ffffffff8df680de>] ? deadline_add_request+0x6e/0x90 [10081.999361] [<ffffffff8dd01092>] ? ktime_get_ts64+0x52/0xf0 [10082.005679] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.011450] [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130 [10082.018164] [<ffffffff8e368388>] io_schedule+0x18/0x20 [10082.024022] [<ffffffff8e366d71>] bit_wait_io+0x11/0x50 [10082.029881] [<ffffffff8e366897>] __wait_on_bit+0x67/0x90 [10082.035921] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.041680] [<ffffffff8e366a01>] out_of_line_wait_on_bit+0x81/0xb0 [10082.048707] [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40 [10082.055333] [<ffffffff8de7931a>] __wait_on_buffer+0x2a/0x30 [10082.061693] [<ffffffffc11429ed>] write_mmp_block+0x11d/0x160 [ldiskfs] [10082.069108] [<ffffffffc1142c48>] kmmpd+0x1a8/0x430 [ldiskfs] [10082.075554] [<ffffffffc1142aa0>] ? __dump_mmp_msg+0x70/0x70 [ldiskfs] [10082.082842] [<ffffffff8dcc1c71>] kthread+0xd1/0xe0 [10082.088321] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.095153] [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21 [10082.102444] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.109292] INFO: task jbd2/dm-6-8:12826 blocked for more than 120 seconds. [10082.117092] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10082.125846] jbd2/dm-6-8 D ffff8da8ea112080 0 12826 2 0x00000080 [10082.133794] Call Trace: [10082.136532] [<ffffffff8e368c49>] schedule+0x29/0x70 [10082.142108] [<ffffffffc047b2bc>] jbd2_journal_commit_transaction+0x23c/0x19b0 [jbd2] [10082.150883] [<ffffffff8dcd9f15>] ? sched_clock_cpu+0x85/0xc0 [10082.157323] [<ffffffff8dc2a59e>] ? __switch_to+0xce/0x580 [10082.163459] [<ffffffff8dcc2d40>] ? wake_up_atomic_t+0x30/0x30 [10082.169993] [<ffffffff8e36878f>] ? __schedule+0x3ff/0x890 [10082.176140] [<ffffffff8dcaa11e>] ? try_to_del_timer_sync+0x5e/0x90 [10082.183156] [<ffffffffc0481e89>] kjournald2+0xc9/0x260 [jbd2] [10082.189686] [<ffffffff8dcc2d40>] ? wake_up_atomic_t+0x30/0x30 [10082.196208] [<ffffffffc0481dc0>] ? commit_timeout+0x10/0x10 [jbd2] [10082.203227] [<ffffffff8dcc1c71>] kthread+0xd1/0xe0 [10082.208710] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.215521] [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21 [10082.222812] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.229643] INFO: task mdt_out01_012:13070 blocked for more than 120 seconds. [10082.237624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10082.246372] mdt_out01_012 D ffff8da8d67f9040 0 13070 2 0x00000080 [10082.254354] Call Trace: [10082.257115] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.262863] [<ffffffff8e368c49>] schedule+0x29/0x70 [10082.268446] [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0 [10082.274989] [<ffffffff8de78c3d>] ? __brelse+0x3d/0x50 [10082.280755] [<ffffffff8de78f2a>] ? bh_lru_install+0x18a/0x1e0 [10082.287296] [<ffffffff8dd01092>] ? ktime_get_ts64+0x52/0xf0 [10082.293630] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.299394] [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130 [10082.306106] [<ffffffff8e368388>] io_schedule+0x18/0x20 [10082.311963] [<ffffffff8e366d71>] bit_wait_io+0x11/0x50 [10082.317820] [<ffffffff8e366921>] __wait_on_bit_lock+0x61/0xc0 [10082.324361] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.330114] [<ffffffff8e366b71>] out_of_line_wait_on_bit_lock+0x81/0xb0 [10082.337614] [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40 [10082.344252] [<ffffffff8de79352>] __lock_buffer+0x32/0x40 [10082.350326] [<ffffffffc0479ea2>] do_get_write_access+0x432/0x4d0 [jbd2] [10082.357831] [<ffffffff8de7959d>] ? __getblk+0x2d/0x300 [10082.363681] [<ffffffffc0479f67>] jbd2_journal_get_write_access+0x27/0x40 [jbd2] [10082.371965] [<ffffffffc1159dab>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs] [10082.381027] [<ffffffffc112f180>] add_dirent_to_buf+0xd0/0x2c0 [ldiskfs] [10082.388530] [<ffffffffc112f55e>] ldiskfs_dx_add_entry+0x1ee/0xc60 [ldiskfs] [10082.396428] [<ffffffffc047a2d1>] ? jbd2_journal_dirty_metadata+0xd1/0x260 [jbd2] [10082.404806] [<ffffffff8dcc2d05>] ? wake_up_bit+0x25/0x30 [10082.410868] [<ffffffffc1130a54>] __ldiskfs_add_entry+0x254/0x6e0 [ldiskfs] [10082.418671] [<ffffffffc1132ae3>] ? ldiskfs_add_dot_dotdot+0x83/0x90 [ldiskfs] [10082.426755] [<ffffffffc11e9404>] osd_ldiskfs_add_entry+0x34/0x260 [osd_ldiskfs] [10082.435042] [<ffffffffc11e9c77>] __osd_ea_add_rec+0x197/0x360 [osd_ldiskfs] [10082.442941] [<ffffffffc11f4ad1>] osd_index_ea_insert+0x491/0x12f0 [osd_ldiskfs] [10082.451206] [<ffffffff8e367f32>] ? down_write+0x12/0x3d [10082.457246] [<ffffffffc0f43350>] out_tx_index_insert_exec+0x1e0/0x490 [ptlrpc] [10082.465464] [<ffffffffc0f41521>] ? out_obj_ref_add+0xc1/0x220 [ptlrpc] [10082.472895] [<ffffffffc0f3c391>] out_tx_end+0xe1/0x5c0 [ptlrpc] [10082.479666] [<ffffffffc0f405b2>] out_handle+0x1452/0x1bc0 [ptlrpc] [10082.486717] [<ffffffffc0e87890>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [10082.494854] [<ffffffffc0f351da>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [10082.502588] [<ffffffffc0f10c11>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [10082.511061] [<ffffffffc0b2abde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [10082.519002] [<ffffffffc0eda80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [10082.527624] [<ffffffffc0ed7695>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [10082.535212] [<ffffffff8dcd6802>] ? default_wake_function+0x12/0x20 [10082.542246] [<ffffffff8dccbadb>] ? __wake_up_common+0x5b/0x90 [10082.548808] [<ffffffffc0ede13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] [10082.555853] [<ffffffffc0edd640>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [10082.564140] [<ffffffff8dcc1c71>] kthread+0xd1/0xe0 [10082.569609] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.576435] [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21 [10082.583733] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.590580] INFO: task debugfs:27640 blocked for more than 120 seconds. [10082.597980] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10082.606728] debugfs D ffff8da16f7c4100 0 27640 27639 0x00000080 [10082.614668] Call Trace: [10082.617431] [<ffffffff8df7d1b2>] ? radix_tree_lookup_slot+0x22/0x50 [10082.624532] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.630298] [<ffffffff8e368c49>] schedule+0x29/0x70 [10082.635855] [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0 [10082.637668] device-mapper: multipath: Reinstating path 8:96. [10082.637830] device-mapper: multipath: Failing path 8:96. [10082.654638] [<ffffffff8ddb6d2d>] ? find_get_pages_tag+0xfd/0x240 [10082.661459] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.667204] [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130 [10082.673927] [<ffffffff8e368388>] io_schedule+0x18/0x20 [10082.679768] [<ffffffff8e366d71>] bit_wait_io+0x11/0x50 [10082.685610] [<ffffffff8e366897>] __wait_on_bit+0x67/0x90 [10082.691649] [<ffffffff8ddb5881>] wait_on_page_bit+0x81/0xa0 [10082.697995] [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40 [10082.704613] [<ffffffff8ddb59b1>] __filemap_fdatawait_range+0x111/0x190 [10082.712004] [<ffffffff8ddb5a44>] filemap_fdatawait_range+0x14/0x30 [10082.719033] [<ffffffff8ddb7ea6>] filemap_write_and_wait_range+0x56/0x90 [10082.726530] [<ffffffff8de803eb>] blkdev_fsync+0x1b/0x50 [10082.732468] [<ffffffff8de762f7>] do_fsync+0x67/0xb0 [10082.738041] [<ffffffff8e375d15>] ? system_call_after_swapgs+0xa2/0x146 [10082.745435] [<ffffffff8de765e0>] SyS_fsync+0x10/0x20 [10082.751082] [<ffffffff8e375ddb>] system_call_fastpath+0x22/0x27 [10082.757818] [<ffffffff8e375d21>] ? system_call_after_swapgs+0xae/0x146 [10085.638108] device-mapper: multipath: Reinstating path 8:128. [10085.644767] device-mapper: multipath: Failing path 8:128. [10087.645110] device-mapper: multipath: Reinstating path 8:96. [10087.651810] sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, queueing MODE_SELECT command [10088.301259] sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, MODE_SELECT completed [10088.676785] LustreError: 13053:0:(osd_oi.c:762:osd_oi_insert()) dm-6: the FID [0x2800cae3d:0xe3ec:0x0] is used by two objects: 30409754/1062736205 477627435/1062736207 [10088.693877] LNet: 12726:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.109@o2ib [10090.652279] device-mapper: multipath: Reinstating path 8:128. [10090.658823] device-mapper: multipath: Failing path 8:128. [10095.663302] device-mapper: multipath: Reinstating path 8:128. [10095.669968] device-mapper: multipath: Failing path 8:128. [10100.670909] device-mapper: multipath: Reinstating path 8:128. [10100.677556] device-mapper: multipath: Failing path 8:128. [10105.677908] device-mapper: multipath: Reinstating path 8:128. [10105.684607] device-mapper: multipath: Failing path 8:128. [10110.686120] device-mapper: multipath: Reinstating path 8:128. [10110.692756] device-mapper: multipath: Failing path 8:128. [10115.694525] device-mapper: multipath: Reinstating path 8:128. [10115.701149] device-mapper: multipath: Failing path 8:128. [10119.333857] BUG: unable to handle kernel NULL pointer dereference at (null) [10119.342631] IP: [< (null)>] (null) [10119.348281] PGD 0 [10119.350546] Oops: 0010 [#1] SMP [10119.354188] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ipmi_ssif ablk_helper dm_round_robin cryptd iTCO_wdt iTCO_vendor_support ipmi_si joydev pcspkr ipmi_devintf mei_me ioatdma ipmi_msghandler i2c_i801 mei sg lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx4_core(OE) igb ttm isci ahci devlink crct10dif_pclmul crct10dif_common libsas mpt3sas libahci crc32c_intel ptp drm pps_core raid_class libata mlx_compat(OE) dca scsi_transport_sas drm_panel_orientation_quirks i2c_algo_bit [10119.465397] CPU: 13 PID: 13069 Comm: mdt_out01_011 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.10.1.el7_lustre.x86_64 #1 [10119.479569] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [10119.492080] task: ffff8da8e85fa080 ti: ffff8da8ccfb0000 task.ti: ffff8da8ccfb0000 [10119.500439] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [10119.508809] RSP: 0018:ffff8da8ccfb3b60 EFLAGS: 00010246 [10119.514738] RAX: 0000000000000000 RBX: ffff8da8d64fa580 RCX: 0000000000000002 [10119.522705] RDX: ffffffffc11e1a50 RSI: ffff8da8ccfb3b68 RDI: ffff8da8d64fa588 [10119.530669] RBP: ffff8da8ccfb3ba0 R08: 0000000000000004 R09: 0000000000000000 [10119.538639] R10: 0000000000000001 R11: 00000000007fffff R12: ffff8da85672a300 [10119.546601] R13: ffff8da86e8e5d08 R14: ffff8da4b1756200 R15: ffff8da8d64fa588 [10119.554566] FS: 0000000000000000(0000) GS:ffff8da8ed940000(0000) knlGS:0000000000000000 [10119.563610] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [10119.570023] CR2: 0000000000000000 CR3: 000000004c210000 CR4: 00000000000607e0 [10119.578001] Call Trace: [10119.580767] [<ffffffffc11dfe4e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs] [10119.589030] [<ffffffffc11e0027>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs] [10119.596673] [<ffffffffc0c987d8>] dt_index_walk+0xf8/0x430 [obdclass] [10119.603889] [<ffffffffc0c98b10>] ? dt_index_walk+0x430/0x430 [obdclass] [10119.611386] [<ffffffffc0c99404>] dt_index_read+0x394/0x6a0 [obdclass] [10119.618775] [<ffffffffc0f32162>] tgt_obd_idx_read+0x612/0x860 [ptlrpc] [10119.626211] [<ffffffffc0f351da>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [10119.633936] [<ffffffffc0f10c11>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [10119.642404] [<ffffffffc0b2abde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [10119.650319] [<ffffffffc0eda80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [10119.658910] [<ffffffffc0ed7695>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [10119.666489] [<ffffffff8dcd6802>] ? default_wake_function+0x12/0x20 [10119.673492] [<ffffffff8dccbadb>] ? __wake_up_common+0x5b/0x90 [10119.680047] [<ffffffffc0ede13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] [10119.687087] [<ffffffffc0edd640>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [10119.695351] [<ffffffff8dcc1c71>] kthread+0xd1/0xe0 [10119.700795] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10119.707625] [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21 [10119.714910] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10119.721711] Code: Bad RIP value. [10119.725449] RIP [< (null)>] (null) [10119.731192] RSP <ffff8da8ccfb3b60> [10119.735084] CR2: 0000000000000000 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct
            pjones Peter Jones added a comment -

            ok - thanks Sarah!

            pjones Peter Jones added a comment - ok - thanks Sarah!
            sarah Sarah Liu added a comment -

            I think we can close it as cannot reproduce for now.

            sarah Sarah Liu added a comment - I think we can close it as cannot reproduce for now.
            pjones Peter Jones added a comment -

            So can we close this ticket as cannot repo? Previously this had been happening every couple of hours, right?

            pjones Peter Jones added a comment - So can we close this ticket as cannot repo? Previously this had been happening every couple of hours, right?
            sarah Sarah Liu added a comment -

            hi John, it is lustre-master-ib build 137. No, I don't see the failure recently, soak is running on lustre-lustre-ib build 142 right now

            sarah Sarah Liu added a comment - hi John, it is lustre-master-ib build 137. No, I don't see the failure recently, soak is running on lustre-lustre-ib build 142 right now
            jhammond John Hammond added a comment -

            Which build was this? Was it lustre-master-patchless #137? Is there a more recent build (recent enough that we still have the RPMs) that has this issue?

            jhammond John Hammond added a comment - Which build was this? Was it lustre-master-patchless #137? Is there a more recent build (recent enough that we still have the RPMs) that has this issue?
            ys Yang Sheng added a comment -
            [15638.646875] blk_cloned_rq_check_limits: over max size limit.
            [15638.646879] blk_cloned_rq_check_limits: over max size limit.
            [15638.646888] blk_cloned_rq_check_limits: over max size limit.
            [15638.646890] blk_cloned_rq_check_limits: over max size limit.
            [15638.646892] blk_cloned_rq_check_limits: over max size limit.
            [15638.646895] blk_cloned_rq_check_limits: over max size limit.
            [15638.646916] device-mapper: multipath: Failing path 8:96.
            

            This issue should be fixed by LU-9551. Looks like still there? But so great we can reproduce it on our cluster.

            ys Yang Sheng added a comment - [15638.646875] blk_cloned_rq_check_limits: over max size limit. [15638.646879] blk_cloned_rq_check_limits: over max size limit. [15638.646888] blk_cloned_rq_check_limits: over max size limit. [15638.646890] blk_cloned_rq_check_limits: over max size limit. [15638.646892] blk_cloned_rq_check_limits: over max size limit. [15638.646895] blk_cloned_rq_check_limits: over max size limit. [15638.646916] device-mapper: multipath: Failing path 8:96. This issue should be fixed by LU-9551 . Looks like still there? But so great we can reproduce it on our cluster.

            People

              ys Yang Sheng
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: