Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11457

osd_oi_insert(): the FID is used by two objects

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0
    • Lustre 2.12.0, Lustre 2.12.2, Lustre 2.12.4, Lustre 2.12.5
    • 3
    • 9223372036854775807

    Description

      tag-2.11.55

      MDS crash

      [15650.670434] device-mapper: multipath: Failing path 8:96.^M
      [15650.765276] BUG: unable to handle kernel NULL pointer dereference at           (null)^M
      [15650.775741] IP: [<          (null)>]           (null)^M
      [15650.783081] PGD 0 ^M
      [15650.786948] Oops: 0010 [#1] SMP ^M
      [15650.792218] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt joydev pcspkr ipmi_ssif iTCO_vendor_support sg ipmi_si ipmi_devintf shpchp ipmi_msghandler i2c_i801 mei_me ioatdma mei lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm igb isci ahci ptp mlx4_core(OE) mpt3sas libsas libahci pps_core dca crct10dif_pclmul devlink i2c_algo_bit crct10dif_common raid_class crc32c_intel libata i2c_core mlx_compat(OE) scsi_transport_sas^M
      [15650.934649] CPU: 14 PID: 9491 Comm: mdt_rdpg01_008 Tainted: P           OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1^M
      [15650.952002] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
      [15650.966961] task: ffff8be6a1253f40 ti: ffff8be68a044000 task.ti: ffff8be68a044000^M
      [15650.977791] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)^M
      [15650.988693] RSP: 0018:ffff8be68a047b58  EFLAGS: 00010246^M
      [15650.997167] RAX: 0000000000000000 RBX: ffff8be68b820000 RCX: 0000000000000002^M
      [15651.007733] RDX: ffffffffc164c7b0 RSI: ffff8be68a047b60 RDI: ffff8be68b820008^M
      [15651.018326] RBP: ffff8be68a047b98 R08: 0000000000000004 R09: 0000000000000000^M
      [15651.028930] R10: 0000000000000001 R11: 00000000007fffff R12: ffff8be26f9fab00^M
      [15651.039547] R13: ffff8be279a448a0 R14: ffff8be68a160000 R15: ffff8be68b820008^M
      [15651.050168] FS:  0000000000000000(0000) GS:ffff8be6ad980000(0000) knlGS:0000000000000000^M
      [15651.061880] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
      [15651.070980] CR2: 0000000000000000 CR3: 000000042c3b6000 CR4: 00000000000607e0^M
      [15651.081660] Call Trace:^M
      [15651.087091]  [<ffffffffc164ac3e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]^M
      [15651.098058]  [<ffffffffc164ae17>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]^M
      [15651.108370]  [<ffffffffc188eb47>] lod_it_load+0x27/0x90 [lod]^M
      [15651.117554]  [<ffffffffc0f48808>] dt_index_walk+0xf8/0x430 [obdclass]^M
      [15651.127457]  [<ffffffffc1915080>] ? mdd_object_lock+0xe0/0xe0 [mdd]^M
      [15651.137132]  [<ffffffffc1916d9f>] mdd_readpage+0x25f/0x5a0 [mdd]^M
      [15651.146553]  [<ffffffffc1782bda>] mdt_readpage+0x63a/0x880 [mdt]^M
      [15651.155992]  [<ffffffffc11e82ca>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M
      [15651.166379]  [<ffffffffc11c02e1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]^M
      [15651.177493]  [<ffffffffc0dfcbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]^M
      [15651.188033]  [<ffffffffc118b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M
      [15651.199251]  [<ffffffffc1188315>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]^M
      [15651.209399]  [<ffffffff83ccf682>] ? default_wake_function+0x12/0x20^M
      [15651.218931]  [<ffffffff83cc52ab>] ? __wake_up_common+0x5b/0x90^M
      [15651.228026]  [<ffffffffc118ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]^M
      [15651.237575]  [<ffffffffc118e1b0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]^M
      [15651.248365]  [<ffffffff83cbb621>] kthread+0xd1/0xe0^M
      [15651.256344]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.265688]  [<ffffffff843205f7>] ret_from_fork_nospec_begin+0x21/0x21^M
      [15651.275475]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.284736] Code:  Bad RIP value.^M
      [15651.290946] RIP  [<          (null)>]           (null)^M
      [15651.299236]  RSP <ffff8be68a047b58>^M
      [15651.305543] CR2: 0000000000000000^M
      [15651.315778] ---[ end trace 4ae4238c00f9aeec ]---^M
      [15651.336386] Kernel panic - not syncing: Fatal exception^M
      [15651.344613] Kernel Offset: 0x2c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)^M
      [15651.369289] ------------[ cut here ]------------^M
      [15651.376397] WARNING: CPU: 14 PID: 9491 at arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x65/0x70^M
      [15651.388915] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt joydev pcspkr ipmi_ssif iTCO_vendor_support sg ipmi_si ipmi_devintf shpchp ipmi_msghandler i2c_i801 mei_me ioatdma mei lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm igb isci ahci ptp mlx4_core(OE) mpt3sas libsas libahci pps_core dca crct10dif_pclmul devlink i2c_algo_bit crct10dif_common raid_class crc32c_intel libata i2c_core mlx_compat(OE) scsi_transport_sas^M
      [15651.529620] CPU: 14 PID: 9491 Comm: mdt_rdpg01_008 Tainted: P      D    OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1^M
      [15651.546472] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
      [15651.561156] Call Trace:^M
      [15651.566023]  <IRQ>  [<ffffffff8430e84e>] dump_stack+0x19/0x1b^M
      [15651.574646]  [<ffffffff83c91e18>] __warn+0xd8/0x100^M
      [15651.582224]  [<ffffffff83c91f5d>] warn_slowpath_null+0x1d/0x20^M
      [15651.590851]  [<ffffffff83c54e95>] native_smp_send_reschedule+0x65/0x70^M
      [15651.600279]  [<ffffffff83cddf81>] trigger_load_balance+0x191/0x280^M
      [15651.609280]  [<ffffffff83ccdc0a>] scheduler_tick+0x10a/0x150^M
      [15651.617702]  [<ffffffff83d01c10>] ? tick_sched_do_timer+0x50/0x50^M
      [15651.626619]  [<ffffffff83ca4f65>] update_process_times+0x65/0x80^M
      [15651.635416]  [<ffffffff83d01a10>] tick_sched_handle+0x30/0x70^M
      [15651.643916]  [<ffffffff83d01c49>] tick_sched_timer+0x39/0x80^M
      [15651.652315]  [<ffffffff83cbf7e6>] __hrtimer_run_queues+0xd6/0x260^M
      [15651.661210]  [<ffffffff83cbfd7f>] hrtimer_interrupt+0xaf/0x1d0^M
      [15651.669814]  [<ffffffff83c5847b>] local_apic_timer_interrupt+0x3b/0x60^M
      [15651.679184]  [<ffffffff84325063>] smp_apic_timer_interrupt+0x43/0x60^M
      [15651.688352]  [<ffffffff843217b2>] apic_timer_interrupt+0x162/0x170^M
      [15651.697316]  <EOI>  [<ffffffff84308c3d>] ? panic+0x1d5/0x21f^M
      [15651.705715]  [<ffffffff84308ba1>] ? panic+0x139/0x21f^M
      [15651.713430]  [<ffffffff84318745>] oops_end+0xc5/0xe0^M
      [15651.721020]  [<ffffffff8430807e>] no_context+0x285/0x2a8^M
      [15651.728984]  [<ffffffff84308115>] __bad_area_nosemaphore+0x74/0x1d1^M
      [15651.738014]  [<ffffffff84308286>] bad_area_nosemaphore+0x14/0x16^M
      [15651.746760]  [<ffffffff8431b6e0>] __do_page_fault+0x330/0x4f0^M
      [15651.755199]  [<ffffffff8431b8d5>] do_page_fault+0x35/0x90^M
      [15651.763264]  [<ffffffff84317758>] page_fault+0x28/0x30^M
      [15651.771013]  [<ffffffffc164c7b0>] ? osd_object_alloc+0x360/0x360 [osd_ldiskfs]^M
      [15651.781105]  [<ffffffffc164ac3e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]^M
      [15651.791402]  [<ffffffffc164ae17>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]^M
      [15651.801028]  [<ffffffffc188eb47>] lod_it_load+0x27/0x90 [lod]^M
      [15651.809517]  [<ffffffffc0f48808>] dt_index_walk+0xf8/0x430 [obdclass]^M
      [15651.818761]  [<ffffffffc1915080>] ? mdd_object_lock+0xe0/0xe0 [mdd]^M
      [15651.827808]  [<ffffffffc1916d9f>] mdd_readpage+0x25f/0x5a0 [mdd]^M
      [15651.836533]  [<ffffffffc1782bda>] mdt_readpage+0x63a/0x880 [mdt]^M
      [15651.845269]  [<ffffffffc11e82ca>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M
      [15651.854937]  [<ffffffffc11c02e1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]^M
      [15651.865302]  [<ffffffffc0dfcbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]^M
      [15651.875084]  [<ffffffffc118b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M
      [15651.885522]  [<ffffffffc1188315>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]^M
      [15651.894881]  [<ffffffff83ccf682>] ? default_wake_function+0x12/0x20^M
      [15651.903621]  [<ffffffff83cc52ab>] ? __wake_up_common+0x5b/0x90^M
      [15651.911880]  [<ffffffffc118ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]^M
      [15651.920585]  [<ffffffffc118e1b0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]^M
      [15651.930460]  [<ffffffff83cbb621>] kthread+0xd1/0xe0^M
      [15651.937468]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.945794]  [<ffffffff843205f7>] ret_from_fork_nospec_begin+0x21/0x21^M
      [15651.954564]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
      [15651.962806] ---[ end trace 4ae4238c00f9aeed ]---^M
      
      

      Attachments

        Issue Links

          Activity

            [LU-11457] osd_oi_insert(): the FID is used by two objects
            sarah Sarah Liu added a comment -

            Hit this again on 2.12.2-rc1 testing

            sarah Sarah Liu added a comment - Hit this again on 2.12.2-rc1 testing
            sarah Sarah Liu added a comment -

            Hit the same issue again when testing 2.12.2-rc1, 1 MDS crash after running about 34 hours

            soak-10 console log

            [10077.630267] device-mapper: multipath: Failing path 8:96.
            [10080.630817] device-mapper: multipath: Reinstating path 8:128.
            [10080.637462] device-mapper: multipath: Failing path 8:128.
            [10081.939589] INFO: task kmmpd-dm-6:12825 blocked for more than 120 seconds.
            [10081.947277] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [10081.956040] kmmpd-dm-6      D ffff8da0f6120000     0 12825      2 0x00000080
            [10081.964033] Call Trace:
            [10081.966797]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10081.972543]  [<ffffffff8e368c49>] schedule+0x29/0x70
            [10081.978120]  [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0
            [10081.984704]  [<ffffffffc05040bd>] ? dm_old_request_fn+0x18d/0x210 [dm_mod]
            [10081.992416]  [<ffffffff8df680de>] ? deadline_add_request+0x6e/0x90
            [10081.999361]  [<ffffffff8dd01092>] ? ktime_get_ts64+0x52/0xf0
            [10082.005679]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10082.011450]  [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130
            [10082.018164]  [<ffffffff8e368388>] io_schedule+0x18/0x20
            [10082.024022]  [<ffffffff8e366d71>] bit_wait_io+0x11/0x50
            [10082.029881]  [<ffffffff8e366897>] __wait_on_bit+0x67/0x90
            [10082.035921]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10082.041680]  [<ffffffff8e366a01>] out_of_line_wait_on_bit+0x81/0xb0
            [10082.048707]  [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40
            [10082.055333]  [<ffffffff8de7931a>] __wait_on_buffer+0x2a/0x30
            [10082.061693]  [<ffffffffc11429ed>] write_mmp_block+0x11d/0x160 [ldiskfs]
            [10082.069108]  [<ffffffffc1142c48>] kmmpd+0x1a8/0x430 [ldiskfs]
            [10082.075554]  [<ffffffffc1142aa0>] ? __dump_mmp_msg+0x70/0x70 [ldiskfs]
            [10082.082842]  [<ffffffff8dcc1c71>] kthread+0xd1/0xe0
            [10082.088321]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10082.095153]  [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21
            [10082.102444]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10082.109292] INFO: task jbd2/dm-6-8:12826 blocked for more than 120 seconds.
            [10082.117092] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [10082.125846] jbd2/dm-6-8     D ffff8da8ea112080     0 12826      2 0x00000080
            [10082.133794] Call Trace:
            [10082.136532]  [<ffffffff8e368c49>] schedule+0x29/0x70
            [10082.142108]  [<ffffffffc047b2bc>] jbd2_journal_commit_transaction+0x23c/0x19b0 [jbd2]
            [10082.150883]  [<ffffffff8dcd9f15>] ? sched_clock_cpu+0x85/0xc0
            [10082.157323]  [<ffffffff8dc2a59e>] ? __switch_to+0xce/0x580
            [10082.163459]  [<ffffffff8dcc2d40>] ? wake_up_atomic_t+0x30/0x30
            [10082.169993]  [<ffffffff8e36878f>] ? __schedule+0x3ff/0x890
            [10082.176140]  [<ffffffff8dcaa11e>] ? try_to_del_timer_sync+0x5e/0x90
            [10082.183156]  [<ffffffffc0481e89>] kjournald2+0xc9/0x260 [jbd2]
            [10082.189686]  [<ffffffff8dcc2d40>] ? wake_up_atomic_t+0x30/0x30
            [10082.196208]  [<ffffffffc0481dc0>] ? commit_timeout+0x10/0x10 [jbd2]
            [10082.203227]  [<ffffffff8dcc1c71>] kthread+0xd1/0xe0
            [10082.208710]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10082.215521]  [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21
            [10082.222812]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10082.229643] INFO: task mdt_out01_012:13070 blocked for more than 120 seconds.
            [10082.237624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [10082.246372] mdt_out01_012   D ffff8da8d67f9040     0 13070      2 0x00000080
            [10082.254354] Call Trace:
            [10082.257115]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10082.262863]  [<ffffffff8e368c49>] schedule+0x29/0x70
            [10082.268446]  [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0
            [10082.274989]  [<ffffffff8de78c3d>] ? __brelse+0x3d/0x50
            [10082.280755]  [<ffffffff8de78f2a>] ? bh_lru_install+0x18a/0x1e0
            [10082.287296]  [<ffffffff8dd01092>] ? ktime_get_ts64+0x52/0xf0
            [10082.293630]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10082.299394]  [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130
            [10082.306106]  [<ffffffff8e368388>] io_schedule+0x18/0x20
            [10082.311963]  [<ffffffff8e366d71>] bit_wait_io+0x11/0x50
            [10082.317820]  [<ffffffff8e366921>] __wait_on_bit_lock+0x61/0xc0
            [10082.324361]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10082.330114]  [<ffffffff8e366b71>] out_of_line_wait_on_bit_lock+0x81/0xb0
            [10082.337614]  [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40
            [10082.344252]  [<ffffffff8de79352>] __lock_buffer+0x32/0x40
            [10082.350326]  [<ffffffffc0479ea2>] do_get_write_access+0x432/0x4d0 [jbd2]
            [10082.357831]  [<ffffffff8de7959d>] ? __getblk+0x2d/0x300
            [10082.363681]  [<ffffffffc0479f67>] jbd2_journal_get_write_access+0x27/0x40 [jbd2]
            [10082.371965]  [<ffffffffc1159dab>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs]
            [10082.381027]  [<ffffffffc112f180>] add_dirent_to_buf+0xd0/0x2c0 [ldiskfs]
            [10082.388530]  [<ffffffffc112f55e>] ldiskfs_dx_add_entry+0x1ee/0xc60 [ldiskfs]
            [10082.396428]  [<ffffffffc047a2d1>] ? jbd2_journal_dirty_metadata+0xd1/0x260 [jbd2]
            [10082.404806]  [<ffffffff8dcc2d05>] ? wake_up_bit+0x25/0x30
            [10082.410868]  [<ffffffffc1130a54>] __ldiskfs_add_entry+0x254/0x6e0 [ldiskfs]
            [10082.418671]  [<ffffffffc1132ae3>] ? ldiskfs_add_dot_dotdot+0x83/0x90 [ldiskfs]
            [10082.426755]  [<ffffffffc11e9404>] osd_ldiskfs_add_entry+0x34/0x260 [osd_ldiskfs]
            [10082.435042]  [<ffffffffc11e9c77>] __osd_ea_add_rec+0x197/0x360 [osd_ldiskfs]
            [10082.442941]  [<ffffffffc11f4ad1>] osd_index_ea_insert+0x491/0x12f0 [osd_ldiskfs]
            [10082.451206]  [<ffffffff8e367f32>] ? down_write+0x12/0x3d
            [10082.457246]  [<ffffffffc0f43350>] out_tx_index_insert_exec+0x1e0/0x490 [ptlrpc]
            [10082.465464]  [<ffffffffc0f41521>] ? out_obj_ref_add+0xc1/0x220 [ptlrpc]
            [10082.472895]  [<ffffffffc0f3c391>] out_tx_end+0xe1/0x5c0 [ptlrpc]
            [10082.479666]  [<ffffffffc0f405b2>] out_handle+0x1452/0x1bc0 [ptlrpc]
            [10082.486717]  [<ffffffffc0e87890>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
            [10082.494854]  [<ffffffffc0f351da>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            [10082.502588]  [<ffffffffc0f10c11>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
            [10082.511061]  [<ffffffffc0b2abde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
            [10082.519002]  [<ffffffffc0eda80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [10082.527624]  [<ffffffffc0ed7695>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
            [10082.535212]  [<ffffffff8dcd6802>] ? default_wake_function+0x12/0x20
            [10082.542246]  [<ffffffff8dccbadb>] ? __wake_up_common+0x5b/0x90
            [10082.548808]  [<ffffffffc0ede13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
            [10082.555853]  [<ffffffffc0edd640>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
            [10082.564140]  [<ffffffff8dcc1c71>] kthread+0xd1/0xe0
            [10082.569609]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10082.576435]  [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21
            [10082.583733]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10082.590580] INFO: task debugfs:27640 blocked for more than 120 seconds.
            [10082.597980] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [10082.606728] debugfs         D ffff8da16f7c4100     0 27640  27639 0x00000080
            [10082.614668] Call Trace:
            [10082.617431]  [<ffffffff8df7d1b2>] ? radix_tree_lookup_slot+0x22/0x50
            [10082.624532]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10082.630298]  [<ffffffff8e368c49>] schedule+0x29/0x70
            [10082.635855]  [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0
            [10082.637668] device-mapper: multipath: Reinstating path 8:96.
            [10082.637830] device-mapper: multipath: Failing path 8:96.
            [10082.654638]  [<ffffffff8ddb6d2d>] ? find_get_pages_tag+0xfd/0x240
            [10082.661459]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
            [10082.667204]  [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130
            [10082.673927]  [<ffffffff8e368388>] io_schedule+0x18/0x20
            [10082.679768]  [<ffffffff8e366d71>] bit_wait_io+0x11/0x50
            [10082.685610]  [<ffffffff8e366897>] __wait_on_bit+0x67/0x90
            [10082.691649]  [<ffffffff8ddb5881>] wait_on_page_bit+0x81/0xa0
            [10082.697995]  [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40
            [10082.704613]  [<ffffffff8ddb59b1>] __filemap_fdatawait_range+0x111/0x190
            [10082.712004]  [<ffffffff8ddb5a44>] filemap_fdatawait_range+0x14/0x30
            [10082.719033]  [<ffffffff8ddb7ea6>] filemap_write_and_wait_range+0x56/0x90
            [10082.726530]  [<ffffffff8de803eb>] blkdev_fsync+0x1b/0x50
            [10082.732468]  [<ffffffff8de762f7>] do_fsync+0x67/0xb0
            [10082.738041]  [<ffffffff8e375d15>] ? system_call_after_swapgs+0xa2/0x146
            [10082.745435]  [<ffffffff8de765e0>] SyS_fsync+0x10/0x20
            [10082.751082]  [<ffffffff8e375ddb>] system_call_fastpath+0x22/0x27
            [10082.757818]  [<ffffffff8e375d21>] ? system_call_after_swapgs+0xae/0x146
            [10085.638108] device-mapper: multipath: Reinstating path 8:128.
            [10085.644767] device-mapper: multipath: Failing path 8:128.
            [10087.645110] device-mapper: multipath: Reinstating path 8:96.
            [10087.651810] sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, queueing MODE_SELECT command
            [10088.301259] sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, MODE_SELECT completed
            [10088.676785] LustreError: 13053:0:(osd_oi.c:762:osd_oi_insert()) dm-6: the FID [0x2800cae3d:0xe3ec:0x0] is used by two objects: 30409754/1062736205 477627435/1062736207
            [10088.693877] LNet: 12726:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.109@o2ib
            [10090.652279] device-mapper: multipath: Reinstating path 8:128.
            [10090.658823] device-mapper: multipath: Failing path 8:128.
            [10095.663302] device-mapper: multipath: Reinstating path 8:128.
            [10095.669968] device-mapper: multipath: Failing path 8:128.
            [10100.670909] device-mapper: multipath: Reinstating path 8:128.
            [10100.677556] device-mapper: multipath: Failing path 8:128.
            [10105.677908] device-mapper: multipath: Reinstating path 8:128.
            [10105.684607] device-mapper: multipath: Failing path 8:128.
            [10110.686120] device-mapper: multipath: Reinstating path 8:128.
            [10110.692756] device-mapper: multipath: Failing path 8:128.
            [10115.694525] device-mapper: multipath: Reinstating path 8:128.
            [10115.701149] device-mapper: multipath: Failing path 8:128.
            [10119.333857] BUG: unable to handle kernel NULL pointer dereference at           (null)
            [10119.342631] IP: [<          (null)>]           (null)
            [10119.348281] PGD 0 
            [10119.350546] Oops: 0010 [#1] SMP 
            [10119.354188] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ipmi_ssif ablk_helper dm_round_robin cryptd iTCO_wdt iTCO_vendor_support ipmi_si joydev pcspkr ipmi_devintf mei_me ioatdma ipmi_msghandler i2c_i801 mei sg lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx4_core(OE) igb ttm isci ahci devlink crct10dif_pclmul crct10dif_common libsas mpt3sas libahci crc32c_intel ptp drm pps_core raid_class libata mlx_compat(OE) dca scsi_transport_sas drm_panel_orientation_quirks i2c_algo_bit
            [10119.465397] CPU: 13 PID: 13069 Comm: mdt_out01_011 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.10.1.el7_lustre.x86_64 #1
            [10119.479569] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
            [10119.492080] task: ffff8da8e85fa080 ti: ffff8da8ccfb0000 task.ti: ffff8da8ccfb0000
            [10119.500439] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
            [10119.508809] RSP: 0018:ffff8da8ccfb3b60  EFLAGS: 00010246
            [10119.514738] RAX: 0000000000000000 RBX: ffff8da8d64fa580 RCX: 0000000000000002
            [10119.522705] RDX: ffffffffc11e1a50 RSI: ffff8da8ccfb3b68 RDI: ffff8da8d64fa588
            [10119.530669] RBP: ffff8da8ccfb3ba0 R08: 0000000000000004 R09: 0000000000000000
            [10119.538639] R10: 0000000000000001 R11: 00000000007fffff R12: ffff8da85672a300
            [10119.546601] R13: ffff8da86e8e5d08 R14: ffff8da4b1756200 R15: ffff8da8d64fa588
            [10119.554566] FS:  0000000000000000(0000) GS:ffff8da8ed940000(0000) knlGS:0000000000000000
            [10119.563610] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [10119.570023] CR2: 0000000000000000 CR3: 000000004c210000 CR4: 00000000000607e0
            [10119.578001] Call Trace:
            [10119.580767]  [<ffffffffc11dfe4e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]
            [10119.589030]  [<ffffffffc11e0027>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]
            [10119.596673]  [<ffffffffc0c987d8>] dt_index_walk+0xf8/0x430 [obdclass]
            [10119.603889]  [<ffffffffc0c98b10>] ? dt_index_walk+0x430/0x430 [obdclass]
            [10119.611386]  [<ffffffffc0c99404>] dt_index_read+0x394/0x6a0 [obdclass]
            [10119.618775]  [<ffffffffc0f32162>] tgt_obd_idx_read+0x612/0x860 [ptlrpc]
            [10119.626211]  [<ffffffffc0f351da>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            [10119.633936]  [<ffffffffc0f10c11>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
            [10119.642404]  [<ffffffffc0b2abde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
            [10119.650319]  [<ffffffffc0eda80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [10119.658910]  [<ffffffffc0ed7695>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
            [10119.666489]  [<ffffffff8dcd6802>] ? default_wake_function+0x12/0x20
            [10119.673492]  [<ffffffff8dccbadb>] ? __wake_up_common+0x5b/0x90
            [10119.680047]  [<ffffffffc0ede13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
            [10119.687087]  [<ffffffffc0edd640>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
            [10119.695351]  [<ffffffff8dcc1c71>] kthread+0xd1/0xe0
            [10119.700795]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10119.707625]  [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21
            [10119.714910]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
            [10119.721711] Code:  Bad RIP value.
            [10119.725449] RIP  [<          (null)>]           (null)
            [10119.731192]  RSP <ffff8da8ccfb3b60>
            [10119.735084] CR2: 0000000000000000
            [    0.000000] Initializing cgroup subsys cpuset
            [    0.000000] Initializing cgroup subsys cpu
            [    0.000000] Initializing cgroup subsys cpuacct
            
            sarah Sarah Liu added a comment - Hit the same issue again when testing 2.12.2-rc1, 1 MDS crash after running about 34 hours soak-10 console log [10077.630267] device-mapper: multipath: Failing path 8:96. [10080.630817] device-mapper: multipath: Reinstating path 8:128. [10080.637462] device-mapper: multipath: Failing path 8:128. [10081.939589] INFO: task kmmpd-dm-6:12825 blocked for more than 120 seconds. [10081.947277] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10081.956040] kmmpd-dm-6 D ffff8da0f6120000 0 12825 2 0x00000080 [10081.964033] Call Trace: [10081.966797] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10081.972543] [<ffffffff8e368c49>] schedule+0x29/0x70 [10081.978120] [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0 [10081.984704] [<ffffffffc05040bd>] ? dm_old_request_fn+0x18d/0x210 [dm_mod] [10081.992416] [<ffffffff8df680de>] ? deadline_add_request+0x6e/0x90 [10081.999361] [<ffffffff8dd01092>] ? ktime_get_ts64+0x52/0xf0 [10082.005679] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.011450] [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130 [10082.018164] [<ffffffff8e368388>] io_schedule+0x18/0x20 [10082.024022] [<ffffffff8e366d71>] bit_wait_io+0x11/0x50 [10082.029881] [<ffffffff8e366897>] __wait_on_bit+0x67/0x90 [10082.035921] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.041680] [<ffffffff8e366a01>] out_of_line_wait_on_bit+0x81/0xb0 [10082.048707] [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40 [10082.055333] [<ffffffff8de7931a>] __wait_on_buffer+0x2a/0x30 [10082.061693] [<ffffffffc11429ed>] write_mmp_block+0x11d/0x160 [ldiskfs] [10082.069108] [<ffffffffc1142c48>] kmmpd+0x1a8/0x430 [ldiskfs] [10082.075554] [<ffffffffc1142aa0>] ? __dump_mmp_msg+0x70/0x70 [ldiskfs] [10082.082842] [<ffffffff8dcc1c71>] kthread+0xd1/0xe0 [10082.088321] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.095153] [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21 [10082.102444] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.109292] INFO: task jbd2/dm-6-8:12826 blocked for more than 120 seconds. [10082.117092] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10082.125846] jbd2/dm-6-8 D ffff8da8ea112080 0 12826 2 0x00000080 [10082.133794] Call Trace: [10082.136532] [<ffffffff8e368c49>] schedule+0x29/0x70 [10082.142108] [<ffffffffc047b2bc>] jbd2_journal_commit_transaction+0x23c/0x19b0 [jbd2] [10082.150883] [<ffffffff8dcd9f15>] ? sched_clock_cpu+0x85/0xc0 [10082.157323] [<ffffffff8dc2a59e>] ? __switch_to+0xce/0x580 [10082.163459] [<ffffffff8dcc2d40>] ? wake_up_atomic_t+0x30/0x30 [10082.169993] [<ffffffff8e36878f>] ? __schedule+0x3ff/0x890 [10082.176140] [<ffffffff8dcaa11e>] ? try_to_del_timer_sync+0x5e/0x90 [10082.183156] [<ffffffffc0481e89>] kjournald2+0xc9/0x260 [jbd2] [10082.189686] [<ffffffff8dcc2d40>] ? wake_up_atomic_t+0x30/0x30 [10082.196208] [<ffffffffc0481dc0>] ? commit_timeout+0x10/0x10 [jbd2] [10082.203227] [<ffffffff8dcc1c71>] kthread+0xd1/0xe0 [10082.208710] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.215521] [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21 [10082.222812] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.229643] INFO: task mdt_out01_012:13070 blocked for more than 120 seconds. [10082.237624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10082.246372] mdt_out01_012 D ffff8da8d67f9040 0 13070 2 0x00000080 [10082.254354] Call Trace: [10082.257115] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.262863] [<ffffffff8e368c49>] schedule+0x29/0x70 [10082.268446] [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0 [10082.274989] [<ffffffff8de78c3d>] ? __brelse+0x3d/0x50 [10082.280755] [<ffffffff8de78f2a>] ? bh_lru_install+0x18a/0x1e0 [10082.287296] [<ffffffff8dd01092>] ? ktime_get_ts64+0x52/0xf0 [10082.293630] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.299394] [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130 [10082.306106] [<ffffffff8e368388>] io_schedule+0x18/0x20 [10082.311963] [<ffffffff8e366d71>] bit_wait_io+0x11/0x50 [10082.317820] [<ffffffff8e366921>] __wait_on_bit_lock+0x61/0xc0 [10082.324361] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.330114] [<ffffffff8e366b71>] out_of_line_wait_on_bit_lock+0x81/0xb0 [10082.337614] [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40 [10082.344252] [<ffffffff8de79352>] __lock_buffer+0x32/0x40 [10082.350326] [<ffffffffc0479ea2>] do_get_write_access+0x432/0x4d0 [jbd2] [10082.357831] [<ffffffff8de7959d>] ? __getblk+0x2d/0x300 [10082.363681] [<ffffffffc0479f67>] jbd2_journal_get_write_access+0x27/0x40 [jbd2] [10082.371965] [<ffffffffc1159dab>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs] [10082.381027] [<ffffffffc112f180>] add_dirent_to_buf+0xd0/0x2c0 [ldiskfs] [10082.388530] [<ffffffffc112f55e>] ldiskfs_dx_add_entry+0x1ee/0xc60 [ldiskfs] [10082.396428] [<ffffffffc047a2d1>] ? jbd2_journal_dirty_metadata+0xd1/0x260 [jbd2] [10082.404806] [<ffffffff8dcc2d05>] ? wake_up_bit+0x25/0x30 [10082.410868] [<ffffffffc1130a54>] __ldiskfs_add_entry+0x254/0x6e0 [ldiskfs] [10082.418671] [<ffffffffc1132ae3>] ? ldiskfs_add_dot_dotdot+0x83/0x90 [ldiskfs] [10082.426755] [<ffffffffc11e9404>] osd_ldiskfs_add_entry+0x34/0x260 [osd_ldiskfs] [10082.435042] [<ffffffffc11e9c77>] __osd_ea_add_rec+0x197/0x360 [osd_ldiskfs] [10082.442941] [<ffffffffc11f4ad1>] osd_index_ea_insert+0x491/0x12f0 [osd_ldiskfs] [10082.451206] [<ffffffff8e367f32>] ? down_write+0x12/0x3d [10082.457246] [<ffffffffc0f43350>] out_tx_index_insert_exec+0x1e0/0x490 [ptlrpc] [10082.465464] [<ffffffffc0f41521>] ? out_obj_ref_add+0xc1/0x220 [ptlrpc] [10082.472895] [<ffffffffc0f3c391>] out_tx_end+0xe1/0x5c0 [ptlrpc] [10082.479666] [<ffffffffc0f405b2>] out_handle+0x1452/0x1bc0 [ptlrpc] [10082.486717] [<ffffffffc0e87890>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [10082.494854] [<ffffffffc0f351da>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [10082.502588] [<ffffffffc0f10c11>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [10082.511061] [<ffffffffc0b2abde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [10082.519002] [<ffffffffc0eda80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [10082.527624] [<ffffffffc0ed7695>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [10082.535212] [<ffffffff8dcd6802>] ? default_wake_function+0x12/0x20 [10082.542246] [<ffffffff8dccbadb>] ? __wake_up_common+0x5b/0x90 [10082.548808] [<ffffffffc0ede13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] [10082.555853] [<ffffffffc0edd640>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [10082.564140] [<ffffffff8dcc1c71>] kthread+0xd1/0xe0 [10082.569609] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.576435] [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21 [10082.583733] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.590580] INFO: task debugfs:27640 blocked for more than 120 seconds. [10082.597980] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10082.606728] debugfs D ffff8da16f7c4100 0 27640 27639 0x00000080 [10082.614668] Call Trace: [10082.617431] [<ffffffff8df7d1b2>] ? radix_tree_lookup_slot+0x22/0x50 [10082.624532] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.630298] [<ffffffff8e368c49>] schedule+0x29/0x70 [10082.635855] [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0 [10082.637668] device-mapper: multipath: Reinstating path 8:96. [10082.637830] device-mapper: multipath: Failing path 8:96. [10082.654638] [<ffffffff8ddb6d2d>] ? find_get_pages_tag+0xfd/0x240 [10082.661459] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.667204] [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130 [10082.673927] [<ffffffff8e368388>] io_schedule+0x18/0x20 [10082.679768] [<ffffffff8e366d71>] bit_wait_io+0x11/0x50 [10082.685610] [<ffffffff8e366897>] __wait_on_bit+0x67/0x90 [10082.691649] [<ffffffff8ddb5881>] wait_on_page_bit+0x81/0xa0 [10082.697995] [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40 [10082.704613] [<ffffffff8ddb59b1>] __filemap_fdatawait_range+0x111/0x190 [10082.712004] [<ffffffff8ddb5a44>] filemap_fdatawait_range+0x14/0x30 [10082.719033] [<ffffffff8ddb7ea6>] filemap_write_and_wait_range+0x56/0x90 [10082.726530] [<ffffffff8de803eb>] blkdev_fsync+0x1b/0x50 [10082.732468] [<ffffffff8de762f7>] do_fsync+0x67/0xb0 [10082.738041] [<ffffffff8e375d15>] ? system_call_after_swapgs+0xa2/0x146 [10082.745435] [<ffffffff8de765e0>] SyS_fsync+0x10/0x20 [10082.751082] [<ffffffff8e375ddb>] system_call_fastpath+0x22/0x27 [10082.757818] [<ffffffff8e375d21>] ? system_call_after_swapgs+0xae/0x146 [10085.638108] device-mapper: multipath: Reinstating path 8:128. [10085.644767] device-mapper: multipath: Failing path 8:128. [10087.645110] device-mapper: multipath: Reinstating path 8:96. [10087.651810] sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, queueing MODE_SELECT command [10088.301259] sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, MODE_SELECT completed [10088.676785] LustreError: 13053:0:(osd_oi.c:762:osd_oi_insert()) dm-6: the FID [0x2800cae3d:0xe3ec:0x0] is used by two objects: 30409754/1062736205 477627435/1062736207 [10088.693877] LNet: 12726:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.109@o2ib [10090.652279] device-mapper: multipath: Reinstating path 8:128. [10090.658823] device-mapper: multipath: Failing path 8:128. [10095.663302] device-mapper: multipath: Reinstating path 8:128. [10095.669968] device-mapper: multipath: Failing path 8:128. [10100.670909] device-mapper: multipath: Reinstating path 8:128. [10100.677556] device-mapper: multipath: Failing path 8:128. [10105.677908] device-mapper: multipath: Reinstating path 8:128. [10105.684607] device-mapper: multipath: Failing path 8:128. [10110.686120] device-mapper: multipath: Reinstating path 8:128. [10110.692756] device-mapper: multipath: Failing path 8:128. [10115.694525] device-mapper: multipath: Reinstating path 8:128. [10115.701149] device-mapper: multipath: Failing path 8:128. [10119.333857] BUG: unable to handle kernel NULL pointer dereference at (null) [10119.342631] IP: [< (null)>] (null) [10119.348281] PGD 0 [10119.350546] Oops: 0010 [#1] SMP [10119.354188] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ipmi_ssif ablk_helper dm_round_robin cryptd iTCO_wdt iTCO_vendor_support ipmi_si joydev pcspkr ipmi_devintf mei_me ioatdma ipmi_msghandler i2c_i801 mei sg lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx4_core(OE) igb ttm isci ahci devlink crct10dif_pclmul crct10dif_common libsas mpt3sas libahci crc32c_intel ptp drm pps_core raid_class libata mlx_compat(OE) dca scsi_transport_sas drm_panel_orientation_quirks i2c_algo_bit [10119.465397] CPU: 13 PID: 13069 Comm: mdt_out01_011 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.10.1.el7_lustre.x86_64 #1 [10119.479569] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [10119.492080] task: ffff8da8e85fa080 ti: ffff8da8ccfb0000 task.ti: ffff8da8ccfb0000 [10119.500439] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [10119.508809] RSP: 0018:ffff8da8ccfb3b60 EFLAGS: 00010246 [10119.514738] RAX: 0000000000000000 RBX: ffff8da8d64fa580 RCX: 0000000000000002 [10119.522705] RDX: ffffffffc11e1a50 RSI: ffff8da8ccfb3b68 RDI: ffff8da8d64fa588 [10119.530669] RBP: ffff8da8ccfb3ba0 R08: 0000000000000004 R09: 0000000000000000 [10119.538639] R10: 0000000000000001 R11: 00000000007fffff R12: ffff8da85672a300 [10119.546601] R13: ffff8da86e8e5d08 R14: ffff8da4b1756200 R15: ffff8da8d64fa588 [10119.554566] FS: 0000000000000000(0000) GS:ffff8da8ed940000(0000) knlGS:0000000000000000 [10119.563610] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [10119.570023] CR2: 0000000000000000 CR3: 000000004c210000 CR4: 00000000000607e0 [10119.578001] Call Trace: [10119.580767] [<ffffffffc11dfe4e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs] [10119.589030] [<ffffffffc11e0027>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs] [10119.596673] [<ffffffffc0c987d8>] dt_index_walk+0xf8/0x430 [obdclass] [10119.603889] [<ffffffffc0c98b10>] ? dt_index_walk+0x430/0x430 [obdclass] [10119.611386] [<ffffffffc0c99404>] dt_index_read+0x394/0x6a0 [obdclass] [10119.618775] [<ffffffffc0f32162>] tgt_obd_idx_read+0x612/0x860 [ptlrpc] [10119.626211] [<ffffffffc0f351da>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [10119.633936] [<ffffffffc0f10c11>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [10119.642404] [<ffffffffc0b2abde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [10119.650319] [<ffffffffc0eda80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [10119.658910] [<ffffffffc0ed7695>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [10119.666489] [<ffffffff8dcd6802>] ? default_wake_function+0x12/0x20 [10119.673492] [<ffffffff8dccbadb>] ? __wake_up_common+0x5b/0x90 [10119.680047] [<ffffffffc0ede13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] [10119.687087] [<ffffffffc0edd640>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [10119.695351] [<ffffffff8dcc1c71>] kthread+0xd1/0xe0 [10119.700795] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10119.707625] [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21 [10119.714910] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10119.721711] Code: Bad RIP value. [10119.725449] RIP [< (null)>] (null) [10119.731192] RSP <ffff8da8ccfb3b60> [10119.735084] CR2: 0000000000000000 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct
            pjones Peter Jones added a comment -

            ok - thanks Sarah!

            pjones Peter Jones added a comment - ok - thanks Sarah!
            sarah Sarah Liu added a comment -

            I think we can close it as cannot reproduce for now.

            sarah Sarah Liu added a comment - I think we can close it as cannot reproduce for now.
            pjones Peter Jones added a comment -

            So can we close this ticket as cannot repo? Previously this had been happening every couple of hours, right?

            pjones Peter Jones added a comment - So can we close this ticket as cannot repo? Previously this had been happening every couple of hours, right?
            sarah Sarah Liu added a comment -

            hi John, it is lustre-master-ib build 137. No, I don't see the failure recently, soak is running on lustre-lustre-ib build 142 right now

            sarah Sarah Liu added a comment - hi John, it is lustre-master-ib build 137. No, I don't see the failure recently, soak is running on lustre-lustre-ib build 142 right now
            jhammond John Hammond added a comment -

            Which build was this? Was it lustre-master-patchless #137? Is there a more recent build (recent enough that we still have the RPMs) that has this issue?

            jhammond John Hammond added a comment - Which build was this? Was it lustre-master-patchless #137? Is there a more recent build (recent enough that we still have the RPMs) that has this issue?
            ys Yang Sheng added a comment -
            [15638.646875] blk_cloned_rq_check_limits: over max size limit.
            [15638.646879] blk_cloned_rq_check_limits: over max size limit.
            [15638.646888] blk_cloned_rq_check_limits: over max size limit.
            [15638.646890] blk_cloned_rq_check_limits: over max size limit.
            [15638.646892] blk_cloned_rq_check_limits: over max size limit.
            [15638.646895] blk_cloned_rq_check_limits: over max size limit.
            [15638.646916] device-mapper: multipath: Failing path 8:96.
            

            This issue should be fixed by LU-9551. Looks like still there? But so great we can reproduce it on our cluster.

            ys Yang Sheng added a comment - [15638.646875] blk_cloned_rq_check_limits: over max size limit. [15638.646879] blk_cloned_rq_check_limits: over max size limit. [15638.646888] blk_cloned_rq_check_limits: over max size limit. [15638.646890] blk_cloned_rq_check_limits: over max size limit. [15638.646892] blk_cloned_rq_check_limits: over max size limit. [15638.646895] blk_cloned_rq_check_limits: over max size limit. [15638.646916] device-mapper: multipath: Failing path 8:96. This issue should be fixed by LU-9551 . Looks like still there? But so great we can reproduce it on our cluster.
            ys Yang Sheng added a comment -

            Maybe some issue like LDEV-642, Let me do further investigation.

            ys Yang Sheng added a comment - Maybe some issue like LDEV-642, Let me do further investigation.
            pjones Peter Jones added a comment -

            Yang Sheng

            This seems to be an ldiskfs issue. It has only appeared between the 2.11.54 and 2.11.55 tags. Any suggestions?

            Peter

            pjones Peter Jones added a comment - Yang Sheng This seems to be an ldiskfs issue. It has only appeared between the 2.11.54 and 2.11.55 tags. Any suggestions? Peter
            sarah Sarah Liu added a comment -

            Hi Lai,

            No, I didn't run dir migration. When I filed the ticket, it was the first time seen the problem, but I restarted soak on tip of master on Monday with build #137, hit the same issue again in about 2 hours run. Here is the trace from the latest failure:

            on soak-10

            [ 9654.637324] blk_cloned_rq_check_limits: over max size limit.
            [ 9658.590523] BUG: unable to handle kernel NULL pointer dereference at           (null)
            [ 9658.600366] IP: [<          (null)>]           (null)
            [ 9658.606866] PGD 0 
            [ 9658.609984] Oops: 0010 [#1] SMP 
            [ 9658.614447] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) sb_edac intel_powerclamp coretemp spl(OE) intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ipmi_ssif pcspkr ipmi_si ipmi_devintf ipmi_msghandler iTCO_wdt iTCO_vendor_support joydev sg mei_me mei i2c_i801 lpc_ich ioatdma wmi shpchp auth_rpcgss dm_multipath dm_mod sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb isci ahci ptp drm mlx4_core(OE) libsas libahci pps_core mpt3sas dca crct10dif_pclmul devlink i2c_algo_bit crct10dif_common crc32c_intel raid_class libata i2c_core mlx_compat(OE) scsi_transport_sas
            [ 9658.745591] CPU: 11 PID: 22180 Comm: mdt_out01_010 Tainted: P           OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1
            [ 9658.760612] Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
            [ 9658.777200] task: ffff9e132d6f2f70 ti: ffff9e12e2b24000 task.ti: ffff9e12e2b24000
            [ 9658.786678] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
            [ 9658.796193] RSP: 0018:ffff9e12e2b27b60  EFLAGS: 00010246
            [ 9658.803358] RAX: 0000000000000000 RBX: ffff9e1715878000 RCX: 0000000000000002
            [ 9658.812569] RDX: ffffffffc16877b0 RSI: ffff9e12e2b27b68 RDI: ffff9e1715878008
            [ 9658.821879] RBP: ffff9e12e2b27ba0 R08: 0000000000000004 R09: 0000000000000000
            [ 9658.831212] R10: 0000000000000001 R11: 00000000007fffff R12: ffff9e1317b6ad00
            [ 9658.840559] R13: ffff9e16edd7c030 R14: ffff9e1714d4e800 R15: ffff9e1715878008
            [ 9658.849980] FS:  0000000000000000(0000) GS:ffff9e172d8c0000(0000) knlGS:0000000000000000
            [ 9658.860384] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [ 9658.868127] CR2: 0000000000000000 CR3: 000000073a60e000 CR4: 00000000000607e0
            [ 9658.877444] Call Trace:
            [ 9658.881511]  [<ffffffffc1685c3e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]
            [ 9658.891221]  [<ffffffffc1685e17>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]
            [ 9658.900163]  [<ffffffffc0f47608>] dt_index_walk+0xf8/0x430 [obdclass]
            [ 9658.908835]  [<ffffffffc0f47940>] ? dt_index_walk+0x430/0x430 [obdclass]
            [ 9658.917808]  [<ffffffffc0f48a14>] dt_index_read+0x394/0x6a0 [obdclass]
            [ 9658.926634]  [<ffffffffc11dfd22>] tgt_obd_idx_read+0x612/0x860 [ptlrpc]
            [ 9658.935280]  [<ffffffffc11e2f3a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            [ 9658.944303]  [<ffffffffc11bea61>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
            [ 9658.954245]  [<ffffffffc0dc2bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
            [ 9658.963572]  [<ffffffffc1189acb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [ 9658.973654]  [<ffffffffc1186955>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
            [ 9658.982660]  [<ffffffffaaccf682>] ? default_wake_function+0x12/0x20
            [ 9658.991178]  [<ffffffffaacc52ab>] ? __wake_up_common+0x5b/0x90
            [ 9658.999219]  [<ffffffffc118d2ec>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc]
            [ 9659.007754]  [<ffffffffc118c7f0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]
            [ 9659.017507]  [<ffffffffaacbb621>] kthread+0xd1/0xe0
            [ 9659.024461]  [<ffffffffaacbb550>] ? insert_kthread_work+0x40/0x40
            [ 9659.032809]  [<ffffffffab3205f7>] ret_from_fork_nospec_begin+0x21/0x21
            [ 9659.041631]  [<ffffffffaacbb550>] ? insert_kthread_work+0x40/0x40
            [ 9659.049943] Code:  Bad RIP value.
            [ 9659.055214] RIP  [<          (null)>]           (null)
            [ 9659.062486]  RSP <ffff9e12e2b27b60>
            [ 9659.067869] CR2: 0000000000000000
            [ 9659.075547] ---[ end trace 8f1f37f93401bf0f ]---
            l Corporation.All rights reserved. 
            Version 2.00.1201.Copyright(c) 2010 - 2012 American Megatrends,Inc. 
            Installed BIOS: SE5C600.86B.01.08.0003
            
            ------system reboot, and after reboot--------
            
            CentOS Linux 7 (Core)
            Kernel 3.10.0-862.9.1.el7_lustre.x86_64 on an x86_64
            
            soak-10 login: [  175.721668] LNet: HW NUMA nodes: 2, HW CPU cores: 32, npartitions: 2
            [  175.731983] alg: No test for adler32 (adler32-zlib)
            [  176.631197] Lustre: Lustre: Build Version: 2.11.55_65_gec2e999
            [  176.925651] LNet: Using FMR for registration
            [  176.943513] LNet: Added LNI 192.168.1.110@o2ib [8/256/0/180]
            [  177.121204] LDISKFS-fs warning (device dm-6): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.
            [  177.121204] 
            [  223.131715] LDISKFS-fs (dm-6): recovery complete
            [  223.137107] LDISKFS-fs (dm-6): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
            [  227.640081] Lustre: soaked-MDT0002: Not available for connect from 192.168.1.142@o2ib (not set up)
            [  230.666682] Lustre: soaked-MDT0002: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
            [  230.703152] LustreError: 4067:0:(llog_osd.c:1005:llog_osd_next_block()) soaked-MDT0000-osp-MDT0002: missed desired record? 3 > 1
            [  230.716141] LustreError: 4067:0:(lod_dev.c:428:lod_sub_recovery_thread()) soaked-MDT0000-osp-MDT0002 get update log failed: rc = -2
            [  234.808297] Lustre: soaked-MDT0002: Connection restored to 192.168.1.106@o2ib (at 192.168.1.106@o2ib)
            [  237.283751] Lustre: soaked-MDT0002: Connection restored to e3d1d543-7071-4f1d-f9c5-f174c66a3f7e (at 192.168.1.143@o2ib)
            [  237.295833] Lustre: Skipped 1 previous similar message
            [  237.308240] Lustre: 4070:0:(ldlm_lib.c:2048:target_recovery_overseer()) recovery is aborted, evict exports in recovery
            [  237.320897] Lustre: soaked-MDT0002: disconnecting 28 stale clients
            [  237.327887] LustreError: 4070:0:(tgt_grant.c:248:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 2097152 != fo_tot_granted 4194304
            [  238.302385] Lustre: soaked-MDT0002: Connection restored to 1d05ec18-7f6a-896f-de54-063cf1f8b51c (at 192.168.1.122@o2ib)
            [  238.314575] Lustre: Skipped 1 previous similar message
            [  240.471517] Lustre: soaked-MDT0002: Connection restored to a225c1a8-609c-d769-404b-3c25112913a0 (at 192.168.1.141@o2ib)
            [  240.483695] Lustre: Skipped 7 previous similar messages
            [  244.492077] Lustre: soaked-MDT0002: Connection restored to a6836049-d02e-27ab-c254-5e76eb9cef2b (at 192.168.1.136@o2ib)
            [  244.504273] Lustre: Skipped 4 previous similar messages
            [  254.539127] Lustre: soaked-MDT0002: Connection restored to e6fd2edd-0c64-011a-f170-b8c415906b8c (at 192.168.1.124@o2ib)
            [  254.551333] Lustre: Skipped 9 previous similar messages
            [  273.870109] Lustre: soaked-MDT0002: Connection restored to 192.168.1.104@o2ib (at 192.168.1.104@o2ib)
            [  273.880570] Lustre: Skipped 7 previous similar messages
            [  292.450898] LustreError: 168-f: soaked-MDT0002: BAD WRITE CHECKSUM: from 12345-192.168.1.135@o2ib inode [0x28001e862:0x214:0x0] object 0x28001e862:532 extent [36-39]: client csum 2ff01ff, server csum 2ed01f6
            [  293.574529] LustreError: 168-f: soaked-MDT0002: BAD WRITE CHECKSUM: from 12345-192.168.1.135@o2ib inode [0x28001e862:0x214:0x0] object 0x28001e862:532 extent [36-39]: client csum 2ff01ff, server csum 2ed01f6
            [  295.414655] LustreError: 168-f: soaked-MDT0002: BAD WRITE CHECKSUM: from 12345-192.168.1.135@o2ib inode [0x28001e862:0x214:0x0] object 0x28001e862:532 extent [36-39]: client csum 2ff01ff, server csum 2ed01f6
            [  296.317843] BUG: unable to handle kernel NULL pointer dereference at 00000000000005ab
            [  296.326718] IP: [<ffffffff8c3528c3>] rb_next+0x23/0x50
            [  296.332536] PGD 0 
            [  296.334823] Oops: 0000 [#1] SMP 
            [  296.341568] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support joydev ipmi_ssif pcspkr mei_me sg mei i2c_i801 lpc_ich ipmi_si ipmi_devintf ipmi_msghandler wmi ioatdma shpchp dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb ttm ptp mlx4_core(OE) drm isci ahci mpt3sas pps_core libsas crct10dif_pclmul libahci devlink dca crct10dif_common i2c_algo_bit raid_class crc32c_intel libata i2c_core mlx_compat(OE) scsi_transport_sas
            [  296.490014] CPU: 10 PID: 3968 Comm: mdt_out01_000 Tainted: P           OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1
            [  296.505712] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
            [  296.522014] task: ffff9193e349af70 ti: ffff9190a81f4000 task.ti: ffff9190a81f4000
            [  296.534051] RIP: 0010:[<ffffffff8c3528c3>]  [<ffffffff8c3528c3>] rb_next+0x23/0x50
            [  296.545986] RSP: 0018:ffff9190a81f7a40  EFLAGS: 00010202
            [  296.555563] RAX: 000000000000059b RBX: ffff91981aea8d90 RCX: 0000000000000000
            [  296.567341] RDX: 000000000000059b RSI: ffff919815fb2d5e RDI: ffff919815fb2d28
            [  296.578836] RBP: ffff9190a81f7a40 R08: 000000000e800157 R09: 0000000000000004
            [  296.590438] R10: ffff91982ae9c500 R11: ffff91982ae9c500 R12: ffff91980d991900
            [  296.601802] R13: ffff91981aea8d90 R14: ffff91981f763000 R15: ffff91980df68008
            [  296.613303] FS:  0000000000000000(0000) GS:ffff91982d880000(0000) knlGS:0000000000000000
            [  296.625676] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [  296.634865] CR2: 00000000000005ab CR3: 00000004ebe0e000 CR4: 00000000000607e0
            [  296.645657] Call Trace:
            [  296.651676]  [<ffffffffc162cd44>] ldiskfs_readdir+0x5b4/0x850 [ldiskfs]
            [  296.662343]  [<ffffffffc0944ef2>] ? fld_local_lookup+0x62/0x270 [fld]
            [  296.672780]  [<ffffffffc16ae7b0>] ? osd_object_alloc+0x360/0x360 [osd_ldiskfs]
            [  296.684053]  [<ffffffffc16acc3e>] osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]
            [  296.695182]  [<ffffffffc16acfa6>] osd_it_ea_next+0xc6/0x150 [osd_ldiskfs]
            [  296.705729]  [<ffffffffc1184ae8>] dt_index_page_build+0x1a8/0x470 [obdclass]
            [  296.716658]  [<ffffffffc11846b0>] dt_index_walk+0x1a0/0x430 [obdclass]
            [  296.726761]  [<ffffffffc1184940>] ? dt_index_walk+0x430/0x430 [obdclass]
            [  296.737149]  [<ffffffffc1185a14>] dt_index_read+0x394/0x6a0 [obdclass]
            [  296.747492]  [<ffffffffc141cd22>] tgt_obd_idx_read+0x612/0x860 [ptlrpc]
            [  296.757860]  [<ffffffffc141ff3a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            [  296.768476]  [<ffffffffc13fba61>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
            [  296.779775]  [<ffffffffc1038bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
            [  296.790527]  [<ffffffffc13c6acb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [  296.801935]  [<ffffffffc13c3955>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
            [  296.812275]  [<ffffffff8c0cf682>] ? default_wake_function+0x12/0x20
            [  296.822010]  [<ffffffff8c0c52ab>] ? __wake_up_common+0x5b/0x90
            [  296.831226]  [<ffffffffc13ca2ec>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc]
            [  296.840865]  [<ffffffffc13c97f0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]
            [  296.851675]  [<ffffffff8c0bb621>] kthread+0xd1/0xe0
            [  296.859622]  [<ffffffff8c0bb550>] ? insert_kthread_work+0x40/0x40
            [  296.868941]  [<ffffffff8c7205f7>] ret_from_fork_nospec_begin+0x21/0x21
            [  296.878704]  [<ffffffff8c0bb550>] ? insert_kthread_work+0x40/0x40
            [  296.888023] Code: c0 5d c3 0f 1f 44 00 00 55 48 8b 17 48 89 e5 48 39 d7 74 3b 48 8b 47 08 48 85 c0 75 0e eb 25 66 0f 1f 84 00 00 00 00 00 48 89 d0 <48> 8b 50 10 48 85 d2 75 f4 5d c3 66 90 48 3b 78 08 75 f6 48 8b 
            [  296.915195] RIP  [<ffffffff8c3528c3>] rb_next+0x23/0x50
            [  296.923524]  RSP <ffff9190a81f7a40>
            [  296.929891] CR2: 00000000000005ab
            [  296.935844] ---[ end trace 7f7dc5e5140c0c8b ]---
            
            
            
            sarah Sarah Liu added a comment - Hi Lai, No, I didn't run dir migration. When I filed the ticket, it was the first time seen the problem, but I restarted soak on tip of master on Monday with build #137, hit the same issue again in about 2 hours run. Here is the trace from the latest failure: on soak-10 [ 9654.637324] blk_cloned_rq_check_limits: over max size limit. [ 9658.590523] BUG: unable to handle kernel NULL pointer dereference at (null) [ 9658.600366] IP: [< (null)>] (null) [ 9658.606866] PGD 0 [ 9658.609984] Oops: 0010 [#1] SMP [ 9658.614447] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) sb_edac intel_powerclamp coretemp spl(OE) intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ipmi_ssif pcspkr ipmi_si ipmi_devintf ipmi_msghandler iTCO_wdt iTCO_vendor_support joydev sg mei_me mei i2c_i801 lpc_ich ioatdma wmi shpchp auth_rpcgss dm_multipath dm_mod sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb isci ahci ptp drm mlx4_core(OE) libsas libahci pps_core mpt3sas dca crct10dif_pclmul devlink i2c_algo_bit crct10dif_common crc32c_intel raid_class libata i2c_core mlx_compat(OE) scsi_transport_sas [ 9658.745591] CPU: 11 PID: 22180 Comm: mdt_out01_010 Tainted: P OE ------------ 3.10.0-862.9.1.el7_lustre.x86_64 #1 [ 9658.760612] Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [ 9658.777200] task: ffff9e132d6f2f70 ti: ffff9e12e2b24000 task.ti: ffff9e12e2b24000 [ 9658.786678] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [ 9658.796193] RSP: 0018:ffff9e12e2b27b60 EFLAGS: 00010246 [ 9658.803358] RAX: 0000000000000000 RBX: ffff9e1715878000 RCX: 0000000000000002 [ 9658.812569] RDX: ffffffffc16877b0 RSI: ffff9e12e2b27b68 RDI: ffff9e1715878008 [ 9658.821879] RBP: ffff9e12e2b27ba0 R08: 0000000000000004 R09: 0000000000000000 [ 9658.831212] R10: 0000000000000001 R11: 00000000007fffff R12: ffff9e1317b6ad00 [ 9658.840559] R13: ffff9e16edd7c030 R14: ffff9e1714d4e800 R15: ffff9e1715878008 [ 9658.849980] FS: 0000000000000000(0000) GS:ffff9e172d8c0000(0000) knlGS:0000000000000000 [ 9658.860384] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9658.868127] CR2: 0000000000000000 CR3: 000000073a60e000 CR4: 00000000000607e0 [ 9658.877444] Call Trace: [ 9658.881511] [<ffffffffc1685c3e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs] [ 9658.891221] [<ffffffffc1685e17>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs] [ 9658.900163] [<ffffffffc0f47608>] dt_index_walk+0xf8/0x430 [obdclass] [ 9658.908835] [<ffffffffc0f47940>] ? dt_index_walk+0x430/0x430 [obdclass] [ 9658.917808] [<ffffffffc0f48a14>] dt_index_read+0x394/0x6a0 [obdclass] [ 9658.926634] [<ffffffffc11dfd22>] tgt_obd_idx_read+0x612/0x860 [ptlrpc] [ 9658.935280] [<ffffffffc11e2f3a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 9658.944303] [<ffffffffc11bea61>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [ 9658.954245] [<ffffffffc0dc2bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [ 9658.963572] [<ffffffffc1189acb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 9658.973654] [<ffffffffc1186955>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [ 9658.982660] [<ffffffffaaccf682>] ? default_wake_function+0x12/0x20 [ 9658.991178] [<ffffffffaacc52ab>] ? __wake_up_common+0x5b/0x90 [ 9658.999219] [<ffffffffc118d2ec>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc] [ 9659.007754] [<ffffffffc118c7f0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc] [ 9659.017507] [<ffffffffaacbb621>] kthread+0xd1/0xe0 [ 9659.024461] [<ffffffffaacbb550>] ? insert_kthread_work+0x40/0x40 [ 9659.032809] [<ffffffffab3205f7>] ret_from_fork_nospec_begin+0x21/0x21 [ 9659.041631] [<ffffffffaacbb550>] ? insert_kthread_work+0x40/0x40 [ 9659.049943] Code: Bad RIP value. [ 9659.055214] RIP [< (null)>] (null) [ 9659.062486] RSP <ffff9e12e2b27b60> [ 9659.067869] CR2: 0000000000000000 [ 9659.075547] ---[ end trace 8f1f37f93401bf0f ]--- l Corporation.All rights reserved. Version 2.00.1201.Copyright(c) 2010 - 2012 American Megatrends,Inc. Installed BIOS: SE5C600.86B.01.08.0003 ------system reboot, and after reboot-------- CentOS Linux 7 (Core) Kernel 3.10.0-862.9.1.el7_lustre.x86_64 on an x86_64 soak-10 login: [ 175.721668] LNet: HW NUMA nodes: 2, HW CPU cores: 32, npartitions: 2 [ 175.731983] alg: No test for adler32 (adler32-zlib) [ 176.631197] Lustre: Lustre: Build Version: 2.11.55_65_gec2e999 [ 176.925651] LNet: Using FMR for registration [ 176.943513] LNet: Added LNI 192.168.1.110@o2ib [8/256/0/180] [ 177.121204] LDISKFS-fs warning (device dm-6): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait. [ 177.121204] [ 223.131715] LDISKFS-fs (dm-6): recovery complete [ 223.137107] LDISKFS-fs (dm-6): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc [ 227.640081] Lustre: soaked-MDT0002: Not available for connect from 192.168.1.142@o2ib (not set up) [ 230.666682] Lustre: soaked-MDT0002: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 [ 230.703152] LustreError: 4067:0:(llog_osd.c:1005:llog_osd_next_block()) soaked-MDT0000-osp-MDT0002: missed desired record? 3 > 1 [ 230.716141] LustreError: 4067:0:(lod_dev.c:428:lod_sub_recovery_thread()) soaked-MDT0000-osp-MDT0002 get update log failed: rc = -2 [ 234.808297] Lustre: soaked-MDT0002: Connection restored to 192.168.1.106@o2ib (at 192.168.1.106@o2ib) [ 237.283751] Lustre: soaked-MDT0002: Connection restored to e3d1d543-7071-4f1d-f9c5-f174c66a3f7e (at 192.168.1.143@o2ib) [ 237.295833] Lustre: Skipped 1 previous similar message [ 237.308240] Lustre: 4070:0:(ldlm_lib.c:2048:target_recovery_overseer()) recovery is aborted, evict exports in recovery [ 237.320897] Lustre: soaked-MDT0002: disconnecting 28 stale clients [ 237.327887] LustreError: 4070:0:(tgt_grant.c:248:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 2097152 != fo_tot_granted 4194304 [ 238.302385] Lustre: soaked-MDT0002: Connection restored to 1d05ec18-7f6a-896f-de54-063cf1f8b51c (at 192.168.1.122@o2ib) [ 238.314575] Lustre: Skipped 1 previous similar message [ 240.471517] Lustre: soaked-MDT0002: Connection restored to a225c1a8-609c-d769-404b-3c25112913a0 (at 192.168.1.141@o2ib) [ 240.483695] Lustre: Skipped 7 previous similar messages [ 244.492077] Lustre: soaked-MDT0002: Connection restored to a6836049-d02e-27ab-c254-5e76eb9cef2b (at 192.168.1.136@o2ib) [ 244.504273] Lustre: Skipped 4 previous similar messages [ 254.539127] Lustre: soaked-MDT0002: Connection restored to e6fd2edd-0c64-011a-f170-b8c415906b8c (at 192.168.1.124@o2ib) [ 254.551333] Lustre: Skipped 9 previous similar messages [ 273.870109] Lustre: soaked-MDT0002: Connection restored to 192.168.1.104@o2ib (at 192.168.1.104@o2ib) [ 273.880570] Lustre: Skipped 7 previous similar messages [ 292.450898] LustreError: 168-f: soaked-MDT0002: BAD WRITE CHECKSUM: from 12345-192.168.1.135@o2ib inode [0x28001e862:0x214:0x0] object 0x28001e862:532 extent [36-39]: client csum 2ff01ff, server csum 2ed01f6 [ 293.574529] LustreError: 168-f: soaked-MDT0002: BAD WRITE CHECKSUM: from 12345-192.168.1.135@o2ib inode [0x28001e862:0x214:0x0] object 0x28001e862:532 extent [36-39]: client csum 2ff01ff, server csum 2ed01f6 [ 295.414655] LustreError: 168-f: soaked-MDT0002: BAD WRITE CHECKSUM: from 12345-192.168.1.135@o2ib inode [0x28001e862:0x214:0x0] object 0x28001e862:532 extent [36-39]: client csum 2ff01ff, server csum 2ed01f6 [ 296.317843] BUG: unable to handle kernel NULL pointer dereference at 00000000000005ab [ 296.326718] IP: [<ffffffff8c3528c3>] rb_next+0x23/0x50 [ 296.332536] PGD 0 [ 296.334823] Oops: 0000 [#1] SMP [ 296.341568] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support joydev ipmi_ssif pcspkr mei_me sg mei i2c_i801 lpc_ich ipmi_si ipmi_devintf ipmi_msghandler wmi ioatdma shpchp dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb ttm ptp mlx4_core(OE) drm isci ahci mpt3sas pps_core libsas crct10dif_pclmul libahci devlink dca crct10dif_common i2c_algo_bit raid_class crc32c_intel libata i2c_core mlx_compat(OE) scsi_transport_sas [ 296.490014] CPU: 10 PID: 3968 Comm: mdt_out01_000 Tainted: P OE ------------ 3.10.0-862.9.1.el7_lustre.x86_64 #1 [ 296.505712] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [ 296.522014] task: ffff9193e349af70 ti: ffff9190a81f4000 task.ti: ffff9190a81f4000 [ 296.534051] RIP: 0010:[<ffffffff8c3528c3>] [<ffffffff8c3528c3>] rb_next+0x23/0x50 [ 296.545986] RSP: 0018:ffff9190a81f7a40 EFLAGS: 00010202 [ 296.555563] RAX: 000000000000059b RBX: ffff91981aea8d90 RCX: 0000000000000000 [ 296.567341] RDX: 000000000000059b RSI: ffff919815fb2d5e RDI: ffff919815fb2d28 [ 296.578836] RBP: ffff9190a81f7a40 R08: 000000000e800157 R09: 0000000000000004 [ 296.590438] R10: ffff91982ae9c500 R11: ffff91982ae9c500 R12: ffff91980d991900 [ 296.601802] R13: ffff91981aea8d90 R14: ffff91981f763000 R15: ffff91980df68008 [ 296.613303] FS: 0000000000000000(0000) GS:ffff91982d880000(0000) knlGS:0000000000000000 [ 296.625676] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 296.634865] CR2: 00000000000005ab CR3: 00000004ebe0e000 CR4: 00000000000607e0 [ 296.645657] Call Trace: [ 296.651676] [<ffffffffc162cd44>] ldiskfs_readdir+0x5b4/0x850 [ldiskfs] [ 296.662343] [<ffffffffc0944ef2>] ? fld_local_lookup+0x62/0x270 [fld] [ 296.672780] [<ffffffffc16ae7b0>] ? osd_object_alloc+0x360/0x360 [osd_ldiskfs] [ 296.684053] [<ffffffffc16acc3e>] osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs] [ 296.695182] [<ffffffffc16acfa6>] osd_it_ea_next+0xc6/0x150 [osd_ldiskfs] [ 296.705729] [<ffffffffc1184ae8>] dt_index_page_build+0x1a8/0x470 [obdclass] [ 296.716658] [<ffffffffc11846b0>] dt_index_walk+0x1a0/0x430 [obdclass] [ 296.726761] [<ffffffffc1184940>] ? dt_index_walk+0x430/0x430 [obdclass] [ 296.737149] [<ffffffffc1185a14>] dt_index_read+0x394/0x6a0 [obdclass] [ 296.747492] [<ffffffffc141cd22>] tgt_obd_idx_read+0x612/0x860 [ptlrpc] [ 296.757860] [<ffffffffc141ff3a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 296.768476] [<ffffffffc13fba61>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [ 296.779775] [<ffffffffc1038bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [ 296.790527] [<ffffffffc13c6acb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 296.801935] [<ffffffffc13c3955>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [ 296.812275] [<ffffffff8c0cf682>] ? default_wake_function+0x12/0x20 [ 296.822010] [<ffffffff8c0c52ab>] ? __wake_up_common+0x5b/0x90 [ 296.831226] [<ffffffffc13ca2ec>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc] [ 296.840865] [<ffffffffc13c97f0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc] [ 296.851675] [<ffffffff8c0bb621>] kthread+0xd1/0xe0 [ 296.859622] [<ffffffff8c0bb550>] ? insert_kthread_work+0x40/0x40 [ 296.868941] [<ffffffff8c7205f7>] ret_from_fork_nospec_begin+0x21/0x21 [ 296.878704] [<ffffffff8c0bb550>] ? insert_kthread_work+0x40/0x40 [ 296.888023] Code: c0 5d c3 0f 1f 44 00 00 55 48 8b 17 48 89 e5 48 39 d7 74 3b 48 8b 47 08 48 85 c0 75 0e eb 25 66 0f 1f 84 00 00 00 00 00 48 89 d0 <48> 8b 50 10 48 85 d2 75 f4 5d c3 66 90 48 3b 78 08 75 f6 48 8b [ 296.915195] RIP [<ffffffff8c3528c3>] rb_next+0x23/0x50 [ 296.923524] RSP <ffff9190a81f7a40> [ 296.929891] CR2: 00000000000005ab [ 296.935844] ---[ end trace 7f7dc5e5140c0c8b ]---

            People

              ys Yang Sheng
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: