[LU-11457] osd_oi_insert(): the FID is used by two objects Created: 02/Oct/18  Updated: 02/Oct/23  Resolved: 31/Aug/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.12.2, Lustre 2.12.4, Lustre 2.12.5
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Critical
Reporter: Sarah Liu Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: soak

Attachments: File soak-11.log-20180930.gz     Text File vmcore-dmesg.txt    
Issue Links:
Related
is related to LU-5314 Lustre 2.4.2 MDS hit LBUG and crash Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

tag-2.11.55

MDS crash

[15650.670434] device-mapper: multipath: Failing path 8:96.^M
[15650.765276] BUG: unable to handle kernel NULL pointer dereference at           (null)^M
[15650.775741] IP: [<          (null)>]           (null)^M
[15650.783081] PGD 0 ^M
[15650.786948] Oops: 0010 [#1] SMP ^M
[15650.792218] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt joydev pcspkr ipmi_ssif iTCO_vendor_support sg ipmi_si ipmi_devintf shpchp ipmi_msghandler i2c_i801 mei_me ioatdma mei lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm igb isci ahci ptp mlx4_core(OE) mpt3sas libsas libahci pps_core dca crct10dif_pclmul devlink i2c_algo_bit crct10dif_common raid_class crc32c_intel libata i2c_core mlx_compat(OE) scsi_transport_sas^M
[15650.934649] CPU: 14 PID: 9491 Comm: mdt_rdpg01_008 Tainted: P           OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1^M
[15650.952002] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
[15650.966961] task: ffff8be6a1253f40 ti: ffff8be68a044000 task.ti: ffff8be68a044000^M
[15650.977791] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)^M
[15650.988693] RSP: 0018:ffff8be68a047b58  EFLAGS: 00010246^M
[15650.997167] RAX: 0000000000000000 RBX: ffff8be68b820000 RCX: 0000000000000002^M
[15651.007733] RDX: ffffffffc164c7b0 RSI: ffff8be68a047b60 RDI: ffff8be68b820008^M
[15651.018326] RBP: ffff8be68a047b98 R08: 0000000000000004 R09: 0000000000000000^M
[15651.028930] R10: 0000000000000001 R11: 00000000007fffff R12: ffff8be26f9fab00^M
[15651.039547] R13: ffff8be279a448a0 R14: ffff8be68a160000 R15: ffff8be68b820008^M
[15651.050168] FS:  0000000000000000(0000) GS:ffff8be6ad980000(0000) knlGS:0000000000000000^M
[15651.061880] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
[15651.070980] CR2: 0000000000000000 CR3: 000000042c3b6000 CR4: 00000000000607e0^M
[15651.081660] Call Trace:^M
[15651.087091]  [<ffffffffc164ac3e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]^M
[15651.098058]  [<ffffffffc164ae17>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]^M
[15651.108370]  [<ffffffffc188eb47>] lod_it_load+0x27/0x90 [lod]^M
[15651.117554]  [<ffffffffc0f48808>] dt_index_walk+0xf8/0x430 [obdclass]^M
[15651.127457]  [<ffffffffc1915080>] ? mdd_object_lock+0xe0/0xe0 [mdd]^M
[15651.137132]  [<ffffffffc1916d9f>] mdd_readpage+0x25f/0x5a0 [mdd]^M
[15651.146553]  [<ffffffffc1782bda>] mdt_readpage+0x63a/0x880 [mdt]^M
[15651.155992]  [<ffffffffc11e82ca>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M
[15651.166379]  [<ffffffffc11c02e1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]^M
[15651.177493]  [<ffffffffc0dfcbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]^M
[15651.188033]  [<ffffffffc118b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M
[15651.199251]  [<ffffffffc1188315>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]^M
[15651.209399]  [<ffffffff83ccf682>] ? default_wake_function+0x12/0x20^M
[15651.218931]  [<ffffffff83cc52ab>] ? __wake_up_common+0x5b/0x90^M
[15651.228026]  [<ffffffffc118ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]^M
[15651.237575]  [<ffffffffc118e1b0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]^M
[15651.248365]  [<ffffffff83cbb621>] kthread+0xd1/0xe0^M
[15651.256344]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
[15651.265688]  [<ffffffff843205f7>] ret_from_fork_nospec_begin+0x21/0x21^M
[15651.275475]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
[15651.284736] Code:  Bad RIP value.^M
[15651.290946] RIP  [<          (null)>]           (null)^M
[15651.299236]  RSP <ffff8be68a047b58>^M
[15651.305543] CR2: 0000000000000000^M
[15651.315778] ---[ end trace 4ae4238c00f9aeec ]---^M
[15651.336386] Kernel panic - not syncing: Fatal exception^M
[15651.344613] Kernel Offset: 0x2c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)^M
[15651.369289] ------------[ cut here ]------------^M
[15651.376397] WARNING: CPU: 14 PID: 9491 at arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x65/0x70^M
[15651.388915] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt joydev pcspkr ipmi_ssif iTCO_vendor_support sg ipmi_si ipmi_devintf shpchp ipmi_msghandler i2c_i801 mei_me ioatdma mei lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm igb isci ahci ptp mlx4_core(OE) mpt3sas libsas libahci pps_core dca crct10dif_pclmul devlink i2c_algo_bit crct10dif_common raid_class crc32c_intel libata i2c_core mlx_compat(OE) scsi_transport_sas^M
[15651.529620] CPU: 14 PID: 9491 Comm: mdt_rdpg01_008 Tainted: P      D    OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1^M
[15651.546472] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
[15651.561156] Call Trace:^M
[15651.566023]  <IRQ>  [<ffffffff8430e84e>] dump_stack+0x19/0x1b^M
[15651.574646]  [<ffffffff83c91e18>] __warn+0xd8/0x100^M
[15651.582224]  [<ffffffff83c91f5d>] warn_slowpath_null+0x1d/0x20^M
[15651.590851]  [<ffffffff83c54e95>] native_smp_send_reschedule+0x65/0x70^M
[15651.600279]  [<ffffffff83cddf81>] trigger_load_balance+0x191/0x280^M
[15651.609280]  [<ffffffff83ccdc0a>] scheduler_tick+0x10a/0x150^M
[15651.617702]  [<ffffffff83d01c10>] ? tick_sched_do_timer+0x50/0x50^M
[15651.626619]  [<ffffffff83ca4f65>] update_process_times+0x65/0x80^M
[15651.635416]  [<ffffffff83d01a10>] tick_sched_handle+0x30/0x70^M
[15651.643916]  [<ffffffff83d01c49>] tick_sched_timer+0x39/0x80^M
[15651.652315]  [<ffffffff83cbf7e6>] __hrtimer_run_queues+0xd6/0x260^M
[15651.661210]  [<ffffffff83cbfd7f>] hrtimer_interrupt+0xaf/0x1d0^M
[15651.669814]  [<ffffffff83c5847b>] local_apic_timer_interrupt+0x3b/0x60^M
[15651.679184]  [<ffffffff84325063>] smp_apic_timer_interrupt+0x43/0x60^M
[15651.688352]  [<ffffffff843217b2>] apic_timer_interrupt+0x162/0x170^M
[15651.697316]  <EOI>  [<ffffffff84308c3d>] ? panic+0x1d5/0x21f^M
[15651.705715]  [<ffffffff84308ba1>] ? panic+0x139/0x21f^M
[15651.713430]  [<ffffffff84318745>] oops_end+0xc5/0xe0^M
[15651.721020]  [<ffffffff8430807e>] no_context+0x285/0x2a8^M
[15651.728984]  [<ffffffff84308115>] __bad_area_nosemaphore+0x74/0x1d1^M
[15651.738014]  [<ffffffff84308286>] bad_area_nosemaphore+0x14/0x16^M
[15651.746760]  [<ffffffff8431b6e0>] __do_page_fault+0x330/0x4f0^M
[15651.755199]  [<ffffffff8431b8d5>] do_page_fault+0x35/0x90^M
[15651.763264]  [<ffffffff84317758>] page_fault+0x28/0x30^M
[15651.771013]  [<ffffffffc164c7b0>] ? osd_object_alloc+0x360/0x360 [osd_ldiskfs]^M
[15651.781105]  [<ffffffffc164ac3e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]^M
[15651.791402]  [<ffffffffc164ae17>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]^M
[15651.801028]  [<ffffffffc188eb47>] lod_it_load+0x27/0x90 [lod]^M
[15651.809517]  [<ffffffffc0f48808>] dt_index_walk+0xf8/0x430 [obdclass]^M
[15651.818761]  [<ffffffffc1915080>] ? mdd_object_lock+0xe0/0xe0 [mdd]^M
[15651.827808]  [<ffffffffc1916d9f>] mdd_readpage+0x25f/0x5a0 [mdd]^M
[15651.836533]  [<ffffffffc1782bda>] mdt_readpage+0x63a/0x880 [mdt]^M
[15651.845269]  [<ffffffffc11e82ca>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M
[15651.854937]  [<ffffffffc11c02e1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]^M
[15651.865302]  [<ffffffffc0dfcbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]^M
[15651.875084]  [<ffffffffc118b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M
[15651.885522]  [<ffffffffc1188315>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]^M
[15651.894881]  [<ffffffff83ccf682>] ? default_wake_function+0x12/0x20^M
[15651.903621]  [<ffffffff83cc52ab>] ? __wake_up_common+0x5b/0x90^M
[15651.911880]  [<ffffffffc118ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]^M
[15651.920585]  [<ffffffffc118e1b0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]^M
[15651.930460]  [<ffffffff83cbb621>] kthread+0xd1/0xe0^M
[15651.937468]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
[15651.945794]  [<ffffffff843205f7>] ret_from_fork_nospec_begin+0x21/0x21^M
[15651.954564]  [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M
[15651.962806] ---[ end trace 4ae4238c00f9aeed ]---^M



 Comments   
Comment by Peter Jones [ 02/Oct/18 ]

Lai

Does this appear to be related to recent changes under LU-4684?

Peter

Comment by Lai Siyao [ 10/Oct/18 ]

Sarah, did you run dir migration test in soak test? And how many times have you seen this crash?

Comment by Sarah Liu [ 10/Oct/18 ]

Hi Lai,

No, I didn't run dir migration. When I filed the ticket, it was the first time seen the problem, but I restarted soak on tip of master on Monday with build #137, hit the same issue again in about 2 hours run. Here is the trace from the latest failure:

on soak-10

[ 9654.637324] blk_cloned_rq_check_limits: over max size limit.
[ 9658.590523] BUG: unable to handle kernel NULL pointer dereference at           (null)
[ 9658.600366] IP: [<          (null)>]           (null)
[ 9658.606866] PGD 0 
[ 9658.609984] Oops: 0010 [#1] SMP 
[ 9658.614447] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) sb_edac intel_powerclamp coretemp spl(OE) intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ipmi_ssif pcspkr ipmi_si ipmi_devintf ipmi_msghandler iTCO_wdt iTCO_vendor_support joydev sg mei_me mei i2c_i801 lpc_ich ioatdma wmi shpchp auth_rpcgss dm_multipath dm_mod sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb isci ahci ptp drm mlx4_core(OE) libsas libahci pps_core mpt3sas dca crct10dif_pclmul devlink i2c_algo_bit crct10dif_common crc32c_intel raid_class libata i2c_core mlx_compat(OE) scsi_transport_sas
[ 9658.745591] CPU: 11 PID: 22180 Comm: mdt_out01_010 Tainted: P           OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1
[ 9658.760612] Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
[ 9658.777200] task: ffff9e132d6f2f70 ti: ffff9e12e2b24000 task.ti: ffff9e12e2b24000
[ 9658.786678] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
[ 9658.796193] RSP: 0018:ffff9e12e2b27b60  EFLAGS: 00010246
[ 9658.803358] RAX: 0000000000000000 RBX: ffff9e1715878000 RCX: 0000000000000002
[ 9658.812569] RDX: ffffffffc16877b0 RSI: ffff9e12e2b27b68 RDI: ffff9e1715878008
[ 9658.821879] RBP: ffff9e12e2b27ba0 R08: 0000000000000004 R09: 0000000000000000
[ 9658.831212] R10: 0000000000000001 R11: 00000000007fffff R12: ffff9e1317b6ad00
[ 9658.840559] R13: ffff9e16edd7c030 R14: ffff9e1714d4e800 R15: ffff9e1715878008
[ 9658.849980] FS:  0000000000000000(0000) GS:ffff9e172d8c0000(0000) knlGS:0000000000000000
[ 9658.860384] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9658.868127] CR2: 0000000000000000 CR3: 000000073a60e000 CR4: 00000000000607e0
[ 9658.877444] Call Trace:
[ 9658.881511]  [<ffffffffc1685c3e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]
[ 9658.891221]  [<ffffffffc1685e17>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]
[ 9658.900163]  [<ffffffffc0f47608>] dt_index_walk+0xf8/0x430 [obdclass]
[ 9658.908835]  [<ffffffffc0f47940>] ? dt_index_walk+0x430/0x430 [obdclass]
[ 9658.917808]  [<ffffffffc0f48a14>] dt_index_read+0x394/0x6a0 [obdclass]
[ 9658.926634]  [<ffffffffc11dfd22>] tgt_obd_idx_read+0x612/0x860 [ptlrpc]
[ 9658.935280]  [<ffffffffc11e2f3a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[ 9658.944303]  [<ffffffffc11bea61>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[ 9658.954245]  [<ffffffffc0dc2bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[ 9658.963572]  [<ffffffffc1189acb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[ 9658.973654]  [<ffffffffc1186955>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[ 9658.982660]  [<ffffffffaaccf682>] ? default_wake_function+0x12/0x20
[ 9658.991178]  [<ffffffffaacc52ab>] ? __wake_up_common+0x5b/0x90
[ 9658.999219]  [<ffffffffc118d2ec>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc]
[ 9659.007754]  [<ffffffffc118c7f0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]
[ 9659.017507]  [<ffffffffaacbb621>] kthread+0xd1/0xe0
[ 9659.024461]  [<ffffffffaacbb550>] ? insert_kthread_work+0x40/0x40
[ 9659.032809]  [<ffffffffab3205f7>] ret_from_fork_nospec_begin+0x21/0x21
[ 9659.041631]  [<ffffffffaacbb550>] ? insert_kthread_work+0x40/0x40
[ 9659.049943] Code:  Bad RIP value.
[ 9659.055214] RIP  [<          (null)>]           (null)
[ 9659.062486]  RSP <ffff9e12e2b27b60>
[ 9659.067869] CR2: 0000000000000000
[ 9659.075547] ---[ end trace 8f1f37f93401bf0f ]---
l Corporation.All rights reserved. 
Version 2.00.1201.Copyright(c) 2010 - 2012 American Megatrends,Inc. 
Installed BIOS: SE5C600.86B.01.08.0003

------system reboot, and after reboot--------

CentOS Linux 7 (Core)
Kernel 3.10.0-862.9.1.el7_lustre.x86_64 on an x86_64

soak-10 login: [  175.721668] LNet: HW NUMA nodes: 2, HW CPU cores: 32, npartitions: 2
[  175.731983] alg: No test for adler32 (adler32-zlib)
[  176.631197] Lustre: Lustre: Build Version: 2.11.55_65_gec2e999
[  176.925651] LNet: Using FMR for registration
[  176.943513] LNet: Added LNI 192.168.1.110@o2ib [8/256/0/180]
[  177.121204] LDISKFS-fs warning (device dm-6): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.
[  177.121204] 
[  223.131715] LDISKFS-fs (dm-6): recovery complete
[  223.137107] LDISKFS-fs (dm-6): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
[  227.640081] Lustre: soaked-MDT0002: Not available for connect from 192.168.1.142@o2ib (not set up)
[  230.666682] Lustre: soaked-MDT0002: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
[  230.703152] LustreError: 4067:0:(llog_osd.c:1005:llog_osd_next_block()) soaked-MDT0000-osp-MDT0002: missed desired record? 3 > 1
[  230.716141] LustreError: 4067:0:(lod_dev.c:428:lod_sub_recovery_thread()) soaked-MDT0000-osp-MDT0002 get update log failed: rc = -2
[  234.808297] Lustre: soaked-MDT0002: Connection restored to 192.168.1.106@o2ib (at 192.168.1.106@o2ib)
[  237.283751] Lustre: soaked-MDT0002: Connection restored to e3d1d543-7071-4f1d-f9c5-f174c66a3f7e (at 192.168.1.143@o2ib)
[  237.295833] Lustre: Skipped 1 previous similar message
[  237.308240] Lustre: 4070:0:(ldlm_lib.c:2048:target_recovery_overseer()) recovery is aborted, evict exports in recovery
[  237.320897] Lustre: soaked-MDT0002: disconnecting 28 stale clients
[  237.327887] LustreError: 4070:0:(tgt_grant.c:248:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 2097152 != fo_tot_granted 4194304
[  238.302385] Lustre: soaked-MDT0002: Connection restored to 1d05ec18-7f6a-896f-de54-063cf1f8b51c (at 192.168.1.122@o2ib)
[  238.314575] Lustre: Skipped 1 previous similar message
[  240.471517] Lustre: soaked-MDT0002: Connection restored to a225c1a8-609c-d769-404b-3c25112913a0 (at 192.168.1.141@o2ib)
[  240.483695] Lustre: Skipped 7 previous similar messages
[  244.492077] Lustre: soaked-MDT0002: Connection restored to a6836049-d02e-27ab-c254-5e76eb9cef2b (at 192.168.1.136@o2ib)
[  244.504273] Lustre: Skipped 4 previous similar messages
[  254.539127] Lustre: soaked-MDT0002: Connection restored to e6fd2edd-0c64-011a-f170-b8c415906b8c (at 192.168.1.124@o2ib)
[  254.551333] Lustre: Skipped 9 previous similar messages
[  273.870109] Lustre: soaked-MDT0002: Connection restored to 192.168.1.104@o2ib (at 192.168.1.104@o2ib)
[  273.880570] Lustre: Skipped 7 previous similar messages
[  292.450898] LustreError: 168-f: soaked-MDT0002: BAD WRITE CHECKSUM: from 12345-192.168.1.135@o2ib inode [0x28001e862:0x214:0x0] object 0x28001e862:532 extent [36-39]: client csum 2ff01ff, server csum 2ed01f6
[  293.574529] LustreError: 168-f: soaked-MDT0002: BAD WRITE CHECKSUM: from 12345-192.168.1.135@o2ib inode [0x28001e862:0x214:0x0] object 0x28001e862:532 extent [36-39]: client csum 2ff01ff, server csum 2ed01f6
[  295.414655] LustreError: 168-f: soaked-MDT0002: BAD WRITE CHECKSUM: from 12345-192.168.1.135@o2ib inode [0x28001e862:0x214:0x0] object 0x28001e862:532 extent [36-39]: client csum 2ff01ff, server csum 2ed01f6
[  296.317843] BUG: unable to handle kernel NULL pointer dereference at 00000000000005ab
[  296.326718] IP: [<ffffffff8c3528c3>] rb_next+0x23/0x50
[  296.332536] PGD 0 
[  296.334823] Oops: 0000 [#1] SMP 
[  296.341568] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support joydev ipmi_ssif pcspkr mei_me sg mei i2c_i801 lpc_ich ipmi_si ipmi_devintf ipmi_msghandler wmi ioatdma shpchp dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb ttm ptp mlx4_core(OE) drm isci ahci mpt3sas pps_core libsas crct10dif_pclmul libahci devlink dca crct10dif_common i2c_algo_bit raid_class crc32c_intel libata i2c_core mlx_compat(OE) scsi_transport_sas
[  296.490014] CPU: 10 PID: 3968 Comm: mdt_out01_000 Tainted: P           OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1
[  296.505712] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
[  296.522014] task: ffff9193e349af70 ti: ffff9190a81f4000 task.ti: ffff9190a81f4000
[  296.534051] RIP: 0010:[<ffffffff8c3528c3>]  [<ffffffff8c3528c3>] rb_next+0x23/0x50
[  296.545986] RSP: 0018:ffff9190a81f7a40  EFLAGS: 00010202
[  296.555563] RAX: 000000000000059b RBX: ffff91981aea8d90 RCX: 0000000000000000
[  296.567341] RDX: 000000000000059b RSI: ffff919815fb2d5e RDI: ffff919815fb2d28
[  296.578836] RBP: ffff9190a81f7a40 R08: 000000000e800157 R09: 0000000000000004
[  296.590438] R10: ffff91982ae9c500 R11: ffff91982ae9c500 R12: ffff91980d991900
[  296.601802] R13: ffff91981aea8d90 R14: ffff91981f763000 R15: ffff91980df68008
[  296.613303] FS:  0000000000000000(0000) GS:ffff91982d880000(0000) knlGS:0000000000000000
[  296.625676] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  296.634865] CR2: 00000000000005ab CR3: 00000004ebe0e000 CR4: 00000000000607e0
[  296.645657] Call Trace:
[  296.651676]  [<ffffffffc162cd44>] ldiskfs_readdir+0x5b4/0x850 [ldiskfs]
[  296.662343]  [<ffffffffc0944ef2>] ? fld_local_lookup+0x62/0x270 [fld]
[  296.672780]  [<ffffffffc16ae7b0>] ? osd_object_alloc+0x360/0x360 [osd_ldiskfs]
[  296.684053]  [<ffffffffc16acc3e>] osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]
[  296.695182]  [<ffffffffc16acfa6>] osd_it_ea_next+0xc6/0x150 [osd_ldiskfs]
[  296.705729]  [<ffffffffc1184ae8>] dt_index_page_build+0x1a8/0x470 [obdclass]
[  296.716658]  [<ffffffffc11846b0>] dt_index_walk+0x1a0/0x430 [obdclass]
[  296.726761]  [<ffffffffc1184940>] ? dt_index_walk+0x430/0x430 [obdclass]
[  296.737149]  [<ffffffffc1185a14>] dt_index_read+0x394/0x6a0 [obdclass]
[  296.747492]  [<ffffffffc141cd22>] tgt_obd_idx_read+0x612/0x860 [ptlrpc]
[  296.757860]  [<ffffffffc141ff3a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[  296.768476]  [<ffffffffc13fba61>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[  296.779775]  [<ffffffffc1038bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[  296.790527]  [<ffffffffc13c6acb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[  296.801935]  [<ffffffffc13c3955>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[  296.812275]  [<ffffffff8c0cf682>] ? default_wake_function+0x12/0x20
[  296.822010]  [<ffffffff8c0c52ab>] ? __wake_up_common+0x5b/0x90
[  296.831226]  [<ffffffffc13ca2ec>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc]
[  296.840865]  [<ffffffffc13c97f0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]
[  296.851675]  [<ffffffff8c0bb621>] kthread+0xd1/0xe0
[  296.859622]  [<ffffffff8c0bb550>] ? insert_kthread_work+0x40/0x40
[  296.868941]  [<ffffffff8c7205f7>] ret_from_fork_nospec_begin+0x21/0x21
[  296.878704]  [<ffffffff8c0bb550>] ? insert_kthread_work+0x40/0x40
[  296.888023] Code: c0 5d c3 0f 1f 44 00 00 55 48 8b 17 48 89 e5 48 39 d7 74 3b 48 8b 47 08 48 85 c0 75 0e eb 25 66 0f 1f 84 00 00 00 00 00 48 89 d0 <48> 8b 50 10 48 85 d2 75 f4 5d c3 66 90 48 3b 78 08 75 f6 48 8b 
[  296.915195] RIP  [<ffffffff8c3528c3>] rb_next+0x23/0x50
[  296.923524]  RSP <ffff9190a81f7a40>
[  296.929891] CR2: 00000000000005ab
[  296.935844] ---[ end trace 7f7dc5e5140c0c8b ]---


Comment by Peter Jones [ 12/Oct/18 ]

Yang Sheng

This seems to be an ldiskfs issue. It has only appeared between the 2.11.54 and 2.11.55 tags. Any suggestions?

Peter

Comment by Yang Sheng [ 12/Oct/18 ]

Maybe some issue like LDEV-642, Let me do further investigation.

Comment by Yang Sheng [ 12/Oct/18 ]
[15638.646875] blk_cloned_rq_check_limits: over max size limit.
[15638.646879] blk_cloned_rq_check_limits: over max size limit.
[15638.646888] blk_cloned_rq_check_limits: over max size limit.
[15638.646890] blk_cloned_rq_check_limits: over max size limit.
[15638.646892] blk_cloned_rq_check_limits: over max size limit.
[15638.646895] blk_cloned_rq_check_limits: over max size limit.
[15638.646916] device-mapper: multipath: Failing path 8:96.

This issue should be fixed by LU-9551. Looks like still there? But so great we can reproduce it on our cluster.

Comment by John Hammond [ 30/Oct/18 ]

Which build was this? Was it lustre-master-patchless #137? Is there a more recent build (recent enough that we still have the RPMs) that has this issue?

Comment by Sarah Liu [ 30/Oct/18 ]

hi John, it is lustre-master-ib build 137. No, I don't see the failure recently, soak is running on lustre-lustre-ib build 142 right now

Comment by Peter Jones [ 30/Oct/18 ]

So can we close this ticket as cannot repo? Previously this had been happening every couple of hours, right?

Comment by Sarah Liu [ 30/Oct/18 ]

I think we can close it as cannot reproduce for now.

Comment by Peter Jones [ 30/Oct/18 ]

ok - thanks Sarah!

Comment by Sarah Liu [ 13/May/19 ]

Hit the same issue again when testing 2.12.2-rc1, 1 MDS crash after running about 34 hours

soak-10 console log

[10077.630267] device-mapper: multipath: Failing path 8:96.
[10080.630817] device-mapper: multipath: Reinstating path 8:128.
[10080.637462] device-mapper: multipath: Failing path 8:128.
[10081.939589] INFO: task kmmpd-dm-6:12825 blocked for more than 120 seconds.
[10081.947277] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10081.956040] kmmpd-dm-6      D ffff8da0f6120000     0 12825      2 0x00000080
[10081.964033] Call Trace:
[10081.966797]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
[10081.972543]  [<ffffffff8e368c49>] schedule+0x29/0x70
[10081.978120]  [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0
[10081.984704]  [<ffffffffc05040bd>] ? dm_old_request_fn+0x18d/0x210 [dm_mod]
[10081.992416]  [<ffffffff8df680de>] ? deadline_add_request+0x6e/0x90
[10081.999361]  [<ffffffff8dd01092>] ? ktime_get_ts64+0x52/0xf0
[10082.005679]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
[10082.011450]  [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130
[10082.018164]  [<ffffffff8e368388>] io_schedule+0x18/0x20
[10082.024022]  [<ffffffff8e366d71>] bit_wait_io+0x11/0x50
[10082.029881]  [<ffffffff8e366897>] __wait_on_bit+0x67/0x90
[10082.035921]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
[10082.041680]  [<ffffffff8e366a01>] out_of_line_wait_on_bit+0x81/0xb0
[10082.048707]  [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40
[10082.055333]  [<ffffffff8de7931a>] __wait_on_buffer+0x2a/0x30
[10082.061693]  [<ffffffffc11429ed>] write_mmp_block+0x11d/0x160 [ldiskfs]
[10082.069108]  [<ffffffffc1142c48>] kmmpd+0x1a8/0x430 [ldiskfs]
[10082.075554]  [<ffffffffc1142aa0>] ? __dump_mmp_msg+0x70/0x70 [ldiskfs]
[10082.082842]  [<ffffffff8dcc1c71>] kthread+0xd1/0xe0
[10082.088321]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
[10082.095153]  [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21
[10082.102444]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
[10082.109292] INFO: task jbd2/dm-6-8:12826 blocked for more than 120 seconds.
[10082.117092] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10082.125846] jbd2/dm-6-8     D ffff8da8ea112080     0 12826      2 0x00000080
[10082.133794] Call Trace:
[10082.136532]  [<ffffffff8e368c49>] schedule+0x29/0x70
[10082.142108]  [<ffffffffc047b2bc>] jbd2_journal_commit_transaction+0x23c/0x19b0 [jbd2]
[10082.150883]  [<ffffffff8dcd9f15>] ? sched_clock_cpu+0x85/0xc0
[10082.157323]  [<ffffffff8dc2a59e>] ? __switch_to+0xce/0x580
[10082.163459]  [<ffffffff8dcc2d40>] ? wake_up_atomic_t+0x30/0x30
[10082.169993]  [<ffffffff8e36878f>] ? __schedule+0x3ff/0x890
[10082.176140]  [<ffffffff8dcaa11e>] ? try_to_del_timer_sync+0x5e/0x90
[10082.183156]  [<ffffffffc0481e89>] kjournald2+0xc9/0x260 [jbd2]
[10082.189686]  [<ffffffff8dcc2d40>] ? wake_up_atomic_t+0x30/0x30
[10082.196208]  [<ffffffffc0481dc0>] ? commit_timeout+0x10/0x10 [jbd2]
[10082.203227]  [<ffffffff8dcc1c71>] kthread+0xd1/0xe0
[10082.208710]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
[10082.215521]  [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21
[10082.222812]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
[10082.229643] INFO: task mdt_out01_012:13070 blocked for more than 120 seconds.
[10082.237624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10082.246372] mdt_out01_012   D ffff8da8d67f9040     0 13070      2 0x00000080
[10082.254354] Call Trace:
[10082.257115]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
[10082.262863]  [<ffffffff8e368c49>] schedule+0x29/0x70
[10082.268446]  [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0
[10082.274989]  [<ffffffff8de78c3d>] ? __brelse+0x3d/0x50
[10082.280755]  [<ffffffff8de78f2a>] ? bh_lru_install+0x18a/0x1e0
[10082.287296]  [<ffffffff8dd01092>] ? ktime_get_ts64+0x52/0xf0
[10082.293630]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
[10082.299394]  [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130
[10082.306106]  [<ffffffff8e368388>] io_schedule+0x18/0x20
[10082.311963]  [<ffffffff8e366d71>] bit_wait_io+0x11/0x50
[10082.317820]  [<ffffffff8e366921>] __wait_on_bit_lock+0x61/0xc0
[10082.324361]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
[10082.330114]  [<ffffffff8e366b71>] out_of_line_wait_on_bit_lock+0x81/0xb0
[10082.337614]  [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40
[10082.344252]  [<ffffffff8de79352>] __lock_buffer+0x32/0x40
[10082.350326]  [<ffffffffc0479ea2>] do_get_write_access+0x432/0x4d0 [jbd2]
[10082.357831]  [<ffffffff8de7959d>] ? __getblk+0x2d/0x300
[10082.363681]  [<ffffffffc0479f67>] jbd2_journal_get_write_access+0x27/0x40 [jbd2]
[10082.371965]  [<ffffffffc1159dab>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs]
[10082.381027]  [<ffffffffc112f180>] add_dirent_to_buf+0xd0/0x2c0 [ldiskfs]
[10082.388530]  [<ffffffffc112f55e>] ldiskfs_dx_add_entry+0x1ee/0xc60 [ldiskfs]
[10082.396428]  [<ffffffffc047a2d1>] ? jbd2_journal_dirty_metadata+0xd1/0x260 [jbd2]
[10082.404806]  [<ffffffff8dcc2d05>] ? wake_up_bit+0x25/0x30
[10082.410868]  [<ffffffffc1130a54>] __ldiskfs_add_entry+0x254/0x6e0 [ldiskfs]
[10082.418671]  [<ffffffffc1132ae3>] ? ldiskfs_add_dot_dotdot+0x83/0x90 [ldiskfs]
[10082.426755]  [<ffffffffc11e9404>] osd_ldiskfs_add_entry+0x34/0x260 [osd_ldiskfs]
[10082.435042]  [<ffffffffc11e9c77>] __osd_ea_add_rec+0x197/0x360 [osd_ldiskfs]
[10082.442941]  [<ffffffffc11f4ad1>] osd_index_ea_insert+0x491/0x12f0 [osd_ldiskfs]
[10082.451206]  [<ffffffff8e367f32>] ? down_write+0x12/0x3d
[10082.457246]  [<ffffffffc0f43350>] out_tx_index_insert_exec+0x1e0/0x490 [ptlrpc]
[10082.465464]  [<ffffffffc0f41521>] ? out_obj_ref_add+0xc1/0x220 [ptlrpc]
[10082.472895]  [<ffffffffc0f3c391>] out_tx_end+0xe1/0x5c0 [ptlrpc]
[10082.479666]  [<ffffffffc0f405b2>] out_handle+0x1452/0x1bc0 [ptlrpc]
[10082.486717]  [<ffffffffc0e87890>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[10082.494854]  [<ffffffffc0f351da>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[10082.502588]  [<ffffffffc0f10c11>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[10082.511061]  [<ffffffffc0b2abde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[10082.519002]  [<ffffffffc0eda80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[10082.527624]  [<ffffffffc0ed7695>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[10082.535212]  [<ffffffff8dcd6802>] ? default_wake_function+0x12/0x20
[10082.542246]  [<ffffffff8dccbadb>] ? __wake_up_common+0x5b/0x90
[10082.548808]  [<ffffffffc0ede13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[10082.555853]  [<ffffffffc0edd640>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[10082.564140]  [<ffffffff8dcc1c71>] kthread+0xd1/0xe0
[10082.569609]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
[10082.576435]  [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21
[10082.583733]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
[10082.590580] INFO: task debugfs:27640 blocked for more than 120 seconds.
[10082.597980] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10082.606728] debugfs         D ffff8da16f7c4100     0 27640  27639 0x00000080
[10082.614668] Call Trace:
[10082.617431]  [<ffffffff8df7d1b2>] ? radix_tree_lookup_slot+0x22/0x50
[10082.624532]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
[10082.630298]  [<ffffffff8e368c49>] schedule+0x29/0x70
[10082.635855]  [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0
[10082.637668] device-mapper: multipath: Reinstating path 8:96.
[10082.637830] device-mapper: multipath: Failing path 8:96.
[10082.654638]  [<ffffffff8ddb6d2d>] ? find_get_pages_tag+0xfd/0x240
[10082.661459]  [<ffffffff8e366d60>] ? bit_wait+0x50/0x50
[10082.667204]  [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130
[10082.673927]  [<ffffffff8e368388>] io_schedule+0x18/0x20
[10082.679768]  [<ffffffff8e366d71>] bit_wait_io+0x11/0x50
[10082.685610]  [<ffffffff8e366897>] __wait_on_bit+0x67/0x90
[10082.691649]  [<ffffffff8ddb5881>] wait_on_page_bit+0x81/0xa0
[10082.697995]  [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40
[10082.704613]  [<ffffffff8ddb59b1>] __filemap_fdatawait_range+0x111/0x190
[10082.712004]  [<ffffffff8ddb5a44>] filemap_fdatawait_range+0x14/0x30
[10082.719033]  [<ffffffff8ddb7ea6>] filemap_write_and_wait_range+0x56/0x90
[10082.726530]  [<ffffffff8de803eb>] blkdev_fsync+0x1b/0x50
[10082.732468]  [<ffffffff8de762f7>] do_fsync+0x67/0xb0
[10082.738041]  [<ffffffff8e375d15>] ? system_call_after_swapgs+0xa2/0x146
[10082.745435]  [<ffffffff8de765e0>] SyS_fsync+0x10/0x20
[10082.751082]  [<ffffffff8e375ddb>] system_call_fastpath+0x22/0x27
[10082.757818]  [<ffffffff8e375d21>] ? system_call_after_swapgs+0xae/0x146
[10085.638108] device-mapper: multipath: Reinstating path 8:128.
[10085.644767] device-mapper: multipath: Failing path 8:128.
[10087.645110] device-mapper: multipath: Reinstating path 8:96.
[10087.651810] sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, queueing MODE_SELECT command
[10088.301259] sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, MODE_SELECT completed
[10088.676785] LustreError: 13053:0:(osd_oi.c:762:osd_oi_insert()) dm-6: the FID [0x2800cae3d:0xe3ec:0x0] is used by two objects: 30409754/1062736205 477627435/1062736207
[10088.693877] LNet: 12726:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.109@o2ib
[10090.652279] device-mapper: multipath: Reinstating path 8:128.
[10090.658823] device-mapper: multipath: Failing path 8:128.
[10095.663302] device-mapper: multipath: Reinstating path 8:128.
[10095.669968] device-mapper: multipath: Failing path 8:128.
[10100.670909] device-mapper: multipath: Reinstating path 8:128.
[10100.677556] device-mapper: multipath: Failing path 8:128.
[10105.677908] device-mapper: multipath: Reinstating path 8:128.
[10105.684607] device-mapper: multipath: Failing path 8:128.
[10110.686120] device-mapper: multipath: Reinstating path 8:128.
[10110.692756] device-mapper: multipath: Failing path 8:128.
[10115.694525] device-mapper: multipath: Reinstating path 8:128.
[10115.701149] device-mapper: multipath: Failing path 8:128.
[10119.333857] BUG: unable to handle kernel NULL pointer dereference at           (null)
[10119.342631] IP: [<          (null)>]           (null)
[10119.348281] PGD 0 
[10119.350546] Oops: 0010 [#1] SMP 
[10119.354188] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ipmi_ssif ablk_helper dm_round_robin cryptd iTCO_wdt iTCO_vendor_support ipmi_si joydev pcspkr ipmi_devintf mei_me ioatdma ipmi_msghandler i2c_i801 mei sg lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx4_core(OE) igb ttm isci ahci devlink crct10dif_pclmul crct10dif_common libsas mpt3sas libahci crc32c_intel ptp drm pps_core raid_class libata mlx_compat(OE) dca scsi_transport_sas drm_panel_orientation_quirks i2c_algo_bit
[10119.465397] CPU: 13 PID: 13069 Comm: mdt_out01_011 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.10.1.el7_lustre.x86_64 #1
[10119.479569] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
[10119.492080] task: ffff8da8e85fa080 ti: ffff8da8ccfb0000 task.ti: ffff8da8ccfb0000
[10119.500439] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
[10119.508809] RSP: 0018:ffff8da8ccfb3b60  EFLAGS: 00010246
[10119.514738] RAX: 0000000000000000 RBX: ffff8da8d64fa580 RCX: 0000000000000002
[10119.522705] RDX: ffffffffc11e1a50 RSI: ffff8da8ccfb3b68 RDI: ffff8da8d64fa588
[10119.530669] RBP: ffff8da8ccfb3ba0 R08: 0000000000000004 R09: 0000000000000000
[10119.538639] R10: 0000000000000001 R11: 00000000007fffff R12: ffff8da85672a300
[10119.546601] R13: ffff8da86e8e5d08 R14: ffff8da4b1756200 R15: ffff8da8d64fa588
[10119.554566] FS:  0000000000000000(0000) GS:ffff8da8ed940000(0000) knlGS:0000000000000000
[10119.563610] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10119.570023] CR2: 0000000000000000 CR3: 000000004c210000 CR4: 00000000000607e0
[10119.578001] Call Trace:
[10119.580767]  [<ffffffffc11dfe4e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]
[10119.589030]  [<ffffffffc11e0027>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]
[10119.596673]  [<ffffffffc0c987d8>] dt_index_walk+0xf8/0x430 [obdclass]
[10119.603889]  [<ffffffffc0c98b10>] ? dt_index_walk+0x430/0x430 [obdclass]
[10119.611386]  [<ffffffffc0c99404>] dt_index_read+0x394/0x6a0 [obdclass]
[10119.618775]  [<ffffffffc0f32162>] tgt_obd_idx_read+0x612/0x860 [ptlrpc]
[10119.626211]  [<ffffffffc0f351da>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[10119.633936]  [<ffffffffc0f10c11>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[10119.642404]  [<ffffffffc0b2abde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[10119.650319]  [<ffffffffc0eda80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[10119.658910]  [<ffffffffc0ed7695>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[10119.666489]  [<ffffffff8dcd6802>] ? default_wake_function+0x12/0x20
[10119.673492]  [<ffffffff8dccbadb>] ? __wake_up_common+0x5b/0x90
[10119.680047]  [<ffffffffc0ede13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[10119.687087]  [<ffffffffc0edd640>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[10119.695351]  [<ffffffff8dcc1c71>] kthread+0xd1/0xe0
[10119.700795]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
[10119.707625]  [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21
[10119.714910]  [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40
[10119.721711] Code:  Bad RIP value.
[10119.725449] RIP  [<          (null)>]           (null)
[10119.731192]  RSP <ffff8da8ccfb3b60>
[10119.735084] CR2: 0000000000000000
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
Comment by Sarah Liu [ 13/May/19 ]

Hit this again on 2.12.2-rc1 testing

Comment by Yang Sheng [ 14/May/19 ]

Hi, Sarah,

Do you have crash dump for further analysis?

Thanks,
Yangsheng

Comment by Sarah Liu [ 15/May/19 ]

Hi Yang Sheng,

I uploaded the vmcore to my home dir on trevis, can you please see if you can access it. Thank you

/home/w3liu/test_logs/lu-11457

Comment by Yang Sheng [ 16/May/19 ]

Hi, Sarah,

Which build do you test? I use

https://build.whamcloud.com/job/lustre-b2_12/arch=x86_64,build_type=server,distro=el7.6,ib_stack=inkernel/15/artifact/artifacts/RPMS/x86_64/

cannot decode the vmcore.

Thanks,
YangSheng

Comment by Sarah Liu [ 16/May/19 ]

I uploaded the vmcore again, please try if it works. Or if you can access spirit, the dumps are located under
/scratch/dumps/soak-10.spirit.whamcloud.com/10.10.1.110-2019-05-12-13:06:18

I was using the mlnx bulid
https://build.whamcloud.com/job/lustre-b2_12-ib/16/arch=x86_64,build_type=server,distro=el7.6,ib_stack=mlx/

Comment by Yang Sheng [ 21/May/19 ]

From vmcore, looks like some thing really wrong:

.....
crash> inode ffff8da86e8e5d08 -x
struct inode {
  i_mode = 0x41ed,
  i_opflags = 0x0,
  i_uid = {
    val = 0x21b
  },
  i_gid = {
    val = 0x21b
  },
  i_flags = 0x80,
  i_acl = 0x0,
  i_default_acl = 0xffffffffffffffff,
  i_op = 0xffffffffc1178d00 <ldiskfs_file_inode_operations>,
  i_sb = 0xffff8da5797df800,
  i_mapping = 0xffff8da86e8e5e58,
  i_security = 0x0,
  i_ino = 0x1c78042b,
  {
    i_nlink = 0x1,
    __i_nlink = 0x1
  },
  i_rdev = 0x0,
  i_size = 0x0,
  i_atime = {
    tv_sec = 0x5cd8196d,
    tv_nsec = 0x0
  },
  i_mtime = {
    tv_sec = 0x5cd8196d,
    tv_nsec = 0x0
  },
  i_ctime = {
    tv_sec = 0x5cd8196d,
    tv_nsec = 0x0
  },
  i_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0x0
          }
        }
      }
    }
  },
  i_bytes = 0x0,
  i_blkbits = 0xc,
  i_blocks = 0x0,
  i_state = 0x0,
  i_mutex = {
    count = {
      counter = 0x1
    },
    wait_lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0x0
            }
          }
        }
      }
    },
    wait_list = {
      next = 0xffff8da86e8e5db8,
      prev = 0xffff8da86e8e5db8
    },
    owner = 0x0,
    {
      osq = {
        tail = {
          counter = 0x0
        }
      },
      __UNIQUE_ID_rh_kabi_hide3 = {
        spin_mlock = 0x0
      },
      {<No data fields>}
    }
  },
  dirtied_when = 0x0,
  i_hash = {
    next = 0x0,
    pprev = 0xffffab0282849760
  },
  i_io_list = {
    next = 0xffff8da86e8e5df0,
    prev = 0xffff8da86e8e5df0
  },
  i_lru = {
    next = 0xffff8da86e8e5e00,
    prev = 0xffff8da86e8e5e00
  },
  i_sb_list = {
    next = 0xffff8da8555f27a0,
    prev = 0xffff8da86ea655b0
  },
  {
    i_dentry = {
      first = 0x0
    },
    i_rcu = {
      next = 0x0,
      func = 0x0
    }
  },
  i_version = 0x1,
  i_count = {
    counter = 0x1
  },
  i_dio_count = {
    counter = 0x0
  },
  i_writecount = {
    counter = 0x0
  },
  i_fop = 0xffffffffc1178dc0 <ldiskfs_file_operations>,
  i_flock = 0x0,
  i_data = {
    host = 0xffff8da86e8e5d08,
    page_tree = {
      rh_reserved_height = 0x0,
      gfp_mask = 0x20,
      rnode = 0x0
    },
    tree_lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0x0
            }
          }
        }
      }
    },
    {
      i_mmap_writable = {
        counter = 0x0
      },
      __UNIQUE_ID_rh_kabi_hide17 = {
        i_mmap_writable = 0x0
      },
      {<No data fields>}
    },
    i_mmap = {
      rb_node = 0x0
    },
    i_mmap_nonlinear = {
      next = 0xffff8da86e8e5e80,
      prev = 0xffff8da86e8e5e80
    },
    i_mmap_mutex = {
      count = {
        counter = 0x1
      },
      wait_lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0x0
              }
            }
          }
        }
      },
      wait_list = {
        next = 0xffff8da86e8e5e98,
        prev = 0xffff8da86e8e5e98
      },
      owner = 0x0,
      {
        osq = {
          tail = {
            counter = 0x0
          }
        },
        __UNIQUE_ID_rh_kabi_hide3 = {
          spin_mlock = 0x0
        },
        {<No data fields>}
      }
    },
    nrpages = 0x0,
    nrexceptional = 0x0,
    writeback_index = 0x0,
    a_ops = 0xffffffffc1178b20 <ldiskfs_journalled_aops>,
    flags = 0x200da,
    backing_dev_info = 0xffff8da8e3673b20,
    private_lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0x0
            }
          }
        }
      }
    },
    private_list = {
      next = 0xffff8da86e8e5ef0,
      prev = 0xffff8da86e8e5ef0
    },
    private_data = 0x0
  },
  i_dquot = {0xffff8da8e5124100, 0xffff8da8e5124200, 0x0},
  i_devices = {
    next = 0xffff8da86e8e5f20,
    prev = 0xffff8da86e8e5f20
  },
  {
    i_pipe = 0x0,
    i_bdev = 0x0,
    i_cdev = 0x0
  },
  i_generation = 0x3f58114f,
  i_fsnotify_mask = 0x0,
  {
    i_fsnotify_marks = 0x0,
    __UNIQUE_ID_rh_kabi_hide18 = {
      i_fsnotify_marks = {
        first = 0x0
      }
    },
    {<No data fields>}
  },
  i_readcount = {
    counter = 0x0
  },
  i_private = 0x0
}
......

The i_mode indicated this inode is a directory(0x4000). But the i_op & i_fop set to REG operations. Looks like some corner case causes such situation?

Thanks,
YangSheng

Comment by Sarah Liu [ 27/Jan/20 ]

Hit this issue again on lustre-b2_12-next #57
soak-11 console:

[16422.066032] Lustre: 79110:0:(service.c:1322:ptlrpc_at_send_early_reply()) Skipped 425 previous similar messages
[16469.920777] Lustre: 79107:0:(service.c:2011:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 225s  req@ffff89052d9cf980 x1656840860166528/t0(0) o103->s
oaked-MDT0002-mdtlov_UUID@192.168.1.110@o2ib:0/0 lens 328/0 e 0 to 0 dl 0 ref 1 fl New:/2/ffffffff rc 0/-1
[16469.946879] Lustre: 79107:0:(service.c:2011:ptlrpc_server_handle_req_in()) Skipped 370 previous similar messages
[16511.443090] Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound).
[16511.453496] Lustre: 79109:0:(service.c:1541:ptlrpc_at_check_timed()) earlyQ=3796 reqQ=17207 recA=0, svcEst=600, delay=85403
[16511.466032] Lustre: 79109:0:(service.c:1322:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-2s), not sending early reply. Consider increasing at_ear
ly_margin (5)?  req@ffff890983671200 x1656840826067648/t0(0) o103->soaked-MDT0000-mdtlov_UUID@192.168.1.108@o2ib:154/0 lens 328/0 e 0 to 0 dl 1580100394 ref 2 fl
 New:/0/ffffffff rc 0/-1
[16511.499803] Lustre: 79109:0:(service.c:1322:ptlrpc_at_send_early_reply()) Skipped 109 previous similar messages
[16589.022832] Lustre: soaked-MDT0003: Client 58fa13bb-1d12-d81e-9977-655c7b0a359d (at 192.168.1.117@o2ib) reconnecting
[16665.657814] Lustre: soaked-MDT0003: Client 2d96e5f6-af44-1f55-5648-3c5026213dae (at 192.168.1.130@o2ib) reconnecting
[16751.557870] Lustre: soaked-MDT0003: Client d23f5b34-b590-b62f-81cf-55bba7beeb62 (at 192.168.1.142@o2ib) reconnecting
[16756.583622] Lustre: soaked-MDT0003: Client bd3af862-653c-bd66-2072-7b69d8812dd7 (at 192.168.1.140@o2ib) reconnecting
[16756.595384] Lustre: Skipped 1 previous similar message
[16781.901025] Lustre: soaked-MDT0003: Client 55eecb6a-6cc1-b7f2-e17b-d2ce51f4ff57 (at 192.168.1.141@o2ib) reconnecting
[16785.356557] perf: interrupt took too long (3961 > 3942), lowering kernel.perf_event_max_sample_rate to 50000
[16840.006883] Lustre: 78276:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1580100124/real 1580100124]  req@
ffff890592e0d580 x1656840776727360/t0(0) o103->soaked-MDT0000-osp-MDT0003@192.168.1.108@o2ib:17/18 lens 328/224 e 7 to 1 dl 1580100725 ref 1 fl Rpc:X/2/ffffffff 
rc -11/-1
[16840.012891] Lustre: soaked-MDT0000-osp-MDT0003: Connection to soaked-MDT0000 (at 192.168.1.108@o2ib) was lost; in progress operations using this service will 
wait for recovery to complete
[16840.057838] Lustre: 78276:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1124 previous similar messages
[16840.104230] Lustre: soaked-MDT0000-osp-MDT0003: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib)
[16840.115694] Lustre: Skipped 8 previous similar messages
[16844.361410] Lustre: soaked-MDT0003: Received new LWP connection from 192.168.1.110@o2ib, removing former export from same NID
[16844.374044] Lustre: Skipped 1 previous similar message
[16879.242891] Lustre: 79107:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
[16879.242891]   req@ffff890969d8e300 x1656840800674880/t0(0) o103->soaked-MDT0000-mdtlov_UUID@192.168.1.108@o2ib:529/0 lens 328/0 e 0 to 0 dl 1580100769 ref 2 f
l New:/2/ffffffff rc 0/-1
[16879.274271] Lustre: 79107:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 34712 previous similar messages
[16884.947622] LustreError: 79136:0:(osd_oi.c:762:osd_oi_insert()) dm-5: the FID [0x2c00543c1:0xe1d:0x0] is used by two objects: 212861283/3086025257 233832815/3
086025258
[16884.964548] LNet: 78257:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.108@o2ib
[16887.446546] BUG: unable to handle kernel NULL pointer dereference at           (null)
[16887.455308] IP: [<          (null)>]           (null)
[16887.460965] PGD 0
[16887.463234] Oops: 0010 [#1] SMP
[16887.466862] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov
(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE)
rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) sb_edac intel_powe
rclamp coretemp intel_rapl iosf_mbi kvm iTCO_wdt irqbypass iTCO_vendor_support crc32_pclmul ghash_clmulni_intel aesni_intel lrw ipmi_ssif gf128mul glue_helper me
i_me dm_round_robin ablk_helper i2c_i801 mei ipmi_si lpc_ich cryptd ioatdma joydev pcspkr sg ipmi_devintf ipmi_msghandler wmi pcc_cpufreq dm_multipath dm_mod aut
h_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect mlx4_core(OE) igb sysimgblt fb_sys
_fops isci mpt3sas ttm ahci libsas libahci ptp crct10dif_pclmul drm devlink pps_core raid_class crct10dif_common dca libata crc32c_intel mlx_compat(OE) scsi_tran
sport_sas drm_panel_orientation_quirks i2c_algo_bit
[16887.582758] CPU: 8 PID: 83890 Comm: mdt_out01_011 Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.9.1.el7_lustre.x86_64 #1
[16887.596833] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
[16887.609356] task: ffff8905a85d41c0 ti: ffff8905765b0000 task.ti: ffff8905765b0000
[16887.617710] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
[16887.626075] RSP: 0018:ffff8905765b3b68  EFLAGS: 00010246
[16887.631996] RAX: 0000000000000000 RBX: ffff89097aff8000 RCX: 0000000000000002
[16887.639962] RDX: ffffffffc1564ab0 RSI: ffff8905765b3b70 RDI: ffff89097aff8008
[16887.647926] RBP: ffff8905765b3ba8 R08: 0000000000000000 R09: 0000000000000001
[16887.655891] R10: 0000000200000008 R11: 00000000007fffff R12: ffff8909a93ded00
[16887.663863] R13: ffff89096ddb15d8 R14: ffff890981779a00 R15: ffff89097aff8008
[16887.671829] FS:  0000000000000000(0000) GS:ffff8909ae200000(0000) knlGS:0000000000000000
[16887.680870] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16887.687285] CR2: 0000000000000000 CR3: 0000000604210000 CR4: 00000000000607e0
[16887.695251] Call Trace:
[16887.698006]  [<ffffffffc1562eae>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]
[16887.706274]  [<ffffffffc1563087>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]
[16887.713907]  [<ffffffffc0dba208>] dt_index_walk+0xf8/0x430 [obdclass]
[16887.721135]  [<ffffffffc0dba540>] ? dt_index_walk+0x430/0x430 [obdclass]
[16887.728649]  [<ffffffffc0dbae34>] dt_index_read+0x394/0x6a0 [obdclass]
[16887.736047]  [<ffffffffc10f4e42>] tgt_obd_idx_read+0x612/0x860 [ptlrpc]
[16887.743503]  [<ffffffffc10f99ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
[16887.751242]  [<ffffffffc10d3101>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[16887.759709]  [<ffffffffc0c6ebde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[16887.767639]  [<ffffffffc109e47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[16887.776256]  [<ffffffffc109b295>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[16887.783833]  [<ffffffff892d3903>] ? __wake_up+0x13/0x20
[16887.789723]  [<ffffffffc10a1de4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[16887.796776]  [<ffffffffc10a12b0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[16887.805038]  [<ffffffff892c61f1>] kthread+0xd1/0xe0
[16887.810487]  [<ffffffff892c6120>] ? insert_kthread_work+0x40/0x40
[16887.817293]  [<ffffffff8998dd37>] ret_from_fork_nospec_begin+0x21/0x21
[16887.824581]  [<ffffffff892c6120>] ? insert_kthread_work+0x40/0x40
[16887.831383] Code:  Bad RIP value.
[16887.835102] RIP  [<          (null)>]           (null)
[16887.835102] RIP  [<          (null)>]           (null)
[16887.840852]  RSP <ffff8905765b3b68>
[16887.844734] CR2: 0000000000000000
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 3.10.0-1062.9.1.el7_lustre.x86_64 (jenkins@trevis-310-el7-x8664-6.trevis.whamcloud.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Thu Jan 23 14:10:41 UTC 2020
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-1062.9.1.el7_lustre.x86_64 ro console=ttyS0,115200 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd disable_cpu_apicid=0 elfcorehdr=869812K
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000008ffff] usable
[    0.000000] BIOS-e820: [mem 0x0000000000090000-0x000000000009ffff] reserved

core dump can be found under /scratch/dumps/soak-11.spirit.whamcloud.com/10.10.1.111-2020-01-27-04\:53\:21/

Comment by Sarah Liu [ 05/Jun/20 ]

soak hit the problem on 2.12.5 testing after running for 4 days.

[18137.800259] Pid: 5007, comm: mdt01_017 3.10.0-1062.18.1.el7_lustre.x86_64 #1 SMP Wed May 27 23:31:53 UTC 2020
[18137.811351] Call Trace:
[18137.814107]  [<ffffffffc107f8fc>] top_trans_wait_result+0xa9/0x158 [ptlrpc]
[18137.822008]  [<ffffffffc105fb79>] top_trans_stop+0x4e9/0xa70 [ptlrpc]
[18137.829352]  [<ffffffffc1522dec>] lod_trans_stop+0x25c/0x340 [lod]
[18137.836307]  [<ffffffffc15d56ee>] mdd_trans_stop+0x2e/0x174 [mdd]
[18137.843150]  [<ffffffffc15bdc73>] mdd_unlink+0x3f3/0xaa0 [mdd]
[18137.849700]  [<ffffffffc148139d>] mdo_unlink+0x46/0x48 [mdt]
[18137.856087]  [<ffffffffc14452bd>] mdt_reint_unlink+0xbed/0x14b0 [mdt]
[18137.863361]  [<ffffffffc1449963>] mdt_reint_rec+0x83/0x210 [mdt]
[18137.870109]  [<ffffffffc1426273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[18137.877464]  [<ffffffffc14316e7>] mdt_reint+0x67/0x140 [mdt]
[18137.883828]  [<ffffffffc104d9da>] tgt_request_handle+0xada/0x1570 [ptlrpc]
[18137.891641]  [<ffffffffc0ff248b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[18137.900298]  [<ffffffffc0ff5df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[18137.907411]  [<ffffffff9f2c6321>] kthread+0xd1/0xe0
[18137.912887]  [<ffffffff9f98ed37>] ret_from_fork_nospec_end+0x0/0x39
[18137.919936]  [<ffffffffffffffff>] 0xffffffffffffffff
[18137.925528] LNet: Service thread pid 4804 was inactive for 202.30s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[18152.716405] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 0 seconds
[18152.727780] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 111 previous similar messages
[18210.718946] Lustre: 4675:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1591370247/real 1591370247]  req@ffff8b87f443ec00 x1668654025156672/t0(0) o105->soaked-MDT0001@192.168.1.111@o2ib:15/16 lens 304/224 e 0 to 1 dl 1591370254 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
[18210.749830] Lustre: 4675:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 110 previous similar messages
[18232.719942] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 0 seconds
[18232.731320] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 91 previous similar messages
[18235.829192] Lustre: soaked-MDT0001: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID
[18235.841908] Lustre: soaked-MDT0001: Connection restored to soaked-MDT0000-mdtlov_UUID (at 192.168.1.108@o2ib)
[18235.853001] Lustre: Skipped 1 previous similar message
[18236.004566] LustreError: 4796:0:(osd_oi.c:762:osd_oi_insert()) dm-1: the FID [0x2400134c5:0x8001:0x0] is used by two objects: 477626398/123258462 477626399/123258463
[18236.053964] LNet: 4615:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.108@o2ib
[18236.794778] BUG: unable to handle kernel NULL pointer dereference at           (null)
[18236.803570] IP: [<          (null)>]           (null)
[18236.809248] PGD 0 
[18236.811519] Oops: 0010 [#1] SMP 
[18236.815153] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel iTCO_wdt iTCO_vendor_support aesni_intel lrw gf128mul glue_helper ipmi_ssif ablk_helper dm_round_robin mei_me cryptd mei sg ioatdma ipmi_si i2c_i801 lpc_ich ipmi_devintf joydev ipmi_msghandler pcspkr pcc_cpufreq wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops isci ttm igb mlx4_core(OE) ahci libsas libahci ptp mpt2sas drm devlink pps_core crct10dif_pclmul raid_class crct10dif_common dca libata crc32c_intel mlx_compat(OE) scsi_transport_sas drm_panel_orientation_quirks i2c_algo_bit
[18236.927648] CPU: 31 PID: 4694 Comm: mdt_rdpg01_001 Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.18.1.el7_lustre.x86_64 #1
[18236.941923] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
[18236.954462] task: ffff8b8807dae2a0 ti: ffff8b88072f0000 task.ti: ffff8b88072f0000
[18236.962824] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
[18236.971207] RSP: 0018:ffff8b88072f3b68  EFLAGS: 00010246
[18236.977142] RAX: 0000000000000000 RBX: ffff8b87ff710000 RCX: 0000000000000002
[18236.985109] RDX: ffffffffc12e2ab0 RSI: ffff8b88072f3b70 RDI: ffff8b87ff710008
[18236.993080] RBP: ffff8b88072f3ba8 R08: 0000000000000000 R09: 0000000000000001
[18237.001053] R10: ffff8b842dbb5800 R11: 00000000007fffff R12: ffff8b87fef0f000
[18237.009011] R13: ffff8b87eb34f628 R14: ffff8b88283e6800 R15: ffff8b87ff710008
[18237.016993] FS:  0000000000000000(0000) GS:ffff8b882e5c0000(0000) knlGS:0000000000000000
[18237.026029] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[18237.032449] CR2: 0000000000000000 CR3: 00000006c0a10000 CR4: 00000000000607e0
[18237.040420] Call Trace:
[18237.043201]  [<ffffffffc12e0eae>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]
[18237.051478]  [<ffffffffc12e1087>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]
[18237.059104]  [<ffffffffc1538e07>] lod_it_load+0x27/0x90 [lod]
[18237.065587]  [<ffffffffc0d0e258>] dt_index_walk+0xf8/0x430 [obdclass]
[18237.072807]  [<ffffffffc15c0c60>] ? mdd_object_lock+0xe0/0xe0 [mdd]
[18237.079818]  [<ffffffffc15c25bf>] mdd_readpage+0x25f/0x5a0 [mdd]
[18237.086558]  [<ffffffffc141f88a>] mdt_readpage+0x69a/0x8e0 [mdt]
[18237.093384]  [<ffffffffc104d9da>] tgt_request_handle+0xada/0x1570 [ptlrpc]
[18237.101137]  [<ffffffffc1027121>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[18237.109611]  [<ffffffffc0bc0bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[18237.117558]  [<ffffffffc0ff248b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[18237.126192]  [<ffffffffc0fef2a5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[18237.133784]  [<ffffffff9f2d3a33>] ? __wake_up+0x13/0x20
[18237.139689]  [<ffffffffc0ff5df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[18237.146704]  [<ffffffff9f980e02>] ? __schedule+0x402/0x840
[18237.152899]  [<ffffffffc0ff52c0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[18237.161175]  [<ffffffff9f2c6321>] kthread+0xd1/0xe0
[18237.166629]  [<ffffffff9f2c6250>] ? insert_kthread_work+0x40/0x40
[18237.173457]  [<ffffffff9f98ed37>] ret_from_fork_nospec_begin+0x21/0x21
[18237.180748]  [<ffffffff9f2c6250>] ? insert_kthread_work+0x40/0x40
[18237.187548] Code:  Bad RIP value.
[18237.191284] RIP  [<          (null)>]           (null)
[18237.197031]  RSP <ffff8b88072f3b68>
[18237.200923] CR2: 0000000000000000
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 3.10.0-1062.18.1.el7_lustre.x86_64 (jenkins@trevis-307-el7-x8664-1.trevis.whamcloud.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Wed May 27 23:31:53 UTC 2020
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-1062.18.1.el7_lustre.x86_64 ro console=ttyS0,115200 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd disable_cpu_apicid=0 elfcorehdr=869816K
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000008efff] usable
[    0.000000] BIOS-e820: [mem 0x000000000008f000-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x000000002b000000-0x000000003516dfff] usable
[    0.000000] BIOS-e820: [mem 0x00000000bb3c7000-0x00000000bdd2efff] reserved

Comment by Li Xi [ 21/Dec/21 ]

We checked the inodes via debugfs and they showed that the inodes did not have the same FID in the LMA, and in fact two of the inodes were completely unused, so it isn't clear why this error was being printed at all:

I am not sure whether this related, but the osd_add_dot_dotdot_internal() seems save the wrong child fid to "..".

Comment by Andreas Dilger [ 23/Sep/22 ]

This was hit on a customer OST. It looks like this can happen if a block in the inode table overwrites another in-use block in the inode table, and then there are two inodes that have the same content (lma FID, blocks), but different O/<seq>/dX/<oid> pathnames. Alternately, it is possible that the original object was cleared by e2fsck (maybe due to duplicate block processing) and a new empty object was created in its place with the same FID (it looks like both happened). OI Scrub was modified to do a full O/ tree walk in LU-13124 and check multiply-linked objects, but in this case it should detect the path/FID mismatch and either:

  • replace the "new" object if it has no blocks, but "bad" object has blocks
  • remove the "bad" object has no blocks or the same blocks, it should be removed
Comment by Gerrit Updater [ 07/Jul/23 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51601
Subject: LU-11457 osd-ldiskfs: scrub FID reuse
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a2627b8c26d5b0d5580971f227d1cb11d8ae8669

Comment by Gerrit Updater [ 31/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51601/
Subject: LU-11457 osd-ldiskfs: scrub FID reuse
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5c59f8551a96ec0b3bbb935a3173c75e878d9ff0

Comment by Peter Jones [ 31/Aug/23 ]

Landed for 2.16

Generated at Sat Feb 10 02:44:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.