[LU-11457] osd_oi_insert(): the FID is used by two objects Created: 02/Oct/18 Updated: 02/Oct/23 Resolved: 31/Aug/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0, Lustre 2.12.2, Lustre 2.12.4, Lustre 2.12.5 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Sarah Liu | Assignee: | Yang Sheng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | soak | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
tag-2.11.55 MDS crash [15650.670434] device-mapper: multipath: Failing path 8:96.^M [15650.765276] BUG: unable to handle kernel NULL pointer dereference at (null)^M [15650.775741] IP: [< (null)>] (null)^M [15650.783081] PGD 0 ^M [15650.786948] Oops: 0010 [#1] SMP ^M [15650.792218] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt joydev pcspkr ipmi_ssif iTCO_vendor_support sg ipmi_si ipmi_devintf shpchp ipmi_msghandler i2c_i801 mei_me ioatdma mei lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm igb isci ahci ptp mlx4_core(OE) mpt3sas libsas libahci pps_core dca crct10dif_pclmul devlink i2c_algo_bit crct10dif_common raid_class crc32c_intel libata i2c_core mlx_compat(OE) scsi_transport_sas^M [15650.934649] CPU: 14 PID: 9491 Comm: mdt_rdpg01_008 Tainted: P OE ------------ 3.10.0-862.9.1.el7_lustre.x86_64 #1^M [15650.952002] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M [15650.966961] task: ffff8be6a1253f40 ti: ffff8be68a044000 task.ti: ffff8be68a044000^M [15650.977791] RIP: 0010:[<0000000000000000>] [< (null)>] (null)^M [15650.988693] RSP: 0018:ffff8be68a047b58 EFLAGS: 00010246^M [15650.997167] RAX: 0000000000000000 RBX: ffff8be68b820000 RCX: 0000000000000002^M [15651.007733] RDX: ffffffffc164c7b0 RSI: ffff8be68a047b60 RDI: ffff8be68b820008^M [15651.018326] RBP: ffff8be68a047b98 R08: 0000000000000004 R09: 0000000000000000^M [15651.028930] R10: 0000000000000001 R11: 00000000007fffff R12: ffff8be26f9fab00^M [15651.039547] R13: ffff8be279a448a0 R14: ffff8be68a160000 R15: ffff8be68b820008^M [15651.050168] FS: 0000000000000000(0000) GS:ffff8be6ad980000(0000) knlGS:0000000000000000^M [15651.061880] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M [15651.070980] CR2: 0000000000000000 CR3: 000000042c3b6000 CR4: 00000000000607e0^M [15651.081660] Call Trace:^M [15651.087091] [<ffffffffc164ac3e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]^M [15651.098058] [<ffffffffc164ae17>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]^M [15651.108370] [<ffffffffc188eb47>] lod_it_load+0x27/0x90 [lod]^M [15651.117554] [<ffffffffc0f48808>] dt_index_walk+0xf8/0x430 [obdclass]^M [15651.127457] [<ffffffffc1915080>] ? mdd_object_lock+0xe0/0xe0 [mdd]^M [15651.137132] [<ffffffffc1916d9f>] mdd_readpage+0x25f/0x5a0 [mdd]^M [15651.146553] [<ffffffffc1782bda>] mdt_readpage+0x63a/0x880 [mdt]^M [15651.155992] [<ffffffffc11e82ca>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M [15651.166379] [<ffffffffc11c02e1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]^M [15651.177493] [<ffffffffc0dfcbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]^M [15651.188033] [<ffffffffc118b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M [15651.199251] [<ffffffffc1188315>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]^M [15651.209399] [<ffffffff83ccf682>] ? default_wake_function+0x12/0x20^M [15651.218931] [<ffffffff83cc52ab>] ? __wake_up_common+0x5b/0x90^M [15651.228026] [<ffffffffc118ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]^M [15651.237575] [<ffffffffc118e1b0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]^M [15651.248365] [<ffffffff83cbb621>] kthread+0xd1/0xe0^M [15651.256344] [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M [15651.265688] [<ffffffff843205f7>] ret_from_fork_nospec_begin+0x21/0x21^M [15651.275475] [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M [15651.284736] Code: Bad RIP value.^M [15651.290946] RIP [< (null)>] (null)^M [15651.299236] RSP <ffff8be68a047b58>^M [15651.305543] CR2: 0000000000000000^M [15651.315778] ---[ end trace 4ae4238c00f9aeec ]---^M [15651.336386] Kernel panic - not syncing: Fatal exception^M [15651.344613] Kernel Offset: 0x2c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)^M [15651.369289] ------------[ cut here ]------------^M [15651.376397] WARNING: CPU: 14 PID: 9491 at arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x65/0x70^M [15651.388915] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt joydev pcspkr ipmi_ssif iTCO_vendor_support sg ipmi_si ipmi_devintf shpchp ipmi_msghandler i2c_i801 mei_me ioatdma mei lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm igb isci ahci ptp mlx4_core(OE) mpt3sas libsas libahci pps_core dca crct10dif_pclmul devlink i2c_algo_bit crct10dif_common raid_class crc32c_intel libata i2c_core mlx_compat(OE) scsi_transport_sas^M [15651.529620] CPU: 14 PID: 9491 Comm: mdt_rdpg01_008 Tainted: P D OE ------------ 3.10.0-862.9.1.el7_lustre.x86_64 #1^M [15651.546472] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M [15651.561156] Call Trace:^M [15651.566023] <IRQ> [<ffffffff8430e84e>] dump_stack+0x19/0x1b^M [15651.574646] [<ffffffff83c91e18>] __warn+0xd8/0x100^M [15651.582224] [<ffffffff83c91f5d>] warn_slowpath_null+0x1d/0x20^M [15651.590851] [<ffffffff83c54e95>] native_smp_send_reschedule+0x65/0x70^M [15651.600279] [<ffffffff83cddf81>] trigger_load_balance+0x191/0x280^M [15651.609280] [<ffffffff83ccdc0a>] scheduler_tick+0x10a/0x150^M [15651.617702] [<ffffffff83d01c10>] ? tick_sched_do_timer+0x50/0x50^M [15651.626619] [<ffffffff83ca4f65>] update_process_times+0x65/0x80^M [15651.635416] [<ffffffff83d01a10>] tick_sched_handle+0x30/0x70^M [15651.643916] [<ffffffff83d01c49>] tick_sched_timer+0x39/0x80^M [15651.652315] [<ffffffff83cbf7e6>] __hrtimer_run_queues+0xd6/0x260^M [15651.661210] [<ffffffff83cbfd7f>] hrtimer_interrupt+0xaf/0x1d0^M [15651.669814] [<ffffffff83c5847b>] local_apic_timer_interrupt+0x3b/0x60^M [15651.679184] [<ffffffff84325063>] smp_apic_timer_interrupt+0x43/0x60^M [15651.688352] [<ffffffff843217b2>] apic_timer_interrupt+0x162/0x170^M [15651.697316] <EOI> [<ffffffff84308c3d>] ? panic+0x1d5/0x21f^M [15651.705715] [<ffffffff84308ba1>] ? panic+0x139/0x21f^M [15651.713430] [<ffffffff84318745>] oops_end+0xc5/0xe0^M [15651.721020] [<ffffffff8430807e>] no_context+0x285/0x2a8^M [15651.728984] [<ffffffff84308115>] __bad_area_nosemaphore+0x74/0x1d1^M [15651.738014] [<ffffffff84308286>] bad_area_nosemaphore+0x14/0x16^M [15651.746760] [<ffffffff8431b6e0>] __do_page_fault+0x330/0x4f0^M [15651.755199] [<ffffffff8431b8d5>] do_page_fault+0x35/0x90^M [15651.763264] [<ffffffff84317758>] page_fault+0x28/0x30^M [15651.771013] [<ffffffffc164c7b0>] ? osd_object_alloc+0x360/0x360 [osd_ldiskfs]^M [15651.781105] [<ffffffffc164ac3e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs]^M [15651.791402] [<ffffffffc164ae17>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs]^M [15651.801028] [<ffffffffc188eb47>] lod_it_load+0x27/0x90 [lod]^M [15651.809517] [<ffffffffc0f48808>] dt_index_walk+0xf8/0x430 [obdclass]^M [15651.818761] [<ffffffffc1915080>] ? mdd_object_lock+0xe0/0xe0 [mdd]^M [15651.827808] [<ffffffffc1916d9f>] mdd_readpage+0x25f/0x5a0 [mdd]^M [15651.836533] [<ffffffffc1782bda>] mdt_readpage+0x63a/0x880 [mdt]^M [15651.845269] [<ffffffffc11e82ca>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M [15651.854937] [<ffffffffc11c02e1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]^M [15651.865302] [<ffffffffc0dfcbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]^M [15651.875084] [<ffffffffc118b48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M [15651.885522] [<ffffffffc1188315>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]^M [15651.894881] [<ffffffff83ccf682>] ? default_wake_function+0x12/0x20^M [15651.903621] [<ffffffff83cc52ab>] ? __wake_up_common+0x5b/0x90^M [15651.911880] [<ffffffffc118ecc4>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]^M [15651.920585] [<ffffffffc118e1b0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]^M [15651.930460] [<ffffffff83cbb621>] kthread+0xd1/0xe0^M [15651.937468] [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M [15651.945794] [<ffffffff843205f7>] ret_from_fork_nospec_begin+0x21/0x21^M [15651.954564] [<ffffffff83cbb550>] ? insert_kthread_work+0x40/0x40^M [15651.962806] ---[ end trace 4ae4238c00f9aeed ]---^M |
| Comments |
| Comment by Peter Jones [ 02/Oct/18 ] |
|
Lai Does this appear to be related to recent changes under Peter |
| Comment by Lai Siyao [ 10/Oct/18 ] |
|
Sarah, did you run dir migration test in soak test? And how many times have you seen this crash? |
| Comment by Sarah Liu [ 10/Oct/18 ] |
|
Hi Lai, No, I didn't run dir migration. When I filed the ticket, it was the first time seen the problem, but I restarted soak on tip of master on Monday with build #137, hit the same issue again in about 2 hours run. Here is the trace from the latest failure: on soak-10 [ 9654.637324] blk_cloned_rq_check_limits: over max size limit. [ 9658.590523] BUG: unable to handle kernel NULL pointer dereference at (null) [ 9658.600366] IP: [< (null)>] (null) [ 9658.606866] PGD 0 [ 9658.609984] Oops: 0010 [#1] SMP [ 9658.614447] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) sb_edac intel_powerclamp coretemp spl(OE) intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ipmi_ssif pcspkr ipmi_si ipmi_devintf ipmi_msghandler iTCO_wdt iTCO_vendor_support joydev sg mei_me mei i2c_i801 lpc_ich ioatdma wmi shpchp auth_rpcgss dm_multipath dm_mod sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb isci ahci ptp drm mlx4_core(OE) libsas libahci pps_core mpt3sas dca crct10dif_pclmul devlink i2c_algo_bit crct10dif_common crc32c_intel raid_class libata i2c_core mlx_compat(OE) scsi_transport_sas [ 9658.745591] CPU: 11 PID: 22180 Comm: mdt_out01_010 Tainted: P OE ------------ 3.10.0-862.9.1.el7_lustre.x86_64 #1 [ 9658.760612] Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [ 9658.777200] task: ffff9e132d6f2f70 ti: ffff9e12e2b24000 task.ti: ffff9e12e2b24000 [ 9658.786678] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [ 9658.796193] RSP: 0018:ffff9e12e2b27b60 EFLAGS: 00010246 [ 9658.803358] RAX: 0000000000000000 RBX: ffff9e1715878000 RCX: 0000000000000002 [ 9658.812569] RDX: ffffffffc16877b0 RSI: ffff9e12e2b27b68 RDI: ffff9e1715878008 [ 9658.821879] RBP: ffff9e12e2b27ba0 R08: 0000000000000004 R09: 0000000000000000 [ 9658.831212] R10: 0000000000000001 R11: 00000000007fffff R12: ffff9e1317b6ad00 [ 9658.840559] R13: ffff9e16edd7c030 R14: ffff9e1714d4e800 R15: ffff9e1715878008 [ 9658.849980] FS: 0000000000000000(0000) GS:ffff9e172d8c0000(0000) knlGS:0000000000000000 [ 9658.860384] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9658.868127] CR2: 0000000000000000 CR3: 000000073a60e000 CR4: 00000000000607e0 [ 9658.877444] Call Trace: [ 9658.881511] [<ffffffffc1685c3e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs] [ 9658.891221] [<ffffffffc1685e17>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs] [ 9658.900163] [<ffffffffc0f47608>] dt_index_walk+0xf8/0x430 [obdclass] [ 9658.908835] [<ffffffffc0f47940>] ? dt_index_walk+0x430/0x430 [obdclass] [ 9658.917808] [<ffffffffc0f48a14>] dt_index_read+0x394/0x6a0 [obdclass] [ 9658.926634] [<ffffffffc11dfd22>] tgt_obd_idx_read+0x612/0x860 [ptlrpc] [ 9658.935280] [<ffffffffc11e2f3a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 9658.944303] [<ffffffffc11bea61>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [ 9658.954245] [<ffffffffc0dc2bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [ 9658.963572] [<ffffffffc1189acb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 9658.973654] [<ffffffffc1186955>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [ 9658.982660] [<ffffffffaaccf682>] ? default_wake_function+0x12/0x20 [ 9658.991178] [<ffffffffaacc52ab>] ? __wake_up_common+0x5b/0x90 [ 9658.999219] [<ffffffffc118d2ec>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc] [ 9659.007754] [<ffffffffc118c7f0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc] [ 9659.017507] [<ffffffffaacbb621>] kthread+0xd1/0xe0 [ 9659.024461] [<ffffffffaacbb550>] ? insert_kthread_work+0x40/0x40 [ 9659.032809] [<ffffffffab3205f7>] ret_from_fork_nospec_begin+0x21/0x21 [ 9659.041631] [<ffffffffaacbb550>] ? insert_kthread_work+0x40/0x40 [ 9659.049943] Code: Bad RIP value. [ 9659.055214] RIP [< (null)>] (null) [ 9659.062486] RSP <ffff9e12e2b27b60> [ 9659.067869] CR2: 0000000000000000 [ 9659.075547] ---[ end trace 8f1f37f93401bf0f ]--- l Corporation.All rights reserved. Version 2.00.1201.Copyright(c) 2010 - 2012 American Megatrends,Inc. Installed BIOS: SE5C600.86B.01.08.0003 ------system reboot, and after reboot-------- CentOS Linux 7 (Core) Kernel 3.10.0-862.9.1.el7_lustre.x86_64 on an x86_64 soak-10 login: [ 175.721668] LNet: HW NUMA nodes: 2, HW CPU cores: 32, npartitions: 2 [ 175.731983] alg: No test for adler32 (adler32-zlib) [ 176.631197] Lustre: Lustre: Build Version: 2.11.55_65_gec2e999 [ 176.925651] LNet: Using FMR for registration [ 176.943513] LNet: Added LNI 192.168.1.110@o2ib [8/256/0/180] [ 177.121204] LDISKFS-fs warning (device dm-6): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait. [ 177.121204] [ 223.131715] LDISKFS-fs (dm-6): recovery complete [ 223.137107] LDISKFS-fs (dm-6): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc [ 227.640081] Lustre: soaked-MDT0002: Not available for connect from 192.168.1.142@o2ib (not set up) [ 230.666682] Lustre: soaked-MDT0002: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 [ 230.703152] LustreError: 4067:0:(llog_osd.c:1005:llog_osd_next_block()) soaked-MDT0000-osp-MDT0002: missed desired record? 3 > 1 [ 230.716141] LustreError: 4067:0:(lod_dev.c:428:lod_sub_recovery_thread()) soaked-MDT0000-osp-MDT0002 get update log failed: rc = -2 [ 234.808297] Lustre: soaked-MDT0002: Connection restored to 192.168.1.106@o2ib (at 192.168.1.106@o2ib) [ 237.283751] Lustre: soaked-MDT0002: Connection restored to e3d1d543-7071-4f1d-f9c5-f174c66a3f7e (at 192.168.1.143@o2ib) [ 237.295833] Lustre: Skipped 1 previous similar message [ 237.308240] Lustre: 4070:0:(ldlm_lib.c:2048:target_recovery_overseer()) recovery is aborted, evict exports in recovery [ 237.320897] Lustre: soaked-MDT0002: disconnecting 28 stale clients [ 237.327887] LustreError: 4070:0:(tgt_grant.c:248:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 2097152 != fo_tot_granted 4194304 [ 238.302385] Lustre: soaked-MDT0002: Connection restored to 1d05ec18-7f6a-896f-de54-063cf1f8b51c (at 192.168.1.122@o2ib) [ 238.314575] Lustre: Skipped 1 previous similar message [ 240.471517] Lustre: soaked-MDT0002: Connection restored to a225c1a8-609c-d769-404b-3c25112913a0 (at 192.168.1.141@o2ib) [ 240.483695] Lustre: Skipped 7 previous similar messages [ 244.492077] Lustre: soaked-MDT0002: Connection restored to a6836049-d02e-27ab-c254-5e76eb9cef2b (at 192.168.1.136@o2ib) [ 244.504273] Lustre: Skipped 4 previous similar messages [ 254.539127] Lustre: soaked-MDT0002: Connection restored to e6fd2edd-0c64-011a-f170-b8c415906b8c (at 192.168.1.124@o2ib) [ 254.551333] Lustre: Skipped 9 previous similar messages [ 273.870109] Lustre: soaked-MDT0002: Connection restored to 192.168.1.104@o2ib (at 192.168.1.104@o2ib) [ 273.880570] Lustre: Skipped 7 previous similar messages [ 292.450898] LustreError: 168-f: soaked-MDT0002: BAD WRITE CHECKSUM: from 12345-192.168.1.135@o2ib inode [0x28001e862:0x214:0x0] object 0x28001e862:532 extent [36-39]: client csum 2ff01ff, server csum 2ed01f6 [ 293.574529] LustreError: 168-f: soaked-MDT0002: BAD WRITE CHECKSUM: from 12345-192.168.1.135@o2ib inode [0x28001e862:0x214:0x0] object 0x28001e862:532 extent [36-39]: client csum 2ff01ff, server csum 2ed01f6 [ 295.414655] LustreError: 168-f: soaked-MDT0002: BAD WRITE CHECKSUM: from 12345-192.168.1.135@o2ib inode [0x28001e862:0x214:0x0] object 0x28001e862:532 extent [36-39]: client csum 2ff01ff, server csum 2ed01f6 [ 296.317843] BUG: unable to handle kernel NULL pointer dereference at 00000000000005ab [ 296.326718] IP: [<ffffffff8c3528c3>] rb_next+0x23/0x50 [ 296.332536] PGD 0 [ 296.334823] Oops: 0000 [#1] SMP [ 296.341568] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) dm_round_robin zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support joydev ipmi_ssif pcspkr mei_me sg mei i2c_i801 lpc_ich ipmi_si ipmi_devintf ipmi_msghandler wmi ioatdma shpchp dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb ttm ptp mlx4_core(OE) drm isci ahci mpt3sas pps_core libsas crct10dif_pclmul libahci devlink dca crct10dif_common i2c_algo_bit raid_class crc32c_intel libata i2c_core mlx_compat(OE) scsi_transport_sas [ 296.490014] CPU: 10 PID: 3968 Comm: mdt_out01_000 Tainted: P OE ------------ 3.10.0-862.9.1.el7_lustre.x86_64 #1 [ 296.505712] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [ 296.522014] task: ffff9193e349af70 ti: ffff9190a81f4000 task.ti: ffff9190a81f4000 [ 296.534051] RIP: 0010:[<ffffffff8c3528c3>] [<ffffffff8c3528c3>] rb_next+0x23/0x50 [ 296.545986] RSP: 0018:ffff9190a81f7a40 EFLAGS: 00010202 [ 296.555563] RAX: 000000000000059b RBX: ffff91981aea8d90 RCX: 0000000000000000 [ 296.567341] RDX: 000000000000059b RSI: ffff919815fb2d5e RDI: ffff919815fb2d28 [ 296.578836] RBP: ffff9190a81f7a40 R08: 000000000e800157 R09: 0000000000000004 [ 296.590438] R10: ffff91982ae9c500 R11: ffff91982ae9c500 R12: ffff91980d991900 [ 296.601802] R13: ffff91981aea8d90 R14: ffff91981f763000 R15: ffff91980df68008 [ 296.613303] FS: 0000000000000000(0000) GS:ffff91982d880000(0000) knlGS:0000000000000000 [ 296.625676] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 296.634865] CR2: 00000000000005ab CR3: 00000004ebe0e000 CR4: 00000000000607e0 [ 296.645657] Call Trace: [ 296.651676] [<ffffffffc162cd44>] ldiskfs_readdir+0x5b4/0x850 [ldiskfs] [ 296.662343] [<ffffffffc0944ef2>] ? fld_local_lookup+0x62/0x270 [fld] [ 296.672780] [<ffffffffc16ae7b0>] ? osd_object_alloc+0x360/0x360 [osd_ldiskfs] [ 296.684053] [<ffffffffc16acc3e>] osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs] [ 296.695182] [<ffffffffc16acfa6>] osd_it_ea_next+0xc6/0x150 [osd_ldiskfs] [ 296.705729] [<ffffffffc1184ae8>] dt_index_page_build+0x1a8/0x470 [obdclass] [ 296.716658] [<ffffffffc11846b0>] dt_index_walk+0x1a0/0x430 [obdclass] [ 296.726761] [<ffffffffc1184940>] ? dt_index_walk+0x430/0x430 [obdclass] [ 296.737149] [<ffffffffc1185a14>] dt_index_read+0x394/0x6a0 [obdclass] [ 296.747492] [<ffffffffc141cd22>] tgt_obd_idx_read+0x612/0x860 [ptlrpc] [ 296.757860] [<ffffffffc141ff3a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 296.768476] [<ffffffffc13fba61>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [ 296.779775] [<ffffffffc1038bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [ 296.790527] [<ffffffffc13c6acb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 296.801935] [<ffffffffc13c3955>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [ 296.812275] [<ffffffff8c0cf682>] ? default_wake_function+0x12/0x20 [ 296.822010] [<ffffffff8c0c52ab>] ? __wake_up_common+0x5b/0x90 [ 296.831226] [<ffffffffc13ca2ec>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc] [ 296.840865] [<ffffffffc13c97f0>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc] [ 296.851675] [<ffffffff8c0bb621>] kthread+0xd1/0xe0 [ 296.859622] [<ffffffff8c0bb550>] ? insert_kthread_work+0x40/0x40 [ 296.868941] [<ffffffff8c7205f7>] ret_from_fork_nospec_begin+0x21/0x21 [ 296.878704] [<ffffffff8c0bb550>] ? insert_kthread_work+0x40/0x40 [ 296.888023] Code: c0 5d c3 0f 1f 44 00 00 55 48 8b 17 48 89 e5 48 39 d7 74 3b 48 8b 47 08 48 85 c0 75 0e eb 25 66 0f 1f 84 00 00 00 00 00 48 89 d0 <48> 8b 50 10 48 85 d2 75 f4 5d c3 66 90 48 3b 78 08 75 f6 48 8b [ 296.915195] RIP [<ffffffff8c3528c3>] rb_next+0x23/0x50 [ 296.923524] RSP <ffff9190a81f7a40> [ 296.929891] CR2: 00000000000005ab [ 296.935844] ---[ end trace 7f7dc5e5140c0c8b ]--- |
| Comment by Peter Jones [ 12/Oct/18 ] |
|
Yang Sheng This seems to be an ldiskfs issue. It has only appeared between the 2.11.54 and 2.11.55 tags. Any suggestions? Peter |
| Comment by Yang Sheng [ 12/Oct/18 ] |
|
Maybe some issue like LDEV-642, Let me do further investigation. |
| Comment by Yang Sheng [ 12/Oct/18 ] |
[15638.646875] blk_cloned_rq_check_limits: over max size limit. [15638.646879] blk_cloned_rq_check_limits: over max size limit. [15638.646888] blk_cloned_rq_check_limits: over max size limit. [15638.646890] blk_cloned_rq_check_limits: over max size limit. [15638.646892] blk_cloned_rq_check_limits: over max size limit. [15638.646895] blk_cloned_rq_check_limits: over max size limit. [15638.646916] device-mapper: multipath: Failing path 8:96. This issue should be fixed by |
| Comment by John Hammond [ 30/Oct/18 ] |
|
Which build was this? Was it lustre-master-patchless #137? Is there a more recent build (recent enough that we still have the RPMs) that has this issue? |
| Comment by Sarah Liu [ 30/Oct/18 ] |
|
hi John, it is lustre-master-ib build 137. No, I don't see the failure recently, soak is running on lustre-lustre-ib build 142 right now |
| Comment by Peter Jones [ 30/Oct/18 ] |
|
So can we close this ticket as cannot repo? Previously this had been happening every couple of hours, right? |
| Comment by Sarah Liu [ 30/Oct/18 ] |
|
I think we can close it as cannot reproduce for now. |
| Comment by Peter Jones [ 30/Oct/18 ] |
|
ok - thanks Sarah! |
| Comment by Sarah Liu [ 13/May/19 ] |
|
Hit the same issue again when testing 2.12.2-rc1, 1 MDS crash after running about 34 hours soak-10 console log [10077.630267] device-mapper: multipath: Failing path 8:96. [10080.630817] device-mapper: multipath: Reinstating path 8:128. [10080.637462] device-mapper: multipath: Failing path 8:128. [10081.939589] INFO: task kmmpd-dm-6:12825 blocked for more than 120 seconds. [10081.947277] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10081.956040] kmmpd-dm-6 D ffff8da0f6120000 0 12825 2 0x00000080 [10081.964033] Call Trace: [10081.966797] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10081.972543] [<ffffffff8e368c49>] schedule+0x29/0x70 [10081.978120] [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0 [10081.984704] [<ffffffffc05040bd>] ? dm_old_request_fn+0x18d/0x210 [dm_mod] [10081.992416] [<ffffffff8df680de>] ? deadline_add_request+0x6e/0x90 [10081.999361] [<ffffffff8dd01092>] ? ktime_get_ts64+0x52/0xf0 [10082.005679] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.011450] [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130 [10082.018164] [<ffffffff8e368388>] io_schedule+0x18/0x20 [10082.024022] [<ffffffff8e366d71>] bit_wait_io+0x11/0x50 [10082.029881] [<ffffffff8e366897>] __wait_on_bit+0x67/0x90 [10082.035921] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.041680] [<ffffffff8e366a01>] out_of_line_wait_on_bit+0x81/0xb0 [10082.048707] [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40 [10082.055333] [<ffffffff8de7931a>] __wait_on_buffer+0x2a/0x30 [10082.061693] [<ffffffffc11429ed>] write_mmp_block+0x11d/0x160 [ldiskfs] [10082.069108] [<ffffffffc1142c48>] kmmpd+0x1a8/0x430 [ldiskfs] [10082.075554] [<ffffffffc1142aa0>] ? __dump_mmp_msg+0x70/0x70 [ldiskfs] [10082.082842] [<ffffffff8dcc1c71>] kthread+0xd1/0xe0 [10082.088321] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.095153] [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21 [10082.102444] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.109292] INFO: task jbd2/dm-6-8:12826 blocked for more than 120 seconds. [10082.117092] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10082.125846] jbd2/dm-6-8 D ffff8da8ea112080 0 12826 2 0x00000080 [10082.133794] Call Trace: [10082.136532] [<ffffffff8e368c49>] schedule+0x29/0x70 [10082.142108] [<ffffffffc047b2bc>] jbd2_journal_commit_transaction+0x23c/0x19b0 [jbd2] [10082.150883] [<ffffffff8dcd9f15>] ? sched_clock_cpu+0x85/0xc0 [10082.157323] [<ffffffff8dc2a59e>] ? __switch_to+0xce/0x580 [10082.163459] [<ffffffff8dcc2d40>] ? wake_up_atomic_t+0x30/0x30 [10082.169993] [<ffffffff8e36878f>] ? __schedule+0x3ff/0x890 [10082.176140] [<ffffffff8dcaa11e>] ? try_to_del_timer_sync+0x5e/0x90 [10082.183156] [<ffffffffc0481e89>] kjournald2+0xc9/0x260 [jbd2] [10082.189686] [<ffffffff8dcc2d40>] ? wake_up_atomic_t+0x30/0x30 [10082.196208] [<ffffffffc0481dc0>] ? commit_timeout+0x10/0x10 [jbd2] [10082.203227] [<ffffffff8dcc1c71>] kthread+0xd1/0xe0 [10082.208710] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.215521] [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21 [10082.222812] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.229643] INFO: task mdt_out01_012:13070 blocked for more than 120 seconds. [10082.237624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10082.246372] mdt_out01_012 D ffff8da8d67f9040 0 13070 2 0x00000080 [10082.254354] Call Trace: [10082.257115] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.262863] [<ffffffff8e368c49>] schedule+0x29/0x70 [10082.268446] [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0 [10082.274989] [<ffffffff8de78c3d>] ? __brelse+0x3d/0x50 [10082.280755] [<ffffffff8de78f2a>] ? bh_lru_install+0x18a/0x1e0 [10082.287296] [<ffffffff8dd01092>] ? ktime_get_ts64+0x52/0xf0 [10082.293630] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.299394] [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130 [10082.306106] [<ffffffff8e368388>] io_schedule+0x18/0x20 [10082.311963] [<ffffffff8e366d71>] bit_wait_io+0x11/0x50 [10082.317820] [<ffffffff8e366921>] __wait_on_bit_lock+0x61/0xc0 [10082.324361] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.330114] [<ffffffff8e366b71>] out_of_line_wait_on_bit_lock+0x81/0xb0 [10082.337614] [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40 [10082.344252] [<ffffffff8de79352>] __lock_buffer+0x32/0x40 [10082.350326] [<ffffffffc0479ea2>] do_get_write_access+0x432/0x4d0 [jbd2] [10082.357831] [<ffffffff8de7959d>] ? __getblk+0x2d/0x300 [10082.363681] [<ffffffffc0479f67>] jbd2_journal_get_write_access+0x27/0x40 [jbd2] [10082.371965] [<ffffffffc1159dab>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs] [10082.381027] [<ffffffffc112f180>] add_dirent_to_buf+0xd0/0x2c0 [ldiskfs] [10082.388530] [<ffffffffc112f55e>] ldiskfs_dx_add_entry+0x1ee/0xc60 [ldiskfs] [10082.396428] [<ffffffffc047a2d1>] ? jbd2_journal_dirty_metadata+0xd1/0x260 [jbd2] [10082.404806] [<ffffffff8dcc2d05>] ? wake_up_bit+0x25/0x30 [10082.410868] [<ffffffffc1130a54>] __ldiskfs_add_entry+0x254/0x6e0 [ldiskfs] [10082.418671] [<ffffffffc1132ae3>] ? ldiskfs_add_dot_dotdot+0x83/0x90 [ldiskfs] [10082.426755] [<ffffffffc11e9404>] osd_ldiskfs_add_entry+0x34/0x260 [osd_ldiskfs] [10082.435042] [<ffffffffc11e9c77>] __osd_ea_add_rec+0x197/0x360 [osd_ldiskfs] [10082.442941] [<ffffffffc11f4ad1>] osd_index_ea_insert+0x491/0x12f0 [osd_ldiskfs] [10082.451206] [<ffffffff8e367f32>] ? down_write+0x12/0x3d [10082.457246] [<ffffffffc0f43350>] out_tx_index_insert_exec+0x1e0/0x490 [ptlrpc] [10082.465464] [<ffffffffc0f41521>] ? out_obj_ref_add+0xc1/0x220 [ptlrpc] [10082.472895] [<ffffffffc0f3c391>] out_tx_end+0xe1/0x5c0 [ptlrpc] [10082.479666] [<ffffffffc0f405b2>] out_handle+0x1452/0x1bc0 [ptlrpc] [10082.486717] [<ffffffffc0e87890>] ? target_send_reply_msg+0x170/0x170 [ptlrpc] [10082.494854] [<ffffffffc0f351da>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [10082.502588] [<ffffffffc0f10c11>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [10082.511061] [<ffffffffc0b2abde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [10082.519002] [<ffffffffc0eda80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [10082.527624] [<ffffffffc0ed7695>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [10082.535212] [<ffffffff8dcd6802>] ? default_wake_function+0x12/0x20 [10082.542246] [<ffffffff8dccbadb>] ? __wake_up_common+0x5b/0x90 [10082.548808] [<ffffffffc0ede13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] [10082.555853] [<ffffffffc0edd640>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [10082.564140] [<ffffffff8dcc1c71>] kthread+0xd1/0xe0 [10082.569609] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.576435] [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21 [10082.583733] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10082.590580] INFO: task debugfs:27640 blocked for more than 120 seconds. [10082.597980] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10082.606728] debugfs D ffff8da16f7c4100 0 27640 27639 0x00000080 [10082.614668] Call Trace: [10082.617431] [<ffffffff8df7d1b2>] ? radix_tree_lookup_slot+0x22/0x50 [10082.624532] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.630298] [<ffffffff8e368c49>] schedule+0x29/0x70 [10082.635855] [<ffffffff8e366721>] schedule_timeout+0x221/0x2d0 [10082.637668] device-mapper: multipath: Reinstating path 8:96. [10082.637830] device-mapper: multipath: Failing path 8:96. [10082.654638] [<ffffffff8ddb6d2d>] ? find_get_pages_tag+0xfd/0x240 [10082.661459] [<ffffffff8e366d60>] ? bit_wait+0x50/0x50 [10082.667204] [<ffffffff8e3682ed>] io_schedule_timeout+0xad/0x130 [10082.673927] [<ffffffff8e368388>] io_schedule+0x18/0x20 [10082.679768] [<ffffffff8e366d71>] bit_wait_io+0x11/0x50 [10082.685610] [<ffffffff8e366897>] __wait_on_bit+0x67/0x90 [10082.691649] [<ffffffff8ddb5881>] wait_on_page_bit+0x81/0xa0 [10082.697995] [<ffffffff8dcc2e00>] ? wake_bit_function+0x40/0x40 [10082.704613] [<ffffffff8ddb59b1>] __filemap_fdatawait_range+0x111/0x190 [10082.712004] [<ffffffff8ddb5a44>] filemap_fdatawait_range+0x14/0x30 [10082.719033] [<ffffffff8ddb7ea6>] filemap_write_and_wait_range+0x56/0x90 [10082.726530] [<ffffffff8de803eb>] blkdev_fsync+0x1b/0x50 [10082.732468] [<ffffffff8de762f7>] do_fsync+0x67/0xb0 [10082.738041] [<ffffffff8e375d15>] ? system_call_after_swapgs+0xa2/0x146 [10082.745435] [<ffffffff8de765e0>] SyS_fsync+0x10/0x20 [10082.751082] [<ffffffff8e375ddb>] system_call_fastpath+0x22/0x27 [10082.757818] [<ffffffff8e375d21>] ? system_call_after_swapgs+0xae/0x146 [10085.638108] device-mapper: multipath: Reinstating path 8:128. [10085.644767] device-mapper: multipath: Failing path 8:128. [10087.645110] device-mapper: multipath: Reinstating path 8:96. [10087.651810] sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, queueing MODE_SELECT command [10088.301259] sd 0:0:0:44: rdac: array soak-netapp5660-1, ctlr 0, MODE_SELECT completed [10088.676785] LustreError: 13053:0:(osd_oi.c:762:osd_oi_insert()) dm-6: the FID [0x2800cae3d:0xe3ec:0x0] is used by two objects: 30409754/1062736205 477627435/1062736207 [10088.693877] LNet: 12726:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.109@o2ib [10090.652279] device-mapper: multipath: Reinstating path 8:128. [10090.658823] device-mapper: multipath: Failing path 8:128. [10095.663302] device-mapper: multipath: Reinstating path 8:128. [10095.669968] device-mapper: multipath: Failing path 8:128. [10100.670909] device-mapper: multipath: Reinstating path 8:128. [10100.677556] device-mapper: multipath: Failing path 8:128. [10105.677908] device-mapper: multipath: Reinstating path 8:128. [10105.684607] device-mapper: multipath: Failing path 8:128. [10110.686120] device-mapper: multipath: Reinstating path 8:128. [10110.692756] device-mapper: multipath: Failing path 8:128. [10115.694525] device-mapper: multipath: Reinstating path 8:128. [10115.701149] device-mapper: multipath: Failing path 8:128. [10119.333857] BUG: unable to handle kernel NULL pointer dereference at (null) [10119.342631] IP: [< (null)>] (null) [10119.348281] PGD 0 [10119.350546] Oops: 0010 [#1] SMP [10119.354188] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ipmi_ssif ablk_helper dm_round_robin cryptd iTCO_wdt iTCO_vendor_support ipmi_si joydev pcspkr ipmi_devintf mei_me ioatdma ipmi_msghandler i2c_i801 mei sg lpc_ich wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx4_core(OE) igb ttm isci ahci devlink crct10dif_pclmul crct10dif_common libsas mpt3sas libahci crc32c_intel ptp drm pps_core raid_class libata mlx_compat(OE) dca scsi_transport_sas drm_panel_orientation_quirks i2c_algo_bit [10119.465397] CPU: 13 PID: 13069 Comm: mdt_out01_011 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.10.1.el7_lustre.x86_64 #1 [10119.479569] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [10119.492080] task: ffff8da8e85fa080 ti: ffff8da8ccfb0000 task.ti: ffff8da8ccfb0000 [10119.500439] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [10119.508809] RSP: 0018:ffff8da8ccfb3b60 EFLAGS: 00010246 [10119.514738] RAX: 0000000000000000 RBX: ffff8da8d64fa580 RCX: 0000000000000002 [10119.522705] RDX: ffffffffc11e1a50 RSI: ffff8da8ccfb3b68 RDI: ffff8da8d64fa588 [10119.530669] RBP: ffff8da8ccfb3ba0 R08: 0000000000000004 R09: 0000000000000000 [10119.538639] R10: 0000000000000001 R11: 00000000007fffff R12: ffff8da85672a300 [10119.546601] R13: ffff8da86e8e5d08 R14: ffff8da4b1756200 R15: ffff8da8d64fa588 [10119.554566] FS: 0000000000000000(0000) GS:ffff8da8ed940000(0000) knlGS:0000000000000000 [10119.563610] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [10119.570023] CR2: 0000000000000000 CR3: 000000004c210000 CR4: 00000000000607e0 [10119.578001] Call Trace: [10119.580767] [<ffffffffc11dfe4e>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs] [10119.589030] [<ffffffffc11e0027>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs] [10119.596673] [<ffffffffc0c987d8>] dt_index_walk+0xf8/0x430 [obdclass] [10119.603889] [<ffffffffc0c98b10>] ? dt_index_walk+0x430/0x430 [obdclass] [10119.611386] [<ffffffffc0c99404>] dt_index_read+0x394/0x6a0 [obdclass] [10119.618775] [<ffffffffc0f32162>] tgt_obd_idx_read+0x612/0x860 [ptlrpc] [10119.626211] [<ffffffffc0f351da>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [10119.633936] [<ffffffffc0f10c11>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [10119.642404] [<ffffffffc0b2abde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [10119.650319] [<ffffffffc0eda80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [10119.658910] [<ffffffffc0ed7695>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [10119.666489] [<ffffffff8dcd6802>] ? default_wake_function+0x12/0x20 [10119.673492] [<ffffffff8dccbadb>] ? __wake_up_common+0x5b/0x90 [10119.680047] [<ffffffffc0ede13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] [10119.687087] [<ffffffffc0edd640>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [10119.695351] [<ffffffff8dcc1c71>] kthread+0xd1/0xe0 [10119.700795] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10119.707625] [<ffffffff8e375c37>] ret_from_fork_nospec_begin+0x21/0x21 [10119.714910] [<ffffffff8dcc1ba0>] ? insert_kthread_work+0x40/0x40 [10119.721711] Code: Bad RIP value. [10119.725449] RIP [< (null)>] (null) [10119.731192] RSP <ffff8da8ccfb3b60> [10119.735084] CR2: 0000000000000000 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct |
| Comment by Sarah Liu [ 13/May/19 ] |
|
Hit this again on 2.12.2-rc1 testing |
| Comment by Yang Sheng [ 14/May/19 ] |
|
Hi, Sarah, Do you have crash dump for further analysis? Thanks, |
| Comment by Sarah Liu [ 15/May/19 ] |
|
Hi Yang Sheng, I uploaded the vmcore to my home dir on trevis, can you please see if you can access it. Thank you /home/w3liu/test_logs/lu-11457 |
| Comment by Yang Sheng [ 16/May/19 ] |
|
Hi, Sarah, Which build do you test? I use https://build.whamcloud.com/job/lustre-b2_12/arch=x86_64,build_type=server,distro=el7.6,ib_stack=inkernel/15/artifact/artifacts/RPMS/x86_64/ cannot decode the vmcore. Thanks, |
| Comment by Sarah Liu [ 16/May/19 ] |
|
I uploaded the vmcore again, please try if it works. Or if you can access spirit, the dumps are located under I was using the mlnx bulid |
| Comment by Yang Sheng [ 21/May/19 ] |
|
From vmcore, looks like some thing really wrong: .....
crash> inode ffff8da86e8e5d08 -x
struct inode {
i_mode = 0x41ed,
i_opflags = 0x0,
i_uid = {
val = 0x21b
},
i_gid = {
val = 0x21b
},
i_flags = 0x80,
i_acl = 0x0,
i_default_acl = 0xffffffffffffffff,
i_op = 0xffffffffc1178d00 <ldiskfs_file_inode_operations>,
i_sb = 0xffff8da5797df800,
i_mapping = 0xffff8da86e8e5e58,
i_security = 0x0,
i_ino = 0x1c78042b,
{
i_nlink = 0x1,
__i_nlink = 0x1
},
i_rdev = 0x0,
i_size = 0x0,
i_atime = {
tv_sec = 0x5cd8196d,
tv_nsec = 0x0
},
i_mtime = {
tv_sec = 0x5cd8196d,
tv_nsec = 0x0
},
i_ctime = {
tv_sec = 0x5cd8196d,
tv_nsec = 0x0
},
i_lock = {
{
rlock = {
raw_lock = {
val = {
counter = 0x0
}
}
}
}
},
i_bytes = 0x0,
i_blkbits = 0xc,
i_blocks = 0x0,
i_state = 0x0,
i_mutex = {
count = {
counter = 0x1
},
wait_lock = {
{
rlock = {
raw_lock = {
val = {
counter = 0x0
}
}
}
}
},
wait_list = {
next = 0xffff8da86e8e5db8,
prev = 0xffff8da86e8e5db8
},
owner = 0x0,
{
osq = {
tail = {
counter = 0x0
}
},
__UNIQUE_ID_rh_kabi_hide3 = {
spin_mlock = 0x0
},
{<No data fields>}
}
},
dirtied_when = 0x0,
i_hash = {
next = 0x0,
pprev = 0xffffab0282849760
},
i_io_list = {
next = 0xffff8da86e8e5df0,
prev = 0xffff8da86e8e5df0
},
i_lru = {
next = 0xffff8da86e8e5e00,
prev = 0xffff8da86e8e5e00
},
i_sb_list = {
next = 0xffff8da8555f27a0,
prev = 0xffff8da86ea655b0
},
{
i_dentry = {
first = 0x0
},
i_rcu = {
next = 0x0,
func = 0x0
}
},
i_version = 0x1,
i_count = {
counter = 0x1
},
i_dio_count = {
counter = 0x0
},
i_writecount = {
counter = 0x0
},
i_fop = 0xffffffffc1178dc0 <ldiskfs_file_operations>,
i_flock = 0x0,
i_data = {
host = 0xffff8da86e8e5d08,
page_tree = {
rh_reserved_height = 0x0,
gfp_mask = 0x20,
rnode = 0x0
},
tree_lock = {
{
rlock = {
raw_lock = {
val = {
counter = 0x0
}
}
}
}
},
{
i_mmap_writable = {
counter = 0x0
},
__UNIQUE_ID_rh_kabi_hide17 = {
i_mmap_writable = 0x0
},
{<No data fields>}
},
i_mmap = {
rb_node = 0x0
},
i_mmap_nonlinear = {
next = 0xffff8da86e8e5e80,
prev = 0xffff8da86e8e5e80
},
i_mmap_mutex = {
count = {
counter = 0x1
},
wait_lock = {
{
rlock = {
raw_lock = {
val = {
counter = 0x0
}
}
}
}
},
wait_list = {
next = 0xffff8da86e8e5e98,
prev = 0xffff8da86e8e5e98
},
owner = 0x0,
{
osq = {
tail = {
counter = 0x0
}
},
__UNIQUE_ID_rh_kabi_hide3 = {
spin_mlock = 0x0
},
{<No data fields>}
}
},
nrpages = 0x0,
nrexceptional = 0x0,
writeback_index = 0x0,
a_ops = 0xffffffffc1178b20 <ldiskfs_journalled_aops>,
flags = 0x200da,
backing_dev_info = 0xffff8da8e3673b20,
private_lock = {
{
rlock = {
raw_lock = {
val = {
counter = 0x0
}
}
}
}
},
private_list = {
next = 0xffff8da86e8e5ef0,
prev = 0xffff8da86e8e5ef0
},
private_data = 0x0
},
i_dquot = {0xffff8da8e5124100, 0xffff8da8e5124200, 0x0},
i_devices = {
next = 0xffff8da86e8e5f20,
prev = 0xffff8da86e8e5f20
},
{
i_pipe = 0x0,
i_bdev = 0x0,
i_cdev = 0x0
},
i_generation = 0x3f58114f,
i_fsnotify_mask = 0x0,
{
i_fsnotify_marks = 0x0,
__UNIQUE_ID_rh_kabi_hide18 = {
i_fsnotify_marks = {
first = 0x0
}
},
{<No data fields>}
},
i_readcount = {
counter = 0x0
},
i_private = 0x0
}
......
The i_mode indicated this inode is a directory(0x4000). But the i_op & i_fop set to REG operations. Looks like some corner case causes such situation? Thanks, |
| Comment by Sarah Liu [ 27/Jan/20 ] |
|
Hit this issue again on lustre-b2_12-next #57 [16422.066032] Lustre: 79110:0:(service.c:1322:ptlrpc_at_send_early_reply()) Skipped 425 previous similar messages [16469.920777] Lustre: 79107:0:(service.c:2011:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 225s req@ffff89052d9cf980 x1656840860166528/t0(0) o103->s oaked-MDT0002-mdtlov_UUID@192.168.1.110@o2ib:0/0 lens 328/0 e 0 to 0 dl 0 ref 1 fl New:/2/ffffffff rc 0/-1 [16469.946879] Lustre: 79107:0:(service.c:2011:ptlrpc_server_handle_req_in()) Skipped 370 previous similar messages [16511.443090] Lustre: ldlm_canceld: This server is not able to keep up with request traffic (cpu-bound). [16511.453496] Lustre: 79109:0:(service.c:1541:ptlrpc_at_check_timed()) earlyQ=3796 reqQ=17207 recA=0, svcEst=600, delay=85403 [16511.466032] Lustre: 79109:0:(service.c:1322:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-2s), not sending early reply. Consider increasing at_ear ly_margin (5)? req@ffff890983671200 x1656840826067648/t0(0) o103->soaked-MDT0000-mdtlov_UUID@192.168.1.108@o2ib:154/0 lens 328/0 e 0 to 0 dl 1580100394 ref 2 fl New:/0/ffffffff rc 0/-1 [16511.499803] Lustre: 79109:0:(service.c:1322:ptlrpc_at_send_early_reply()) Skipped 109 previous similar messages [16589.022832] Lustre: soaked-MDT0003: Client 58fa13bb-1d12-d81e-9977-655c7b0a359d (at 192.168.1.117@o2ib) reconnecting [16665.657814] Lustre: soaked-MDT0003: Client 2d96e5f6-af44-1f55-5648-3c5026213dae (at 192.168.1.130@o2ib) reconnecting [16751.557870] Lustre: soaked-MDT0003: Client d23f5b34-b590-b62f-81cf-55bba7beeb62 (at 192.168.1.142@o2ib) reconnecting [16756.583622] Lustre: soaked-MDT0003: Client bd3af862-653c-bd66-2072-7b69d8812dd7 (at 192.168.1.140@o2ib) reconnecting [16756.595384] Lustre: Skipped 1 previous similar message [16781.901025] Lustre: soaked-MDT0003: Client 55eecb6a-6cc1-b7f2-e17b-d2ce51f4ff57 (at 192.168.1.141@o2ib) reconnecting [16785.356557] perf: interrupt took too long (3961 > 3942), lowering kernel.perf_event_max_sample_rate to 50000 [16840.006883] Lustre: 78276:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1580100124/real 1580100124] req@ ffff890592e0d580 x1656840776727360/t0(0) o103->soaked-MDT0000-osp-MDT0003@192.168.1.108@o2ib:17/18 lens 328/224 e 7 to 1 dl 1580100725 ref 1 fl Rpc:X/2/ffffffff rc -11/-1 [16840.012891] Lustre: soaked-MDT0000-osp-MDT0003: Connection to soaked-MDT0000 (at 192.168.1.108@o2ib) was lost; in progress operations using this service will wait for recovery to complete [16840.057838] Lustre: 78276:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 1124 previous similar messages [16840.104230] Lustre: soaked-MDT0000-osp-MDT0003: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib) [16840.115694] Lustre: Skipped 8 previous similar messages [16844.361410] Lustre: soaked-MDT0003: Received new LWP connection from 192.168.1.110@o2ib, removing former export from same NID [16844.374044] Lustre: Skipped 1 previous similar message [16879.242891] Lustre: 79107:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply [16879.242891] req@ffff890969d8e300 x1656840800674880/t0(0) o103->soaked-MDT0000-mdtlov_UUID@192.168.1.108@o2ib:529/0 lens 328/0 e 0 to 0 dl 1580100769 ref 2 f l New:/2/ffffffff rc 0/-1 [16879.274271] Lustre: 79107:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 34712 previous similar messages [16884.947622] LustreError: 79136:0:(osd_oi.c:762:osd_oi_insert()) dm-5: the FID [0x2c00543c1:0xe1d:0x0] is used by two objects: 212861283/3086025257 233832815/3 086025258 [16884.964548] LNet: 78257:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.108@o2ib [16887.446546] BUG: unable to handle kernel NULL pointer dereference at (null) [16887.455308] IP: [< (null)>] (null) [16887.460965] PGD 0 [16887.463234] Oops: 0010 [#1] SMP [16887.466862] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov (OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) sb_edac intel_powe rclamp coretemp intel_rapl iosf_mbi kvm iTCO_wdt irqbypass iTCO_vendor_support crc32_pclmul ghash_clmulni_intel aesni_intel lrw ipmi_ssif gf128mul glue_helper me i_me dm_round_robin ablk_helper i2c_i801 mei ipmi_si lpc_ich cryptd ioatdma joydev pcspkr sg ipmi_devintf ipmi_msghandler wmi pcc_cpufreq dm_multipath dm_mod aut h_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect mlx4_core(OE) igb sysimgblt fb_sys _fops isci mpt3sas ttm ahci libsas libahci ptp crct10dif_pclmul drm devlink pps_core raid_class crct10dif_common dca libata crc32c_intel mlx_compat(OE) scsi_tran sport_sas drm_panel_orientation_quirks i2c_algo_bit [16887.582758] CPU: 8 PID: 83890 Comm: mdt_out01_011 Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.9.1.el7_lustre.x86_64 #1 [16887.596833] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [16887.609356] task: ffff8905a85d41c0 ti: ffff8905765b0000 task.ti: ffff8905765b0000 [16887.617710] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [16887.626075] RSP: 0018:ffff8905765b3b68 EFLAGS: 00010246 [16887.631996] RAX: 0000000000000000 RBX: ffff89097aff8000 RCX: 0000000000000002 [16887.639962] RDX: ffffffffc1564ab0 RSI: ffff8905765b3b70 RDI: ffff89097aff8008 [16887.647926] RBP: ffff8905765b3ba8 R08: 0000000000000000 R09: 0000000000000001 [16887.655891] R10: 0000000200000008 R11: 00000000007fffff R12: ffff8909a93ded00 [16887.663863] R13: ffff89096ddb15d8 R14: ffff890981779a00 R15: ffff89097aff8008 [16887.671829] FS: 0000000000000000(0000) GS:ffff8909ae200000(0000) knlGS:0000000000000000 [16887.680870] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [16887.687285] CR2: 0000000000000000 CR3: 0000000604210000 CR4: 00000000000607e0 [16887.695251] Call Trace: [16887.698006] [<ffffffffc1562eae>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs] [16887.706274] [<ffffffffc1563087>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs] [16887.713907] [<ffffffffc0dba208>] dt_index_walk+0xf8/0x430 [obdclass] [16887.721135] [<ffffffffc0dba540>] ? dt_index_walk+0x430/0x430 [obdclass] [16887.728649] [<ffffffffc0dbae34>] dt_index_read+0x394/0x6a0 [obdclass] [16887.736047] [<ffffffffc10f4e42>] tgt_obd_idx_read+0x612/0x860 [ptlrpc] [16887.743503] [<ffffffffc10f99ca>] tgt_request_handle+0xada/0x1570 [ptlrpc] [16887.751242] [<ffffffffc10d3101>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [16887.759709] [<ffffffffc0c6ebde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [16887.767639] [<ffffffffc109e47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [16887.776256] [<ffffffffc109b295>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [16887.783833] [<ffffffff892d3903>] ? __wake_up+0x13/0x20 [16887.789723] [<ffffffffc10a1de4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [16887.796776] [<ffffffffc10a12b0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [16887.805038] [<ffffffff892c61f1>] kthread+0xd1/0xe0 [16887.810487] [<ffffffff892c6120>] ? insert_kthread_work+0x40/0x40 [16887.817293] [<ffffffff8998dd37>] ret_from_fork_nospec_begin+0x21/0x21 [16887.824581] [<ffffffff892c6120>] ? insert_kthread_work+0x40/0x40 [16887.831383] Code: Bad RIP value. [16887.835102] RIP [< (null)>] (null) [16887.835102] RIP [< (null)>] (null) [16887.840852] RSP <ffff8905765b3b68> [16887.844734] CR2: 0000000000000000 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct [ 0.000000] Linux version 3.10.0-1062.9.1.el7_lustre.x86_64 (jenkins@trevis-310-el7-x8664-6.trevis.whamcloud.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Thu Jan 23 14:10:41 UTC 2020 [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-1062.9.1.el7_lustre.x86_64 ro console=ttyS0,115200 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd disable_cpu_apicid=0 elfcorehdr=869812K [ 0.000000] e820: BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000008ffff] usable [ 0.000000] BIOS-e820: [mem 0x0000000000090000-0x000000000009ffff] reserved core dump can be found under /scratch/dumps/soak-11.spirit.whamcloud.com/10.10.1.111-2020-01-27-04\:53\:21/ |
| Comment by Sarah Liu [ 05/Jun/20 ] |
|
soak hit the problem on 2.12.5 testing after running for 4 days. [18137.800259] Pid: 5007, comm: mdt01_017 3.10.0-1062.18.1.el7_lustre.x86_64 #1 SMP Wed May 27 23:31:53 UTC 2020 [18137.811351] Call Trace: [18137.814107] [<ffffffffc107f8fc>] top_trans_wait_result+0xa9/0x158 [ptlrpc] [18137.822008] [<ffffffffc105fb79>] top_trans_stop+0x4e9/0xa70 [ptlrpc] [18137.829352] [<ffffffffc1522dec>] lod_trans_stop+0x25c/0x340 [lod] [18137.836307] [<ffffffffc15d56ee>] mdd_trans_stop+0x2e/0x174 [mdd] [18137.843150] [<ffffffffc15bdc73>] mdd_unlink+0x3f3/0xaa0 [mdd] [18137.849700] [<ffffffffc148139d>] mdo_unlink+0x46/0x48 [mdt] [18137.856087] [<ffffffffc14452bd>] mdt_reint_unlink+0xbed/0x14b0 [mdt] [18137.863361] [<ffffffffc1449963>] mdt_reint_rec+0x83/0x210 [mdt] [18137.870109] [<ffffffffc1426273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] [18137.877464] [<ffffffffc14316e7>] mdt_reint+0x67/0x140 [mdt] [18137.883828] [<ffffffffc104d9da>] tgt_request_handle+0xada/0x1570 [ptlrpc] [18137.891641] [<ffffffffc0ff248b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [18137.900298] [<ffffffffc0ff5df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [18137.907411] [<ffffffff9f2c6321>] kthread+0xd1/0xe0 [18137.912887] [<ffffffff9f98ed37>] ret_from_fork_nospec_end+0x0/0x39 [18137.919936] [<ffffffffffffffff>] 0xffffffffffffffff [18137.925528] LNet: Service thread pid 4804 was inactive for 202.30s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [18152.716405] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 0 seconds [18152.727780] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 111 previous similar messages [18210.718946] Lustre: 4675:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1591370247/real 1591370247] req@ffff8b87f443ec00 x1668654025156672/t0(0) o105->soaked-MDT0001@192.168.1.111@o2ib:15/16 lens 304/224 e 0 to 1 dl 1591370254 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 [18210.749830] Lustre: 4675:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 110 previous similar messages [18232.719942] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 0 seconds [18232.731320] LNet: 4610:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 91 previous similar messages [18235.829192] Lustre: soaked-MDT0001: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID [18235.841908] Lustre: soaked-MDT0001: Connection restored to soaked-MDT0000-mdtlov_UUID (at 192.168.1.108@o2ib) [18235.853001] Lustre: Skipped 1 previous similar message [18236.004566] LustreError: 4796:0:(osd_oi.c:762:osd_oi_insert()) dm-1: the FID [0x2400134c5:0x8001:0x0] is used by two objects: 477626398/123258462 477626399/123258463 [18236.053964] LNet: 4615:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.108@o2ib [18236.794778] BUG: unable to handle kernel NULL pointer dereference at (null) [18236.803570] IP: [< (null)>] (null) [18236.809248] PGD 0 [18236.811519] Oops: 0010 [#1] SMP [18236.815153] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel iTCO_wdt iTCO_vendor_support aesni_intel lrw gf128mul glue_helper ipmi_ssif ablk_helper dm_round_robin mei_me cryptd mei sg ioatdma ipmi_si i2c_i801 lpc_ich ipmi_devintf joydev ipmi_msghandler pcspkr pcc_cpufreq wmi dm_multipath dm_mod auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops isci ttm igb mlx4_core(OE) ahci libsas libahci ptp mpt2sas drm devlink pps_core crct10dif_pclmul raid_class crct10dif_common dca libata crc32c_intel mlx_compat(OE) scsi_transport_sas drm_panel_orientation_quirks i2c_algo_bit [18236.927648] CPU: 31 PID: 4694 Comm: mdt_rdpg01_001 Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.18.1.el7_lustre.x86_64 #1 [18236.941923] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [18236.954462] task: ffff8b8807dae2a0 ti: ffff8b88072f0000 task.ti: ffff8b88072f0000 [18236.962824] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [18236.971207] RSP: 0018:ffff8b88072f3b68 EFLAGS: 00010246 [18236.977142] RAX: 0000000000000000 RBX: ffff8b87ff710000 RCX: 0000000000000002 [18236.985109] RDX: ffffffffc12e2ab0 RSI: ffff8b88072f3b70 RDI: ffff8b87ff710008 [18236.993080] RBP: ffff8b88072f3ba8 R08: 0000000000000000 R09: 0000000000000001 [18237.001053] R10: ffff8b842dbb5800 R11: 00000000007fffff R12: ffff8b87fef0f000 [18237.009011] R13: ffff8b87eb34f628 R14: ffff8b88283e6800 R15: ffff8b87ff710008 [18237.016993] FS: 0000000000000000(0000) GS:ffff8b882e5c0000(0000) knlGS:0000000000000000 [18237.026029] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [18237.032449] CR2: 0000000000000000 CR3: 00000006c0a10000 CR4: 00000000000607e0 [18237.040420] Call Trace: [18237.043201] [<ffffffffc12e0eae>] ? osd_ldiskfs_it_fill+0xbe/0x260 [osd_ldiskfs] [18237.051478] [<ffffffffc12e1087>] osd_it_ea_load+0x37/0x100 [osd_ldiskfs] [18237.059104] [<ffffffffc1538e07>] lod_it_load+0x27/0x90 [lod] [18237.065587] [<ffffffffc0d0e258>] dt_index_walk+0xf8/0x430 [obdclass] [18237.072807] [<ffffffffc15c0c60>] ? mdd_object_lock+0xe0/0xe0 [mdd] [18237.079818] [<ffffffffc15c25bf>] mdd_readpage+0x25f/0x5a0 [mdd] [18237.086558] [<ffffffffc141f88a>] mdt_readpage+0x69a/0x8e0 [mdt] [18237.093384] [<ffffffffc104d9da>] tgt_request_handle+0xada/0x1570 [ptlrpc] [18237.101137] [<ffffffffc1027121>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [18237.109611] [<ffffffffc0bc0bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [18237.117558] [<ffffffffc0ff248b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [18237.126192] [<ffffffffc0fef2a5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [18237.133784] [<ffffffff9f2d3a33>] ? __wake_up+0x13/0x20 [18237.139689] [<ffffffffc0ff5df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [18237.146704] [<ffffffff9f980e02>] ? __schedule+0x402/0x840 [18237.152899] [<ffffffffc0ff52c0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [18237.161175] [<ffffffff9f2c6321>] kthread+0xd1/0xe0 [18237.166629] [<ffffffff9f2c6250>] ? insert_kthread_work+0x40/0x40 [18237.173457] [<ffffffff9f98ed37>] ret_from_fork_nospec_begin+0x21/0x21 [18237.180748] [<ffffffff9f2c6250>] ? insert_kthread_work+0x40/0x40 [18237.187548] Code: Bad RIP value. [18237.191284] RIP [< (null)>] (null) [18237.197031] RSP <ffff8b88072f3b68> [18237.200923] CR2: 0000000000000000 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct [ 0.000000] Linux version 3.10.0-1062.18.1.el7_lustre.x86_64 (jenkins@trevis-307-el7-x8664-1.trevis.whamcloud.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Wed May 27 23:31:53 UTC 2020 [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-1062.18.1.el7_lustre.x86_64 ro console=ttyS0,115200 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd disable_cpu_apicid=0 elfcorehdr=869816K [ 0.000000] e820: BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000008efff] usable [ 0.000000] BIOS-e820: [mem 0x000000000008f000-0x000000000009ffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved [ 0.000000] BIOS-e820: [mem 0x000000002b000000-0x000000003516dfff] usable [ 0.000000] BIOS-e820: [mem 0x00000000bb3c7000-0x00000000bdd2efff] reserved |
| Comment by Li Xi [ 21/Dec/21 ] |
I am not sure whether this related, but the osd_add_dot_dotdot_internal() seems save the wrong child fid to "..". |
| Comment by Andreas Dilger [ 23/Sep/22 ] |
|
This was hit on a customer OST. It looks like this can happen if a block in the inode table overwrites another in-use block in the inode table, and then there are two inodes that have the same content (lma FID, blocks), but different O/<seq>/dX/<oid> pathnames. Alternately, it is possible that the original object was cleared by e2fsck (maybe due to duplicate block processing) and a new empty object was created in its place with the same FID (it looks like both happened). OI Scrub was modified to do a full O/ tree walk in
|
| Comment by Gerrit Updater [ 07/Jul/23 ] |
|
"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51601 |
| Comment by Gerrit Updater [ 31/Aug/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51601/ |
| Comment by Peter Jones [ 31/Aug/23 ] |
|
Landed for 2.16 |