[LU-9274] LBUG: (recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: Created: 30/Mar/17 Updated: 03/Sep/18 Resolved: 18/Jul/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.10.0, Lustre 2.11.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Cliff White (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
Soak cluster version=lustre: 2.9.52_73_gb5c4f03 |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Soak is running, performing successful OSS failover, partitions are being recovered: 2017-03-30 00:47:10,761:fsmgmt.fsmgmt:INFO Next recovery check in 15s...
2017-03-30 00:47:32,586:fsmgmt.fsmgmt:DEBUG Recovery Result Record: {'soak-4': {'soaked-OST000f': 'RECOVERING', 'soaked-OST000e': 'INACTIVE', 'soaked-OST0008': 'INACTIVE', 'soaked-OST0009': 'RECOVERING', 'soaked-OST0002': 'INACTIVE', 'soaked-OST0003': 'RECOVERING', 'soaked-OST0015': 'COMPLETE', 'soaked-OST0014': 'INACTIVE'}}
Single client has LBUG, after recovering some partitions Mar 30 00:47:33 soak-36 kernel: Lustre: soaked-OST0009-osc-ffff88085b72c000: Connection restored to 192.168.1.104@o2ib10 (at 192.168.1.104@o2ib10) .....:q Mar 30 00:48:31 soak-36 kernel: LustreError: 4753:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: Mar 30 00:48:31 soak-36 kernel: LustreError: 4753:0:(recover.c:157:ptlrpc_replay_next()) LBUG Mar 30 00:48:31 soak-36 kernel: Pid: 4753, comm: ptlrpcd_rcv Mar 30 00:48:31 soak-36 kernel: #012Call Trace: Mar 30 00:48:31 soak-36 kernel: [<ffffffffa092c7f3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] Mar 30 00:48:31 soak-36 kernel: [<ffffffffa092c861>] lbug_with_loc+0x41/0xb0 [libcfs] Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d31c87>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d55682>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d2a2ff>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d2e0b5>] ptlrpc_check_set.part.23+0x425/0x1dd0 [ptlrpc] Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d2fabb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d5bb8b>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d5bf3b>] ptlrpcd+0x2bb/0x560 [ptlrpc] Mar 30 00:48:31 soak-36 kernel: [<ffffffff810c4fd0>] ? default_wake_function+0x0/0x20 Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d5bc80>] ? ptlrpcd+0x0/0x560 [ptlrpc] Mar 30 00:48:32 soak-36 kernel: [<ffffffff810b064f>] kthread+0xcf/0xe0 Mar 30 00:48:32 soak-36 kernel: [<ffffffff810b0580>] ? kthread+0x0/0xe0 Mar 30 00:48:32 soak-36 kernel: [<ffffffff81696958>] ret_from_fork+0x58/0x90 Mar 30 00:48:32 soak-36 kernel: [<ffffffff810b0580>] ? kthread+0x0/0xe0 Mar 30 00:48:32 soak-36 kernel: Mar 30 00:48:32 soak-36 kernel: Kernel panic - not syncing: LBUG vmcore-dmesg.txt attached. Full crash dump is available on soak-36 |
| Comments |
| Comment by Joseph Gmitter (Inactive) [ 30/Mar/17 ] |
|
Hi Niu, Would you be able to look into this ptlrpc issue? It looks like you had added this assertion. Thanks. |
| Comment by Cliff White (Inactive) [ 21/Jun/17 ] |
|
Hit this again on the latest tip of master Jun 21 20:12:45 soak-23 kernel: LustreError: 2921:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: Jun 21 20:12:45 soak-23 kernel: LustreError: 2921:0:(recover.c:157:ptlrpc_replay_next()) LBUG Jun 21 20:12:45 soak-23 kernel: Pid: 2921, comm: ptlrpcd_rcv Jun 21 20:12:45 soak-23 kernel: #012Call Trace: Jun 21 20:12:45 soak-23 kernel: [<ffffffffa08057ee>] libcfs_call_trace+0x4e/0x60 [libcfs] Jun 21 20:12:45 soak-23 kernel: [<ffffffffa080587c>] lbug_with_loc+0x4c/0xb0 [libcfs] Jun 21 20:12:45 soak-23 kernel: [<ffffffffa0b376b7>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] Jun 21 20:12:45 soak-23 kernel: [<ffffffffa0b5ac92>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] Jun 21 20:12:45 soak-23 kernel: [<ffffffffa0b2fdbf>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] Jun 21 20:12:45 soak-23 kernel: [<ffffffffa0b33b1c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] Jun 21 20:12:45 soak-23 kernel: [<ffffffffa0b354eb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] Jun 21 20:12:45 soak-23 kernel: [<ffffffffa0b6116b>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] Jun 21 20:12:45 soak-23 kernel: [<ffffffffa0b6151b>] ptlrpcd+0x2bb/0x560 [ptlrpc] Jun 21 20:12:45 soak-23 kernel: [<ffffffff810c5080>] ? default_wake_function+0x0/0x20 Jun 21 20:12:45 soak-23 kernel: [<ffffffffa0b61260>] ? ptlrpcd+0x0/0x560 [ptlrpc] Jun 21 20:12:45 soak-23 kernel: [<ffffffff810b06ff>] kthread+0xcf/0xe0 Jun 21 20:12:45 soak-23 kernel: [<ffffffff810b0630>] ? kthread+0x0/0xe0 Jun 21 20:12:45 soak-23 kernel: [<ffffffff81696a58>] ret_from_fork+0x58/0x90 Jun 21 20:12:45 soak-23 kernel: [<ffffffff810b0630>] ? kthread+0x0/0xe0 Jun 21 20:12:45 soak-23 kernel: Jun 21 20:12:45 soak-23 kernel: Kernel panic - not syncing: LBUG |
| Comment by Cliff White (Inactive) [ 21/Jun/17 ] |
[88503.634729] LNet: 2911:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.102@o2ib: 11 seconds [88503.647523] LNet: 2911:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Skipped 1 previous similar message [88529.635079] LNet: 2911:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.102@o2ib: 12 seconds [88567.330684] Lustre: soaked-OST0000-osc-ffff88101b206800: Connection restored to 192.168.1.102@o2ib (at 192.168.1.102@o2ib) [88578.129087] LustreError: 2921:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: [88578.146490] LustreError: 2921:0:(recover.c:157:ptlrpc_replay_next()) LBUG [88578.156002] Pid: 2921, comm: ptlrpcd_rcv [88578.162345] Call Trace: [88578.170516] [<ffffffffa08057ee>] libcfs_call_trace+0x4e/0x60 [libcfs] [88578.179661] [<ffffffffa080587c>] lbug_with_loc+0x4c/0xb0 [libcfs] [88578.188506] [<ffffffffa0b376b7>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] [88578.197244] [<ffffffffa0b5ac92>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] [88578.207713] [<ffffffffa0b2fdbf>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] [88578.216902] [<ffffffffa0b33b1c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] [88578.226271] [<ffffffffa0b354eb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [88578.234573] [<ffffffffa0b6116b>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] [88578.242748] [<ffffffffa0b6151b>] ptlrpcd+0x2bb/0x560 [ptlrpc] [88578.250303] [<ffffffff810c5080>] ? default_wake_function+0x0/0x20 [88578.258243] [<ffffffffa0b61260>] ? ptlrpcd+0x0/0x560 [ptlrpc] [88578.265777] [<ffffffff810b06ff>] kthread+0xcf/0xe0 [88578.272224] [<ffffffff810b0630>] ? kthread+0x0/0xe0 [88578.278746] [<ffffffff81696a58>] ret_from_fork+0x58/0x90 [88578.285737] [<ffffffff810b0630>] ? kthread+0x0/0xe0 [88578.292212] [88578.294807] Kernel panic - not syncing: LBUG [88578.300515] CPU: 22 PID: 2921 Comm: ptlrpcd_rcv Tainted: G OE ------------ 3.10.0-514.10.2.el7.x86_64 #1 [88578.313346] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [88578.326859] ffffffffa0823e8b 0000000059ebae47 ffff880805e53b30 ffffffff816864ef [88578.336126] ffff880805e53bb0 ffffffff8167f8f6 ffffffff00000008 ffff880805e53bc0 [88578.345376] ffff880805e53b60 0000000059ebae47 0000000059ebae47 ffff88081e38f838 [88578.354607] Call Trace: [88578.358250] [<ffffffff816864ef>] dump_stack+0x19/0x1b [88578.364899] [<ffffffff8167f8f6>] panic+0xe3/0x1f2 [88578.371145] [<ffffffffa0805894>] lbug_with_loc+0x64/0xb0 [libcfs] [88578.378944] [<ffffffffa0b376b7>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] [88578.387403] [<ffffffffa0b5ac92>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] [88578.397595] [<ffffffffa0b2fdbf>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] [88578.406511] [<ffffffffa0b33b1c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] [88578.415621] [<ffffffffa0b354eb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [88578.423637] [<ffffffffa0b6116b>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] [88578.431535] [<ffffffffa0b6151b>] ptlrpcd+0x2bb/0x560 [ptlrpc] [88578.438818] [<ffffffff810c5080>] ? wake_up_state+0x20/0x20 [88578.445879] [<ffffffffa0b61260>] ? ptlrpcd_check+0x5d0/0x5d0 [ptlrpc] [88578.453914] [<ffffffff810b06ff>] kthread+0xcf/0xe0 [88578.460085] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 [88578.468086] [<ffffffff81696a58>] ret_from_fork+0x58/0x90 [88578.474806] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 crash dump is available on soak-23 |
| Comment by Niu Yawei (Inactive) [ 22/Jun/17 ] |
|
I'll make a debug patch to collect more information about the request. |
| Comment by Gerrit Updater [ 22/Jun/17 ] |
|
Niu Yawei (yawei.niu@intel.com) uploaded a new patch: https://review.whamcloud.com/27776 |
| Comment by Cliff White (Inactive) [ 23/Jun/17 ] |
|
Peter requested a run of master. Hit this again, will see about the debug patch. System log attached |
| Comment by Cliff White (Inactive) [ 23/Jun/17 ] |
|
Crash dump is available on soak-16 |
| Comment by Niu Yawei (Inactive) [ 26/Jun/17 ] |
|
Cliff, if it's easy to be reproduced, could you apply the debug patch and run again? Thanks. |
| Comment by Cliff White (Inactive) [ 26/Jun/17 ] |
|
Ran the patch. Jun 24 01:36:21 soak-11 kernel: LustreError: 4559:0:(service.c:2228:ptlrpc_handle_rs()) ASSERTION( lock != ((void *)0) ) failed: Jun 24 01:36:21 soak-11 kernel: LustreError: 4559:0:(service.c:2228:ptlrpc_handle_rs()) LBUG Jun 24 01:36:21 soak-11 kernel: Pid: 4559, comm: ptlrpc_hr00_001 Jun 24 01:36:21 soak-11 kernel: #012Call Trace: Jun 24 01:36:21 soak-11 kernel: [<ffffffffa0bbd7ee>] libcfs_call_trace+0x4e/0x60 [libcfs] Jun 24 01:36:21 soak-11 kernel: [<ffffffffa0bbd87c>] lbug_with_loc+0x4c/0xb0 [libcfs] Jun 24 01:36:21 soak-11 kernel: [<ffffffffa0f3282d>] ptlrpc_hr_main+0x83d/0x8f0 [ptlrpc] Jun 24 01:36:21 soak-11 kernel: [<ffffffff810c8395>] ? sched_clock_cpu+0x85/0xc0 Jun 24 01:36:21 soak-11 kernel: [<ffffffff810c54e0>] ? default_wake_function+0x0/0x20 Jun 24 01:36:21 soak-11 kernel: [<ffffffffa0f31ff0>] ? ptlrpc_hr_main+0x0/0x8f0 [ptlrpc] Jun 24 01:36:21 soak-11 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0 Jun 24 01:36:21 soak-11 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0 Jun 24 01:36:21 soak-11 kernel: [<ffffffff81697798>] ret_from_fork+0x58/0x90 Jun 24 01:36:21 soak-11 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0 Jun 24 01:36:21 soak-11 kernel: Jun 24 01:36:21 soak-11 kernel: Kernel panic - not syncing: LBUG Jun 24 01:36:22 soak-11 kernel: CPU: 1 PID: 4559 Comm: ptlrpc_hr00_001 Tainted: P OE ------------ 3.10.0-514.21.1.el7_lustre.x86_64 #1 Jun 24 01:36:22 soak-11 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 Jun 24 01:36:22 soak-11 kernel: ffffffffa0bdbe8b 00000000de1e9805 ffff88081bdd7d30 ffffffff8168717f Jun 24 01:36:22 soak-11 kernel: ffff88081bdd7db0 ffffffff816805aa ffffffff00000008 ffff88081bdd7dc0 Jun 24 01:36:22 soak-11 kernel: ffff88081bdd7d60 00000000de1e9805 00000000de1e9805 ffff88042e04f838 Jun 24 01:36:22 soak-11 kernel: Call Trace: Jun 24 01:36:22 soak-11 kernel: [<ffffffff8168717f>] dump_stack+0x19/0x1b Jun 24 01:36:22 soak-11 kernel: [<ffffffff816805aa>] panic+0xe3/0x1f2 Jun 24 01:36:22 soak-11 kernel: [<ffffffffa0bbd894>] lbug_with_loc+0x64/0xb0 [libcfs] Jun 24 01:36:22 soak-11 kernel: [<ffffffffa0f3282d>] ptlrpc_hr_main+0x83d/0x8f0 [ptlrpc] Jun 24 01:36:22 soak-11 kernel: [<ffffffff810c8395>] ? sched_clock_cpu+0x85/0xc0 Jun 24 01:36:22 soak-11 kernel: [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20 Jun 24 01:36:22 soak-11 kernel: [<ffffffffa0f31ff0>] ? ptlrpc_svcpt_stop_threads+0x590/0x590 [ptlrpc] Jun 24 01:36:22 soak-11 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0 Jun 24 01:36:22 soak-11 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 Jun 24 01:36:22 soak-11 kernel: [<ffffffff81697798>] ret_from_fork+0x58/0x90 Jun 24 01:36:22 soak-11 kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 Jun 24 01:36:22 soak-11 kernel: ------------[ cut here ]------------ Jun 24 01:36:22 soak-11 kernel: WARNING: at arch/x86/kernel/smp.c:125 native_smp_send_reschedule+0x5f/0x70() Jun 24 01:36:22 soak-11 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm dm_round_robin irqbypass crc32_pclmul ghash_clmulni_intel ntb aesni_intel sg lrw gf128mul glue_helper ablk_helper cryptd sb_edac shpchp iTCO_wdt ipmi_devintf ipmi_ssif iTCO_vendor_support wmi ipmi_si ioatdma edac_core i2c_i801 lpc_ich pcspkr mei_me ipmi_msghandler Jun 24 01:36:22 soak-11 kernel: mei dm_multipath dm_mod nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_en sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea igb sysfillrect sysimgblt isci fb_sys_fops ahci crct10dif_pclmul ttm crct10dif_common libsas ptp crc32c_intel libahci pps_core drm mlx4_core mpt3sas libata dca raid_class i2c_algo_bit scsi_transport_sas devlink i2c_core fjes Jun 24 01:36:22 soak-11 kernel: CPU: 1 PID: 13 Comm: migration/1 Tainted: P OE ------------ 3.10.0-514.21.1.el7_lustre.x86_64 #1 Jun 24 01:36:22 soak-11 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 Jun 24 01:36:22 soak-11 kernel: 0000000000000000 00000000073b2e60 ffff880179ccbc70 ffffffff8168717f [10535.721248] WARNING: at arch/x86/kernel/smp.c:125 native_smp_send_reschedule+0x5f/0x70()^M [10535.731298] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm dm_round_robin irqbypass crc32_pclmul ghash_clmulni_intel ntb aesni_intel sg lrw gf128mul glue_helper ablk_helper cryptd sb_edac shpchp iTCO_wdt ipmi_devintf ipmi_ssif iTCO_vendor_support wmi ipmi_si ioatdma edac_core i2c_i801 lpc_ich pcspkr mei_me ipmi_msghandler mei dm_multipath dm_mod nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 mlx4_en sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea igb sysfillrect sysimgblt isci fb_sys_fops ahci crct10dif_pclmul ttm crct10dif_common libsas ptp crc32c_intel libahci pps_core drm mlx4_core mpt3sas libata dca raid_class i2c_algo_bit scsi_transport_sas devlink i2c_core fjes^M [10535.863515] CPU: 1 PID: 13 Comm: migration/1 Tainted: P OE ------------ 3.10.0-514.21.1.el7_lustre.x86_64 #1^M [10535.876807] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M [10535.890410] 0000000000000000 00000000073b2e60 ffff880179ccbc70 ffffffff8168717f^M [10535.899786] ffff880179ccbca8 ffffffff81085cb0 0000000000000011 ffff880179ccbd38^M [10535.909179] ffff88042e256c40 ffff88017a20edd0 ffff88081be6de20 ffff880179ccbcb8^M [10535.918571] Call Trace:^M [10535.922389] [<ffffffff8168717f>] dump_stack+0x19/0x1b^M [10535.929237] [<ffffffff81085cb0>] warn_slowpath_common+0x70/0xb0^M [10535.937041] [<ffffffff81085dfa>] warn_slowpath_null+0x1a/0x20^M [10535.944645] [<ffffffff8104e10f>] native_smp_send_reschedule+0x5f/0x70^M [10535.953022] [<ffffffff810c1128>] resched_task+0xb8/0xd0^M [10535.960003] [<ffffffff810c1f55>] check_preempt_curr+0x75/0xa0^M [10535.967565] [<ffffffff810c9d5b>] move_task+0x4b/0x50^M [10535.974246] [<ffffffff810cccbf>] active_load_balance_cpu_stop+0x1ff/0x290^M [10535.982984] [<ffffffff810ccac0>] ? can_migrate_task+0x500/0x500^M [10535.990751] [<ffffffff81116a56>] cpu_stopper_thread+0x96/0x170^M [10535.998423] [<ffffffff8168c240>] ? __schedule+0x3b0/0x990^M [10536.005608] [<ffffffff810b95bf>] smpboot_thread_fn+0x12f/0x180^M [10536.013263] [<ffffffff810b9490>] ? lg_double_unlock+0x90/0x90^M [10536.020798] [<ffffffff810b0a4f>] kthread+0xcf/0xe0^M [10536.027233] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M [10536.035501] [<ffffffff81697798>] ret_from_fork+0x58/0x90^M [10536.042505] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M [10536.050768] ---[ end trace 163e77165449f3fa ]---^M |
| Comment by Niu Yawei (Inactive) [ 26/Jun/17 ] |
|
This is |
| Comment by Cliff White (Inactive) [ 26/Jun/17 ] |
|
Can you rebase this on latest master? |
| Comment by Niu Yawei (Inactive) [ 27/Jun/17 ] |
Rebase done. |
| Comment by Cliff White (Inactive) [ 28/Jun/17 ] |
|
System ran for 24 hours, no LBUGS, but soak-29 client had a hard crash with NMI watchdogs. |
| Comment by Niu Yawei (Inactive) [ 29/Jun/17 ] |
|
Looks we hit another bug, lots of cpus stuck on net lock. [50631.252941] CPU: 13 PID: 43744 Comm: simul Tainted: G IOE ------------ 3.10.0-514.10.2.el7.x86_64 #1^M [50631.252942] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015^M [50631.252943] task: ffff880856d15e20 ti: ffff880564080000 task.ti: ffff880564080000^M [50631.252943] RIP: 0010:[<ffffffff8168de5a>] [<ffffffff8168de5a>] _raw_spin_lock+0x3a/0x50^M [50631.252948] RSP: 0018:ffff8805640831e8 EFLAGS: 00000206^M [50631.252949] RAX: 0000000000006893 RBX: ffff880564083fd8 RCX: 0000000000008478^M [50631.252949] RDX: 00000000000084c6 RSI: 00000000000084c6 RDI: ffff88105e021fc0^M [50631.252950] RBP: ffff8805640831e8 R08: 0000000000000001 R09: 0001fd60baec016f^M [50631.252950] R10: ffff88084073c9ad R11: ffff880564083186 R12: ffffffffa09c0200^M [50631.252951] R13: ffffffffa08386d8 R14: ffff880564083290 R15: ffffffffa09c6c60^M [50631.252951] FS: 00007f590af73740(0000) GS:ffff88085ef40000(0000) knlGS:0000000000000000^M [50631.252952] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M [50631.252952] CR2: 00007f2f1b6cb6a8 CR3: 0000000350e07000 CR4: 00000000001407e0^M [50631.252953] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M [50631.252953] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400^M [50631.252954] Stack:^M [50631.252954] ffff880564083210 ffffffffa08447c8 ffffffffa08448b6 0000000000000001^M [50631.252955] ffff88104fb88400 ffff8805640832d0 ffffffffa09869b5 00000add00000000^M [50631.252956] 0000000007a55a10 ffffffffff0a0004 0000000007a55a10 ffffffffffffffff^M [50631.252957] Call Trace:^M [50631.252959] [<ffffffffa08447c8>] cfs_percpt_lock+0x58/0x110 [libcfs]^M [50631.252970] [<ffffffffa08448b6>] ? cfs_percpt_unlock+0x36/0xc0 [libcfs]^M [50631.252976] [<ffffffffa09869b5>] lnet_select_pathway+0x235/0x1140 [lnet]^M [50631.252988] [<ffffffffa0988fb1>] lnet_send+0x51/0x180 [lnet]^M [50631.252994] [<ffffffffa0989325>] LNetPut+0x245/0x7a0 [lnet]^M [50631.253001] [<ffffffffa0bc1996>] ptl_send_buf+0x146/0x530 [ptlrpc]^M [50631.253028] [<ffffffffa0bc358d>] ptl_send_rpc+0x67d/0xe60 [ptlrpc]^M [50631.253046] [<ffffffffa0bb8ed0>] ptlrpc_send_new_req+0x460/0xa60 [ptlrpc]^M [50631.253064] [<ffffffffa0bbdaa1>] ptlrpc_set_wait+0x3d1/0x900 [ptlrpc]^M [50631.253082] [<ffffffffa09fd1e5>] ? lustre_get_jobid+0x215/0x4d0 [obdclass]^M [50631.253103] [<ffffffffa0bc8fe5>] ? lustre_msg_set_jobid+0x95/0x100 [ptlrpc] Could you open another ticket for this? I think we'd ask some Lnet expert to take a look. Thanks. |
| Comment by Cliff White (Inactive) [ 29/Jun/17 ] |
|
I switched to b2.10 build 2 last night, and hit this again. [52683.724559] LustreError: 2896:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: ^M [52683.740443] LustreError: 2896:0:(recover.c:157:ptlrpc_replay_next()) LBUG^M [52683.749161] Pid: 2896, comm: ptlrpcd_rcv^M [52683.754657] ^M [52683.754657] Call Trace:^M [52683.761275] [<ffffffffa07d77ee>] libcfs_call_trace+0x4e/0x60 [libcfs]^M [52683.769670] [<ffffffffa07d787c>] lbug_with_loc+0x4c/0xb0 [libcfs]^M [52683.777689] [<ffffffffa0b1b817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc]^M [52683.786387] [<ffffffffa0b3ede2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc]^M [52683.796819] [<ffffffffa0b13f1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc]^M [52683.805973] [<ffffffffa0b17c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc]^M [52683.815312] [<ffffffffa0b1964b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]^M [52683.823568] [<ffffffffa0b452bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]^M [52683.831704] [<ffffffffa0b4566b>] ptlrpcd+0x2bb/0x560 [ptlrpc]^M [52683.839210] [<ffffffff810c5080>] ? default_wake_function+0x0/0x20^M [52683.847110] [<ffffffffa0b453b0>] ? ptlrpcd+0x0/0x560 [ptlrpc]^M [52683.854587] [<ffffffff810b06ff>] kthread+0xcf/0xe0^M [52683.860982] [<ffffffff810b0630>] ? kthread+0x0/0xe0^M [52683.867455] [<ffffffff81696a58>] ret_from_fork+0x58/0x90^M [52683.874426] [<ffffffff810b0630>] ? kthread+0x0/0xe0^M |
| Comment by Andreas Dilger [ 29/Jun/17 ] |
|
Niu, I just noticed that ptlrpc_connect_interpret() is changing imp_replay_cursor without imp_lock held. There may be other places that this is modified, and would mean that multiple threads changing this pointer, or imp_committed_list or rq_replay_list could cause the LASSERT() to trip? |
| Comment by Oleg Drokin [ 29/Jun/17 ] |
|
the debug that seems useful in this case is: for i in ha net info rpctrace ; do echo +$i >/proc/sys/lnet/debug ; done |
| Comment by Niu Yawei (Inactive) [ 30/Jun/17 ] |
This debug patch changed only the LASSERT to a 'if' check, I don't think it would prevent the error from hitting.
On ptlrpc_connect_interpret(), no replay request is inflight, so I think it's ok to change imp_replay_cursor without holding lock. Given this bug is rare to hit, I think we can just land the debug patch so that we can collect debug information once we hit it in later soak test. |
| Comment by Cliff White (Inactive) [ 30/Jun/17 ] |
|
The LBUG does not trigger with the debug patch, which may be significant. Jun 30 19:07:41 soak-16 kernel: Lustre: 2972:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1498849630/real 0] req@ffff8806cf0b4200 x1571647094223296/t0(0) o8->soaked-OST0016-osc-ffff88082f4ec800@192.168.1.106@o2ib:28/4 lens 520/544 e 0 to 1 dl 1498849661 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Jun 30 19:07:41 soak-16 kernel: Lustre: 2972:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
Jun 30 19:08:14 soak-16 kernel: LustreError: 2972:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed:
Jun 30 19:08:14 soak-16 kernel: LustreError: 2972:0:(recover.c:157:ptlrpc_replay_next()) LBUG
Jun 30 19:08:14 soak-16 kernel: Pid: 2972, comm: ptlrpcd_rcv
Jun 30 19:08:14 soak-16 kernel: #012Call Trace:
Jun 30 19:08:14 soak-16 kernel: [<ffffffffa082a7ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
Jun 30 19:08:14 soak-16 kernel: [<ffffffffa082a87c>] lbug_with_loc+0x4c/0xb0 [libcfs]
Jun 30 19:08:14 soak-16 kernel: [<ffffffffa0b5c817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc]
Jun 30 19:08:14 soak-16 kernel: [<ffffffffa0b7fde2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc]
Jun 30 19:08:14 soak-16 kernel: [<ffffffffa0b54f1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc]
Jun 30 19:08:14 soak-16 kernel: [<ffffffffa0b58c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc]
Jun 30 19:08:14 soak-16 kernel: [<ffffffffa0b5a64b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
Jun 30 19:08:14 soak-16 kernel: [<ffffffffa0b862bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]
Jun 30 19:08:14 soak-16 kernel: [<ffffffffa0b8666b>] ptlrpcd+0x2bb/0x560 [ptlrpc]
Jun 30 19:08:14 soak-16 kernel: [<ffffffff810c5080>] ? default_wake_function+0x0/0x20
Jun 30 19:08:14 soak-16 kernel: [<ffffffffa0b863b0>] ? ptlrpcd+0x0/0x560 [ptlrpc]
Jun 30 19:08:14 soak-16 kernel: [<ffffffff810b06ff>] kthread+0xcf/0xe0
Jun 30 19:08:14 soak-16 kernel: [<ffffffff810b0630>] ? kthread+0x0/0xe0
Jun 30 19:08:14 soak-16 kernel: [<ffffffff81696a58>] ret_from_fork+0x58/0x90
Jun 30 19:08:14 soak-16 kernel: [<ffffffff810b0630>] ? kthread+0x0/0xe0
Jun 30 19:08:14 soak-16 kernel:
|
| Comment by Cliff White (Inactive) [ 30/Jun/17 ] |
|
Crash dump is available on soak-16 |
| Comment by Cliff White (Inactive) [ 03/Jul/17 ] |
|
Ran with debug=0 Jul 1 12:01:11 soak-16 kernel: LustreError: 2905:0:(recover.c:158:ptlrpc_replay_next()) @@@ Not on unreplied list!#012 req@ffff88060fac0f00 x1571667100760512/t738743039794(738743039794) o4->soaked-OST0012-osc-ffff880829008000@192.168.1.102@o2ib:6/4 lens 608/416 e 0 to 0 dl 1498910140 ref 1 fl Complete:R/4/0 rc 0/0 Jul 1 12:01:11 soak-16 kernel: LustreError: 2905:0:(recover.c:159:ptlrpc_replay_next()) LBUG Jul 1 12:01:11 soak-16 kernel: Pid: 2905, comm: ptlrpcd_rcv Jul 1 12:01:11 soak-16 kernel: #012Call Trace: Jul 1 12:01:11 soak-16 kernel: [<ffffffffa08077ee>] libcfs_call_trace+0x4e/0x60 [libcfs] Jul 1 12:01:11 soak-16 kernel: [<ffffffffa080787c>] lbug_with_loc+0x4c/0xb0 [libcfs] Jul 1 12:01:11 soak-16 kernel: [<ffffffffa0b4d81d>] ptlrpc_replay_next+0x44d/0x450 [ptlrpc] Jul 1 12:01:11 soak-16 kernel: [<ffffffffa0b70de2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] Jul 1 12:01:11 soak-16 kernel: [<ffffffffa0b45f1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] Jul 1 12:01:11 soak-16 kernel: [<ffffffffa0b49c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] Jul 1 12:01:11 soak-16 kernel: [<ffffffffa0b4b64b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] Jul 1 12:01:11 soak-16 kernel: [<ffffffffa0b772bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] Jul 1 12:01:11 soak-16 kernel: [<ffffffffa0b7766b>] ptlrpcd+0x2bb/0x560 [ptlrpc] Jul 1 12:01:11 soak-16 kernel: [<ffffffff810c5080>] ? default_wake_function+0x0/0x20 Jul 1 12:01:11 soak-16 kernel: [<ffffffffa0b773b0>] ? ptlrpcd+0x0/0x560 [ptlrpc] Jul 1 12:01:11 soak-16 kernel: [<ffffffff810b06ff>] kthread+0xcf/0xe0 Jul 1 12:01:11 soak-16 kernel: [<ffffffff810b0630>] ? kthread+0x0/0xe0 Jul 1 12:01:11 soak-16 kernel: [<ffffffff81696a58>] ret_from_fork+0x58/0x90 Jul 1 12:01:11 soak-16 kernel: [<ffffffff810b0630>] ? kthread+0x0/0xe0 Jul 1 12:01:11 soak-16 kernel: Jul 1 12:01:11 soak-16 kernel: LustreError: dumping log to /tmp/lustre-log.1498910471.2905 |
| Comment by Cliff White (Inactive) [ 03/Jul/17 ] |
|
lustre-log attached. |
| Comment by Gerrit Updater [ 04/Jul/17 ] |
|
Niu Yawei (yawei.niu@intel.com) uploaded a new patch: https://review.whamcloud.com/27920 |
| Comment by Andreas Dilger [ 04/Jul/17 ] |
|
Cliff, can you please test on soak with this patch. |
| Comment by Cliff White (Inactive) [ 05/Jul/17 ] |
|
I am not seeing LBUGS, so far, but am seeing a consistent pattern where after a failover, most/all clients are evicted and recovery is aborted. Some sample messages Jul 5 15:05:21 soak-8 kernel: Lustre: 4763:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery Jul 5 15:05:21 soak-8 kernel: Lustre: soaked-MDT0000: Recovery over after 2:18, of 37 clients 3 recovered and 34 were evicted. Jul 5 15:37:02 soak-8 kernel: Lustre: MGS: haven't heard from client 1b0cd471-e969-acd0-bd39-45b87774c0c3 (at 192.168.1.107@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8804b978ac00, cur 1499269022 expire 1499268872 last 1499268795 Jul 5 16:47:07 soak-8 kernel: Lustre: MGS: haven't heard from client adfa67c6-0880-1a84-5ecf-172c556a5dd6 (at 192.168.1.111@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8807dd474c00, cur 1499273227 expire 1499273077 last 1499273000 Jul 5 16:47:16 soak-8 kernel: Lustre: soaked-MDT0000: haven't heard from client soaked-MDT0003-mdtlov_UUID (at 192.168.1.111@o2ib) in 228 seconds. I think it's dead, and I am evicting it. exp ffff8807dea70400, cur 1499273236 expire 1499273086 last 1499273008 Jul 5 16:50:06 soak-8 kernel: LustreError: 167-0: soaked-MDT0003-osp-MDT0000: This client was evicted by soaked-MDT0003; in progress operations using this service will fail. Jul 5 17:18:19 soak-8 kernel: Lustre: MGS: haven't heard from client 8fe08516-dbc0-85e3-6fb4-93c94095a877 (at 192.168.1.106@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8807dd475800, cur 1499275099 expire 1499274949 last 1499274872 Jul 5 18:14:24 soak-8 kernel: Lustre: 4771:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery Jul 5 18:14:24 soak-8 kernel: Lustre: soaked-MDT0000: Denying connection for new client a1e375f2-7cfb-7e06-98b7-7ea8a7f2eca4(at 192.168.1.130@o2ib), waiting for 37 known clients (13 recovered, 0 in progress, and 24 evicted) to recover in 4:02 Jul 5 18:14:24 soak-8 kernel: Lustre: soaked-MDT0000: Recovery over after 11:00, of 37 clients 13 recovered and 24 were evicted. Jul 5 18:44:53 soak-8 kernel: Lustre: MGS: haven't heard from client e13056ec-cbfc-e019-c014-c1a0cfd67815 (at 192.168.1.103@o2ib) in 229 seconds. I think it's dead, and I am evicting it. exp ffff88017917cc00, cur 1499280293 expire 1499280143 last 1499280064 Jul 5 19:07:31 soak-8 kernel: Lustre: MGS: haven't heard from client 9b21a845-8a0f-5a86-e846-32a309c2b726 (at 192.168.1.111@o2ib) in 233 seconds. I think it's dead, and I am evicting it. exp ffff88017976dc00, cur 1499281651 expire 1499281501 last 1499281418 Jul 5 19:10:18 soak-8 kernel: LustreError: 167-0: soaked-MDT0003-osp-MDT0000: This client was evicted by soaked-MDT0003; in progress operations using this service will fail. Jul 5 19:35:10 soak-8 kernel: Lustre: MGS: haven't heard from client 687666be-1c95-4869-bffb-261e5c2ec0e7 (at 192.168.1.111@o2ib) in 228 seconds. I think it's dead, and I am evicting it. exp ffff8807dbcfe000, cur 1499283310 expire 1499283160 last 1499283082 Jul 5 19:37:55 soak-8 kernel: LustreError: 167-0: soaked-MDT0003-osp-MDT0000: This client was evicted by soaked-MDT0003; in progress operations using this service will fail. Jul 5 20:06:07 soak-8 kernel: Lustre: MGS: haven't heard from client a43f28ec-dc91-d483-5a19-102f4ecfe255 (at 192.168.1.109@o2ib) in 228 seconds. I think it's dead, and I am evicting it. exp ffff880404926c00, cur 1499285167 expire 1499285017 last 1499284939 Jul 5 20:07:15 soak-8 kernel: Lustre: 15754:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery Jul 5 20:08:00 soak-8 kernel: LustreError: 167-0: soaked-MDT0001-osp-MDT0000: This client was evicted by soaked-MDT0001; in progress operations using this service will fail. It would be good if someone could login and have a look. |
| Comment by Niu Yawei (Inactive) [ 06/Jul/17 ] |
|
All the evicted clients are MDT-MDT client, it seems like a DNE recovery problem, and not related to this patch. Cliff, did you ever observe similar recovery failures during past soak testings? Are these clients (192.168.1.103@o2ib, 106, 109 and 111) evicted by MGS from MDTs or Lustre clients? Did you observe any hung MDT during recovery? Could you attach the logs from all MDTs in tickets? It's quite slow from here to login and operate on soak nodes. (if any MDT hang during recovery, could you dump the |
| Comment by Cliff White (Inactive) [ 06/Jul/17 ] |
|
I am seeing two things constantly with this build: soak-10 ---------------- mdt.soaked-MDT0002.recovery_status= status: WAITING non-ready MDTs: 0000 0001 0003 recovery_start: 1499361565 time_waited: 3037 after failover, recovery takes forever/quite a long time on MDS. 2. When LFSCK is started, it times out and the abort command is issued by the framework. The abort takes a very long time. 2017-07-06 17:29:30,838:fsmgmt.fsmgmt:INFO executing cmd: lctl lfsck_stop -M soaked-MDT0000
2017-07-06 18:13:57,169:fsmgmt.fsmgmt:INFO lfsck stopped on soak-8
2017-07-06 18:13:57,170:fsmgmt.fsmgmt:INFO mds_restart completed, lfsck completed
{code]
Lustre-log dumps from all MDS and further stack traces attached
[ 3113.596481] Lustre: soaked-MDT0002: Recovery already passed deadline 30:28, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.^M |
| Comment by Niu Yawei (Inactive) [ 07/Jul/17 ] |
|
Thank you, Cliff. Seems there is something wrong with the DNE recovery, I see there is a OI scrub thread hung for waiting io completion on soak-8, but not sure if it's related to the recovery failure problem: [19587.652446] OI_scrub D ffff666666663838 0 28932 2 0x00000080 [19587.660994] ffff8807d14c7890 0000000000000046 ffff88082c3f9f60 ffff8807d14c7fd8 [19587.669968] ffff8803d0049898 ffff8807d14c7fd8 ffff88082c3f9f60 ffff88042e056c40 [19587.679285] ffff88042c8a1f78 7fffffffffffffff ffff88039a765000 ffffffff8168a8a0 [19587.688282] Call Trace: [19587.691650] [<ffffffff8168a8a0>] ? bit_wait+0x50/0x50 [19587.698046] [<ffffffff8168c849>] schedule+0x29/0x70 [19587.704230] [<ffffffff8168a289>] schedule_timeout+0x239/0x2c0 [19587.711411] [<ffffffffa0272f19>] ? dm_old_request_fn+0xc9/0x1f0 [dm_mod] [19587.719649] [<ffffffff810eb0dc>] ? ktime_get_ts64+0x4c/0xf0 [19587.726616] [<ffffffff8168a8a0>] ? bit_wait+0x50/0x50 [19587.732999] [<ffffffff8168bdee>] io_schedule_timeout+0xae/0x130 [19587.740348] [<ffffffff813252b1>] ? vsnprintf+0x441/0x6a0 [19587.747378] [<ffffffff8168be88>] ? io_schedule+0x18/0x20 [19587.754050] [<ffffffffa1167b3c>] ? ldiskfs_map_blocks+0x5c/0x5e0 [ldiskfs] [19587.762459] [<ffffffff8168a3d5>] ? __wait_on_bit+0x65/0x90 [19587.769313] [<ffffffff8168a8a0>] ? bit_wait+0x50/0x50 [19587.775671] [<ffffffff8168a481>] ? out_of_line_wait_on_bit+0x81/0xb0 [19587.783486] [<ffffffff810b1be0>] ? wake_bit_function+0x40/0x40 [19587.790716] [<ffffffff8123341a>] ? __wait_on_buffer+0x2a/0x30 [19587.797859] [<ffffffffa116833c>] ? ldiskfs_bread+0x7c/0xc0 [ldiskfs] [19587.806016] [<ffffffffa122f0a0>] ? iam_node_read+0x60/0xe0 [osd_ldiskfs] [19587.814213] [<ffffffffa122a6c7>] ? fid_is_on_ost+0x1c7/0x420 [osd_ldiskfs] [19587.822600] [<ffffffffa122a9c4>] ? osd_oi_lookup+0xa4/0x160 [osd_ldiskfs] [19587.830894] [<ffffffffa0bc43b3>] ? libcfs_log_goto+0x23/0x30 [libcfs] [19587.838792] [<ffffffffa123de09>] ? osd_scrub_get_fid+0x1f9/0x290 [osd_ldiskfs] [19587.847565] [<ffffffffa1242a62>] ? osd_scrub_exec+0x62/0x5b0 [osd_ldiskfs] [19587.855967] [<ffffffffa11631fe>] ? ldiskfs_read_inode_bitmap+0x1fe/0x5b0 [ldiskfs] [19587.865156] [<ffffffffa1244225>] ? osd_inode_iteration+0x475/0xc70 [osd_ldiskfs] [19587.874494] [<ffffffffa1242a00>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs] [19587.883303] [<ffffffffa123eab0>] ? osd_preload_next+0xa0/0xa0 [osd_ldiskfs] [19587.891811] [<ffffffffa1245360>] ? osd_scrub_main+0x940/0xf00 [osd_ldiskfs] [19587.900318] [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20 [19587.907177] [<ffffffffa1244a20>] ? osd_inode_iteration+0xc70/0xc70 [osd_ldiskfs] [19587.916170] [<ffffffff810b0a4f>] ? kthread+0xcf/0xe0 [19587.922453] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 [19587.930372] [<ffffffff81697798>] ? ret_from_fork+0x58/0x90 [19587.937217] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 Lai, could you take a look at the logs and stack traces to see if there is anything suspicious? Thank you. |
| Comment by Lai Siyao [ 07/Jul/17 ] |
|
This is found in soak-10.2.log: soak-10.2.log:00010000:02000400:9.0:1499351737.374320:0:3602:0:(ldlm_lib.c:2080:target_recovery_overseer()) soaked-MDT0002: recovery is timed out, evict stale exports soak-10.2.log:00010000:00080000:9.0:1499351737.427959:0:3602:0:(ldlm_lib.c:2044:target_recovery_overseer()) soaked-MDT0002: there are still update replay (0x1120005fc96)in the queue. soak-10.2.log:00010000:00000040:9.0:1499351737.427965:0:3602:0:(ldlm_lib.c:2504:target_recovery_thread()) 1: request replay stage - 1 clients from t1176821431446 soak-10.2.log:00010000:00080000:9.0:1499351737.427983:0:3602:0:(ldlm_lib.c:2044:target_recovery_overseer()) soaked-MDT0002: there are still update replay (0x1120005fc96)in the queue. It looks that DNE recovery not finished, this may cause system hung or other failures, I'll check how this happened. |
| Comment by Cliff White (Inactive) [ 07/Jul/17 ] |
|
I have created |
| Comment by Cliff White (Inactive) [ 10/Jul/17 ] |
|
Hit this again on lustre-master, build 3606 [ 4931.461826] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: ^M [ 4931.475358] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) LBUG^M [ 4931.482951] Pid: 3604, comm: ptlrpcd_rcv^M [ 4931.487341] ^M [ 4931.487341] Call Trace:^M [ 4931.491756] [<ffffffffa07807ee>] libcfs_call_trace+0x4e/0x60 [libcfs]^M [ 4931.499055] [<ffffffffa078087c>] lbug_with_loc+0x4c/0xb0 [libcfs]^M [ 4931.506024] [<ffffffffa0b19817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc]^M [ 4931.513639] [<ffffffffa0b3cde2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc]^M [ 4931.522999] [<ffffffffa0b11f1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc]^M [ 4931.531098] [<ffffffffa0b15c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc]^M [ 4931.539393] [<ffffffffa0b1764b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]^M [ 4931.546623] [<ffffffffa0b432bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]^M [ 4931.553754] [<ffffffffa0b4366b>] ptlrpcd+0x2bb/0x560 [ptlrpc]^M [ 4931.560279] [<ffffffff810c5080>] ? default_wake_function+0x0/0x20^M [ 4931.567215] [<ffffffffa0b433b0>] ? ptlrpcd+0x0/0x560 [ptlrpc]^M [ 4931.573730] [<ffffffff810b06ff>] kthread+0xcf/0xe0^M [ 4931.579187] [<ffffffff810b0630>] ? kthread+0x0/0xe0^M [ 4931.584740] [<ffffffff81696a58>] ret_from_fork+0x58/0x90^M [ 4931.590778] [<ffffffff810b0630>] ? kthread+0x0/0xe0^M [ 4931.596326] ^M [ 4931.597997] Kernel panic - not syncing: LBUG^M |
| Comment by Cliff White (Inactive) [ 10/Jul/17 ] |
|
There is a crash dump available on soak-31 [ 4874.102156] Lustre: 3604:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages [ 4928.198801] Lustre: soaked-OST0002-osc-ffff88105a2ec000: Connection restored to 192.168.1.104@o2ib (at 192.168.1.104@o2ib) [ 4929.712442] Lustre: soaked-OST0014-osc-ffff88105a2ec000: Connection restored to 192.168.1.104@o2ib (at 192.168.1.104@o2ib) [ 4929.724803] Lustre: Skipped 1 previous similar message [ 4931.461826] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: [ 4931.475358] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) LBUG [ 4931.482951] Pid: 3604, comm: ptlrpcd_rcv [ 4931.487341] Call Trace: [ 4931.491756] [<ffffffffa07807ee>] libcfs_call_trace+0x4e/0x60 [libcfs] [ 4931.499055] [<ffffffffa078087c>] lbug_with_loc+0x4c/0xb0 [libcfs] [ 4931.506024] [<ffffffffa0b19817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] [ 4931.513639] [<ffffffffa0b3cde2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] [ 4931.522999] [<ffffffffa0b11f1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] [ 4931.531098] [<ffffffffa0b15c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] [ 4931.539393] [<ffffffffa0b1764b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [ 4931.546623] [<ffffffffa0b432bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] [ 4931.553754] [<ffffffffa0b4366b>] ptlrpcd+0x2bb/0x560 [ptlrpc] [ 4931.560279] [<ffffffff810c5080>] ? default_wake_function+0x0/0x20 [ 4931.567215] [<ffffffffa0b433b0>] ? ptlrpcd+0x0/0x560 [ptlrpc] [ 4931.573730] [<ffffffff810b06ff>] kthread+0xcf/0xe0 [ 4931.579187] [<ffffffff810b0630>] ? kthread+0x0/0xe0 [ 4931.584740] [<ffffffff81696a58>] ret_from_fork+0x58/0x90 [ 4931.590778] [<ffffffff810b0630>] ? kthread+0x0/0xe0 [ 4931.596326] [ 4931.597997] Kernel panic - not syncing: LBUG [ 4931.602769] CPU: 5 PID: 3604 Comm: ptlrpcd_rcv Tainted: G IOE ------------ 3.10.0-514.10.2.el7.x86_64 #1 [ 4931.614514] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [ 4931.626160] ffffffffa079ee8b 00000000f7bba2b7 ffff880841747b30 ffffffff816864ef [ 4931.634457] ffff880841747bb0 ffffffff8167f8f6 ffffffff00000008 ffff880841747bc0 [ 4931.642744] ffff880841747b60 00000000f7bba2b7 00000000f7bba2b7 ffff88085ed4f838 [ 4931.651035] Call Trace: [ 4931.653764] [<ffffffff816864ef>] dump_stack+0x19/0x1b [ 4931.659507] [<ffffffff8167f8f6>] panic+0xe3/0x1f2 [ 4931.664856] [<ffffffffa0780894>] lbug_with_loc+0x64/0xb0 [libcfs] [ 4931.671781] [<ffffffffa0b19817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] [ 4931.679388] [<ffffffffa0b3cde2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] [ 4931.688742] [<ffffffffa0b11f1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] [ 4931.696832] [<ffffffffa0b15c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] [ 4931.705118] [<ffffffffa0b1764b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [ 4931.712339] [<ffffffffa0b432bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] [ 4931.719465] [<ffffffffa0b4366b>] ptlrpcd+0x2bb/0x560 [ptlrpc] [ 4931.725981] [<ffffffff810c5080>] ? wake_up_state+0x20/0x20 [ 4931.732224] [<ffffffffa0b433b0>] ? ptlrpcd_check+0x5d0/0x5d0 [ptlrpc] [ 4931.739517] [<ffffffff810b06ff>] kthread+0xcf/0xe0 [ 4931.744959] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 [ 4931.752243] [<ffffffff81696a58>] ret_from_fork+0x58/0x90 [ 4931.758267] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 |
| Comment by Cliff White (Inactive) [ 11/Jul/17 ] |
|
Now running lustre-b2_10 build #3 Jul 11 01:13:20 soak-17 kernel: LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: Jul 11 01:13:20 soak-17 kernel: LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) LBUG Jul 11 01:13:20 soak-17 kernel: Pid: 2937, comm: ptlrpcd_rcv Jul 11 01:13:20 soak-17 kernel: #012Call Trace: Jul 11 01:13:20 soak-17 kernel: [<ffffffffa08447ee>] libcfs_call_trace+0x4e/0x60 [libcfs] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa084487c>] lbug_with_loc+0x4c/0xb0 [libcfs] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b76817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b99de2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b6ef1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b72c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b7464b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0ba02bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0ba066b>] ptlrpcd+0x2bb/0x560 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffff810c5080>] ? default_wake_function+0x0/0x20 Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0ba03b0>] ? ptlrpcd+0x0/0x560 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffff810b06ff>] kthread+0xcf/0xe0 Jul 11 01:13:20 soak-17 kernel: [<ffffffff810b0630>] ? kthread+0x0/0xe0 Jul 11 01:13:20 soak-17 kernel: [<ffffffff81696a58>] ret_from_fork+0x58/0x90 Jul 11 01:13:20 soak-17 kernel: [<ffffffff810b0630>] ? kthread+0x0/0xe0 Jul 11 01:13:20 soak-17 kernel: Jul 11 01:13:20 soak-17 kernel: Kernel panic - not syncing: LBUG Crash dump is available on soak-17 [ 3947.210017] perf: interrupt took too long (3225 > 3192), lowering kernel.perf_event_max_sample_rate to 62000 [ 4161.246530] Lustre: 2953:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1499735243/real 0] req@ffff880397b67200 x1572582391402544/t0(0) o101->soaked-OST0008-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 328/400 e 0 to 1 dl 1499735256 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [ 4161.282464] Lustre: soaked-OST0008-osc-ffff8808276d0000: Connection to soaked-OST0008 (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete [ 4162.030527] Lustre: soaked-OST000e-osc-ffff8808276d0000: Connection to soaked-OST000e (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete [ 4169.926508] Lustre: soaked-OST0014-osc-ffff8808276d0000: Connection to soaked-OST0014 (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete [ 4194.948696] Lustre: 2944:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1499735290/real 1499735290] req@ffff880334b43c00 x1572582391418656/t0(0) o400->soaked-OST0002-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 224/224 e 0 to 1 dl 1499735303 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [ 4194.986416] Lustre: 2944:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages [ 4194.998649] Lustre: soaked-OST0002-osc-ffff8808276d0000: Connection to soaked-OST0002 (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete [ 4245.020397] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1499735340/real 1499735340] req@ffff880334b42100 x1572582391449120/t0(0) o8->soaked-OST0002-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 520/544 e 0 to 1 dl 1499735351 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [ 4245.057630] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages [ 4320.019906] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1499735415/real 1499735415] req@ffff880335d55a00 x1572582391496080/t0(0) o8->soaked-OST0008-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 520/544 e 0 to 1 dl 1499735436 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [ 4320.057297] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages [ 4451.018399] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1499735515/real 0] req@ffff880335d53c00 x1572582391561888/t0(0) o8->soaked-OST0014-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 520/544 e 0 to 1 dl 1499735546 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [ 4451.054303] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages [ 4504.566715] LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: [ 4504.587823] LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) LBUG [ 4504.598875] Pid: 2937, comm: ptlrpcd_rcv [ 4504.606014] Call Trace: [ 4504.615457] [<ffffffffa08447ee>] libcfs_call_trace+0x4e/0x60 [libcfs] [ 4504.624991] [<ffffffffa084487c>] lbug_with_loc+0x4c/0xb0 [libcfs] [ 4504.634021] [<ffffffffa0b76817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] [ 4504.643628] [<ffffffffa0b99de2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] [ 4504.654851] [<ffffffffa0b6ef1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] [ 4504.664682] [<ffffffffa0b72c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] [ 4504.674633] [<ffffffffa0b7464b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [ 4504.683405] [<ffffffffa0ba02bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] [ 4504.692018] [<ffffffffa0ba066b>] ptlrpcd+0x2bb/0x560 [ptlrpc] [ 4504.699993] [<ffffffff810c5080>] ? default_wake_function+0x0/0x20 [ 4504.708298] [<ffffffffa0ba03b0>] ? ptlrpcd+0x0/0x560 [ptlrpc] [ 4504.716207] [<ffffffff810b06ff>] kthread+0xcf/0xe0 [ 4504.723002] [<ffffffff810b0630>] ? kthread+0x0/0xe0 [ 4504.729868] [<ffffffff81696a58>] ret_from_fork+0x58/0x90 [ 4504.737213] [<ffffffff810b0630>] ? kthread+0x0/0xe0 [ 4504.744039] [ 4504.746963] Kernel panic - not syncing: LBUG [ 4504.752969] CPU: 14 PID: 2937 Comm: ptlrpcd_rcv Tainted: G OE ------------ 3.10.0-514.10.2.el7.x86_64 #1 [ 4504.766051] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [ 4504.779814] ffffffffa0862e8b 0000000057e50b83 ffff88082ab1fb30 ffffffff816864ef [ 4504.789340] ffff88082ab1fbb0 ffffffff8167f8f6 ffffffff00000008 ffff88082ab1fbc0 [ 4504.798846] ffff88082ab1fb60 0000000057e50b83 0000000057e50b83 ffff88082d98f838 [ 4504.808330] Call Trace: [ 4504.812212] [<ffffffff816864ef>] dump_stack+0x19/0x1b [ 4504.819084] [<ffffffff8167f8f6>] panic+0xe3/0x1f2 [ 4504.825543] [<ffffffffa0844894>] lbug_with_loc+0x64/0xb0 [libcfs] [ 4504.833554] [<ffffffffa0b76817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] [ 4504.842225] [<ffffffffa0b99de2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] [ 4504.852623] [<ffffffffa0b6ef1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] [ 4504.861745] [<ffffffffa0b72c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] [ 4504.871040] [<ffffffffa0b7464b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [ 4504.879243] [<ffffffffa0ba02bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] [ 4504.887319] [<ffffffffa0ba066b>] ptlrpcd+0x2bb/0x560 [ptlrpc] |
| Comment by Cliff White (Inactive) [ 11/Jul/17 ] |
|
Attached lustre dump from all MDS |
| Comment by Niu Yawei (Inactive) [ 11/Jul/17 ] |
|
Cliff, the fix hasn't been landed yet, so the crash is expected. |
| Comment by Peter Jones [ 11/Jul/17 ] |
|
Niu the test also included a reconfiguration on soak which it was hoped might make this race less likely as the disk faliure had coincided with this race being hit far more frenquently (see DCO-7264), Given that this has not worked would it be enough to just land this one fix or did earlier testing suggest further fixes would be needed too? Peter |
| Comment by Niu Yawei (Inactive) [ 11/Jul/17 ] |
|
Peter, I think the fix from https://review.whamcloud.com/#/c/27920/ should be enough. |
| Comment by Gerrit Updater [ 11/Jul/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27920/ |
| Comment by Niu Yawei (Inactive) [ 18/Jul/17 ] |
|
Patch landed for 2.10. |