[LU-14342] replay-single test 135: Input/output error Created: 20/Jan/21 Updated: 09/Jan/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for liuying <emoly@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/5f683fd5-6f69-46c1-b58f-61bc065df204 [10777.099391] Lustre: lustre-OST0000: Denying connection for new client a276db64-2a82-43b4-86a4-2db5e4ba7638 (at 10.9.10.41@tcp), waiting for 5 known clients (2 recovered, 0 in progress, and 0 evicted) to recover in 1:04
[10777.102386] Lustre: Skipped 1 previous similar message
[10777.171736] Lustre: DEBUG MARKER: e2label /dev/mapper/ost1_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
[10777.495534] Lustre: DEBUG MARKER: e2label /dev/mapper/ost1_flakey 2>/dev/null
[10812.155116] Lustre: lustre-OST0000: Denying connection for new client a276db64-2a82-43b4-86a4-2db5e4ba7638 (at 10.9.10.41@tcp), waiting for 5 known clients (4 recovered, 0 in progress, and 0 evicted) to recover in 0:29
[10812.158156] Lustre: Skipped 8 previous similar messages
[10841.164223] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
[10841.165400] Lustre: lustre-OST0000: disconnecting 1 stale clients
[10841.207538] Lustre: lustre-OST0000: Recovery over after 1:06, of 5 clients 4 recovered and 1 was evicted.
[10841.209578] Lustre: lustre-OST0000: deleting orphan objects from 0x0:44159 to 0x0:44169
[10841.211380] Lustre: lustre-OST0000: deleting orphan objects from 0x300000400:3987 to 0x300000400:4353
[10841.213387] Lustre: lustre-OST0000: deleting orphan objects from 0x300000402:6723 to 0x300000402:6753
[10841.214868] Lustre: lustre-OST0000: deleting orphan objects from 0x300000401:3945 to 0x300000401:4289
[10890.624440] Lustre: lustre-OST0000: haven't heard from client a276db64-2a82-43b4-86a4-2db5e4ba7638 (at 10.9.10.41@tcp) in 49 seconds. I think it's dead, and I am evicting it. exp ffff89ffc4174400, cur 1611043480 expire 1611043450 last 1611043431
|
| Comments |
| Comment by Cliff White (Inactive) [ 02/Aug/22 ] |
|
From this ZFS run: Seems like it's close. [28135.710737] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-OST0000 [28146.699417] Lustre: 8035:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1658113488/real 1658113488] req@00000000590dea5d x1738628958349440/t0(0) o400->lustre-OST0001-osc-ffff8dca529bf000@10.240.24.92@tcp:28/4 lens 224/224 e 0 to 1 dl 1658113495 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0' [28146.708477] Lustre: 8035:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 21 previous similar messages [28268.090225] Lustre: lustre-OST0000-osc-ffff8dca529bf000: Connection restored to 10.240.24.92@tcp (at 10.240.24.92@tcp) [28268.092333] Lustre: Skipped 13 previous similar messages [28507.126903] INFO: task sync:623318 blocked for more than 120 seconds. [28507.128311] Tainted: G OE --------- - - 4.18.0-348.2.1.el8_5.x86_64 #1 [28507.129877] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [28507.131368] task:sync state:D stack: 0 pid:623318 ppid:623316 flags:0x00004080 [28507.132964] Call Trace: [28507.133556] __schedule+0x2c4/0x700 [28507.134287] ? __switch_to_asm+0x41/0x70 [28507.135084] ? __switch_to_asm+0x41/0x70 [28507.135896] schedule+0x37/0xa0 [28507.136567] io_schedule+0x12/0x40 [28507.137291] wait_on_page_bit+0x137/0x230 [28507.138107] ? file_fdatawait_range+0x20/0x20 [28507.138982] __filemap_fdatawait_range+0x88/0xe0 [28507.139912] ? __schedule+0x2cc/0x700 [28507.140652] ? _cond_resched+0x15/0x30 [28507.141418] ? bdi_split_work_to_wbs+0x46/0x2e0 [28507.142339] ? finish_wait+0x3c/0x80 [28507.143068] filemap_fdatawait_keep_errors+0x1a/0x40 [28507.144061] sync_inodes_sb+0x227/0x2c0 [28507.144867] ? __x64_sys_splice+0x100/0x100 [28507.145705] iterate_supers+0x98/0x100 [28507.146477] ksys_sync+0x40/0xb0 [28507.147153] __ia32_sys_sync+0xa/0x10 [28507.147923] do_syscall_64+0x5b/0x1a0 [28507.148680] entry_SYSCALL_64_after_hwframe+0x65/0xca [28507.149686] RIP: 0033:0x7fdf2ce594fb [28507.150428] Code: Unable to access opcode bytes at RIP 0x7fdf2ce594d1. [28507.151677] RSP: 002b:00007ffeab7bfa98 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2 [28507.153304] RAX: ffffffffffffffda RBX: 00007ffeab7bfbd8 RCX: 00007fdf2ce594fb [28507.154698] RDX: 00007fdf2d127501 RSI: 00007fdf2ceed300 RDI: 00007ffeab7c4088 [28507.156056] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000 [28507.157419] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [28507.158778] R13: 0000000000000000 R14: 0000000000000000 R15: 00007fdf2d124374 [28629.999879] INFO: task sync:623318 blocked for more than 120 seconds. [28630.001324] Tainted: G OE --------- - - 4.18.0-348.2.1.el8_5.x86_64 #1 [28630.002899] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [28630.004386] task:sync state:D stack: 0 pid:623318 ppid:623316 flags:0x00004080 [28630.006028] Call Trace: [28630.006580] __schedule+0x2c4/0x700 [28630.007308] ? __switch_to_asm+0x41/0x70 [28630.008106] ? __switch_to_asm+0x41/0x70 [28630.008913] schedule+0x37/0xa0 [28630.009576] io_schedule+0x12/0x40 [28630.010278] wait_on_page_bit+0x137/0x230 [28630.011100] ? file_fdatawait_range+0x20/0x20 [28630.012009] __filemap_fdatawait_range+0x88/0xe0 [28630.012939] ? __schedule+0x2cc/0x700 [28630.013683] ? _cond_resched+0x15/0x30 [28630.014449] ? bdi_split_work_to_wbs+0x46/0x2e0 [28630.015369] ? finish_wait+0x3c/0x80 [28630.016108] filemap_fdatawait_keep_errors+0x1a/0x40 [28630.017092] sync_inodes_sb+0x227/0x2c0 [28630.017888] ? __x64_sys_splice+0x100/0x100 [28630.018737] iterate_supers+0x98/0x100 [28630.019503] ksys_sync+0x40/0xb0 [28630.020179] __ia32_sys_sync+0xa/0x10 [28630.020940] do_syscall_64+0x5b/0x1a0 [28630.021701] entry_SYSCALL_64_after_hwframe+0x65/0xca [28630.022698] RIP: 0033:0x7fdf2ce594fb [28630.023444] Code: Unable to access opcode bytes at RIP 0x7fdf2ce594d1. [28630.024705] RSP: 002b:00007ffeab7bfa98 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2 [28630.026153] RAX: ffffffffffffffda RBX: 00007ffeab7bfbd8 RCX: 00007fdf2ce594fb [28630.027550] RDX: 00007fdf2d127501 RSI: 00007fdf2ceed300 RDI: 00007ffeab7c4088 [28630.028917] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000 [28630.030283] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [28630.031650] R13: 0000000000000000 R14: 0000000000000000 R15: 00007fdf2d124374 [28752.872906] INFO: task sync:623318 blocked for more than 120 seconds |