[LU-14342] replay-single test 135: Input/output error Created: 20/Jan/21  Updated: 09/Jan/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for liuying <emoly@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/5f683fd5-6f69-46c1-b58f-61bc065df204

[10777.099391] Lustre: lustre-OST0000: Denying connection for new client a276db64-2a82-43b4-86a4-2db5e4ba7638 (at 10.9.10.41@tcp), waiting for 5 known clients (2 recovered, 0 in progress, and 0 evicted) to recover in 1:04
[10777.102386] Lustre: Skipped 1 previous similar message
[10777.171736] Lustre: DEBUG MARKER: e2label /dev/mapper/ost1_flakey 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
[10777.495534] Lustre: DEBUG MARKER: e2label /dev/mapper/ost1_flakey 2>/dev/null
[10812.155116] Lustre: lustre-OST0000: Denying connection for new client a276db64-2a82-43b4-86a4-2db5e4ba7638 (at 10.9.10.41@tcp), waiting for 5 known clients (4 recovered, 0 in progress, and 0 evicted) to recover in 0:29
[10812.158156] Lustre: Skipped 8 previous similar messages
[10841.164223] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
[10841.165400] Lustre: lustre-OST0000: disconnecting 1 stale clients
[10841.207538] Lustre: lustre-OST0000: Recovery over after 1:06, of 5 clients 4 recovered and 1 was evicted.
[10841.209578] Lustre: lustre-OST0000: deleting orphan objects from 0x0:44159 to 0x0:44169
[10841.211380] Lustre: lustre-OST0000: deleting orphan objects from 0x300000400:3987 to 0x300000400:4353
[10841.213387] Lustre: lustre-OST0000: deleting orphan objects from 0x300000402:6723 to 0x300000402:6753
[10841.214868] Lustre: lustre-OST0000: deleting orphan objects from 0x300000401:3945 to 0x300000401:4289
[10890.624440] Lustre: lustre-OST0000: haven't heard from client a276db64-2a82-43b4-86a4-2db5e4ba7638 (at 10.9.10.41@tcp) in 49 seconds. I think it's dead, and I am evicting it. exp ffff89ffc4174400, cur 1611043480 expire 1611043450 last 1611043431


 Comments   
Comment by Cliff White (Inactive) [ 02/Aug/22 ]

From this ZFS run:
https://testing.whamcloud.com/test_sets/9d0bc162-93a4-49c8-87dc-4f5cf15bb573

Seems like it's close.

[28135.710737] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-OST0000
[28146.699417] Lustre: 8035:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1658113488/real 1658113488]  req@00000000590dea5d x1738628958349440/t0(0) o400->lustre-OST0001-osc-ffff8dca529bf000@10.240.24.92@tcp:28/4 lens 224/224 e 0 to 1 dl 1658113495 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:1.0'
[28146.708477] Lustre: 8035:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 21 previous similar messages
[28268.090225] Lustre: lustre-OST0000-osc-ffff8dca529bf000: Connection restored to 10.240.24.92@tcp (at 10.240.24.92@tcp)
[28268.092333] Lustre: Skipped 13 previous similar messages
[28507.126903] INFO: task sync:623318 blocked for more than 120 seconds.
[28507.128311]       Tainted: G           OE    --------- -  - 4.18.0-348.2.1.el8_5.x86_64 #1
[28507.129877] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28507.131368] task:sync            state:D stack:    0 pid:623318 ppid:623316 flags:0x00004080
[28507.132964] Call Trace:
[28507.133556]  __schedule+0x2c4/0x700
[28507.134287]  ? __switch_to_asm+0x41/0x70
[28507.135084]  ? __switch_to_asm+0x41/0x70
[28507.135896]  schedule+0x37/0xa0
[28507.136567]  io_schedule+0x12/0x40
[28507.137291]  wait_on_page_bit+0x137/0x230
[28507.138107]  ? file_fdatawait_range+0x20/0x20
[28507.138982]  __filemap_fdatawait_range+0x88/0xe0
[28507.139912]  ? __schedule+0x2cc/0x700
[28507.140652]  ? _cond_resched+0x15/0x30
[28507.141418]  ? bdi_split_work_to_wbs+0x46/0x2e0
[28507.142339]  ? finish_wait+0x3c/0x80
[28507.143068]  filemap_fdatawait_keep_errors+0x1a/0x40
[28507.144061]  sync_inodes_sb+0x227/0x2c0
[28507.144867]  ? __x64_sys_splice+0x100/0x100
[28507.145705]  iterate_supers+0x98/0x100
[28507.146477]  ksys_sync+0x40/0xb0
[28507.147153]  __ia32_sys_sync+0xa/0x10
[28507.147923]  do_syscall_64+0x5b/0x1a0
[28507.148680]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[28507.149686] RIP: 0033:0x7fdf2ce594fb
[28507.150428] Code: Unable to access opcode bytes at RIP 0x7fdf2ce594d1.
[28507.151677] RSP: 002b:00007ffeab7bfa98 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
[28507.153304] RAX: ffffffffffffffda RBX: 00007ffeab7bfbd8 RCX: 00007fdf2ce594fb
[28507.154698] RDX: 00007fdf2d127501 RSI: 00007fdf2ceed300 RDI: 00007ffeab7c4088
[28507.156056] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[28507.157419] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[28507.158778] R13: 0000000000000000 R14: 0000000000000000 R15: 00007fdf2d124374
[28629.999879] INFO: task sync:623318 blocked for more than 120 seconds.
[28630.001324]       Tainted: G           OE    --------- -  - 4.18.0-348.2.1.el8_5.x86_64 #1
[28630.002899] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28630.004386] task:sync            state:D stack:    0 pid:623318 ppid:623316 flags:0x00004080
[28630.006028] Call Trace:
[28630.006580]  __schedule+0x2c4/0x700
[28630.007308]  ? __switch_to_asm+0x41/0x70
[28630.008106]  ? __switch_to_asm+0x41/0x70
[28630.008913]  schedule+0x37/0xa0
[28630.009576]  io_schedule+0x12/0x40
[28630.010278]  wait_on_page_bit+0x137/0x230
[28630.011100]  ? file_fdatawait_range+0x20/0x20
[28630.012009]  __filemap_fdatawait_range+0x88/0xe0
[28630.012939]  ? __schedule+0x2cc/0x700
[28630.013683]  ? _cond_resched+0x15/0x30
[28630.014449]  ? bdi_split_work_to_wbs+0x46/0x2e0
[28630.015369]  ? finish_wait+0x3c/0x80
[28630.016108]  filemap_fdatawait_keep_errors+0x1a/0x40
[28630.017092]  sync_inodes_sb+0x227/0x2c0
[28630.017888]  ? __x64_sys_splice+0x100/0x100
[28630.018737]  iterate_supers+0x98/0x100
[28630.019503]  ksys_sync+0x40/0xb0
[28630.020179]  __ia32_sys_sync+0xa/0x10
[28630.020940]  do_syscall_64+0x5b/0x1a0
[28630.021701]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[28630.022698] RIP: 0033:0x7fdf2ce594fb
[28630.023444] Code: Unable to access opcode bytes at RIP 0x7fdf2ce594d1.
[28630.024705] RSP: 002b:00007ffeab7bfa98 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
[28630.026153] RAX: ffffffffffffffda RBX: 00007ffeab7bfbd8 RCX: 00007fdf2ce594fb
[28630.027550] RDX: 00007fdf2d127501 RSI: 00007fdf2ceed300 RDI: 00007ffeab7c4088
[28630.028917] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[28630.030283] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[28630.031650] R13: 0000000000000000 R14: 0000000000000000 R15: 00007fdf2d124374
[28752.872906] INFO: task sync:623318 blocked for more than 120 seconds
Generated at Sat Feb 10 03:08:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.