Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
This issue was created by maloo for Marc Vef <mvef@whamcloud.com>
This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/98af5177-ce97-46ab-b92f-886d62c47a80
Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/109507 - 5.15.0-94-generic
servers: https://build.whamcloud.com/job/lustre-reviews/109507 - 4.18.0-553.27.1.el8_lustre.x86_64
-----------------
test_64e did not start and the test log is empty (zero bytes).
MDT/MGS console log (console.trevis-103vm6.log) reports lost connection:
MGS console reports:[21636.492168] Lustre: MGS: haven't heard from client 07a60d18-2319-4f0a-b076-212c29a4a87b (at 10.240.42.216@tcp) in 32 seconds. I think it's dead, and I am evicting it. exp ffff91de06791400, cur 1733818201 expire 1733818171 last 1733818169 [21782.426674] Autotest: Test running for 360 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43) [22085.669364] Autotest: Test running for 365 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43) [22388.696585] Autotest: Test running for 370 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43) [22692.112781] Autotest: Test running for 375 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43) [22995.106618] Autotest: Test running for 380 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43) [23298.205374] Autotest: Test running for 385 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43) [23601.358682] Autotest: Test running for 390 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43) [23904.609470] Autotest: Test running for 395 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43) [24207.734955] Autotest: Test running for 400 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43) [24511.273212] Autotest: Test running for 405 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43) [24814.289573] Autotest: Test running for 410 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43) [25117.463642] Autotest: Test running for 415 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43) [25360.521363] Autotest: Test running for 420 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43) [25663.594201] Autotest: Test running for 425 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43) [25966.699200] Autotest: Test running for 430 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43) [26270.241227] Autotest: Test running for 435 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43) [26573.298800] Autotest: Test running for 440 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43) [26876.586990] Autotest: Test running for 445 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43) [27037.375680] Autotest: Timing session out! Timeout has been exceeded but the test framework is still running (lustre-reviews_review-ldiskfs-ubuntu_109507.43) [27040.336304] Autotest: Suite sanity-sec timed out, collecting logs (lustre-reviews_review-ldiskfs-ubuntu_109507.43)
Client 1 console log (console.trevis-82vm1.log) reports failing to open changelog catalog and lost connection to MDT when starting test_64e:
[21579.452458] Lustre: DEBUG MARKER: == sanity-sec test 64e: Nodemap enforces chlg_ops RBAC roles ========================================================== 08:09:02 (1733818142) [21589.165628] Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ [21601.328237] LustreError: 1072822:0:(mdc_changelog.c:300:chlg_load()) lustre-MDT0000-mdc-ffff8ed013b67000: fail to open changelog catalog: rc = -13 [21626.780424] Lustre: 1018572:0:(client.c:2358:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1733818174/real 0] req@ffff8ed099796d80 x1818039306845312/t0(0) o400->lustre-MDT0000-mdc-ffff8ed013b67000@10.240.44.81@tcp:12/10 lens 224/224 e 0 to 1 dl 1733818190 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0 [21626.791359] Lustre: lustre-MDT0000-mdc-ffff8ed013b67000: Connection to lustre-MDT0000 (at 10.240.44.81@tcp) was lost; in progress operations using this service will wait for recovery to complete [21626.800404] LustreError: MGC10.240.44.81@tcp: Connection to MGS (at 10.240.44.81@tcp) was lost; in progress operations using this service will fail [21631.900204] Lustre: 1018571:0:(client.c:2358:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1733818179/real 0] req@ffff8ed099797a80 x1818039306846464/t0(0) o400->lustre-OST0000-osc-ffff8ed013b67000@10.240.42.218@tcp:28/4 lens 224/224 e 0 to 1 dl 1733818195 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0 [21631.908279] Lustre: 1018571:0:(client.c:2358:ptlrpc_expire_one_request()) Skipped 7 previous similar messages [21635.996012] Lustre: 1018574:0:(client.c:2358:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1733818184/real 0] req@ffff8ed050cb8000 x1818039306847360/t0(0) o400->lustre-MDT0000-mdc-ffff8ed013b67000@10.240.44.81@tcp:12/10 lens 224/224 e 0 to 1 dl 1733818200 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0 [21636.002016] Lustre: 1018574:0:(client.c:2358:ptlrpc_expire_one_request()) Skipped 6 previous similar messages [21641.115749] Lustre: 1018574:0:(client.c:2358:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1733818189/real 0] req@ffff8ed050cb9040 x1818039306848384/t0(0) o400->lustre-MDT0000-mdc-ffff8ed013b67000@10.240.44.81@tcp:12/10 lens 224/224 e 0 to 1 dl 1733818205 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0 [21641.121737] Lustre: 1018574:0:(client.c:2358:ptlrpc_expire_one_request()) Skipped 6 previous similar messages [21791.637090] nfs: server 10.240.32.204 not responding, still trying [21796.756808] LNet: 2 peer NIs in recovery (showing 2): 10.240.42.218@tcp, 10.240.44.81@tcp [21796.756919] Lustre: 1018571:0:(client.c:2358:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1733818179/real 1733818360] req@ffff8ed099796a40 x1818039306846208/t0(0) o400->MGC10.240.44.81@tcp@10.240.44.81@tcp:26/25 lens 224/224 e 0 to 1 dl 1733818195 ref 1 fl Rpc:EeXNQU/200/ffffffff rc -5/-1 job:'kworker.0' uid:0 gid:0 [21796.765813] Lustre: 1018571:0:(client.c:2358:ptlrpc_expire_one_request()) Skipped 15 previous similar messages [21796.767919] LNetError: Unexpected error -2 connecting to 10.240.44.81@tcp at host 10.240.44.81:7988 [21796.769780] LNetError: Skipped 1 previous similar message [21797.780806] LNetError: Unexpected error -2 connecting to 10.240.42.218@tcp at host 10.240.42.218:7988 [21798.804711] LNetError: Unexpected error -2 connecting to 10.240.42.218@tcp at host 10.240.42.218:7988 [21798.806612] LNetError: Skipped 2 previous similar messages [21802.900545] LNetError: Unexpected error -2 connecting to 10.240.42.218@tcp at host 10.240.42.218:7988 [21806.996399] nfs: server 10.240.32.204 not responding, still trying [21811.092195] LNetError: Unexpected error -2 connecting to 10.240.42.218@tcp at host 10.240.42.218:7988 [21811.094149] LNetError: Skipped 2 previous similar messages [21819.283763] nfs: server 10.240.32.204 not responding, timed out [21827.475654] LNetError: Unexpected error -2 connecting to 10.240.42.218@tcp at host 10.240.42.218:7988 [21827.478122] LNetError: Skipped 1 previous similar message [21856.146125] LNet: 2 peer NIs in recovery (showing 2): 10.240.42.218@tcp, 10.240.44.81@tcp [21859.218017] LNetError: Unexpected error -2 connecting to 10.240.42.218@tcp at host 10.240.42.218:7988 [21859.219964] LNetError: Skipped 1 previous similar message [21870.481491] INFO: task tee:892177 blocked for more than 120 seconds. [21870.482872] Tainted: G W OE 5.15.0-94-generic #104-Ubuntu [21870.484273] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [21870.485883] task:tee state:D stack: 0 pid:892177 ppid: 3114 flags:0x00000002 [21870.487582] Call Trace: [21870.488175] <TASK> [21870.488738] __schedule+0x24e/0x590 [21870.489582] schedule+0x69/0x110 [21870.490315] io_schedule+0x46/0x80 [21870.491074] wait_on_page_bit_common+0x10c/0x3d0 [21870.492078] ? filemap_invalidate_unlock_two+0x50/0x50 [21870.493156] wait_on_page_bit+0x3f/0x50 [21870.494011] wait_on_page_writeback+0x26/0x80 [21870.494944] wait_for_stable_page+0x32/0x40 [21870.495846] grab_cache_page_write_begin+0x31/0x40 [21870.496861] nfs_write_begin+0x61/0x300 [nfs] [21870.497947] generic_perform_write+0xc9/0x200 [21870.498878] ? pipe_write+0x474/0x6a0 [21870.499714] ? __cond_resched+0x1a/0x50 [21870.500556] nfs_file_write+0x1a7/0x2c0 [nfs] [21870.501519] new_sync_write+0x114/0x1a0 [21870.502361] vfs_write+0x1d5/0x270 [21870.503120] ksys_write+0x67/0xf0 [21870.503863] __x64_sys_write+0x19/0x20 [21870.504683] do_syscall_64+0x5c/0xc0 [21870.505498] ? exit_to_user_mode_prepare+0x96/0xb0 [21870.506539] ? syscall_exit_to_user_mode+0x35/0x50 [21870.507551] ? __x64_sys_read+0x19/0x20 [21870.508391] ? do_syscall_64+0x69/0xc0 [21870.509211] ? syscall_exit_to_user_mode+0x35/0x50 [21870.510265] ? __x64_sys_write+0x19/0x20 [21870.511123] ? do_syscall_64+0x69/0xc0 [21870.511939] ? exit_to_user_mode_prepare+0x37/0xb0 [21870.512948] ? syscall_exit_to_user_mode+0x35/0x50 [21870.513973] ? __x64_sys_write+0x19/0x20 [21870.514825] ? do_syscall_64+0x69/0xc0 [21870.515646] ? do_syscall_64+0x69/0xc0 [21870.516465] ? do_syscall_64+0x69/0xc0 [21870.517299] entry_SYSCALL_64_after_hwframe+0x62/0xcc [21870.518377] RIP: 0033:0x7f7529ef4887 [21870.519179] RSP: 002b:00007ffffaaf0ca8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [21870.520724] RAX: ffffffffffffffda RBX: 000000000000005b RCX: 00007f7529ef4887 [21870.522172] RDX: 000000000000005b RSI: 00007ffffaaf0de0 RDI: 0000000000000003 [21870.523603] RBP: 00007ffffaaf0de0 R08: 000000000000005b R09: 0000000000000001 [21870.525047] R10: 00000000000001b6 R11: 0000000000000246 R12: 000000000000005b [21870.526494] R13: 000055cc0da642a0 R14: 00007f7529ff6a00 R15: 000000000000005b [21870.527933] </TASK>
Attachments
Issue Links
- duplicates
-
LU-18569 sanity-sec: timeout - clients lost connection to MGS
-
- Open
-