Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18527

sanity-sec: test_64e timed out (client lost connection)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Marc Vef <mvef@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/98af5177-ce97-46ab-b92f-886d62c47a80

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/109507 - 5.15.0-94-generic
      servers: https://build.whamcloud.com/job/lustre-reviews/109507 - 4.18.0-553.27.1.el8_lustre.x86_64

      -----------------

      test_64e did not start and the test log is empty (zero bytes).

      MDT/MGS console log (console.trevis-103vm6.log) reports lost connection:

      MGS console reports:[21636.492168] Lustre: MGS: haven't heard from client 07a60d18-2319-4f0a-b076-212c29a4a87b (at 10.240.42.216@tcp) in 32 seconds. I think it's dead, and I am evicting it. exp ffff91de06791400, cur 1733818201 expire 1733818171 last 1733818169
      [21782.426674] Autotest: Test running for 360 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43)
      [22085.669364] Autotest: Test running for 365 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43)
      [22388.696585] Autotest: Test running for 370 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43)
      [22692.112781] Autotest: Test running for 375 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43)
      [22995.106618] Autotest: Test running for 380 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43)
      [23298.205374] Autotest: Test running for 385 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43)
      [23601.358682] Autotest: Test running for 390 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43)
      [23904.609470] Autotest: Test running for 395 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43)
      [24207.734955] Autotest: Test running for 400 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43)
      [24511.273212] Autotest: Test running for 405 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43)
      [24814.289573] Autotest: Test running for 410 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43)
      [25117.463642] Autotest: Test running for 415 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43)
      [25360.521363] Autotest: Test running for 420 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43)
      [25663.594201] Autotest: Test running for 425 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43)
      [25966.699200] Autotest: Test running for 430 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43)
      [26270.241227] Autotest: Test running for 435 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43)
      [26573.298800] Autotest: Test running for 440 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43)
      [26876.586990] Autotest: Test running for 445 minutes (lustre-reviews_review-ldiskfs-ubuntu_109507.43)
      [27037.375680] Autotest: Timing session out! Timeout has been exceeded but the test framework is still running (lustre-reviews_review-ldiskfs-ubuntu_109507.43)
      [27040.336304] Autotest: Suite sanity-sec timed out, collecting logs (lustre-reviews_review-ldiskfs-ubuntu_109507.43)

       

      Client 1 console log (console.trevis-82vm1.log) reports failing to open changelog catalog and lost connection to MDT when starting test_64e:

      [21579.452458] Lustre: DEBUG MARKER: == sanity-sec test 64e: Nodemap enforces chlg_ops RBAC roles ========================================================== 08:09:02 (1733818142)
      [21589.165628] Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
      [21601.328237] LustreError: 1072822:0:(mdc_changelog.c:300:chlg_load()) lustre-MDT0000-mdc-ffff8ed013b67000: fail to open changelog catalog: rc = -13
      [21626.780424] Lustre: 1018572:0:(client.c:2358:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1733818174/real 0]  req@ffff8ed099796d80 x1818039306845312/t0(0) o400->lustre-MDT0000-mdc-ffff8ed013b67000@10.240.44.81@tcp:12/10 lens 224/224 e 0 to 1 dl 1733818190 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0
      [21626.791359] Lustre: lustre-MDT0000-mdc-ffff8ed013b67000: Connection to lustre-MDT0000 (at 10.240.44.81@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [21626.800404] LustreError: MGC10.240.44.81@tcp: Connection to MGS (at 10.240.44.81@tcp) was lost; in progress operations using this service will fail
      [21631.900204] Lustre: 1018571:0:(client.c:2358:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1733818179/real 0]  req@ffff8ed099797a80 x1818039306846464/t0(0) o400->lustre-OST0000-osc-ffff8ed013b67000@10.240.42.218@tcp:28/4 lens 224/224 e 0 to 1 dl 1733818195 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0
      [21631.908279] Lustre: 1018571:0:(client.c:2358:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
      [21635.996012] Lustre: 1018574:0:(client.c:2358:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1733818184/real 0]  req@ffff8ed050cb8000 x1818039306847360/t0(0) o400->lustre-MDT0000-mdc-ffff8ed013b67000@10.240.44.81@tcp:12/10 lens 224/224 e 0 to 1 dl 1733818200 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0
      [21636.002016] Lustre: 1018574:0:(client.c:2358:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
      [21641.115749] Lustre: 1018574:0:(client.c:2358:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1733818189/real 0]  req@ffff8ed050cb9040 x1818039306848384/t0(0) o400->lustre-MDT0000-mdc-ffff8ed013b67000@10.240.44.81@tcp:12/10 lens 224/224 e 0 to 1 dl 1733818205 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0
      [21641.121737] Lustre: 1018574:0:(client.c:2358:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
      [21791.637090] nfs: server 10.240.32.204 not responding, still trying
      [21796.756808] LNet: 2 peer NIs in recovery (showing 2): 10.240.42.218@tcp, 10.240.44.81@tcp
      [21796.756919] Lustre: 1018571:0:(client.c:2358:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1733818179/real 1733818360]  req@ffff8ed099796a40 x1818039306846208/t0(0) o400->MGC10.240.44.81@tcp@10.240.44.81@tcp:26/25 lens 224/224 e 0 to 1 dl 1733818195 ref 1 fl Rpc:EeXNQU/200/ffffffff rc -5/-1 job:'kworker.0' uid:0 gid:0
      [21796.765813] Lustre: 1018571:0:(client.c:2358:ptlrpc_expire_one_request()) Skipped 15 previous similar messages
      [21796.767919] LNetError: Unexpected error -2 connecting to 10.240.44.81@tcp at host 10.240.44.81:7988
      [21796.769780] LNetError: Skipped 1 previous similar message
      [21797.780806] LNetError: Unexpected error -2 connecting to 10.240.42.218@tcp at host 10.240.42.218:7988
      [21798.804711] LNetError: Unexpected error -2 connecting to 10.240.42.218@tcp at host 10.240.42.218:7988
      [21798.806612] LNetError: Skipped 2 previous similar messages
      [21802.900545] LNetError: Unexpected error -2 connecting to 10.240.42.218@tcp at host 10.240.42.218:7988
      [21806.996399] nfs: server 10.240.32.204 not responding, still trying
      [21811.092195] LNetError: Unexpected error -2 connecting to 10.240.42.218@tcp at host 10.240.42.218:7988
      [21811.094149] LNetError: Skipped 2 previous similar messages
      [21819.283763] nfs: server 10.240.32.204 not responding, timed out
      [21827.475654] LNetError: Unexpected error -2 connecting to 10.240.42.218@tcp at host 10.240.42.218:7988
      [21827.478122] LNetError: Skipped 1 previous similar message
      [21856.146125] LNet: 2 peer NIs in recovery (showing 2): 10.240.42.218@tcp, 10.240.44.81@tcp
      [21859.218017] LNetError: Unexpected error -2 connecting to 10.240.42.218@tcp at host 10.240.42.218:7988
      [21859.219964] LNetError: Skipped 1 previous similar message
      [21870.481491] INFO: task tee:892177 blocked for more than 120 seconds.
      [21870.482872]       Tainted: G        W  OE     5.15.0-94-generic #104-Ubuntu
      [21870.484273] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [21870.485883] task:tee             state:D stack:    0 pid:892177 ppid:  3114 flags:0x00000002
      [21870.487582] Call Trace:
      [21870.488175]  <TASK>
      [21870.488738]  __schedule+0x24e/0x590
      [21870.489582]  schedule+0x69/0x110
      [21870.490315]  io_schedule+0x46/0x80
      [21870.491074]  wait_on_page_bit_common+0x10c/0x3d0
      [21870.492078]  ? filemap_invalidate_unlock_two+0x50/0x50
      [21870.493156]  wait_on_page_bit+0x3f/0x50
      [21870.494011]  wait_on_page_writeback+0x26/0x80
      [21870.494944]  wait_for_stable_page+0x32/0x40
      [21870.495846]  grab_cache_page_write_begin+0x31/0x40
      [21870.496861]  nfs_write_begin+0x61/0x300 [nfs]
      [21870.497947]  generic_perform_write+0xc9/0x200
      [21870.498878]  ? pipe_write+0x474/0x6a0
      [21870.499714]  ? __cond_resched+0x1a/0x50
      [21870.500556]  nfs_file_write+0x1a7/0x2c0 [nfs]
      [21870.501519]  new_sync_write+0x114/0x1a0
      [21870.502361]  vfs_write+0x1d5/0x270
      [21870.503120]  ksys_write+0x67/0xf0
      [21870.503863]  __x64_sys_write+0x19/0x20
      [21870.504683]  do_syscall_64+0x5c/0xc0
      [21870.505498]  ? exit_to_user_mode_prepare+0x96/0xb0
      [21870.506539]  ? syscall_exit_to_user_mode+0x35/0x50
      [21870.507551]  ? __x64_sys_read+0x19/0x20
      [21870.508391]  ? do_syscall_64+0x69/0xc0
      [21870.509211]  ? syscall_exit_to_user_mode+0x35/0x50
      [21870.510265]  ? __x64_sys_write+0x19/0x20
      [21870.511123]  ? do_syscall_64+0x69/0xc0
      [21870.511939]  ? exit_to_user_mode_prepare+0x37/0xb0
      [21870.512948]  ? syscall_exit_to_user_mode+0x35/0x50
      [21870.513973]  ? __x64_sys_write+0x19/0x20
      [21870.514825]  ? do_syscall_64+0x69/0xc0
      [21870.515646]  ? do_syscall_64+0x69/0xc0
      [21870.516465]  ? do_syscall_64+0x69/0xc0
      [21870.517299]  entry_SYSCALL_64_after_hwframe+0x62/0xcc
      [21870.518377] RIP: 0033:0x7f7529ef4887
      [21870.519179] RSP: 002b:00007ffffaaf0ca8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [21870.520724] RAX: ffffffffffffffda RBX: 000000000000005b RCX: 00007f7529ef4887
      [21870.522172] RDX: 000000000000005b RSI: 00007ffffaaf0de0 RDI: 0000000000000003
      [21870.523603] RBP: 00007ffffaaf0de0 R08: 000000000000005b R09: 0000000000000001
      [21870.525047] R10: 00000000000001b6 R11: 0000000000000246 R12: 000000000000005b
      [21870.526494] R13: 000055cc0da642a0 R14: 00007f7529ff6a00 R15: 000000000000005b
      [21870.527933]  </TASK>

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: