Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18569

sanity-sec: timeout - clients lost connection to MGS

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.17.0, Lustre 2.15.7
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Marc Vef <mvef@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/a1bdbb1f-684a-44dd-9683-4a52e4eaaf4c

      test_31 failed with the following error:

      Timeout occurred after 445 minutes, last suite running was sanity-sec
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/109661 - 5.15.0-94-generic
      servers: https://build.whamcloud.com/job/lustre-reviews/109661 - 4.18.0-553.27.1.el8_lustre.x86_64

      Both clients (vm1 and vm2) lost connection after mounting:

      [21607.508267] Lustre: Mounted lustre-client
      [21607.835897] Lustre: DEBUG MARKER: mount | grep /mnt/lustre' '
      [21629.086605] Lustre: 79408:0:(client.c:2358:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1734413476/real 0]  req@ffff9c605474f740 x1818664259031424/t0(0) o400->lustre-OST0002-osc-ffff9c6049741800@10.240.22.182@tcp:28/4 lens 224/224 e 0 to 1 dl 1734413492 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:'' uid:0 gid:0
      [21629.086637] Lustre: lustre-MDT0000-mdc-ffff9c6049741800: Connection to lustre-MDT0000 (at 10.240.22.189@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [21629.091974] Lustre: 79408:0:(client.c:2358:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      [21629.097039] LustreError: MGC10.240.22.189@tcp: Connection to MGS (at 10.240.22.189@tcp) was lost; in progress operations using this service will fail
      [21634.206602] Lustre: 79408:0:(client.c:2358:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1734413481/real 0]  req@ffff9c605474e700 x1818664259032448/t0(0) o400->lustre-OST0001-osc-ffff9c6049741800@10.240.22.182@tcp:28/4 lens 224/224 e 0 to 1 dl 1734413497 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:'' uid:0 gid:0
      [21634.212144] Lustre: 79408:0:(client.c:2358:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
      [21639.326639] Lustre: 79408:0:(client.c:2358:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1734413486/real 0]  req@ffff9c6054447a80 x1818664259034240/t0(0) o400->lustre-OST0006-osc-ffff9c6049741800@10.240.22.182@tcp:28/4 lens 224/224 e 0 to 1 dl 1734413502 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:'' uid:0 gid:0
      [21639.332231] Lustre: 79408:0:(client.c:2358:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
      [21644.446523] Lustre: 79407:0:(client.c:2358:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1734413492/real 0]  req@ffff9c6053e87a80 x1818664259035776/t0(0) o400->lustre-OST0000-osc-ffff9c6049741800@10.240.22.182@tcp:28/4 lens 224/224 e 0 to 1 dl 1734413508 ref 2 fl Rpc:XNr/200/ffffffff rc 0/-1 job:'' uid:0 gid:0
      [21644.452090] Lustre: 79407:0:(client.c:2358:ptlrpc_expire_one_request()) Skipped 10 previous similar messages
      [21797.277618] nfs: server 10.240.16.204 not responding, timed out
      [21831.069446] LNet: 2 peer NIs in recovery (showing 2): 10.240.22.189@tcp, 10.240.22.182@tcp
      [21831.071116] Lustre: 79408:0:(client.c:2358:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1734413486/real 1734413694]  req@ffff9c6054445380 x1818664259033216/t0(0) o400->MGC10.240.22.189@tcp@10.240.22.189@tcp:26/25 lens 224/224 e 0 to 1 dl 1734413502 ref 1 fl Rpc:EeXNQU/200/ffffffff rc -5/-1 job:'' uid:0 gid:0
      [21831.076441] Lustre: 79408:0:(client.c:2358:ptlrpc_expire_one_request()) Skipped 9 previous similar messages
      [21831.077739] LNetError: Unexpected error -2 connecting to 10.240.22.189@tcp at host 10.240.22.189:7988
      [21832.093479] LNetError: Unexpected error -2 connecting to 10.240.22.189@tcp at host 10.240.22.189:7988
      [21832.095303] LNetError: Skipped 2 previous similar messages
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-sec test_31 - Timeout occurred after 445 minutes, last suite running was sanity-sec

      Attachments

        1. console.onyx-138vm1.log
          1.22 MB
          Marc Vef
        2. test_failures_Jan-Sep_25.pdf
          56 kB
          Marc Vef

        Issue Links

          Activity

            People

              devops-triage DevOps Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: