Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10961

Clients hang after failovers. LustreError: 223668:0:(file.c:4213:ll_inode_revalidate_fini()) soaked: revalidate FID [0x200000007:0x1:0x0] error: rc = -4

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.12.0
    • Lustre 2.12.0
    • soak cluster
    • 3
    • 9223372036854775807

    Description

      We are seeing repeated hard hang on clients after server failover.
      'df' on a client will hang, user tasks do no complete. So far no hard faults, the node just grinds to a halt. Yesterday this occurred on soak-17 and soak-23. I have dumped stacks on both nodes, and crash dumps are available on soak.
      We see:

      • connections to one or more osts drop, and the client does not re-connect:
        Apr 27 03:28:42 soak-23 kernel: Lustre: 2084:0:(client.c:2099:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1524799714/real 0]  req@ffff8808fee67500 x1598738343197024/t0(0) o400->soaked-OST000b-osc-ffff8807f6ba0800@192.168.1.107@o2ib:28/4 lens 224/224 e 0 to 1 dl 1524799721 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
        Apr 27 03:28:42 soak-23 kernel: Lustre: soaked-OST0011-osc-ffff8807f6ba0800: Connection to soaked-OST0011 (at 192.168.1.107@o2ib) was lost; in progress operations using this service will wait for recovery to complete
        Apr 27 03:28:42 soak-23 kernel: Lustre: Skipped 3 previous similar messages
        Apr 27 03:28:42 soak-23 kernel: Lustre: 2084:0:(client.c:2099:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
        Apr 27 03:28:42 soak-23 kernel: Lustre: soaked-OST000b-osc-ffff8807f6ba0800: Connection to soaked-OST000b (at 192.168.1.107@o2ib) was lost; in progress operations using this service will wait for recovery to complete
        

        As of 1700 hours (14 hours after failover) the node still has not reconnected to this OST.

      We also see repeated errors referencing the MDT:

      Apr 27 17:25:42 soak-23 kernel: LustreError: 223668:0:(file.c:4213:ll_inode_revalidate_fini()) soaked: revalidate FID [0x200000007:0x1:0x0] error: rc = -4
      

      The error appears very repeatable. Logs and stack traces are attached.

      Attachments

        1. mds.lustre.log.txt.gz
          17.85 MB
          Cliff White
        2. s-17.client.hang.txt.gz
          7.80 MB
          Cliff White
        3. soak-17.log.gz
          281 kB
          Cliff White
        4. soak-17.lustre.log.txt.gz
          0.9 kB
          Cliff White
        5. soak-17.stacktrace.txt
          553 kB
          Cliff White
        6. soak-18.lustre.log.txt.gz
          1.68 MB
          Cliff White
        7. soak-19.lustre.log.txt.gz
          1.52 MB
          Cliff White
        8. soak-21.06-05-2018.gz
          17.40 MB
          Cliff White
        9. soak-23.client.hang.txt.gz
          7.92 MB
          Cliff White
        10. soak-23.stacks.txt
          574 kB
          Cliff White
        11. soak-24.0430.txt.gz
          19.33 MB
          Cliff White
        12. soak-24.stack.txt
          567 kB
          Cliff White
        13. soak-42.log.gz
          355 kB
          Cliff White
        14. soak-42.lustre.log.txt.gz
          1.14 MB
          Cliff White
        15. soak-44.fini.txt
          136.00 MB
          Cliff White
        16. soak-8.console.log.gz
          2.43 MB
          Cliff White
        17. soak-8.log.gz
          152 kB
          Cliff White
        18. soak-8.lustre.log.2018-06-05.gz
          26.76 MB
          Cliff White
        19. soak-8.syslog.log.gz
          3.42 MB
          Cliff White

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: