Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10961

Clients hang after failovers. LustreError: 223668:0:(file.c:4213:ll_inode_revalidate_fini()) soaked: revalidate FID [0x200000007:0x1:0x0] error: rc = -4

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.12.0
    • Lustre 2.12.0
    • soak cluster
    • 3
    • 9223372036854775807

    Description

      We are seeing repeated hard hang on clients after server failover.
      'df' on a client will hang, user tasks do no complete. So far no hard faults, the node just grinds to a halt. Yesterday this occurred on soak-17 and soak-23. I have dumped stacks on both nodes, and crash dumps are available on soak.
      We see:

      • connections to one or more osts drop, and the client does not re-connect:
        Apr 27 03:28:42 soak-23 kernel: Lustre: 2084:0:(client.c:2099:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1524799714/real 0]  req@ffff8808fee67500 x1598738343197024/t0(0) o400->soaked-OST000b-osc-ffff8807f6ba0800@192.168.1.107@o2ib:28/4 lens 224/224 e 0 to 1 dl 1524799721 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
        Apr 27 03:28:42 soak-23 kernel: Lustre: soaked-OST0011-osc-ffff8807f6ba0800: Connection to soaked-OST0011 (at 192.168.1.107@o2ib) was lost; in progress operations using this service will wait for recovery to complete
        Apr 27 03:28:42 soak-23 kernel: Lustre: Skipped 3 previous similar messages
        Apr 27 03:28:42 soak-23 kernel: Lustre: 2084:0:(client.c:2099:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
        Apr 27 03:28:42 soak-23 kernel: Lustre: soaked-OST000b-osc-ffff8807f6ba0800: Connection to soaked-OST000b (at 192.168.1.107@o2ib) was lost; in progress operations using this service will wait for recovery to complete
        

        As of 1700 hours (14 hours after failover) the node still has not reconnected to this OST.

      We also see repeated errors referencing the MDT:

      Apr 27 17:25:42 soak-23 kernel: LustreError: 223668:0:(file.c:4213:ll_inode_revalidate_fini()) soaked: revalidate FID [0x200000007:0x1:0x0] error: rc = -4
      

      The error appears very repeatable. Logs and stack traces are attached.

      Attachments

        1. mds.lustre.log.txt.gz
          17.85 MB
        2. s-17.client.hang.txt.gz
          7.80 MB
        3. soak-17.log.gz
          281 kB
        4. soak-17.lustre.log.txt.gz
          0.9 kB
        5. soak-17.stacktrace.txt
          553 kB
        6. soak-18.lustre.log.txt.gz
          1.68 MB
        7. soak-19.lustre.log.txt.gz
          1.52 MB
        8. soak-21.06-05-2018.gz
          17.40 MB
        9. soak-23.client.hang.txt.gz
          7.92 MB
        10. soak-23.stacks.txt
          574 kB
        11. soak-24.0430.txt.gz
          19.33 MB
        12. soak-24.stack.txt
          567 kB
        13. soak-42.log.gz
          355 kB
        14. soak-42.lustre.log.txt.gz
          1.14 MB
        15. soak-44.fini.txt
          136.00 MB
        16. soak-8.console.log.gz
          2.43 MB
        17. soak-8.log.gz
          152 kB
        18. soak-8.lustre.log.2018-06-05.gz
          26.76 MB
        19. soak-8.syslog.log.gz
          3.42 MB

        Issue Links

          Activity

            [LU-10961] Clients hang after failovers. LustreError: 223668:0:(file.c:4213:ll_inode_revalidate_fini()) soaked: revalidate FID [0x200000007:0x1:0x0] error: rc = -4

            Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32791
            Subject: LU-10961 ldlm: don't cancel DoM locks before replay
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 5c5ef9be5904a7f16aa41a341c561b0c9c6beb42

            gerrit Gerrit Updater added a comment - Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32791 Subject: LU-10961 ldlm: don't cancel DoM locks before replay Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5c5ef9be5904a7f16aa41a341c561b0c9c6beb42

            I was able to reproduce lock replay issue with DoM files, patch will be ready soon. The problem is the lock cancellation before lock replay, MDC cancels locks differently than OSC and that causes DoM locks to be handled improperly.

            Meanwhile that doesn't explain failover failure with PFL-only file because there is no MDC involved, so there can be another reasons for that. 

            tappro Mikhail Pershin added a comment - I was able to reproduce lock replay issue with DoM files, patch will be ready soon. The problem is the lock cancellation before lock replay, MDC cancels locks differently than OSC and that causes DoM locks to be handled improperly. Meanwhile that doesn't explain failover failure with PFL-only file because there is no MDC involved, so there can be another reasons for that. 

            Peter,

            yes, there are clues, I am working on reproducer and patch

            tappro Mikhail Pershin added a comment - Peter, yes, there are clues, I am working on reproducer and patch
            pjones Peter Jones added a comment -

            Mike

            Has the information provided any insight?

            Peter

            pjones Peter Jones added a comment - Mike Has the information provided any insight? Peter

            Console logs from soak-17,42 and MDS (soak-8) attached.

            cliffw Cliff White (Inactive) added a comment - Console logs from soak-17,42 and MDS (soak-8) attached.

            In the same incident, soak-18/19 were hung, but not showing the _fini error. Lustre logs attached. Also lustre log from MDS.

            cliffw Cliff White (Inactive) added a comment - In the same incident, soak-18/19 were hung, but not showing the _fini error. Lustre logs attached. Also lustre log from MDS.

            Hit the bug right away, lustre logs for two clients attached. Also dumped stacks and crash dumped those nodes, bits are on Spirit. Restarting with full debug.

            cliffw Cliff White (Inactive) added a comment - Hit the bug right away, lustre logs for two clients attached. Also dumped stacks and crash dumped those nodes, bits are on Spirit. Restarting with full debug.

            Mike Pershin (mike.pershin@intel.com) uploaded a new patch: https://review.whamcloud.com/32710
            Subject: LU-10961 osc: add debug code
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 523ac2ce306d182b2dc5db7c9a0f401b39124963

            gerrit Gerrit Updater added a comment - Mike Pershin (mike.pershin@intel.com) uploaded a new patch: https://review.whamcloud.com/32710 Subject: LU-10961 osc: add debug code Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 523ac2ce306d182b2dc5db7c9a0f401b39124963

            Mike - Would you please upload a patch with any necessary debug information that will cause an error so we can get fail instead of waiting for a timeout?

            Thanks.

            jamesanunez James Nunez (Inactive) added a comment - Mike - Would you please upload a patch with any necessary debug information that will cause an error so we can get fail instead of waiting for a timeout? Thanks.

            yes, my question was about just particular time when error happened, so it is not possible to select any specific load causing that. In that case only logs at the moment of failure could make that clear. For this we can try to inject code which will cause error immediately instead of timeout.

            Dmitry showed already place where timeout happens, probably here we can output more debug info and return error immediately instead if waiting, so lustre logs will contain useful data.

            tappro Mikhail Pershin added a comment - yes, my question was about just particular time when error happened, so it is not possible to select any specific load causing that. In that case only logs at the moment of failure could make that clear. For this we can try to inject code which will cause error immediately instead of timeout. Dmitry showed already place where timeout happens, probably here we can output more debug info and return error immediately instead if waiting, so lustre logs will contain useful data.
            cliffw Cliff White (Inactive) added a comment - - edited

            I though we had explained soak to you. The test running at the time of failure were:
            blogbench, iorssf, iorfpp, kcompile, mdtestfpp, mdtestssf, simul,
            fio( random, sequential, SAS simulation)
            The random mix is distributed across the clients, with the intent of seriously loading each client. It is difficult to tell exactly what is running on a specific node at the time of failure, but generally a node will have 2/3 different jobs running at any given time.

            cliffw Cliff White (Inactive) added a comment - - edited I though we had explained soak to you. The test running at the time of failure were: blogbench, iorssf, iorfpp, kcompile, mdtestfpp, mdtestssf, simul, fio( random, sequential, SAS simulation) The random mix is distributed across the clients, with the intent of seriously loading each client. It is difficult to tell exactly what is running on a specific node at the time of failure, but generally a node will have 2/3 different jobs running at any given time.

            People

              tappro Mikhail Pershin
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: