Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17591

Seeing LNetError: 11e-e: Unexpected error -22 connecting to 10.90.1.35@tcp55 at host 10.90.1.35:988

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.15.4
    • Cray SLES clients
    • 3
    • 9223372036854775807

    Description

      With our 200GB production systems we moved recently to 2.15. One the clients we see the reported error for this ticket here:

      LNetError: 11e-e: Unexpected error -22 connecting to NNNN at host XXXX

      While lfs df seems to work on such clients we do see evictions from time to time.

      2024-02-20T13:01:36.026377-05:00 XXXX kernel: Lustre: XXXXX-MDT0000: haven't heard from client d11610eb-9931-4127-ac5d-43ff433eab4e (at NNNN@tcp55) in 227 seconds. I think it's dead, and I am evicting it. exp 000000006594aa8b, cur 1708452096 expire 1708451946 last 1708451869

      2024-02-20T13:01:36.026377-05:00 XXXXX- kernel: Lustre: XXXXX-MDT0000: haven't heard from client d11610eb-9931-4127-ac5d-43ff433eab4e (at NNNN@tcp55]) in 227 seconds. I think it's dead, and I am evicting it. exp 000000006594aa8b, cur 1708452096 expire 1708451946 last 1708451869

      Normal ping works but we see lctl ping some time work and then at other times give an 

      What information can I provide to resolve this. Also for this system we have accept=all.

      Attachments

        Activity

          [LU-17591] Seeing LNetError: 11e-e: Unexpected error -22 connecting to 10.90.1.35@tcp55 at host 10.90.1.35:988

          Moving to 2.15 LTS seems to make this problem a lot less. We can close this ticket.

          simmonsja James A Simmons added a comment - Moving to 2.15 LTS seems to make this problem a lot less. We can close this ticket.

          All the settings are in sync

          simmonsja James A Simmons added a comment - All the settings are in sync

          Is this client mounting multiple Lustre filesystems? Are the timeouts on this client the same on all of the filesystems? This is one of the common issues when a client is being evicted regularly after mounting multiple filesystems with different timeouts.

          Not sure about the LNet error. If TCP, is conns_per_peer set the same on the client and servers?

          adilger Andreas Dilger added a comment - Is this client mounting multiple Lustre filesystems? Are the timeouts on this client the same on all of the filesystems? This is one of the common issues when a client is being evicted regularly after mounting multiple filesystems with different timeouts. Not sure about the LNet error. If TCP, is conns_per_peer set the same on the client and servers?

          People

            pjones Peter Jones
            simmonsja James A Simmons
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: