Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11931

RDMA packets sent from client to MGS are timing out

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: Lustre 2.11.0
    • Fix Version/s: Lustre 2.13.0, Lustre 2.12.1
    • Labels:
    • Environment:
      Cray CLE6 system running 2.11 clients with 2.11 servers.
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      We have seen in a production system the following error which are causing clients to be evicted.

      [85895.120239] LNetError: 18866:0:(o2iblnd_cb.c:3271:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 8 seconds

      [85895.130310] LNetError: 18866:0:(o2iblnd_cb.c:3346:kiblnd_check_conns()) Timed out RDMA with 10.10.32.227@o2ib2 (51): c: 0, oc: 0, rc: 8

      [123887.254790] Lustre: MGS: haven't heard from client 51aa0ab0-3f34-cf7e-2fef-01e9ddcd4448 (at 732@gni4) in 227 seconds. I think it's dead, and I am evicting it. exp ffff961d87b9a000, cur 1547261222 expire 1547261072 last 1547260995

      For our setup we have two back end file systems, F1 which is running 2.8.2 server back end and F2 which is running 2.11 server stack with ZFS (0.7.12). The clients are all running 2.11 cray clients. The LNet configuration is:

      F1 file system server backend with 2.8.2 stack, ldiskfs:

          map_on_demand:0

          concurrent_sends:0

          peer_credits:8

      F2 file system server 2.11 (ZFS 0.7.12)

          map_on_demand:1

          concurrent_sends:63

          peer_credits:8

      C3 (cray 2.11 router)

         map_on_demand:0

         concurrent_sends:16

         peer_credits:8 (o2ib)

         peer_credits:16 (gni).

      C4 (cray 2.11 router)

         map_on_demand:0

         concurrent_sends:63

         peer_credits:8 (o2ib)

         peer_credits:16 (gni)

      Currently the problems are only seen with 2.11 clients with the 2.11 file system. Since F1 is 2.8 and its peer credits are set to 8 this impacts the rest of the systems.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ashehata Amir Shehata
                Reporter:
                simmonsja James A Simmons
              • Votes:
                0 Vote for this issue
                Watchers:
                12 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: