Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11931

RDMA packets sent from client to MGS are timing out

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.13.0, Lustre 2.12.1
    • Lustre 2.11.0
    • Cray CLE6 system running 2.11 clients with 2.11 servers.
    • 3
    • 9223372036854775807

    Description

      We have seen in a production system the following error which are causing clients to be evicted.

      [85895.120239] LNetError: 18866:0:(o2iblnd_cb.c:3271:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 8 seconds

      [85895.130310] LNetError: 18866:0:(o2iblnd_cb.c:3346:kiblnd_check_conns()) Timed out RDMA with 10.10.32.227@o2ib2 (51): c: 0, oc: 0, rc: 8

      [123887.254790] Lustre: MGS: haven't heard from client 51aa0ab0-3f34-cf7e-2fef-01e9ddcd4448 (at 732@gni4) in 227 seconds. I think it's dead, and I am evicting it. exp ffff961d87b9a000, cur 1547261222 expire 1547261072 last 1547260995

      For our setup we have two back end file systems, F1 which is running 2.8.2 server back end and F2 which is running 2.11 server stack with ZFS (0.7.12). The clients are all running 2.11 cray clients. The LNet configuration is:

      F1 file system server backend with 2.8.2 stack, ldiskfs:

          map_on_demand:0

          concurrent_sends:0

          peer_credits:8

      F2 file system server 2.11 (ZFS 0.7.12)

          map_on_demand:1

          concurrent_sends:63

          peer_credits:8

      C3 (cray 2.11 router)

         map_on_demand:0

         concurrent_sends:16

         peer_credits:8 (o2ib)

         peer_credits:16 (gni).

      C4 (cray 2.11 router)

         map_on_demand:0

         concurrent_sends:63

         peer_credits:8 (o2ib)

         peer_credits:16 (gni)

      Currently the problems are only seen with 2.11 clients with the 2.11 file system. Since F1 is 2.8 and its peer credits are set to 8 this impacts the rest of the systems.

      Attachments

        Issue Links

          Activity

            People

              ashehata Amir Shehata (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: