Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11931

RDMA packets sent from client to MGS are timing out

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.13.0, Lustre 2.12.1
    • Lustre 2.11.0
    • Cray CLE6 system running 2.11 clients with 2.11 servers.
    • 3
    • 9223372036854775807

    Description

      We have seen in a production system the following error which are causing clients to be evicted.

      [85895.120239] LNetError: 18866:0:(o2iblnd_cb.c:3271:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 8 seconds

      [85895.130310] LNetError: 18866:0:(o2iblnd_cb.c:3346:kiblnd_check_conns()) Timed out RDMA with 10.10.32.227@o2ib2 (51): c: 0, oc: 0, rc: 8

      [123887.254790] Lustre: MGS: haven't heard from client 51aa0ab0-3f34-cf7e-2fef-01e9ddcd4448 (at 732@gni4) in 227 seconds. I think it's dead, and I am evicting it. exp ffff961d87b9a000, cur 1547261222 expire 1547261072 last 1547260995

      For our setup we have two back end file systems, F1 which is running 2.8.2 server back end and F2 which is running 2.11 server stack with ZFS (0.7.12). The clients are all running 2.11 cray clients. The LNet configuration is:

      F1 file system server backend with 2.8.2 stack, ldiskfs:

          map_on_demand:0

          concurrent_sends:0

          peer_credits:8

      F2 file system server 2.11 (ZFS 0.7.12)

          map_on_demand:1

          concurrent_sends:63

          peer_credits:8

      C3 (cray 2.11 router)

         map_on_demand:0

         concurrent_sends:16

         peer_credits:8 (o2ib)

         peer_credits:16 (gni).

      C4 (cray 2.11 router)

         map_on_demand:0

         concurrent_sends:63

         peer_credits:8 (o2ib)

         peer_credits:16 (gni)

      Currently the problems are only seen with 2.11 clients with the 2.11 file system. Since F1 is 2.8 and its peer credits are set to 8 this impacts the rest of the systems.

      Attachments

        Issue Links

          Activity

            [LU-11931] RDMA packets sent from client to MGS are timing out
            pjones Peter Jones added a comment -

            ok. Main patch landed for 2.13. 34200 should be tracked under a new Jira ticket

            pjones Peter Jones added a comment - ok. Main patch landed for 2.13. 34200 should be tracked under a new Jira ticket

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34646/
            Subject: LU-11931 lnd: bring back concurrent_sends
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 056fe83188f0a24de9e27248a7574c8fae867163

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34646/ Subject: LU-11931 lnd: bring back concurrent_sends Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 056fe83188f0a24de9e27248a7574c8fae867163

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34396/
            Subject: LU-11931 lnd: bring back concurrent_sends
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 83e45ead69babfb2909a3157f054fcd8fdf33360

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34396/ Subject: LU-11931 lnd: bring back concurrent_sends Project: fs/lustre-release Branch: master Current Patch Set: Commit: 83e45ead69babfb2909a3157f054fcd8fdf33360

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34646
            Subject: LU-11931 lnd: bring back concurrent_sends
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: f08724454d919717e8a70555cf1194acddc731ad

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34646 Subject: LU-11931 lnd: bring back concurrent_sends Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: f08724454d919717e8a70555cf1194acddc731ad
            pjones Peter Jones added a comment -

            Ah I see. So then we'll press to get  34396 landed and 34200 should probably move to being tracked under a new Jira reference. That way it'll be simplest to ensure that we get the desired patch into 2.12.1

            pjones Peter Jones added a comment - Ah I see. So then we'll press to get  34396 landed and 34200 should probably move to being tracked under a new Jira reference. That way it'll be simplest to ensure that we get the desired patch into 2.12.1

            Actually their are two patches. I nicked the one patch but I like the other patch that restored the concurrent_send functionality. Also the other patch is what we ended up running in production

            simmonsja James A Simmons added a comment - Actually their are two patches. I nicked the one patch but I like the other patch that restored the concurrent_send functionality. Also the other patch is what we ended up running in production
            pjones Peter Jones added a comment -

            We'll land it to b2_12 as soon as it's landed to master. ATM it's the -1 review from you gating that. Are you willing to reconsider that -1 in light of the success of the patch or did you actually revise the patch as you have suggested before applying it in production?

            pjones Peter Jones added a comment - We'll land it to b2_12 as soon as it's landed to master. ATM it's the -1 review from you gating that. Are you willing to reconsider that -1 in light of the success of the patch or did you actually revise the patch as you have suggested before applying it in production?

            We have released this patch into our production server system and it has resolved the peer credit starvation issues on MGS that was causing client evictions. The work around before the patch was to remove a batch of client nodes until the evictions stopped. Now with the patch in production we have all the clients back in use. Please consider landing this for 2.12 LTS.

            simmonsja James A Simmons added a comment - We have released this patch into our production server system and it has resolved the peer credit starvation issues on MGS that was causing client evictions. The work around before the patch was to remove a batch of client nodes until the evictions stopped. Now with the patch in production we have all the clients back in use. Please consider landing this for 2.12 LTS.

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34396
            Subject: LU-11931 lnd: bring back concurrent_sends
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2dead91d20b77e6c279aa1ca048b53d6f5617b10

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34396 Subject: LU-11931 lnd: bring back concurrent_sends Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2dead91d20b77e6c279aa1ca048b53d6f5617b10

            Yes, this assert needs to change. Although, I'm considering now that it might be a good idea to bring back concurrent_sends. Initially, I was thinking that it's enough to limit the number of txs by queue depth, but it seems like in order to saturate the link you might want to increase the concurrent_sends to over the queue depth. This will lead to queued txs, but it might be necessary to make sure that we maximize the bandwidth

             

            ashehata Amir Shehata (Inactive) added a comment - Yes, this assert needs to change. Although, I'm considering now that it might be a good idea to bring back concurrent_sends. Initially, I was thinking that it's enough to limit the number of txs by queue depth, but it seems like in order to saturate the link you might want to increase the concurrent_sends to over the queue depth. This will lead to queued txs, but it might be necessary to make sure that we maximize the bandwidth  

            People

              ashehata Amir Shehata (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: