Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8715

Regression from LU-8057 causes loading of fld.ko hung in 2.7.2

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.7.0
    • None
    • lustre server nas-2.7.2-3nasS running in centos 6.7.
    • 3
    • 9223372036854775807

    Description

      Since our nas-2.7.2-2nas rebased to b2_7_fe to nas-2.7.2-3nas, we found loading lustre module fld.ko hanged. Modprobe took 100% cpu time and could not be killed.

      I identified the culprit of the problem using git bisect:
      commit f23e22da88f07e95071ec76807aaa42ecd39e8ca
      Author: Amitoj Kaur Chawla <amitoj1606@gmail.com>
      Date: Thu Jun 16 23:12:03 2016 +0800

      LU-8057 ko2iblnd: Replace sg++ with sg = sg_next(sg)

      It was a b2_7_fe back port from the following one:
      Lustre-commit: d226464acaacccd240da43dcc22372fbf8cb04a6
      Lustre-change: http://review.whamcloud.com/19342

      Attachments

        Issue Links

          Activity

            [LU-8715] Regression from LU-8057 causes loading of fld.ko hung in 2.7.2

            we have >12,000 clients. We do see some servers consume all the credits.

            mhanafi Mahmoud Hanafi added a comment - we have >12,000 clients. We do see some servers consume all the credits.

            @Bruno Faccini: Yes, I can reproduce the problem on our freshly rebooted lustre servers by doing 'modprobe fld.'

            jaylan Jay Lan (Inactive) added a comment - @Bruno Faccini: Yes, I can reproduce the problem on our freshly rebooted lustre servers by doing 'modprobe fld.'

            Hi Doug,

            Can you please have a look into the issue since it relates to the LU-8057 change?

            Thanks.
            Joe

            jgmitter Joseph Gmitter (Inactive) added a comment - Hi Doug, Can you please have a look into the issue since it relates to the LU-8057 change? Thanks. Joe
            simmonsja James A Simmons added a comment - - edited

            The fix is correct and it fixes a real bug. What this change did is exposed another problem in the ko2iblnd driver. I have to ask is your system really consuming all those credits? I don't think the IB driver queue pair depth is big enough to handle all those credits.

            simmonsja James A Simmons added a comment - - edited The fix is correct and it fixes a real bug. What this change did is exposed another problem in the ko2iblnd driver. I have to ask is your system really consuming all those credits? I don't think the IB driver queue pair depth is big enough to handle all those credits.

            Module load time before was about 2-5mins, because we have large ntx values.
            (options ko2iblnd ntx=125536 credits=62768 fmr_pool_size=31385)
            But after the patch it takes >20mins

            mhanafi Mahmoud Hanafi added a comment - Module load time before was about 2-5mins, because we have large ntx values. (options ko2iblnd ntx=125536 credits=62768 fmr_pool_size=31385) But after the patch it takes >20mins

            Well, both the failure and suspected cause look surprising.
            Do you mean that the fld.ko module load simply hangs on a fresh system when running "modprobre lustre"?

            bfaccini Bruno Faccini (Inactive) added a comment - Well, both the failure and suspected cause look surprising. Do you mean that the fld.ko module load simply hangs on a fresh system when running "modprobre lustre"?

            People

              ashehata Amir Shehata (Inactive)
              jaylan Jay Lan (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: