Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11089

Performance improvements for lu_object locking

Details

    • Improvement
    • Resolution: Fixed
    • Major
    • Lustre 2.13.0, Lustre 2.12.5
    • Lustre 2.12.0
    • None
    • 9223372036854775807

    Description

      While porting the LU-6800 work upstream the reaction to the approach was disliked since it wasn't a real improvement. Neil has created a patch series to break up the global lock to increase its performance.

      Attachments

        Issue Links

          Activity

            [LU-11089] Performance improvements for lu_object locking

            Patrick, sorry! forgot to update this ticket. Here is new ticket. LU-11624.

            sihara Shuichi Ihara added a comment - Patrick, sorry! forgot to update this ticket. Here is new ticket. LU-11624 .

            Ihara,

            Could you link that ticket here?  I'm interested in tracking it.  Our MDSses running 2.12 are crashing when we fail them over under load.  Pretty reliably.

            paf Patrick Farrell (Inactive) added a comment - Ihara, Could you link that ticket here?  I'm interested in tracking it.  Our MDSses running 2.12 are crashing when we fail them over under load.  Pretty reliably.

            Thanks. I have a patch based on LU-8130 work that should fix this.

            simmonsja James A Simmons added a comment - Thanks. I have a patch based on LU-8130 work that should fix this.

            James, crashing servers were not related to your patches (LU-11089), but looks like more general problem in master. me open new jira ticket for this.

            sihara Shuichi Ihara added a comment - James, crashing servers were not related to your patches ( LU-11089 ), but looks like more general problem in master. me open new jira ticket for this.
            pjones Peter Jones added a comment -

            Could we please have a separate ticket for any instances seen on 2.12 or earlier releases without James's unlanded patches being applied? Is there any suggestion that this is happening more frequently on 2.12 compared to 2.11 and earlier releases?

            pjones Peter Jones added a comment - Could we please have a separate ticket for any instances seen on 2.12 or earlier releases without James's unlanded patches being applied? Is there any suggestion that this is happening more frequently on 2.12 compared to 2.11 and earlier releases?

            Ah, I see now that none of the patches have landed.  So it is definitely a pre-existing bug.  Interesting.

            paf Patrick Farrell (Inactive) added a comment - Ah, I see now that none of the patches have landed.  So it is definitely a pre-existing bug.  Interesting.

            Seen here during recovery as well.  Interesting.  I imagine even if the bug was already there, the changes made it easier to hit.  (Doesn't mean the changes are wrong, just that there's probably a reason we're suddenly seeing it.)

            paf Patrick Farrell (Inactive) added a comment - Seen here during recovery as well.  Interesting.  I imagine even if the bug was already there, the changes made it easier to hit.  (Doesn't mean the changes are wrong, just that there's probably a reason we're suddenly seeing it.)
            simmonsja James A Simmons added a comment - - edited

            Since the NID hash seems to be broken in general I did a port to rhashtables. Still need to work on the /proc entries to display hash stats. Please try it out to see if no longer crashes your nodes. Patch is at:

            https://review.whamcloud.com/#/c/33518

            The build breakage is only on SLES12SP3.

            simmonsja James A Simmons added a comment - - edited Since the NID hash seems to be broken in general I did a port to rhashtables. Still need to work on the /proc entries to display hash stats. Please try it out to see if no longer crashes your nodes. Patch is at: https://review.whamcloud.com/#/c/33518 The build breakage is only on SLES12SP3.

            Thanks for the info. Ruth has pointed out that this is a general bug. I have started the port of the nid hash to rhashtable and I'm seeing hidden issues with the original code.

            simmonsja James A Simmons added a comment - Thanks for the info. Ruth has pointed out that this is a general bug. I have started the port of the nid hash to rhashtable and I'm seeing hidden issues with the original code.

            This happens quite offten. I saw crash even at intial mount. e.g. create filesystem and mount Lustre on 32 clients, then got crash of one of OSS.

            sihara Shuichi Ihara added a comment - This happens quite offten. I saw crash even at intial mount. e.g. create filesystem and mount Lustre on 32 clients, then got crash of one of OSS.

            One off with no patches applied.

            After I brought the node back all the clients mounted and tests ran.

            ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - One off with no patches applied. After I brought the node back all the clients mounted and tests ran.

            People

              simmonsja James A Simmons
              simmonsja James A Simmons
              Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: