Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8246

Leaks on ldlm granted locks counter on MDS leading to canceling loop

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • Lustre 2.9.0
    • Lustre 2.5.3, Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

      A performance problem at one of our customers led us to find that the granted ldlm locks counter (found in /proc/fs/lustre/ldlm/namespaces/mdt-fsname-MDT0000_UUID/pool/granted) is actually missing some decrements in some conditions (yet to be determined).

      This leads after some time to have this counter largely exceed the number found in /proc/fs/lustre/ldlm/namespaces/mdt-fsname-MDT0000_UUID/pool/limit.

      See here:

      [root@prolixmds1 pool]# pwd
      /proc/fs/lustre/ldlm/namespaces/mdt-scratch-MDT0000_UUID/pool
      [root@prolixmds1 pool]# cat limit
      3203616
      [root@prolixmds1 pool]# cat granted
      54882822
      

      However, summing up all granted locks as seen by the all the clients, we get only 16k locks, which is also consistent with the slab consumption on the MDS.

      Once above the limit, the MDS will then constantly try to cancel locks, even those which are not above max_age. Clients then reacquire the locks, but lose time in the process (then showing the performance problem).

      Note that as this is only the counter which is false, we don't have any resource overconsumption tied to this problem.

      We found that this problem is also seen on 2.8.
      Can you help find where the leak comes from ?

      I also wonder if there is any relation with the last comment from Shuichi Ihara in LU-5727.
      I also think Christopher Morrone pointed this out here

            green Oleg Drokin
            spiechurski Sebastien Piechurski
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: