Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5727

MDS OOMs with 2.5.3 clients and lru_size != 0

    XMLWordPrintable

Details

    • 3
    • 16087

    Description

      We have seen mds (admittedly with smallish memory) OOM'ing while testing 2.5.3 whereas there was no problem with 2.5.0. It turns out the problem is that, even though we have lru_size=800 everywhere, the client LDLM lru's are growing huge so that the MDS unreclaimable ldlm slabs fill memory.

      It looks like the root cause is the change to ldlm_cancel_aged_policy() in commit 0a6c6fcd46 on the 2.5 branch (LU-4786 osc: to not pick busy pages for ELC) - it has changed the lru_sze != 0 behaviour. Prior to that, the non-lru_resize behaviour (at least through the early_lock_cancel path which is what we see being hit) was effectively

      cancel lock if (too many in lru cache || lock unused too long)

      In 2.5.3, it's

      cancel lock if (too many in lru cache && lock unused too long)

      Disabling early_lock_cancel doesn't seem to help.

      It might be arguable which of the two behaviours is correct but the lru_size doco suggests the former - the latter makes lru_size != 0 ineffective in practice. It also looks like the change was not actually necessary for LU-4300?

      Attachments

        Issue Links

          Activity

            People

              niu Niu Yawei (Inactive)
              dbs900 David Singleton
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: