Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-416

Many processes hung consuming a lot of CPU in Lustre-Client page-cache lookups

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • None
    • None
    • 3
    • 23,398
    • 8547

    Description

      Hi,

      At CEA they see quite often a problem on Lustre clients where processes are stuck consuming a lot of CPU time in Lustre layers. Unfortunately the only way to really fix this for now is to reboot the impacted nodes (after waiting for them for several hours), since involved processes are not killable.

      Crash dump analysis shows processes stuck with the following stack traces (crash dumps can be analyzed only on customer site):

      =========================================================
      _spin_lock()
      cl_page_gang_lookup()
      cl_lock_page_out()
      osc_lock_flush()
      osc_lock_cancel()
      cl_lock_cancel0()
      .....
      =========================================================

      and/or
      =========================================================
      __cond_resched()
      _cond_resched()
      cfs_cond_resched()
      cl_lock_page_out()
      osc_lock_flush()
      osc_lock_cancel()
      cl_lock_cancel0()
      .....
      =========================================================

      In attachment you will find 3 files:

      • node1330_dmesg is the dmesg of the faulty client;
      • node1330_lctl_dk is the 'lctl dk' output from the faulty client;
      • cmds.txt is the sequence of commands played to get the 'lctl dk' output.

      There are also "ll_imp_inval" threads stuck due to this problem, leaving OSCs in "IN"active state during a too long time finally causing time-outs and EIOs for client processes.
      Data structures involved are "cl_object_header.[coh_page_guard,coh_tree]", respectively for the lock/radix-tree used to manage the page-cache associated to a Lustre-Client object.

      It seems to be a race around the OSC object pages lock/radix-tree when concurrent access occur (OOM, flush, invalidation, concurrent I/O). This problem seems to occur when, on the same
      Lustre-Client, there are concurrent accesses on the same Lustre objects, inducing a competition on the associated lock and radix-tree from multiple CPUs.
      To reproduce this issue, CEA is using one of their proprietary benchmark. But basically, on a single node there are as many processes as cores on this machine, each process mapping a lot of memory. The processes write this memory to Lustre, preferably on the same OST to reproduce the problem. CEA noticed that OSC inactivation process in client eviction can be involved during
      issue reproduction. So a part of the reproducer can be to manually force client eviction on OSS side by using either:
      lct set_param obdfilter.<fs_name>-<OST_name>.evict_client=nid:<ipoib_clnt_addr>@<portal_name>
      or:
      echo 'nid:<ipoib_clnt_addr>@<portal_name>' > /proc/fs/lustre/obdfilter/<fs_name>/<OST_name>/evict_client

      In order to cope with production imperatives, CEA has setup a work-around that consists in freeing pagecache with "echo 1 > /proc/sys/vm/drop_caches". Doing so, clients will be able to reconnect. On the contrary, and it is interesting to note, clearing the LRU with "lctl set_param ldlm.namespaces.*.lru_size=clear" will hang the node!

      Does this issue sound familiar?
      Of course CEA really need a fix for this, as soon as possible.

      Sebastien.

      Attachments

        1. cmds.txt
          0.4 kB
        2. node1330_dmesg
          247 kB
        3. node1330_lctl_dk
          475 kB
        4. radix-intro.pdf
          43 kB

        Activity

          [LU-416] Many processes hung consuming a lot of CPU in Lustre-Client page-cache lookups
          pjones Peter Jones added a comment -

          Bull\CEA confirm that this issue was resolved by the LU394 patch

          pjones Peter Jones added a comment - Bull\CEA confirm that this issue was resolved by the LU394 patch

          indeed, lru_size=clear will drop all of caching locks at the client side, which has the same effect of echo 1 > drop_caches and evicts a client node.

          Actually I'm working on this issue at lu-437, can you please try the last patch at http://review.whamcloud.com/#change,911 to see if it works.

          jay Jinshan Xiong (Inactive) added a comment - indeed, lru_size=clear will drop all of caching locks at the client side, which has the same effect of echo 1 > drop_caches and evicts a client node. Actually I'm working on this issue at lu-437, can you please try the last patch at http://review.whamcloud.com/#change,911 to see if it works.

          Just one more comment which may demonstrate the coh_page_guard/coh_tree (ie, respectivelly spin-lock/radix-tree data structures to manage pages on a Client) current ineficiency when dealing with concurent access and with a huge number of pages, "lctl set_param
          ldlm_namespaces.*.lru_size=clear" pseudo-hangs the same way like the other radix-tree competitors when "echo 1 > /proc/sys/vm/drop_caches" succeeds to flush the pages (i assume via traditional Kernel algorithms) and unblocks the situation !!!

          bfaccini Bruno Faccini (Inactive) added a comment - Just one more comment which may demonstrate the coh_page_guard/coh_tree (ie, respectivelly spin-lock/radix-tree data structures to manage pages on a Client) current ineficiency when dealing with concurent access and with a huge number of pages, "lctl set_param ldlm_namespaces.*.lru_size=clear" pseudo-hangs the same way like the other radix-tree competitors when "echo 1 > /proc/sys/vm/drop_caches" succeeds to flush the pages (i assume via traditional Kernel algorithms) and unblocks the situation !!!

          So when this problem occurs, it takes too much time for the osc to write out all caching pages. This may be due to the deficiency in the implementation of cl_page_gang_lookup(), definitely it can worsen contention of ->coh_page_guard and slow things down.

          jay Jinshan Xiong (Inactive) added a comment - So when this problem occurs, it takes too much time for the osc to write out all caching pages. This may be due to the deficiency in the implementation of cl_page_gang_lookup(), definitely it can worsen contention of ->coh_page_guard and slow things down.

          Jinshan,

          Waiting enough give time to the system to make progress and complete, but it takes hours (even days). It doesn't look to be a live lock (at least some complete). On Jun 16th, I got some numbers, specially the number of locks assigned to the 'slow' client. Only 8 OSC had locks, and none ot them had more than 78 locks. The amount of buffer cache at this time was around 3GB.

          Alex.

          louveta Alexandre Louvet (Inactive) added a comment - Jinshan, Waiting enough give time to the system to make progress and complete, but it takes hours (even days). It doesn't look to be a live lock (at least some complete). On Jun 16th, I got some numbers, specially the number of locks assigned to the 'slow' client. Only 8 OSC had locks, and none ot them had more than 78 locks. The amount of buffer cache at this time was around 3GB. Alex.

          Can you please try patch at http://review.whamcloud.com/#change,911 if you have a test system?

          jay Jinshan Xiong (Inactive) added a comment - Can you please try patch at http://review.whamcloud.com/#change,911 if you have a test system?

          Hi Sebastien,

          I'm sorry, I still don't figure out the root cause of this issue, and there is a similar stack trace on LU-437 where LLNL hit it with IOR, so we're reproducing it in our lab. Meanwhile, I suspect there would be a problem in cl_page_gang_lookup() which may cause infinite loop, this is why I'd like you guys to try that patch, and maybe we can find something new with it.

          It will be great if I can get those data, because I'd like to know if the system is in a livelock state or keep going. Anyway, it will be all right if we can reproduce it in our lab.

          Thanks,
          Jinshan

          jay Jinshan Xiong (Inactive) added a comment - Hi Sebastien, I'm sorry, I still don't figure out the root cause of this issue, and there is a similar stack trace on LU-437 where LLNL hit it with IOR, so we're reproducing it in our lab. Meanwhile, I suspect there would be a problem in cl_page_gang_lookup() which may cause infinite loop, this is why I'd like you guys to try that patch, and maybe we can find something new with it. It will be great if I can get those data, because I'd like to know if the system is in a livelock state or keep going. Anyway, it will be all right if we can reproduce it in our lab. Thanks, Jinshan

          OK thank you Jinshan, we are looking forward to your patch.
          BTW, do you still need all the traces you asked for on July, 17th? because this is very complicated to take that sort of traces out of CEA.

          Cheers,
          Sebastien.

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - OK thank you Jinshan, we are looking forward to your patch. BTW, do you still need all the traces you asked for on July, 17th? because this is very complicated to take that sort of traces out of CEA. Cheers, Sebastien.

          It looks like there is an infinite loop problem in cl_lock_page_out(). I'm going to work out a patch to fix it.

          jay Jinshan Xiong (Inactive) added a comment - It looks like there is an infinite loop problem in cl_lock_page_out(). I'm going to work out a patch to fix it.

          Hi Jay,

          I have requested the data you are asking for to our on-site Support team.

          Sebastien.

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi Jay, I have requested the data you are asking for to our on-site Support team. Sebastien.

          People

            jay Jinshan Xiong (Inactive)
            sebastien.buisson Sebastien Buisson (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: