Details

    • Technical task
    • Resolution: Fixed
    • Minor
    • Lustre 2.6.0
    • Lustre 2.4.0
    • 7449

    Description

      By running the Coverity tool on the Lustre code, we may have found a possible thread deadlock.

      In function put_pages_on_daemon_list(), the loop with cfs_tcd_for_each_type_lock() acquires the lock "cfs_trace_cpu_data.tcd_lock". And inside this loop, the call to put_pages_on_tcd_daemon_list() acquires the lock "page_collection.pc_lock".

      However, we have two other cases where the locks tcd_lock and pc_lock are taken in reverse order.
      In collect_pages_on_all_cpus(), spin_lock(&pc->pc_lock) acquires the lock "pc->pc_lock", and then the loop with cfs_tcd_for_each_type_lock() acquires the lock "cfs_trace_cpu_data.tcd_lock".
      We have the same situation in put_pages_back_on_all_cpus().

      In the end it seems a thread deadlock could occur. What do you think?

      Sebastien.

      Attachments

        Activity

          [LU-3055] Possible deadlock in put_pages_on_daemon_list()
          pjones Peter Jones added a comment -

          Landed for 2.6

          pjones Peter Jones added a comment - Landed for 2.6
          liang Liang Zhen (Inactive) added a comment - patch is ready to land: http://review.whamcloud.com/#/c/7660/
          liang Liang Zhen (Inactive) added a comment - - edited

          This code is not very clean, but it's impossible to deadlock:

          • put_pages_on_daemon_list() is only called by kthread tracefiled(), and @pc here is a structure in thread stack
          • the only chance that the same @pc can be referred by collect_pages_on_all_cpus() is, the thread thread call collect_pages()->>collect_pages_on_all_cpus()

          so @pc is not going to be shared between different threads.
          Actually, after I looked into code, I think pc_lock is totally useless, it's there because we used to have a function trace_call_on_all_cpus() and it's supposed to protect race between CPUs (see BZ15878), now all use-cases of page_collection are actually just per-thread and in-stack, so we can simply remove this lock.

          liang Liang Zhen (Inactive) added a comment - - edited This code is not very clean, but it's impossible to deadlock: put_pages_on_daemon_list() is only called by kthread tracefiled(), and @pc here is a structure in thread stack the only chance that the same @pc can be referred by collect_pages_on_all_cpus() is, the thread thread call collect_pages()->>collect_pages_on_all_cpus() so @pc is not going to be shared between different threads. Actually, after I looked into code, I think pc_lock is totally useless, it's there because we used to have a function trace_call_on_all_cpus() and it's supposed to protect race between CPUs (see BZ15878), now all use-cases of page_collection are actually just per-thread and in-stack, so we can simply remove this lock.
          pjones Peter Jones added a comment -

          Liang

          Are you able to comment?

          Thanks

          Peter

          pjones Peter Jones added a comment - Liang Are you able to comment? Thanks Peter

          People

            liang Liang Zhen (Inactive)
            sebastien.buisson Sebastien Buisson (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: