Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1428

MDT servrice threads spinning in cfs_hash_for_each_relax()

Details

    • 3
    • 4587

    Description

      We have two MDT service threads using 100% CPU on a production MDS. I can't get a backtrace from crash because they do not yield the CPU, but based on oprofile they seem to be spinning in cfs_hash_for_each_relax(). At the same time we are seeing client hangs and high lock cancellation rates on the OSTs.

      samples  %        image name               app name                 symbol name
      4225020  33.0708  libcfs.ko                libcfs.ko                cfs_hash_for_each_relax
      3345225  26.1843  libcfs.ko                libcfs.ko                cfs_hash_hh_hhead
      532409    4.1674  ptlrpc.ko                ptlrpc.ko                ldlm_cancel_locks_for_export_cb
      307199    2.4046  ptlrpc.ko                ptlrpc.ko                lock_res_and_lock
      175349    1.3725  vmlinux                  vmlinux                  native_read_tsc
      151989    1.1897  ptlrpc.ko                ptlrpc.ko                ldlm_del_waiting_lock
      136679    1.0698  libcfs.ko                libcfs.ko                cfs_hash_rw_lock
      109269    0.8553  jbd2.ko                  jbd2.ko                  journal_clean_one_cp_list
      

      Attachments

        Issue Links

          Activity

            [LU-1428] MDT servrice threads spinning in cfs_hash_for_each_relax()
            pjones Peter Jones added a comment -

            Landed for 2.1.3 and 2.3

            pjones Peter Jones added a comment - Landed for 2.1.3 and 2.3

            NASA Ames hit this this morning on a 2.1.1 MDS.

            jaylan Jay Lan (Inactive) added a comment - NASA Ames hit this this morning on a 2.1.1 MDS.

            Patch landed to master for Lustre 2.3.0, still needs to be landed to b2_1 for 2.1.3.

            adilger Andreas Dilger added a comment - Patch landed to master for Lustre 2.3.0, still needs to be landed to b2_1 for 2.1.3.

            Liang,

            I wanted to provide more details for this problem, and particulary to indicate that I feel pretty sure that the looping situation will never clear due to "orphaned" locks stiil beeing on the hash-lists, but seems that you finally found the bug/window !!

            So, may be its too late, but just in case, I also found that digging in the Servers/Clients logs, it seems to me that at the start of the problem, the concerned Client/export reported the following errors/msgs during MDT umount :

            1339507111 2012 Jun 12 15:18:31 lascaux3332 kern err kernel LustreError: 57793:0:(lmv_obd.c:665:lmv_disconnect_mdc()) Target scratch2-MDT0000_UUID disconnect error -110
            1339507111 2012 Jun 12 15:18:31 lascaux3332 kern warning kernel Lustre: client ffff88087d539400 umount complete
            1339507130 2012 Jun 12 15:18:50 lascaux3332 kern err kernel LustreError: 57863:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
            1339507130 2012 Jun 12 15:18:50 lascaux3332 kern err kernel LustreError: 57863:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Skipped 52 previous similar messages

            Also interesting is that this situation may have a "low" impact when concerning a single Client/export which already umounted and locks on a very specific part/directory of the filesystem !! This explain why we found multiple mdt_<id> threads spinning in this situation since days and no problem reported nor found whan accessing/using the concerned filesystem !!!

            bfaccini Bruno Faccini (Inactive) added a comment - Liang, I wanted to provide more details for this problem, and particulary to indicate that I feel pretty sure that the looping situation will never clear due to "orphaned" locks stiil beeing on the hash-lists, but seems that you finally found the bug/window !! So, may be its too late, but just in case, I also found that digging in the Servers/Clients logs, it seems to me that at the start of the problem, the concerned Client/export reported the following errors/msgs during MDT umount : 1339507111 2012 Jun 12 15:18:31 lascaux3332 kern err kernel LustreError: 57793:0:(lmv_obd.c:665:lmv_disconnect_mdc()) Target scratch2-MDT0000_UUID disconnect error -110 1339507111 2012 Jun 12 15:18:31 lascaux3332 kern warning kernel Lustre: client ffff88087d539400 umount complete 1339507130 2012 Jun 12 15:18:50 lascaux3332 kern err kernel LustreError: 57863:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway 1339507130 2012 Jun 12 15:18:50 lascaux3332 kern err kernel LustreError: 57863:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Skipped 52 previous similar messages Also interesting is that this situation may have a "low" impact when concerning a single Client/export which already umounted and locks on a very specific part/directory of the filesystem !! This explain why we found multiple mdt_<id> threads spinning in this situation since days and no problem reported nor found whan accessing/using the concerned filesystem !!!

            Thanks, we also need this patch for 2.1.

            nedbass Ned Bass (Inactive) added a comment - Thanks, we also need this patch for 2.1.

            I've posted a patch for review: http://review.whamcloud.com/#change,3028

            liang Liang Zhen (Inactive) added a comment - I've posted a patch for review: http://review.whamcloud.com/#change,3028

            People

              liang Liang Zhen (Inactive)
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: