Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5333

rm cause MDS to complain hung tasks and disconnecting clients

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.4.3
    • Linux puma-mds-10-6.local 2.6.32-358.23.2.el6_lustre.x86_64 #1 SMP Thu Dec 19 19:57:45 PST 2013 x86_64 x86_64 x86_64 GNU/Linux
    • 2
    • 14881

    Description

      A client was running "rm" to remove a couple of million files when MDS system load shot to 30 and kernel trace dumping complaining hung tasks - See attached output from "dmesg"

      I would think this is normal workload for a duo-westmere CPU / 24GB RAM bonded myricom 10Gbps system.

      We have been seeing happening more frequently in 2.4.3 than when we were at 1.8.7.

      Anything suggestion?

      thanks,
      Haisong

      Attachments

        1. dmesg_log
          56 kB
          Haisong Cai
        2. dmesg.3369
          450 kB
          Haisong Cai
        3. lustre-log.tgz
          1.65 MB
          Haisong Cai

        Issue Links

          Activity

            [LU-5333] rm cause MDS to complain hung tasks and disconnecting clients
            niu Niu Yawei (Inactive) added a comment - b2_5 port: http://review.whamcloud.com/#/c/13464/
            pjones Peter Jones added a comment -

            Haisong

            To be clear LU-5726 is targeted to be fixed in the 2.7 release but is not fixed yet. Your interest in this issue will raise the priority on this work and Niu will look at the possibilities/options to backport a fix to 2.5.x as part of this effort.

            Regards

            Peter

            pjones Peter Jones added a comment - Haisong To be clear LU-5726 is targeted to be fixed in the 2.7 release but is not fixed yet. Your interest in this issue will raise the priority on this work and Niu will look at the possibilities/options to backport a fix to 2.5.x as part of this effort. Regards Peter

            Hi Yawei,

            LU-5726 indicates the issue is fixed in 2.7.0.
            Could you comment on whether the fix can be back-ported into earlier versions, specifically 2.5.*?

            thanks,
            Haisong

            haisong Haisong Cai (Inactive) added a comment - Hi Yawei, LU-5726 indicates the issue is fixed in 2.7.0. Could you comment on whether the fix can be back-ported into earlier versions, specifically 2.5.*? thanks, Haisong

            I think this could be related with LU-5726, and LU-5503 looks another instance of such problem.

            niu Niu Yawei (Inactive) added a comment - I think this could be related with LU-5726 , and LU-5503 looks another instance of such problem.

            Correction: server is running 2.4.2 not 2.4.3

            haisong Haisong Cai (Inactive) added a comment - Correction: server is running 2.4.2 not 2.4.3

            We have another case where removing some several million files from filesystem caused MDS to dump stack traces and gradually hang. I will attache some stack traces and dmesg following this message.

            haisong Haisong Cai (Inactive) added a comment - We have another case where removing some several million files from filesystem caused MDS to dump stack traces and gradually hang. I will attache some stack traces and dmesg following this message.

            Hi Oleg,

            Later that day, the MDS came to a point where local commands were hanging.
            We rebooted the server and ran e2fsck. That fixed a bunch quota entries and couple of inodes.

            It has been stable so far.

            Haisong

            haisong Haisong Cai (Inactive) added a comment - Hi Oleg, Later that day, the MDS came to a point where local commands were hanging. We rebooted the server and ran e2fsck. That fixed a bunch quota entries and couple of inodes. It has been stable so far. Haisong
            green Oleg Drokin added a comment -

            From the traces it looks like it's a combination of OOM and journal deadlock of some sort.

            green Oleg Drokin added a comment - From the traces it looks like it's a combination of OOM and journal deadlock of some sort.
            pjones Peter Jones added a comment -

            Niu

            Could you please advise on this issue?

            Thanks

            Peter

            pjones Peter Jones added a comment - Niu Could you please advise on this issue? Thanks Peter

            People

              niu Niu Yawei (Inactive)
              haisong Haisong Cai (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: