Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-340

system hang when running sanity-quota on RHEL5-x86_64-OFED

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.1.0, Lustre 2.1.1
    • None
    • lustre-master/RHEL5-x86_64/#120/ofa build
    • 3
    • 6100

    Description

      system hang when running sanity-quota on RHEL5-x86_64-ofa build. Please see the attachment for all the logs.

      Attachments

        1. client-18-syslog-trace.log
          2.33 MB
        2. client-5-syslog-trace.log
          2.63 MB
        3. mds-debug.log
          2.15 MB
        4. mds-ost.tar.gz
          745 kB

        Issue Links

          Activity

            [LU-340] system hang when running sanity-quota on RHEL5-x86_64-OFED

            Thank you, Sarah. I think the debug_log confirmed that dqacq_handler failed for group quota not enabled or fail_loc set.

            Could you try the following commands on client-5 to see what will happen? (quotacheck then set group quota):
            lfs quotacheck -ug lustre_dir
            lfs setquota -g group_name -b 0 -B 0 -i 0 -I 0 lustre_dir

            niu Niu Yawei (Inactive) added a comment - Thank you, Sarah. I think the debug_log confirmed that dqacq_handler failed for group quota not enabled or fail_loc set. Could you try the following commands on client-5 to see what will happen? (quotacheck then set group quota): lfs quotacheck -ug lustre_dir lfs setquota -g group_name -b 0 -B 0 -i 0 -I 0 lustre_dir

            I think the default + D_QUOTA will be fine, thank you, Sarah.

            niu Niu Yawei (Inactive) added a comment - I think the default + D_QUOTA will be fine, thank you, Sarah.
            sarah Sarah Liu added a comment -

            Is the D_QUOTA enabled?

            no. I can give you debug log tomorrow. please tell me the debug mask

            sarah Sarah Liu added a comment - Is the D_QUOTA enabled? no. I can give you debug log tomorrow. please tell me the debug mask

            Is the D_QUOTA enabled? can we get the debug log on MDS?

            niu Niu Yawei (Inactive) added a comment - Is the D_QUOTA enabled? can we get the debug log on MDS?
            sarah Sarah Liu added a comment -

            Is it repeatable? What's the /proc/fs/lustre/fail_loc on mds? Thanks.

            yes, it can be reproduced.
            [root@fat-intel-1 ~]# more /proc/sys/lustre/fail_loc
            0

            sarah Sarah Liu added a comment - Is it repeatable? What's the /proc/fs/lustre/fail_loc on mds? Thanks. yes, it can be reproduced. [root@fat-intel-1 ~] # more /proc/sys/lustre/fail_loc 0

            From the log we can see all pdflush threads on client were waiting on page lock, whereas the dd thread was holding the page lock to do synchronous IO, because of something wrong with group quota, the synchronous I/O can't finish in time, which caused the pdflush threads stalled.

            What confused me is that there were lots of "dqacq/dqrel failed! (rc:-5)" errors while setting group quota, but setting user quota was done successfully, and the user quota limit tests passed also. Looks there are only two possible cases that dqacq_handler() return -EIO, one is OBD_FAIL_OBD_DQACQ and another is ll_sb_has_quota_active() checking fails.

            Hi, Sarah

            Is it repeatable? What's the /proc/fs/lustre/fail_loc on mds? Thanks.

            niu Niu Yawei (Inactive) added a comment - From the log we can see all pdflush threads on client were waiting on page lock, whereas the dd thread was holding the page lock to do synchronous IO, because of something wrong with group quota, the synchronous I/O can't finish in time, which caused the pdflush threads stalled. What confused me is that there were lots of "dqacq/dqrel failed! (rc:-5)" errors while setting group quota, but setting user quota was done successfully, and the user quota limit tests passed also. Looks there are only two possible cases that dqacq_handler() return -EIO, one is OBD_FAIL_OBD_DQACQ and another is ll_sb_has_quota_active() checking fails. Hi, Sarah Is it repeatable? What's the /proc/fs/lustre/fail_loc on mds? Thanks.
            pjones Peter Jones added a comment -

            Niu

            Please look into this quotas issue when you get a chance

            Thanks

            Peter

            pjones Peter Jones added a comment - Niu Please look into this quotas issue when you get a chance Thanks Peter

            People

              niu Niu Yawei (Inactive)
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: