Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-340

system hang when running sanity-quota on RHEL5-x86_64-OFED

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.1.0, Lustre 2.1.1
    • None
    • lustre-master/RHEL5-x86_64/#120/ofa build
    • 3
    • 6100

    Description

      system hang when running sanity-quota on RHEL5-x86_64-ofa build. Please see the attachment for all the logs.

      Attachments

        1. client-18-syslog-trace.log
          2.33 MB
        2. client-5-syslog-trace.log
          2.63 MB
        3. mds-debug.log
          2.15 MB
        4. mds-ost.tar.gz
          745 kB

        Issue Links

          Activity

            [LU-340] system hang when running sanity-quota on RHEL5-x86_64-OFED

            Fixed in LU-1782.

            niu Niu Yawei (Inactive) added a comment - Fixed in LU-1782 .
            yujian Jian Yu added a comment -

            Lustre Tag: v2_1_1_0_RC2
            Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/41/
            Distro/Arch: RHEL5/x86_64 (kernel version: 2.6.18-274.12.1.el5)
            Network: IB (OFED 1.5.4)

            The same issue occurred: https://maloo.whamcloud.com/test_sets/f95cf180-584c-11e1-9df1-5254004bbbd3

            yujian Jian Yu added a comment - Lustre Tag: v2_1_1_0_RC2 Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/41/ Distro/Arch: RHEL5/x86_64 (kernel version: 2.6.18-274.12.1.el5) Network: IB (OFED 1.5.4) The same issue occurred: https://maloo.whamcloud.com/test_sets/f95cf180-584c-11e1-9df1-5254004bbbd3
            yujian Jian Yu added a comment -

            Lustre Branch: master
            Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/273/
            Distro/Arch: RHEL5/x86_64
            Network: IB (OFED 1.5.3.1)

            The same failure occurred while running sanity-quota test: https://maloo.whamcloud.com/test_sets/4115f084-d2de-11e0-8d02-52540025f9af

            yujian Jian Yu added a comment - Lustre Branch: master Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/273/ Distro/Arch: RHEL5/x86_64 Network: IB (OFED 1.5.3.1) The same failure occurred while running sanity-quota test: https://maloo.whamcloud.com/test_sets/4115f084-d2de-11e0-8d02-52540025f9af
            yujian Jian Yu added a comment -

            Lustre Clients:
            Tag: 1.8.6-wc1
            Distro/Arch: RHEL5/x86_64 (kernel version: 2.6.18_238.12.1.el5.x86_64)
            Build: http://newbuild.whamcloud.com/job/lustre-b1_8/100/arch=x86_64,build_type=client,distro=el5,ib_stack=ofa/
            Network: IB (OFED 1.5.3.1)

            Lustre Servers:
            Tag: v2_1_0_0_RC1
            Distro/Arch: RHEL5/x86_64 (kernel version: 2.6.18-238.19.1.el5_lustre.g65156ed.x86_64)
            Build: http://newbuild.whamcloud.com/job/lustre-master/273/arch=x86_64,build_type=server,distro=el5,ib_stack=ofa/
            Network: IB (OFED 1.5.3.1)

            sanity-quota test 1 hung: https://maloo.whamcloud.com/test_sets/842c0928-cfc6-11e0-8d02-52540025f9af

            Dmesg on MDS (fat-amd-1-ib) showed:

            Lustre: DEBUG MARKER: == test 1: Block hard limit (normal use and out of quota) === == 01:51:35
            Lustre: DEBUG MARKER: User quota (limit: 95511 kbytes)
            Lustre: DEBUG MARKER: Write ...
            Lustre: DEBUG MARKER: Done
            Lustre: DEBUG MARKER: Write out of block quota ...
            Lustre: DEBUG MARKER: --------------------------------------
            Lustre: DEBUG MARKER: Group quota (limit: 95511 kbytes)
            LustreError: 8250:0:(ldlm_lib.c:2341:target_handle_dqacq_callback()) dqacq/dqrel failed! (rc:-5)
            LustreError: 8251:0:(ldlm_lib.c:2341:target_handle_dqacq_callback()) dqacq/dqrel failed! (rc:-5)
            LustreError: 6520:0:(quota_context.c:708:dqacq_completion()) acquire qunit got error! (rc:-5)
            LustreError: 6520:0:(quota_master.c:1263:mds_init_slave_blimits()) error mds adjust local block quota! (rc:-5)
            LustreError: 6520:0:(quota_master.c:1442:mds_set_dqblk()) init slave blimits failed! (rc:-5)
            <~snip~>
            
            yujian Jian Yu added a comment - Lustre Clients: Tag: 1.8.6-wc1 Distro/Arch: RHEL5/x86_64 (kernel version: 2.6.18_238.12.1.el5.x86_64) Build: http://newbuild.whamcloud.com/job/lustre-b1_8/100/arch=x86_64,build_type=client,distro=el5,ib_stack=ofa/ Network: IB (OFED 1.5.3.1) Lustre Servers: Tag: v2_1_0_0_RC1 Distro/Arch: RHEL5/x86_64 (kernel version: 2.6.18-238.19.1.el5_lustre.g65156ed.x86_64) Build: http://newbuild.whamcloud.com/job/lustre-master/273/arch=x86_64,build_type=server,distro=el5,ib_stack=ofa/ Network: IB (OFED 1.5.3.1) sanity-quota test 1 hung: https://maloo.whamcloud.com/test_sets/842c0928-cfc6-11e0-8d02-52540025f9af Dmesg on MDS (fat-amd-1-ib) showed: Lustre: DEBUG MARKER: == test 1: Block hard limit (normal use and out of quota) === == 01:51:35 Lustre: DEBUG MARKER: User quota (limit: 95511 kbytes) Lustre: DEBUG MARKER: Write ... Lustre: DEBUG MARKER: Done Lustre: DEBUG MARKER: Write out of block quota ... Lustre: DEBUG MARKER: -------------------------------------- Lustre: DEBUG MARKER: Group quota (limit: 95511 kbytes) LustreError: 8250:0:(ldlm_lib.c:2341:target_handle_dqacq_callback()) dqacq/dqrel failed! (rc:-5) LustreError: 8251:0:(ldlm_lib.c:2341:target_handle_dqacq_callback()) dqacq/dqrel failed! (rc:-5) LustreError: 6520:0:(quota_context.c:708:dqacq_completion()) acquire qunit got error! (rc:-5) LustreError: 6520:0:(quota_master.c:1263:mds_init_slave_blimits()) error mds adjust local block quota! (rc:-5) LustreError: 6520:0:(quota_master.c:1442:mds_set_dqblk()) init slave blimits failed! (rc:-5) <~snip~>

            When I logon to the system, I found that "lfs quotaon -ug" can't turn on the local fs group quota on mds, though it can be successfully executed and no any abnormal messages in the debug log.

            The local fs group quota can be enabled by a "lfs quotaon -g", and after the "lfs quotaon -g" executed, the system returned back to normal status, the group quota can be enable/disabled by "lfs quotaon/off -ug" again.

            This bug appeared only on ofa build server, so I suspect it's ofa build related, will continue the investigation when I have time and spare nodes.

            niu Niu Yawei (Inactive) added a comment - When I logon to the system, I found that "lfs quotaon -ug" can't turn on the local fs group quota on mds, though it can be successfully executed and no any abnormal messages in the debug log. The local fs group quota can be enabled by a "lfs quotaon -g", and after the "lfs quotaon -g" executed, the system returned back to normal status, the group quota can be enable/disabled by "lfs quotaon/off -ug" again. This bug appeared only on ofa build server, so I suspect it's ofa build related, will continue the investigation when I have time and spare nodes.
            sarah Sarah Liu added a comment -

            [root@client-15 ~]# lfs quotacheck -ug /mnt/lustre/
            [root@client-15 ~]# lfs setquota -g quota_usr -b 0 -B 0 -i 0 -I 0 /mnt/lustre/
            [root@client-15 ~]# mount
            /dev/sda1 on / type ext3 (rw)
            proc on /proc type proc (rw)
            sysfs on /sys type sysfs (rw)
            devpts on /dev/pts type devpts (rw,gid=5,mode=620)
            tmpfs on /dev/shm type tmpfs (rw)
            none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
            sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
            192.168.4.128@o2ib:/lustre on /mnt/lustre type lustre (rw,flock)

            sarah Sarah Liu added a comment - [root@client-15 ~] # lfs quotacheck -ug /mnt/lustre/ [root@client-15 ~] # lfs setquota -g quota_usr -b 0 -B 0 -i 0 -I 0 /mnt/lustre/ [root@client-15 ~] # mount /dev/sda1 on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) 192.168.4.128@o2ib:/lustre on /mnt/lustre type lustre (rw,flock)

            Thank you, Sarah. I think the debug_log confirmed that dqacq_handler failed for group quota not enabled or fail_loc set.

            Could you try the following commands on client-5 to see what will happen? (quotacheck then set group quota):
            lfs quotacheck -ug lustre_dir
            lfs setquota -g group_name -b 0 -B 0 -i 0 -I 0 lustre_dir

            niu Niu Yawei (Inactive) added a comment - Thank you, Sarah. I think the debug_log confirmed that dqacq_handler failed for group quota not enabled or fail_loc set. Could you try the following commands on client-5 to see what will happen? (quotacheck then set group quota): lfs quotacheck -ug lustre_dir lfs setquota -g group_name -b 0 -B 0 -i 0 -I 0 lustre_dir

            I think the default + D_QUOTA will be fine, thank you, Sarah.

            niu Niu Yawei (Inactive) added a comment - I think the default + D_QUOTA will be fine, thank you, Sarah.
            sarah Sarah Liu added a comment -

            Is the D_QUOTA enabled?

            no. I can give you debug log tomorrow. please tell me the debug mask

            sarah Sarah Liu added a comment - Is the D_QUOTA enabled? no. I can give you debug log tomorrow. please tell me the debug mask

            Is the D_QUOTA enabled? can we get the debug log on MDS?

            niu Niu Yawei (Inactive) added a comment - Is the D_QUOTA enabled? can we get the debug log on MDS?

            People

              niu Niu Yawei (Inactive)
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: