[LU-340] system hang when running sanity-quota on RHEL5-x86_64-OFED Created: 17/May/11  Updated: 01/Apr/13  Resolved: 01/Apr/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0, Lustre 2.1.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Sarah Liu Assignee: Niu Yawei (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

lustre-master/RHEL5-x86_64/#120/ofa build


Attachments: Text File client-18-syslog-trace.log     Text File client-5-syslog-trace.log     Text File mds-debug.log     File mds-ost.tar.gz    
Issue Links:
Related
is related to LU-1782 Ignore sb_has_quota_active() in OFED'... Resolved
Severity: 3
Rank (Obsolete): 6100

 Description   

system hang when running sanity-quota on RHEL5-x86_64-ofa build. Please see the attachment for all the logs.



 Comments   
Comment by Peter Jones [ 18/May/11 ]

Niu

Please look into this quotas issue when you get a chance

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 19/May/11 ]

From the log we can see all pdflush threads on client were waiting on page lock, whereas the dd thread was holding the page lock to do synchronous IO, because of something wrong with group quota, the synchronous I/O can't finish in time, which caused the pdflush threads stalled.

What confused me is that there were lots of "dqacq/dqrel failed! (rc:-5)" errors while setting group quota, but setting user quota was done successfully, and the user quota limit tests passed also. Looks there are only two possible cases that dqacq_handler() return -EIO, one is OBD_FAIL_OBD_DQACQ and another is ll_sb_has_quota_active() checking fails.

Hi, Sarah

Is it repeatable? What's the /proc/fs/lustre/fail_loc on mds? Thanks.

Comment by Sarah Liu [ 19/May/11 ]

Is it repeatable? What's the /proc/fs/lustre/fail_loc on mds? Thanks.

yes, it can be reproduced.
[root@fat-intel-1 ~]# more /proc/sys/lustre/fail_loc
0

Comment by Niu Yawei (Inactive) [ 19/May/11 ]

Is the D_QUOTA enabled? can we get the debug log on MDS?

Comment by Sarah Liu [ 20/May/11 ]

Is the D_QUOTA enabled?

no. I can give you debug log tomorrow. please tell me the debug mask

Comment by Niu Yawei (Inactive) [ 20/May/11 ]

I think the default + D_QUOTA will be fine, thank you, Sarah.

Comment by Niu Yawei (Inactive) [ 22/May/11 ]

Thank you, Sarah. I think the debug_log confirmed that dqacq_handler failed for group quota not enabled or fail_loc set.

Could you try the following commands on client-5 to see what will happen? (quotacheck then set group quota):
lfs quotacheck -ug lustre_dir
lfs setquota -g group_name -b 0 -B 0 -i 0 -I 0 lustre_dir

Comment by Sarah Liu [ 24/May/11 ]

[root@client-15 ~]# lfs quotacheck -ug /mnt/lustre/
[root@client-15 ~]# lfs setquota -g quota_usr -b 0 -B 0 -i 0 -I 0 /mnt/lustre/
[root@client-15 ~]# mount
/dev/sda1 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
192.168.4.128@o2ib:/lustre on /mnt/lustre type lustre (rw,flock)

Comment by Niu Yawei (Inactive) [ 26/May/11 ]

When I logon to the system, I found that "lfs quotaon -ug" can't turn on the local fs group quota on mds, though it can be successfully executed and no any abnormal messages in the debug log.

The local fs group quota can be enabled by a "lfs quotaon -g", and after the "lfs quotaon -g" executed, the system returned back to normal status, the group quota can be enable/disabled by "lfs quotaon/off -ug" again.

This bug appeared only on ofa build server, so I suspect it's ofa build related, will continue the investigation when I have time and spare nodes.

Comment by Jian Yu [ 29/Aug/11 ]

Lustre Clients:
Tag: 1.8.6-wc1
Distro/Arch: RHEL5/x86_64 (kernel version: 2.6.18_238.12.1.el5.x86_64)
Build: http://newbuild.whamcloud.com/job/lustre-b1_8/100/arch=x86_64,build_type=client,distro=el5,ib_stack=ofa/
Network: IB (OFED 1.5.3.1)

Lustre Servers:
Tag: v2_1_0_0_RC1
Distro/Arch: RHEL5/x86_64 (kernel version: 2.6.18-238.19.1.el5_lustre.g65156ed.x86_64)
Build: http://newbuild.whamcloud.com/job/lustre-master/273/arch=x86_64,build_type=server,distro=el5,ib_stack=ofa/
Network: IB (OFED 1.5.3.1)

sanity-quota test 1 hung: https://maloo.whamcloud.com/test_sets/842c0928-cfc6-11e0-8d02-52540025f9af

Dmesg on MDS (fat-amd-1-ib) showed:

Lustre: DEBUG MARKER: == test 1: Block hard limit (normal use and out of quota) === == 01:51:35
Lustre: DEBUG MARKER: User quota (limit: 95511 kbytes)
Lustre: DEBUG MARKER: Write ...
Lustre: DEBUG MARKER: Done
Lustre: DEBUG MARKER: Write out of block quota ...
Lustre: DEBUG MARKER: --------------------------------------
Lustre: DEBUG MARKER: Group quota (limit: 95511 kbytes)
LustreError: 8250:0:(ldlm_lib.c:2341:target_handle_dqacq_callback()) dqacq/dqrel failed! (rc:-5)
LustreError: 8251:0:(ldlm_lib.c:2341:target_handle_dqacq_callback()) dqacq/dqrel failed! (rc:-5)
LustreError: 6520:0:(quota_context.c:708:dqacq_completion()) acquire qunit got error! (rc:-5)
LustreError: 6520:0:(quota_master.c:1263:mds_init_slave_blimits()) error mds adjust local block quota! (rc:-5)
LustreError: 6520:0:(quota_master.c:1442:mds_set_dqblk()) init slave blimits failed! (rc:-5)
<~snip~>
Comment by Jian Yu [ 30/Aug/11 ]

Lustre Branch: master
Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/273/
Distro/Arch: RHEL5/x86_64
Network: IB (OFED 1.5.3.1)

The same failure occurred while running sanity-quota test: https://maloo.whamcloud.com/test_sets/4115f084-d2de-11e0-8d02-52540025f9af

Comment by Jian Yu [ 16/Feb/12 ]

Lustre Tag: v2_1_1_0_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/41/
Distro/Arch: RHEL5/x86_64 (kernel version: 2.6.18-274.12.1.el5)
Network: IB (OFED 1.5.4)

The same issue occurred: https://maloo.whamcloud.com/test_sets/f95cf180-584c-11e1-9df1-5254004bbbd3

Comment by Niu Yawei (Inactive) [ 01/Apr/13 ]

Fixed in LU-1782.

Generated at Sat Feb 10 01:06:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.