[LU-2284] Test failure on test suite sanity-quota, subtest test_7d Created: 05/Nov/12  Updated: 09/Dec/12  Resolved: 09/Dec/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: NFBlocker
Environment:

SLES11 SP2 client


Severity: 3
Rank (Obsolete): 5468

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/ba0c6718-2709-11e2-b04c-52540035b04c.

The sub-test test_7d failed with the following error:

quota_2usr write success, expect EDQUOT

== sanity-quota test 7d: Quota reintegration (Transfer index in multiple bulks) ====================== 13:56:52 (1352066212)
Waiting for local destroys to complete
Creating test directory
CMD: fat-intel-3vm7 lctl set_param fail_val=0
fail_val=0
CMD: fat-intel-3vm7 lctl set_param fail_loc=0
fail_loc=0
CMD: fat-intel-3vm8 lctl set_param fail_val=0
fail_val=0
CMD: fat-intel-3vm8 lctl set_param fail_loc=0
fail_loc=0
CMD: fat-intel-3vm8 lctl set_param fail_val=0
fail_val=0
CMD: fat-intel-3vm8 lctl set_param fail_loc=0
fail_loc=0
CMD: fat-intel-3vm8 lctl set_param fail_val=0
fail_val=0
CMD: fat-intel-3vm8 lctl set_param fail_loc=0
fail_loc=0
CMD: fat-intel-3vm8 lctl set_param fail_val=0
fail_val=0
CMD: fat-intel-3vm8 lctl set_param fail_loc=0
fail_loc=0
CMD: fat-intel-3vm8 lctl set_param fail_val=0
fail_val=0
CMD: fat-intel-3vm8 lctl set_param fail_loc=0
fail_loc=0
CMD: fat-intel-3vm8 lctl set_param fail_val=0
fail_val=0
CMD: fat-intel-3vm8 lctl set_param fail_loc=0
fail_loc=0
CMD: fat-intel-3vm8 lctl set_param fail_val=0
fail_val=0
CMD: fat-intel-3vm8 lctl set_param fail_loc=0
fail_loc=0
CMD: fat-intel-3vm7 /usr/sbin/lctl conf_param lustre.quota.ost=none
CMD: fat-intel-3vm8 /usr/sbin/lctl dl
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0000.quota_slave.enabled
Waiting 90 secs for update
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0000.quota_slave.enabled
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0000.quota_slave.enabled
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0000.quota_slave.enabled
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0000.quota_slave.enabled
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0000.quota_slave.enabled
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0000.quota_slave.enabled
Updated after 6s: wanted 'none' got 'none'
CMD: fat-intel-3vm8 /usr/sbin/lctl dl
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0001.quota_slave.enabled
CMD: fat-intel-3vm8 /usr/sbin/lctl dl
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0002.quota_slave.enabled
CMD: fat-intel-3vm8 /usr/sbin/lctl dl
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0003.quota_slave.enabled
CMD: fat-intel-3vm8 /usr/sbin/lctl dl
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0004.quota_slave.enabled
CMD: fat-intel-3vm8 /usr/sbin/lctl dl
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0005.quota_slave.enabled
CMD: fat-intel-3vm8 /usr/sbin/lctl dl
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0006.quota_slave.enabled
setquota failed: Invalid argument
CMD: fat-intel-3vm7 lctl set_param fail_val=0
fail_val=0
CMD: fat-intel-3vm7 lctl set_param fail_loc=0x608
fail_loc=0x608
CMD: fat-intel-3vm7 /usr/sbin/lctl conf_param lustre.quota.ost=u
CMD: fat-intel-3vm8 /usr/sbin/lctl dl
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0000.quota_slave.enabled
Waiting 90 secs for update
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0000.quota_slave.enabled
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0000.quota_slave.enabled
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0000.quota_slave.enabled
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0000.quota_slave.enabled
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0000.quota_slave.enabled
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0000.quota_slave.enabled
Updated after 6s: wanted 'u' got 'u'
CMD: fat-intel-3vm8 /usr/sbin/lctl dl
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0001.quota_slave.enabled
CMD: fat-intel-3vm8 /usr/sbin/lctl dl
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0002.quota_slave.enabled
CMD: fat-intel-3vm8 /usr/sbin/lctl dl
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0003.quota_slave.enabled
CMD: fat-intel-3vm8 /usr/sbin/lctl dl
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0004.quota_slave.enabled
CMD: fat-intel-3vm8 /usr/sbin/lctl dl
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0005.quota_slave.enabled
CMD: fat-intel-3vm8 /usr/sbin/lctl dl
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0006.quota_slave.enabled
CMD: fat-intel-3vm8 /usr/sbin/lctl dl
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0000.quota_slave.info |
				grep user | awk '{ print \$3 }'
CMD: fat-intel-3vm8 /usr/sbin/lctl dl
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0001.quota_slave.info |
				grep user | awk '{ print \$3 }'
CMD: fat-intel-3vm8 /usr/sbin/lctl dl
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0002.quota_slave.info |
				grep user | awk '{ print \$3 }'
CMD: fat-intel-3vm8 /usr/sbin/lctl dl
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0003.quota_slave.info |
				grep user | awk '{ print \$3 }'
CMD: fat-intel-3vm8 /usr/sbin/lctl dl
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0004.quota_slave.info |
				grep user | awk '{ print \$3 }'
CMD: fat-intel-3vm8 /usr/sbin/lctl dl
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0005.quota_slave.info |
				grep user | awk '{ print \$3 }'
CMD: fat-intel-3vm8 /usr/sbin/lctl dl
CMD: fat-intel-3vm8 /usr/sbin/lctl get_param -n 				osd-ldiskfs.lustre-OST0006.quota_slave.info |
				grep user | awk '{ print \$3 }'
CMD: fat-intel-3vm7 lctl set_param fail_val=0
fail_val=0
CMD: fat-intel-3vm7 lctl set_param fail_loc=0
fail_loc=0
running as uid/gid/euid/egid 60000/60000/60000/60000, groups:
 [dd] [if=/dev/zero] [bs=1M] [of=/mnt/lustre/d0.sanity-quota/d7/f.sanity-quota.7d] [count=21] [oflag=sync]
dd: writing `/mnt/lustre/d0.sanity-quota/d7/f.sanity-quota.7d': Disk quota exceeded
20+0 records in
19+0 records out
19922944 bytes (20 MB) copied, 47.8415 s, 416 kB/s
running as uid/gid/euid/egid 60001/60001/60001/60001, groups:
 [dd] [if=/dev/zero] [bs=1M] [of=/mnt/lustre/d0.sanity-quota/d7/f.sanity-quota.7d-1] [count=21] [oflag=sync]
21+0 records in
21+0 records out
22020096 bytes (22 MB) copied, 2.24614 s, 9.8 MB/s
Disk quotas for user quota_2usr (uid 60001):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
    /mnt/lustre   21504  14450480 15173004       -       1  1048576 1101004       -
lustre-MDT0000_UUID
                      0       -       0       -       1       -  262144       -
lustre-OST0000_UUID
                      0       -       0       -       -       -       -       -
lustre-OST0001_UUID
                      0       -       0       -       -       -       -       -
lustre-OST0002_UUID
                      0       -   20480       -       -       -       -       -
lustre-OST0003_UUID
                  21504       - 1048576       -       -       -       -       -
lustre-OST0004_UUID
                      0       -       0       -       -       -       -       -
lustre-OST0005_UUID
                      0       -       0       -       -       -       -       -
lustre-OST0006_UUID
                      0       -       0       -       -       -       -       -
Files for user (quota_2usr):
  File: `/mnt/lustre/d0.sanity-quota/d7/f.sanity-quota.7d-1'
  Size: 22020096  	Blocks: 43008      IO Block: 2097152 regular file
Device: 2c54f966h/743766374d	Inode: 144117285630509373  Links: 1
Access: (0644/-rw-r--r--)  Uid: (60001/quota_2usr)   Gid: (60001/quota_2usr)
Access: 2012-11-04 13:58:04.000000000 -0800
Modify: 2012-11-04 13:58:07.000000000 -0800
Change: 2012-11-04 13:58:07.000000000 -0800
 Birth: -
 sanity-quota test_7d: @@@@@@ FAIL: quota_2usr write success, expect EDQUOT 


 Comments   
Comment by Jodi Levi (Inactive) [ 06/Nov/12 ]

Can you please have a look at this one?
Thank you!

Comment by Niu Yawei (Inactive) [ 06/Nov/12 ]

The setquota for TSTUSR2 failed:

setquota failed: Invalid argument

That's why the quota_2usr write didn't get EDQUOT. Unfortunately, I didn't find any helpful information from the MDS log.

Sarah, if it can be reproduced, could you reproduce it and collect debuglog from MDS (with D_TRACE & D_QUOTA enabled), I think that'll be helpful to see why the setquota failed with -EINVAL. Thanks.

Comment by Sarah Liu [ 09/Nov/12 ]

I tried twice but cannot reproduce it.

Comment by Peter Jones [ 09/Nov/12 ]

Sarah how frequently is this testing failing when it is being run by auto test?

Comment by Sarah Liu [ 09/Nov/12 ]

Sarah how frequently is this testing failing when it is being run by auto test?

Twice, here is another failure. https://maloo.whamcloud.com/test_sets/c03ea79c-2a13-11e2-94cd-52540035b04c
If it's easy to reproduce on autotest, maybe we can change the PTLDEBUG in auto test's config file to catch more debug info Niu needed.

Comment by Niu Yawei (Inactive) [ 11/Nov/12 ]

http://review.whamcloud.com/4510

I made a debug patch which enables D_TRACE & D_QUOTA before setquota in test_7d, hope we can reproduce it in auto-test.

Comment by Andreas Dilger [ 23/Nov/12 ]

This is reported as failing 11 times in the past week, according to: https://maloo.whamcloud.com/test_sets/query?utf8=%E2%9C%93&test_set[test_set_script_id]=&test_set[status]=&test_set[query_bugs]=LU-2284&test_session[test_host]=&test_session[test_group]=&test_session[user_id]=&test_session[query_date]=&test_session[query_recent_period]=&test_node[os_type_id]=&test_node[distribution_type_id]=&test_node[architecture_type_id]=&test_node[file_system_type_id]=&test_node[lustre_branch_id]=&test_node_network[network_type_id]=&commit=Update+results

The most recent of which is:
https://maloo.whamcloud.com/test_sets/ba59a336-33b4-11e2-a9d7-52540035b04c

Do any of these failures have the debug information you need to resolve this problem?

Comment by Niu Yawei (Inactive) [ 27/Nov/12 ]

The most recent of which is:
https://maloo.whamcloud.com/test_sets/ba59a336-33b4-11e2-a9d7-52540035b04c

Do any of these failures have the debug information you need to resolve this problem?

This one is based on e464b9bbac8e0e900777a4e8a655e156fb901c5a, which doesn't include the debug patch.

Comment by Niu Yawei (Inactive) [ 28/Nov/12 ]

I found few failures with debug patch, for example: https://maloo.whamcloud.com/test_sets/dc835e9c-3711-11e2-a300-52540035b04c

00040000:00000001:1.0:1353843577.744099:0:24544:0:(qmt_entry.c:399:qmt_validate_limits()) Process entered
00040000:00000001:1.0:1353843577.744100:0:24544:0:(qmt_entry.c:403:qmt_validate_limits()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea)
00040000:00000001:1.0:1353843577.744101:0:24544:0:(qmt_handler.c:241:qmt_setquota()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea)

The setquota for quota_2usr failed, because it's trying to set invalid limits (soft > hard), however, we only specified hardlimit in test_7d, which means the softlimit is read from old settings.

Comment by Niu Yawei (Inactive) [ 28/Nov/12 ]

http://review.whamcloud.com/4695

The D_QUOTA needs be re-enabled after each mds/ost restart, otherwise, we'll lose the D_QUOTA log.

Comment by Peter Jones [ 30/Nov/12 ]

Landed for 2.4

Comment by Niu Yawei (Inactive) [ 02/Dec/12 ]

The landed patch is just another debug patch, I hope to see more information in the future test to see why the softlimit for quota_2usr wasn't 0.

Comment by Niu Yawei (Inactive) [ 02/Dec/12 ]

The log shows the limits for quota_2usr(60001) were not zeros (both for block & inode limits):

inode limits:

00040000:04000000:0.0:1354380042.909110:0:30239:0:(qmt_handler.c:173:qmt_getquota()) $$$ getquota qmt:lustre-QMT0000 pool:0-md id:60001 enforced:1 hard:1101004 soft:1048576 granted:0 time:0 qunit:262144 edquot:0 may_rel:0 revoke:0

block limits:

00040000:00000001:0.0:1354380042.909123:0:30239:0:(lquota_entry.c:289:lqe_read()) Process entered
00040000:04000000:0.0:1354380042.909124:0:30239:0:(lquota_entry.c:296:lqe_read()) $$$ read qmt:lustre-QMT0000 pool:0-dt id:60001 enforced:0 hard:0 soft:0 granted:0 time:0 qunit:0 edquot:0 may_rel:0 revoke:0
00040000:00000001:0.0:1354380042.909126:0:30239:0:(qmt_entry.c:67:qmt_lqe_read()) Process entered
00040000:00000001:0.0:1354380042.909127:0:30239:0:(lquota_disk.c:579:lquota_disk_read()) Process entered
00000004:00000001:0.0:1354380042.909129:0:30239:0:(osd_handler.c:3034:osd_index_iam_lookup()) Process entered
00000004:00000001:0.0:1354380042.909140:0:30239:0:(osd_handler.c:3078:osd_index_iam_lookup()) Process leaving (rc=1 : 1 : 1)
00040000:00000001:0.0:1354380042.909141:0:30239:0:(lquota_disk.c:590:lquota_disk_read()) Process leaving (rc=0 : 0 : 0)
00040000:04000000:0.0:1354380042.909142:0:30239:0:(qmt_entry.c:100:qmt_lqe_read()) $$$ read qmt:lustre-QMT0000 pool:0-dt id:60001 enforced:1 hard:15173004 soft:14450480 granted:0 time:0 qunit:0 edquot:0 may_rel:0 revoke:0
00040000:00000001:0.0:1354380042.909144:0:30239:0:(qmt_entry.c:101:qmt_lqe_read()) Process leaving (rc=0 : 0 : 0)

I really don't know where those limits came from. Another thing worth mentioning is that some test result reporting the backend filesystem is zfs, but the debug log shows it's actually a ldiskfs backend (osd-ldiskfs symbols in the debuglog). Which means all test failures were on ldiskfs backend, I suspect there is some defect lurking in iam?

Comment by Andreas Dilger [ 03/Dec/12 ]

The original failure for this patch was:

sanity-quota test_7d: @@@@@@ FAIL: quota_2usr write success, expect EDQUOT

but all of the recent failures are reporting a different:

sanity-quota test_7d: @@@@@@ FAIL: set quota for quota_2usr failed

Is this original problem fixed?

Comment by Niu Yawei (Inactive) [ 03/Dec/12 ]

Hi, Andreas. They are same failure, the original test script doesn't fail test on setquota failure, we changed this in the first debug patch.

Comment by Niu Yawei (Inactive) [ 05/Dec/12 ]

http://review.whamcloud.com/4752

The log shows the limits of quota_2usr has been non-zero before the test_7d, unfortunately, we can't see when it becomes non-zero since the log was truncated. To catch more log, this debug patch verifies the limits at the test starting and each end of test.

Comment by Niu Yawei (Inactive) [ 07/Dec/12 ]

I see where the limits came from now: it's the left-over from other tests which has quota enabled (see setup_quota of t-f), so we'd just simply clear all the limits before testing:
http://review.whamcloud.com/4766

Comment by Niu Yawei (Inactive) [ 09/Dec/12 ]

patch landed for 2.4

Generated at Sat Feb 10 01:23:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.