[LU-5638] sanity-quota test_33 for ZFS-based backend: Used inodes for user 60000 isn't 0. 1 Created: 18/Sep/14  Updated: 02/Aug/18  Resolved: 02/Aug/18

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: zfs

Issue Links:
Related
is related to LU-11024 Broken inode accounting of MDT on ZFS Resolved
is related to LU-2435 inode accounting in osd-zfs is racy Resolved
is related to LU-7991 Add project quota for ZFS Resolved
is related to LU-9592 sanity-quota test cases 33 remove fro... Resolved
Severity: 3
Rank (Obsolete): 15788

 Description   

This issue was created by maloo for nasf <fan.yong@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/a26efad0-3e95-11e4-916a-5254006e85c2.

The sub-test test_33 failed with the following error:

Used inodes for user 60000 isn't 0. 1

Please provide additional information about the failure here.

Info required for matching: sanity-quota 33



 Comments   
Comment by Isaac Huang (Inactive) [ 18/Sep/14 ]

Lots of errors like these in the debug logs:
00000001:04000000:0.0:1410962344.043221:0:30818:0:(osd_quota.c:98:osd_acct_index_lookup()) lustre-OST0001: id ea60 not found in DMU accounting ZAP
00000001:04000000:0.0:1410962344.043236:0:30818:0:(osd_quota.c:117:osd_acct_index_lookup()) lustre-OST0001: id ea60 not found in accounting ZAP

And 60000=0xea60, so it looked like the user hadn't created anything yet so the zap_lookup() returned negative ENOENT for both the DMU ZAP and the OSD ZAP. But in this case osd_acct_index_lookup() already sets both rec->bspace and rec->ispace to 0. I'm a bit confused by the returns of osd_acct_index_lookup() though: it returns either +1 or -errno, but lquota_disk_read() callers expect 0 for success -ENOENT and others for error.

Someone who knows the quota code should comment.

Comment by Niu Yawei (Inactive) [ 22/Sep/14 ]

I'm a bit confused by the returns of osd_acct_index_lookup() though: it returns either +1 or -errno, but lquota_disk_read() callers expect 0 for success -ENOENT and others for error.

dt_lookup() converted the return values.

Comment by Niu Yawei (Inactive) [ 22/Sep/14 ]

Lots of errors like these in the debug logs:
00000001:04000000:0.0:1410962344.043221:0:30818:0:(osd_quota.c:98:osd_acct_index_lookup()) lustre-OST0001: id ea60 not found in DMU accounting ZAP
00000001:04000000:0.0:1410962344.043236:0:30818:0:(osd_quota.c:117:osd_acct_index_lookup()) lustre-OST0001: id ea60 not found in accounting ZAP

And 60000=0xea60, so it looked like the user hadn't created anything yet so the zap_lookup() returned negative ENOENT for both the DMU ZAP and the OSD ZAP. But in this case osd_acct_index_lookup() already sets both rec->bspace and rec->ispace to 0.

These messages were from OSTs, and the test was failed for incorrect inode usage (which is inode usage on MDT), so I think those OST messages is irrelevant.

I checked MDT log, but didn't find anything abnormal. I suspect this failure is caused by the race of updating inode accounting ZAP: zap_increment_int() doesn't take lock to make "lookup -> update" atomic. I believe the patch from LU-2600 could fix this problem, unfortunately, the patch was reverted for some regression. I'm not sure if Alex is still working this.

As a short term solution, probably we may introduce some lock in osd layer to serialize the zap_increment_int()?

Comment by Isaac Huang (Inactive) [ 21/Oct/14 ]

I think it makes sense to fix zap_increment_int() instead - it needs exclusive access to do zap_update() anyway.

Comment by Alex Zhuravlev [ 22/Oct/14 ]

doing so on every accounting change would be very expensive, IMO. instead we should be doing this at commit where all "users" transactions are done, when we have an exclusive access by definition.

Comment by Isaac Huang (Inactive) [ 22/Oct/14 ]

Yes of course, batching the updates at sync time would be the best solution. Actually that's exactly how DMU updates the DMU_USERUSED_OBJECT/DMU_GROUPUSED_OBJECT, in dsl_pool_sync()>dmu_objset_do_userquota_updates()>do_userquota_update()->zap_increment_int().

Comment by Alex Zhuravlev [ 22/Oct/14 ]

right, this is what I was trying to implement in http://review.whamcloud.com/#/c/10785/, but failed.

Comment by Isaac Huang (Inactive) [ 28/Oct/14 ]

Johann has asked me to work on adding dnode accounting support to ZFS in LU-2435.

Comment by James Nunez (Inactive) [ 01/Jul/15 ]

Another instance of this failure at https://testing.hpdd.intel.com/test_sets/2caf1f82-1f45-11e5-a4d6-5254006e85c2

Comment by Frederic Saunier [ 06/Jul/15 ]

These tests seem hitting occurences of same issue (sanity quota 33, 34 and 35):

Comment by Gregoire Pichon [ 07/Jul/15 ]

two new occurrences on master
https://testing.hpdd.intel.com/test_sessions/761c8276-2419-11e5-91e9-5254006e85c2
https://testing.hpdd.intel.com/test_sessions/1b40755e-242a-11e5-87f6-5254006e85c2

Comment by James Nunez (Inactive) [ 07/Jul/15 ]

sanity-quota test 11 started failing less than a week ago with inode quota issues. The test is failing with

Used inodes(1) is less than 2

It looks like the test 11 failures might be the same or related to this ticket because the MDS debug_log contains the same messages as above:

(osd_quota.c:120:osd_acct_index_lookup()) lustre-MDT0000: id ea60 not found in DMU accounting ZAP

In the cases below, sanity-quota tests 33, 34 and 35 all fail after test 11 fails:
2015-07-03 12:14:19 - https://testing.hpdd.intel.com/test_sets/5ee4d41e-21ad-11e5-a979-5254006e85c2
2015-07-03 19:34:12 - https://testing.hpdd.intel.com/test_sets/41c9e6e6-21e7-11e5-b398-5254006e85c2
2015-07-05 17:43:49 - https://testing.hpdd.intel.com/test_sets/ff5d7856-2365-11e5-a6ad-5254006e85c2
2015-07-06 17:15:26 - https://testing.hpdd.intel.com/test_sets/7a443f68-242a-11e5-87f6-5254006e85c2
2015-07-06 21:12:05 - https://testing.hpdd.intel.com/test_sets/080b7f58-244b-11e5-91e9-5254006e85c2

Comment by Bruno Faccini (Inactive) [ 10/Jul/15 ]

3 new+consecutive occurences for the same master patch (http://review.whamcloud.com/14384/) review :
https://testing.hpdd.intel.com/test_sessions/da77e524-21e6-11e5-b398-5254006e85c2
https://testing.hpdd.intel.com/test_sessions/0a6988f4-2585-11e5-a6b1-5254006e85c2
https://testing.hpdd.intel.com/test_sessions/72ee971e-2627-11e5-8b33-5254006e85c2
is there any activity ongoing for this issue?

Comment by Gerrit Updater [ 13/Jul/15 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/15590
Subject: LU-5638 tests: Skip sanity-quota 11 and 33 for ZFS
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ff39144cf538fc5cac88b1ad61d16640b7854b09

Comment by James Nunez (Inactive) [ 13/Jul/15 ]

Temporarily skipping sanity-quota tests 11 and 33 for review-zfs-part-* until the patch for LU-2435 lands.

Patch at http://review.whamcloud.com/#/c/15590

Comment by Bob Glossman (Inactive) [ 17/Jul/15 ]

another on master:
https://testing.hpdd.intel.com/test_sets/e8dfb4f0-2c15-11e5-8c67-5254006e85c2

Comment by Gerrit Updater [ 18/Jul/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15590/
Subject: LU-5638 tests: Skip sanity-quota tests for ZFS
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 87a28e4280e08947d852a317f49be174cc6f4cb6

Comment by Peter Jones [ 18/Jul/15 ]

Landed for 2.8

Comment by James Nunez (Inactive) [ 19/Jul/15 ]

This issue is not resolved. Only a patch to skip the tests was landed. The original problem causing sanity-quota 11, 33, 34, and 35 still exists.

Comment by Andreas Dilger [ 21/Apr/17 ]

There is a belief that this was caused by slow ZFS metadata performance, which has been improved in Lustre 2.9. It would be worthwhile to retest these skipped tests (with ZFS of course) to see if they now pass reliably.

Comment by Bob Glossman (Inactive) [ 02/Jun/17 ]

being seen in non-zfs tests too. example:
https://testing.hpdd.intel.com/test_sets/834ddd66-472e-11e7-b3fe-5254006e85c2

I note that test 33 is skipped with ALWAYS_EXCEPT for test runs on zfs. Maybe it needs to be skipped all the time on everything.

Comment by Bob Glossman (Inactive) [ 03/Jun/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/1637dff4-4839-11e7-bc6c-5254006e85c2

Comment by Andreas Dilger [ 04/Jun/17 ]

I don't think skipping the test is the right way forward, except as a short-term workaround. Instead, someone needs to take the time to figure out what file is being left behind with this UID.

Comment by Niu Yawei (Inactive) [ 05/Jun/17 ]

I think the old issue should have been fixed once the LU-2435 being landed, we can re-enable test_33 for zfs testing now, I'll cook a patch to re-enable it.

The new occurrences on ldiskfs is another issue, I believe it's a defect in project quota:

sanity-quota test_33: @@@@@@ FAIL: Used space for project 1000:18432, expected:20480

I think we'd open a new ticket for it.

Comment by Niu Yawei (Inactive) [ 05/Jun/17 ]

The new issue is created at LU-9597.

Comment by Gerrit Updater [ 05/Jun/17 ]

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: https://review.whamcloud.com/27423
Subject: LU-5638 tests: re-enable zfs quota tests
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ec72f9a800214354e537b3f95c83c1ea509fa178

Comment by James Nunez (Inactive) [ 24/Jul/17 ]

It looks like sanity-quota test 33 is still failing with ZFS servers.

Logs for two recent failures are at:
https://testing.hpdd.intel.com/test_sets/506f3d2e-480d-11e7-91f4-5254006e85c2
https://testing.hpdd.intel.com/test_sets/ac775cfe-4a84-11e7-91f4-5254006e85c2

Comment by Dilip Krishnagiri (Inactive) [ 09/Aug/17 ]

sanity-quota test 33 is failing. Maloo link to look at needed information

https://testing.hpdd.intel.com/test_sets/8442a52c-7bad-11e7-a168-5254006e85c2

Error: 'Used inode for user 60000 is 1, expected 10'
Wait for setattr on objects finished...
sleep 5 for ZFS OSD
Waiting for local destroys to complete
CMD: onyx-45vm7,onyx-45vm8 lctl set_param -n osd*.MDT.force_sync=1
CMD: onyx-45vm10 lctl set_param -n osd*.OS.force_sync=1
Verify disk usage after write
Verify inode usage after write
sanity-quota test_33: @@@@@@ FAIL: Used inode for user 60000 is 1, expected 10
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:5291:error()
= /usr/lib64/lustre/tests/sanity-quota.sh:2423:test_33()
= /usr/lib64/lustre/tests/test-framework.sh:5567:run_one()
= /usr/lib64/lustre/tests/test-framework.sh:5606:run_one_logged()
= /usr/lib64/lustre/tests/test-framework.sh:5453:run_test()
= /usr/lib64/lustre/tests/sanity-quota.sh:2450:main()
Dumping lctl log to /test_logs/2017-08-07/lustre-reviews-el7-x86_64-custom-1_101_1_49353_-70097615897520-184409/sanity-quota.test_33.*.1502136383.log
CMD: onyx-45vm10,onyx-45vm1.onyx.hpdd.intel.com,onyx-45vm2,onyx-45vm7,onyx-45vm8 /usr/sbin/lctl dk > /test_logs/2017-08-07/lustre-reviews-el7-x86_64-custom-1_101_1_49353_-70097615897520-184409/sanity-quota.test_33.debug_log.\$(hostname -s).1502136383.log;
dmesg > /test_logs/2017-08-07/lustre-reviews-el7-x86_64-custom-1_101_1_49353_-70097615897520-184409/sanity-quota.test_33.dmesg.\$(hostname -s).1502136383.log
Resetting fail_loc on all nodes...CMD: onyx-45vm10,onyx-45vm1.onyx.hpdd.intel.com,onyx-45vm2,onyx-45vm7,onyx-45vm8 lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null
done.

Comment by Peter Jones [ 09/Aug/17 ]

Hongchao

Could you please advise on this one?

Thanks

Peter

Comment by Hongchao Zhang [ 26/Mar/18 ]

there is no abnormal information in the logs, and it could still be related to the ZFS performance.
this issue has not occurred since Dec 19, 2017

Comment by Andreas Dilger [ 16/May/18 ]

It appears that this was "fixed" by the landing of https://review.whamcloud.com/27093 which changed the detection of ZFS project quotas but broke detection of ZFS dnode accounting. That patch landed to b2_10 on Dec 20, 2017 (master landing on Nov 9, 2017).

Comment by Gerrit Updater [ 11/Jun/18 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/32694
Subject: LU-5638 tests: resume running sanity-quota tests
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 78fee01723aece184b4e328335bab5e120667583

Comment by Gerrit Updater [ 24/Jul/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32694/
Subject: LU-5638 tests: resume running sanity-quota tests
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2fa651d7c9bbeb2ce87e149f25d753c7b66640ab

Comment by James Nunez (Inactive) [ 02/Aug/18 ]

Patch landed to remove sanity-quota 33 from the ALWAYS_EXCEPT list for 2.11.54.

Generated at Sat Feb 10 01:53:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.