[LU-5638] sanity-quota test_33 for ZFS-based backend: Used inodes for user 60000 isn't 0. 1 Created: 18/Sep/14 Updated: 02/Aug/18 Resolved: 02/Aug/18 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.12.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | zfs | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 15788 | ||||||||||||||||||||
| Description |
|
This issue was created by maloo for nasf <fan.yong@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/a26efad0-3e95-11e4-916a-5254006e85c2. The sub-test test_33 failed with the following error:
Please provide additional information about the failure here. Info required for matching: sanity-quota 33 |
| Comments |
| Comment by Isaac Huang (Inactive) [ 18/Sep/14 ] |
|
Lots of errors like these in the debug logs: And 60000=0xea60, so it looked like the user hadn't created anything yet so the zap_lookup() returned negative ENOENT for both the DMU ZAP and the OSD ZAP. But in this case osd_acct_index_lookup() already sets both rec->bspace and rec->ispace to 0. I'm a bit confused by the returns of osd_acct_index_lookup() though: it returns either +1 or -errno, but lquota_disk_read() callers expect 0 for success -ENOENT and others for error. Someone who knows the quota code should comment. |
| Comment by Niu Yawei (Inactive) [ 22/Sep/14 ] |
dt_lookup() converted the return values. |
| Comment by Niu Yawei (Inactive) [ 22/Sep/14 ] |
These messages were from OSTs, and the test was failed for incorrect inode usage (which is inode usage on MDT), so I think those OST messages is irrelevant. I checked MDT log, but didn't find anything abnormal. I suspect this failure is caused by the race of updating inode accounting ZAP: zap_increment_int() doesn't take lock to make "lookup -> update" atomic. I believe the patch from As a short term solution, probably we may introduce some lock in osd layer to serialize the zap_increment_int()? |
| Comment by Isaac Huang (Inactive) [ 21/Oct/14 ] |
|
I think it makes sense to fix zap_increment_int() instead - it needs exclusive access to do zap_update() anyway. |
| Comment by Alex Zhuravlev [ 22/Oct/14 ] |
|
doing so on every accounting change would be very expensive, IMO. instead we should be doing this at commit where all "users" transactions are done, when we have an exclusive access by definition. |
| Comment by Isaac Huang (Inactive) [ 22/Oct/14 ] |
|
Yes of course, batching the updates at sync time would be the best solution. Actually that's exactly how DMU updates the DMU_USERUSED_OBJECT/DMU_GROUPUSED_OBJECT, in dsl_pool_sync() |
| Comment by Alex Zhuravlev [ 22/Oct/14 ] |
|
right, this is what I was trying to implement in http://review.whamcloud.com/#/c/10785/, but failed. |
| Comment by Isaac Huang (Inactive) [ 28/Oct/14 ] |
|
Johann has asked me to work on adding dnode accounting support to ZFS in |
| Comment by James Nunez (Inactive) [ 01/Jul/15 ] |
|
Another instance of this failure at https://testing.hpdd.intel.com/test_sets/2caf1f82-1f45-11e5-a4d6-5254006e85c2 |
| Comment by Frederic Saunier [ 06/Jul/15 ] |
|
These tests seem hitting occurences of same issue (sanity quota 33, 34 and 35): |
| Comment by Gregoire Pichon [ 07/Jul/15 ] |
|
two new occurrences on master |
| Comment by James Nunez (Inactive) [ 07/Jul/15 ] |
|
sanity-quota test 11 started failing less than a week ago with inode quota issues. The test is failing with Used inodes(1) is less than 2 It looks like the test 11 failures might be the same or related to this ticket because the MDS debug_log contains the same messages as above: (osd_quota.c:120:osd_acct_index_lookup()) lustre-MDT0000: id ea60 not found in DMU accounting ZAP In the cases below, sanity-quota tests 33, 34 and 35 all fail after test 11 fails: |
| Comment by Bruno Faccini (Inactive) [ 10/Jul/15 ] |
|
3 new+consecutive occurences for the same master patch (http://review.whamcloud.com/14384/) review : |
| Comment by Gerrit Updater [ 13/Jul/15 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/15590 |
| Comment by James Nunez (Inactive) [ 13/Jul/15 ] |
|
Temporarily skipping sanity-quota tests 11 and 33 for review-zfs-part-* until the patch for |
| Comment by Bob Glossman (Inactive) [ 17/Jul/15 ] |
|
another on master: |
| Comment by Gerrit Updater [ 18/Jul/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15590/ |
| Comment by Peter Jones [ 18/Jul/15 ] |
|
Landed for 2.8 |
| Comment by James Nunez (Inactive) [ 19/Jul/15 ] |
|
This issue is not resolved. Only a patch to skip the tests was landed. The original problem causing sanity-quota 11, 33, 34, and 35 still exists. |
| Comment by Andreas Dilger [ 21/Apr/17 ] |
|
There is a belief that this was caused by slow ZFS metadata performance, which has been improved in Lustre 2.9. It would be worthwhile to retest these skipped tests (with ZFS of course) to see if they now pass reliably. |
| Comment by Bob Glossman (Inactive) [ 02/Jun/17 ] |
|
being seen in non-zfs tests too. example: I note that test 33 is skipped with ALWAYS_EXCEPT for test runs on zfs. Maybe it needs to be skipped all the time on everything. |
| Comment by Bob Glossman (Inactive) [ 03/Jun/17 ] |
|
another on master: |
| Comment by Andreas Dilger [ 04/Jun/17 ] |
|
I don't think skipping the test is the right way forward, except as a short-term workaround. Instead, someone needs to take the time to figure out what file is being left behind with this UID. |
| Comment by Niu Yawei (Inactive) [ 05/Jun/17 ] |
|
I think the old issue should have been fixed once the The new occurrences on ldiskfs is another issue, I believe it's a defect in project quota: sanity-quota test_33: @@@@@@ FAIL: Used space for project 1000:18432, expected:20480
I think we'd open a new ticket for it. |
| Comment by Niu Yawei (Inactive) [ 05/Jun/17 ] |
|
The new issue is created at |
| Comment by Gerrit Updater [ 05/Jun/17 ] |
|
Niu Yawei (yawei.niu@intel.com) uploaded a new patch: https://review.whamcloud.com/27423 |
| Comment by James Nunez (Inactive) [ 24/Jul/17 ] |
|
It looks like sanity-quota test 33 is still failing with ZFS servers. Logs for two recent failures are at: |
| Comment by Dilip Krishnagiri (Inactive) [ 09/Aug/17 ] |
|
sanity-quota test 33 is failing. Maloo link to look at needed information https://testing.hpdd.intel.com/test_sets/8442a52c-7bad-11e7-a168-5254006e85c2 Error: 'Used inode for user 60000 is 1, expected 10' |
| Comment by Peter Jones [ 09/Aug/17 ] |
|
Hongchao Could you please advise on this one? Thanks Peter |
| Comment by Hongchao Zhang [ 26/Mar/18 ] |
|
there is no abnormal information in the logs, and it could still be related to the ZFS performance. |
| Comment by Andreas Dilger [ 16/May/18 ] |
|
It appears that this was "fixed" by the landing of https://review.whamcloud.com/27093 which changed the detection of ZFS project quotas but broke detection of ZFS dnode accounting. That patch landed to b2_10 on Dec 20, 2017 (master landing on Nov 9, 2017). |
| Comment by Gerrit Updater [ 11/Jun/18 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/32694 |
| Comment by Gerrit Updater [ 24/Jul/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32694/ |
| Comment by James Nunez (Inactive) [ 02/Aug/18 ] |
|
Patch landed to remove sanity-quota 33 from the ALWAYS_EXCEPT list for 2.11.54. |