[LU-13187] sanity test_129: current dir size 4096, previous limit 20480 Created: 01/Feb/20 Updated: 23/Sep/20 Resolved: 10/Sep/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.5 |
| Fix Version/s: | Lustre 2.14.0, Lustre 2.12.6 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Dongyang Li |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: test_129 failed with the following error: current dir size 4096, previous limit 20480 It looks like this started on 2020-01-28 when a number of patches landed. VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Gerrit Updater [ 22/Feb/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37683 |
| Comment by Andreas Dilger [ 01/May/20 ] |
|
+42 failures in the past 4 weeks (2020-04-04 - 2020-04-30) |
| Comment by Andreas Dilger [ 01/May/20 ] |
|
Since this failure is only relevant for the ldiskfs code, it seems likely that the only ldiskfs patch that landed on 2020-01-28 is the culprit, namely patch https://review.whamcloud.com/37116 " |
| Comment by Arshad Hussain [ 03/May/20 ] |
|
+1 on Master: https://testing.whamcloud.com/sub_tests/d7a74936-6a65-4d13-8de0-1749e351fb84 |
| Comment by Andreas Dilger [ 06/May/20 ] |
|
Dongyang, can you please look into this. It is being hit repeatedly with e2fsprogs testing, but also for regular review testing. |
| Comment by Dongyang Li [ 08/May/20 ] |
|
I'm having a hard time to reproduce this and the patch https://review.whamcloud.com/37116 is just about truncate, should have nothing to do with the size of the dir. |
| Comment by Gerrit Updater [ 09/May/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38550 |
| Comment by Arshad Hussain [ 16/May/20 ] |
|
+1 on master https://testing.whamcloud.com/sub_tests/963e28b3-1f52-4221-ad58-5c7b5871893b |
| Comment by Gerrit Updater [ 21/May/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38550/ |
| Comment by Andreas Dilger [ 21/May/20 ] |
|
So the patch has landed, but the test is still failing (although a bit more verbosely than before): CMD: trevis-13vm8 echo 32768 >/sys/fs/ldiskfs/dm-5/max_dir_size CMD: trevis-13vm8 echo 24576 >/sys/fs/ldiskfs/dm-5/warning_dir_size mcreate: cannot create `/mnt/lustre/d129.sanity/file_base_93' with mode 0100644: No space left on device rc=28 returned as expected after 93 files total: 5 open/close in 0.01 seconds: 634.44 ops/second dirsize 4096 < 32768 after 93 files Looking at the debug logs it is fairly clear that the MDS is returning -ENOSPC=-28 after creating only 93 files, but it isn't printing any errors related to reaching the directory limits. So it seems that this problem may be a defect in the RHEL7.8 ldiskfs code, or possibly in the upstream ext4 code for that kernel? Could someone with a RHEL7.8 or 8.0 kernel run a manual test to see if this sys/fs/ldiskfs/XXX/max_dir_size setting is working at all? |
| Comment by Dongyang Li [ 21/May/20 ] |
|
I've done that on RHEL7.8 some time ago and it's working: [root@centos7 ~]# echo 20480 > /sys/fs/ldiskfs/vdb/max_dir_size [root@centos7 ~]# echo 20480 > /sys/fs/ldiskfs/vdb/warning_dir_size [root@centos7 ~]# sh ./test_129.sh open(O_RDWR|O_CREAT): No space left on device 28 655 [root@centos7 ~]# stat /mnt/testdir File: ‘/mnt/testdir’ Size: 20480 Blocks: 40 IO Block: 4096 directory Device: fd10h/64784d Inode: 1048577 Links: 2 Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2020-05-07 21:35:00.688000000 +1000 Modify: 2020-05-07 21:45:55.832000000 +1000 Change: 2020-05-07 21:45:55.832000000 +1000 Birth: - the test_129.sh just creates files under testdir using multiop and prints errcode (28) and number of files created(655) I haven't done it on RHEL8 but I think ldiskfs side is fine. |
| Comment by Andreas Dilger [ 21/May/20 ] |
|
dongyang can you please submit a patch to test_129 that sets debug=-1 on the client and MDS, and maybe adds CDEBUG() to osd-ldiskfs so we can see where the -28 is coming from. I was looking at the debug logs from the current failure and they didn't show enough details |
| Comment by Gerrit Updater [ 22/May/20 ] |
|
Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/38700 |
| Comment by Chris Horn [ 28/May/20 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/52fe5475-a2f3-4bd6-a2b7-c1dc99b20590 |
| Comment by Jian Yu [ 14/Jun/20 ] |
|
+1 on Lustre b2_12 branch: https://testing.whamcloud.com/test_sets/5e20954f-2daf-47ff-9b6c-eacbf7a41dfc |
| Comment by Emoly Liu [ 17/Jun/20 ] |
|
more on master: |
| Comment by Chris Horn [ 18/Jun/20 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/1e95c770-b87b-4de6-9a58-08d40241c712 |
| Comment by Chris Horn [ 07/Aug/20 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/62ce778c-aac4-4504-a1dc-ecd559e78533 |
| Comment by James A Simmons [ 07/Aug/20 ] |
|
Note Neil also ran into this problem on SUSE15 and pushed a fix here: https://review.whamcloud.com/#/c/39571/ The same problem could be for RHEL platforms. |
| Comment by Chris Horn [ 20/Aug/20 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/18024581-0159-4d24-84ee-9ae6554ced77 |
| Comment by Andreas Dilger [ 30/Aug/20 ] |
|
+3 on master:
|
| Comment by Andreas Dilger [ 30/Aug/20 ] |
|
There were 55 failures of this subtest in the last week, which is about a 10% failure rate, but since sanity is running multiple times per patch, it is affecting landing more than this. |
| Comment by Gerrit Updater [ 31/Aug/20 ] |
|
James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/39773 |
| Comment by James A Simmons [ 31/Aug/20 ] |
|
RHE8 and Ubuntu overlap for the ldiskfs patches so they will need to be updated at the same time. I can update Ubuntu but I don't have a RHEL8 system to fix it up on. |
| Comment by Neil Brown [ 02/Sep/20 ] |
|
I think this problem is caused by some metadata directory trying to grow. I added some tracing and found that the call to osd_ldiskfs_append() in iam_new_node() was failing with ENOSPC. Maybe the best fix would be to add a test to ldiskfs_append() to check if it is a special lustre metadata directory, and if so to bypass the dir limit. Is there an easy way to detect lustre metadata directories? |
| Comment by Dongyang Li [ 03/Sep/20 ] |
|
Great, I just could not reproduce the problem. if it's failing with ENOSPC from osd_ldiskfs_append(), I think we can add a new param to ldiskfs/ext4_append() to bypass the limit check for the oi related code path, like iam_new_node(), iam_lfix_create() and iam_lvar_create(). the normal dir is using a different code path, osd_ldiskfs_add_entry()->__ldiskfs/ext4_add_entry()
|
| Comment by Andreas Dilger [ 03/Sep/20 ] |
|
Neil, thanks for tracking this down. Dongyang, I think it would be better avoid changing the API for ext4_append(), as that would need even more changes to the core ext4 code. I think there are two options that are relatively simple:
Probably the second option is less intrusive, as it is likely that patch could avoid conflicts with ext4-pdirop.patch and ext4-misc.patch, and hopefully would not need to be different for every kernel. |
| Comment by Gerrit Updater [ 03/Sep/20 ] |
|
Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/39823 |
| Comment by Gerrit Updater [ 10/Sep/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39823/ |
| Comment by Peter Jones [ 10/Sep/20 ] |
|
Landed for 2.14 |
| Comment by Gerrit Updater [ 11/Sep/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39882 |
| Comment by Gerrit Updater [ 19/Sep/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39882/ |