[LU-13187] sanity test_129: current dir size 4096, previous limit 20480 Created: 01/Feb/20  Updated: 23/Sep/20  Resolved: 10/Sep/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.5
Fix Version/s: Lustre 2.14.0, Lustre 2.12.6

Type: Bug Priority: Major
Reporter: Maloo Assignee: Dongyang Li
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-11310 support for SLES 15 Resolved
is related to LU-13916 sanity test_129: dirsize 4096 < 32768... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run:
https://testing.whamcloud.com/test_sets/2c10b9da-44b8-11ea-bffa-52540065bddc

test_129 failed with the following error:

current dir size 4096,  previous limit 20480

It looks like this started on 2020-01-28 when a number of patches landed.

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_129 - current dir size 4096, previous limit 20480
sanity test_129 - dirsize 4096 < 32768 after 93 files



 Comments   
Comment by Gerrit Updater [ 22/Feb/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37683
Subject: LU-13187 tests: add ONLY_REPEAT parameter to repeat subtests
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 23dd0d69dc325e206a653d1379661e27d8320fd9

Comment by Andreas Dilger [ 01/May/20 ]

+42 failures in the past 4 weeks (2020-04-04 - 2020-04-30)

Comment by Andreas Dilger [ 01/May/20 ]

Since this failure is only relevant for the ldiskfs code, it seems likely that the only ldiskfs patch that landed on 2020-01-28 is the culprit, namely patch https://review.whamcloud.com/37116 "LU-12977 ldiskfs: properly take inode_lock() for truncates". Unfortunately, I can't seem to get the test to fail consistently even when run in a loop, so it isn't possible to know whether reverting this patch would fix the test failures.

Comment by Arshad Hussain [ 03/May/20 ]

+1 on Master: https://testing.whamcloud.com/sub_tests/d7a74936-6a65-4d13-8de0-1749e351fb84

Comment by Andreas Dilger [ 06/May/20 ]

Dongyang, can you please look into this. It is being hit repeatedly with e2fsprogs testing, but also for regular review testing.

Comment by Dongyang Li [ 08/May/20 ]

I'm having a hard time to reproduce this and the patch https://review.whamcloud.com/37116 is just about truncate, should have nothing to do with the size of the dir.

Comment by Gerrit Updater [ 09/May/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38550
Subject: LU-13187 tests: improve sanity test_129 checking
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e8a90531a24453778e7a39e2fbdf7f8ff138b547

Comment by Arshad Hussain [ 16/May/20 ]

+1 on master https://testing.whamcloud.com/sub_tests/963e28b3-1f52-4221-ad58-5c7b5871893b

Comment by Gerrit Updater [ 21/May/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38550/
Subject: LU-13187 tests: improve sanity test_129 checking
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: eb80e37f023b85ef0e610ab65cee1d8ee07235fc

Comment by Andreas Dilger [ 21/May/20 ]

So the patch has landed, but the test is still failing (although a bit more verbosely than before):
https://testing.whamcloud.com/test_sets/3611de75-e067-402d-adce-920f82e2e2cc

CMD: trevis-13vm8 echo 32768 >/sys/fs/ldiskfs/dm-5/max_dir_size
CMD: trevis-13vm8 echo 24576 >/sys/fs/ldiskfs/dm-5/warning_dir_size
mcreate: cannot create `/mnt/lustre/d129.sanity/file_base_93' with mode 0100644: No space left on device
rc=28 returned as expected after 93 files
total: 5 open/close in 0.01 seconds: 634.44 ops/second
 	dirsize 4096 < 32768 after 93 files

Looking at the debug logs it is fairly clear that the MDS is returning -ENOSPC=-28 after creating only 93 files, but it isn't printing any errors related to reaching the directory limits.

So it seems that this problem may be a defect in the RHEL7.8 ldiskfs code, or possibly in the upstream ext4 code for that kernel? Could someone with a RHEL7.8 or 8.0 kernel run a manual test to see if this sys/fs/ldiskfs/XXX/max_dir_size setting is working at all?

Comment by Dongyang Li [ 21/May/20 ]

I've done that on RHEL7.8 some time ago and it's working:

[root@centos7 ~]# echo 20480 > /sys/fs/ldiskfs/vdb/max_dir_size
[root@centos7 ~]# echo 20480 > /sys/fs/ldiskfs/vdb/warning_dir_size 
[root@centos7 ~]# sh ./test_129.sh 
open(O_RDWR|O_CREAT): No space left on device
28
655
[root@centos7 ~]# stat /mnt/testdir
  File: ‘/mnt/testdir’
  Size: 20480     	Blocks: 40         IO Block: 4096   directory
Device: fd10h/64784d	Inode: 1048577     Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-05-07 21:35:00.688000000 +1000
Modify: 2020-05-07 21:45:55.832000000 +1000
Change: 2020-05-07 21:45:55.832000000 +1000
 Birth: -

the test_129.sh just creates files under testdir using multiop and prints errcode (28) and number of files created(655)

I haven't done it on RHEL8 but I think ldiskfs side is fine.

Comment by Andreas Dilger [ 21/May/20 ]

dongyang can you please submit a patch to test_129 that sets debug=-1 on the client and MDS, and maybe adds CDEBUG() to osd-ldiskfs so we can see where the -28 is coming from. I was looking at the debug logs from the current failure and they didn't show enough details

Comment by Gerrit Updater [ 22/May/20 ]

Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/38700
Subject: LU-13187 tests: get more debug info from test_129
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: db1fab110e4a0e57afbbc0300e14f4fbbd2d6f94

Comment by Chris Horn [ 28/May/20 ]

+1 on master https://testing.whamcloud.com/test_sets/52fe5475-a2f3-4bd6-a2b7-c1dc99b20590

Comment by Jian Yu [ 14/Jun/20 ]

+1 on Lustre b2_12 branch: https://testing.whamcloud.com/test_sets/5e20954f-2daf-47ff-9b6c-eacbf7a41dfc

Comment by Emoly Liu [ 17/Jun/20 ]

more on master:
https://testing.whamcloud.com/test_sets/353838f4-221f-4336-accc-ccaea50e17e3
https://testing.whamcloud.com/test_sets/629cec52-dd19-40c0-b0f2-0c22435f81df

Comment by Chris Horn [ 18/Jun/20 ]

+1 on master: https://testing.whamcloud.com/test_sets/1e95c770-b87b-4de6-9a58-08d40241c712

Comment by Chris Horn [ 07/Aug/20 ]

+1 on master https://testing.whamcloud.com/test_sets/62ce778c-aac4-4504-a1dc-ecd559e78533

Comment by James A Simmons [ 07/Aug/20 ]

Note Neil also ran into this problem on SUSE15 and pushed a fix here:

https://review.whamcloud.com/#/c/39571/

The same problem could be for RHEL platforms.

Comment by Chris Horn [ 20/Aug/20 ]

+1 on master: https://testing.whamcloud.com/test_sets/18024581-0159-4d24-84ee-9ae6554ced77

Comment by Andreas Dilger [ 30/Aug/20 ]

+3 on master:
https://testing.whamcloud.com/test_sessions/7b47cafb-4b4e-4cc3-ae57-971c31e4ce84
https://testing.whamcloud.com/test_sessions/70e01f6b-f61c-4d82-a3c6-fa141eb170fe
https://testing.whamcloud.com/test_sessions/27615c0d-2da3-42c0-8bb9-230da1f3acb2

 

Comment by Andreas Dilger [ 30/Aug/20 ]

There were 55 failures of this subtest in the last week, which is about a 10% failure rate, but since sanity is running multiple times per patch, it is affecting landing more than this.

Comment by Gerrit Updater [ 31/Aug/20 ]

James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/39773
Subject: LU-13187 ldiskfs: Fix max_dir_size_kb for RHEL7
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 43fe1051ee6cab1c9f8b85863ec91aec2c06b251

Comment by James A Simmons [ 31/Aug/20 ]

RHE8 and Ubuntu overlap for the ldiskfs patches so they will need to be updated at the same time. I can update Ubuntu but I don't have a RHEL8 system to fix it up on.

Comment by Neil Brown [ 02/Sep/20 ]

I think this problem is caused by some metadata directory trying to grow.

I added some tracing and found that the call to osd_ldiskfs_append() in iam_new_node() was failing with ENOSPC.

Maybe the best fix would be to add a test to ldiskfs_append() to check if it is a special lustre metadata directory, and if so to bypass the dir limit.

Is there an easy way to detect lustre metadata directories?

Comment by Dongyang Li [ 03/Sep/20 ]

Great, I just could not reproduce the problem.

if it's failing with ENOSPC from osd_ldiskfs_append(), I think we can add a new param to ldiskfs/ext4_append() to bypass the limit check for the oi related code path, like iam_new_node(), iam_lfix_create() and iam_lvar_create().

the normal dir is using a different code path, osd_ldiskfs_add_entry()->__ldiskfs/ext4_add_entry()

 

Comment by Andreas Dilger [ 03/Sep/20 ]

Neil, thanks for tracking this down.

Dongyang, I think it would be better avoid changing the API for ext4_append(), as that would need even more changes to the core ext4 code.

I think there are two options that are relatively simple:

  • split ext4_append() into an outer function of the same name that checks the directory size limit, and a second internal function (e.g. ext4_append_nolimit() or similar) that can be called directly from the iam_* functions
  • set a new EXT4_STATE_IAM flag on the IAM objects when they are opened, and check that inside ext4_append() when checking the size limit

Probably the second option is less intrusive, as it is likely that patch could avoid conflicts with ext4-pdirop.patch and ext4-misc.patch, and hopefully would not need to be different for every kernel.

Comment by Gerrit Updater [ 03/Sep/20 ]

Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/39823
Subject: LU-13187 osd-ldiskfs: don't enforce max dir size limit on IAM objects
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 07c97b04411a396572d7124f3217a7d561a96d2b

Comment by Gerrit Updater [ 10/Sep/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39823/
Subject: LU-13187 osd-ldiskfs: don't enforce max dir size limit on IAM objects
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 03e6db505be90d35ccacb3af7e15277784e5d448

Comment by Peter Jones [ 10/Sep/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 11/Sep/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39882
Subject: LU-13187 osd-ldiskfs: don't enforce max dir size limit on IAM objects
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: d8c40507e87798e37d05988d190ccef78b528c42

Comment by Gerrit Updater [ 19/Sep/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39882/
Subject: LU-13187 osd-ldiskfs: don't enforce max dir size limit on IAM objects
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: a73f4e566debadfc156b6d8c48237a2e34ac75ba

Generated at Sat Feb 10 02:59:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.