[LU-11915] conf-sanity test 115 is skipped or hangs Created: 01/Feb/19  Updated: 04/Aug/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Artem Blagodarenko
Resolution: Unresolved Votes: 0
Labels: DNE, always_except
Environment:

DNE


Issue Links:
Gantt End to Start
has to be done before LU-11882 OST recreated objects gets badness ma... Closed
Related
is related to LU-10717 several conf-sanity tests failed: FAI... Resolved
is related to LU-15789 conf-sanity test_115() cleanup_115(... Resolved
is related to LU-1365 Implement ldiskfs LARGEDIR support fo... Resolved
is related to LU-11546 enable large_dir support for MDTs Resolved
is related to LU-15700 conf-sanity test 115 does not cleanup... Resolved
is related to LU-15365 conf-sanity/115 to cleanup properly Resolved
is related to LU-13604 rebase Lustre e2fsprogs onto 1.45.6 Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

conf-sanity test_115 is only run for ldiskfs MDS file systems and is skipped for ZFS. Looking back for the past couple of weeks, this test is either skipped and in the past few days it has started to hang.

For some reason, the test is skipped when the formatting of MDS1 fails

8316         local mds_opts="$(mkfs_opts mds1 ${mdsdev}) --device-size=$IMAGESIZE   \
8317                 --mkfsoptions='-O lazy_itable_init,ea_inode,^resize_inode,meta_bg \
8318                 -i 1024'"
8319         add mds1 $mds_opts --mgs --reformat $mdsdev ||
8320                 { skip_env "format large MDT failed"; return 0; }

Shouldn’t this be an error?

Starting on January 30, 2019, conf-sanity test 115 started hanging only for review-dne-part-3 test sessions. Looking at the logs from a recent hang, https://testing.whamcloud.com/test_sets/d49db868-2610-11e9-8486-52540065bddc , the last thing seen in the client test_log is

== conf-sanity test 115: Access large xattr with inodes number over 2TB ============================== 09:51:24 (1549014684)
Stopping clients: onyx-34vm6.onyx.whamcloud.com,onyx-34vm7 /mnt/lustre (opts:)
CMD: onyx-34vm6.onyx.whamcloud.com,onyx-34vm7 running=\$(grep -c /mnt/lustre' ' /proc/mounts);
if [ \$running -ne 0 ] ; then
echo Stopping client \$(hostname) /mnt/lustre opts:;
lsof /mnt/lustre || need_kill=no;
if [ x != x -a x\$need_kill != xno ]; then
    pids=\$(lsof -t /mnt/lustre | sort -u);
    if [ -n \"\$pids\" ]; then
             kill -9 \$pids;
    fi
fi;
while umount  /mnt/lustre 2>&1 | grep -q busy; do
    echo /mnt/lustre is still busy, wait one second && sleep 1;
done;
fi

The console logs don’t have much information about this test in them; no errors, LBUGS, etc.

There were two new tests added to conf-sanity that run right before test 115; conf-santiy tests 110 and 111 added by https://review.whamcloud.com/22009 . Maybe there is some residual effect from these tests running in a DNE environment.

Logs for other hangs are at
https://testing.whamcloud.com/test_sets/d49db868-2610-11e9-8486-52540065bddc
https://testing.whamcloud.com/test_sets/8d6ec5d2-25ab-11e9-a318-52540065bddc

Logs for skipping this test are at
https://testing.whamcloud.com/test_sets/bb4dbd90-25a7-11e9-b97f-52540065bddc
https://testing.whamcloud.com/test_sets/1adf3cf6-2590-11e9-b54c-52540065bddc



 Comments   
Comment by Andreas Dilger [ 01/Feb/19 ]

Probably this needs to reset back to the original filesystem parameters and reformat the filesystem if the large MDT format fails.

Comment by James Nunez (Inactive) [ 01/Feb/19 ]

Artem,

Would you please investigate these failures?

Thank you,
James

Comment by Gerrit Updater [ 02/Feb/19 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34166
Subject: LU-11915 tests: stop running conf-sanity test 110
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d1f889acd98d01ac7379372bca5b7971733aa937

Comment by Gerrit Updater [ 04/Feb/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34166/
Subject: LU-11915 tests: stop running conf-sanity test 110
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 22ea5c8edfc9608ee5be056b61dd9fc000865791

Comment by James Nunez (Inactive) [ 08/Feb/19 ]

Maybe this is a DNE issue; https://testing.whamcloud.com/test_sets/e48fa04c-2b60-11e9-b3df-52540065bddc ?

Comment by Gerrit Updater [ 26/Feb/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34334
Subject: LU-11915 tests: clean up after test_110 properly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1eaa048b40d2a8b657ae296ec1a77b1c4a15b8f3

Comment by Andreas Dilger [ 26/Feb/19 ]

Hit this on a single-MDS test run:
https://testing.whamcloud.com/test_sets/aaf3cd4c-39c3-11e9-8f69-52540065bddc

Comment by Alexander Boyko [ 27/Feb/19 ]

Andreas, I`ve created a big cleanup for conf-sanity and you try to do the same at your patch https://review.whamcloud.com/34334. reformat at every test would add much time for conf-sanity.

Could you inspect https://review.whamcloud.com/#/c/33589/, it pass Maloo with different configurations. 

Comment by Andreas Dilger [ 27/Feb/19 ]

Add a link to the other ticket so it can be found.

Comment by Andreas Dilger [ 27/Feb/19 ]

Alexander, I didn't know about LU-10717 since there was no linkage to that ticket here, and no patch was available, and I wanted to get this test passing again consistently.

That said, test_110 is still being skipped in patch https://review.whamcloud.com/33589 "LU-10717 tests: tests should not start mgs" so it isn't clear if your patch is fixing the problem or not.

Comment by Andreas Dilger [ 27/Feb/19 ]

Alexander, could you or Artem please update https://review.whamcloud.com/34334 to not conflict with your https://review.whamcloud.com/33589 patch, or maybe just change it to skip test_110 always and fix it in a separate patch, since this is still causing test failures.

Comment by Artem Blagodarenko (Inactive) [ 21/Mar/19 ]

Andreas, it looks like conf-sanity test 115 can not pass because linkEA is never created in external node because of this fix:

ommit e760042016bb5b12f9b21568304c02711930720f
Author: Fan Yong <fan.yong@intel.com>
Date:   Sun Aug 28 18:15:37 2016 +0800


    LU-8569 linkea: linkEA size limitation
    
    Under DNE mode, if we do not restrict the linkEA size, and if there
    are too many cross-MDTs hard links to the same object, then it will
    casue the llog overflow. On the other hand, too many linkEA entries
    in the linkEA will serious affect the linkEA performance because we
    only support to locate linkEA entry consecutively.
    
    So we need to restrict the linkEA size. Currently, it is 4096 bytes,
    that is independent from the backend. If too many hard links caused
    the linkEA overflowed, we will add overflow timestamp in the linkEA
    header. Such overflow timestamp has some functionalities:
    
    1. It will prevent the object being migrated to other MDT, because
       some name entries may be not in the linkEA, so we cannot update
       these name entries for the migration.
    
    2. It will tell the namespace LFSCK that the 'nlink' attribute may
       be more trustable than the linkEA, then avoid misguiding the
       namespace LFSCK to repair 'nlink' attribute based on linkEA.
    
    There will be subsequent patch(es) for namespace LFSCK to handle the
    linkEA size limitation and overflow cases.
    
    Signed-off-by: Fan Yong <fan.yong@intel.com>
    Change-Id: I2d6c2be04305c1d7e3af160d8b80e73b66a36483
    Reviewed-on: https://review.whamcloud.com/23500
    Tested-by: Jenkins
    Tested-by: Maloo <hpdd-maloo@intel.com>
    Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
    Reviewed-by: wangdi <di.wang@intel.com>
    Reviewed-by: Lai Siyao <lai.siyao@intel.com>
    Reviewed-by: Oleg Drokin <oleg.drokin@intel.com> 

We need some other way to create xattrs in external node.

Comment by James Nunez (Inactive) [ 26/Apr/19 ]

We are seeing conf-sanity test 115 hang; https://testing.whamcloud.com/test_sets/373dc784-6661-11e9-aeec-52540065bddc . The interesting thing here is that the test says it is going to be skipped, but hangs due to issues with the NFS server. From client 1

[21609.829724] echo Stopping client $(hostname) /mnt/lustre2 opts:-f;
[21609.829724] lsof /mnt/lustre2 || need_kill=no;
[21609.829724] if [ x-f != x -a x$need_kill != xno ]; then
[21609.829724]     pids=$(lsof -t /mnt/lustre2 | sort -u);
[21800.653279] nfs: server trevis-1.trevis.whamcloud.com not responding, still trying
[21816.012465] nfs: server trevis-1.trevis.whamcloud.com not responding, still trying
[21831.371588] nfs: server trevis-2.trevis.whamcloud.com not responding, timed out
[21843.530876] nfs: server trevis-1.trevis.whamcloud.com not responding, timed out
[21870.281299] INFO: task tee:30317 blocked for more than 120 seconds.
[21870.282667]       Tainted: G           OE    4.15.0-45-generic #48-Ubuntu
[21870.283887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[21870.285300] tee             D    0 30317   1211 0x00000000
[21870.286399] Call Trace:
[21870.286996]  __schedule+0x291/0x8a0
[21870.287718]  ? update_curr+0x173/0x1d0
[21870.288504]  ? bit_wait+0x60/0x60
[21870.289194]  schedule+0x2c/0x80
[21870.289854]  io_schedule+0x16/0x40
[21870.290563]  bit_wait_io+0x11/0x60
[21870.291238]  __wait_on_bit+0x4c/0x90
[21870.291947]  out_of_line_wait_on_bit+0x90/0xb0
[21870.292797]  ? bit_waitqueue+0x40/0x40
[21870.293622]  nfs_wait_on_request+0x46/0x50 [nfs]
[21870.294504]  nfs_lock_and_join_requests+0x121/0x510 [nfs]
[21870.295507]  ? __switch_to_asm+0x34/0x70
[21870.296273]  ? radix_tree_lookup_slot+0x22/0x50
[21870.297139]  nfs_updatepage+0x155/0x920 [nfs]
[21870.297996]  nfs_write_end+0x19a/0x4f0 [nfs]
[21870.298841]  generic_perform_write+0xf6/0x1b0
[21870.299672]  ? _cond_resched+0x19/0x40
[21870.300396]  ? _cond_resched+0x19/0x40
[21870.301135]  nfs_file_write+0xfd/0x240 [nfs]
[21870.301999]  new_sync_write+0xe7/0x140
[21870.302754]  __vfs_write+0x29/0x40
[21870.303433]  vfs_write+0xb1/0x1a0
[21870.304102]  SyS_write+0x55/0xc0
[21870.304767]  do_syscall_64+0x73/0x130
[21870.305497]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[21870.306442] RIP: 0033:0x7ff061e0a154
[21870.307144] RSP: 002b:00007ffe153fc848 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[21870.308494] RAX: ffffffffffffffda RBX: 0000000000000044 RCX: 00007ff061e0a154
[21870.309781] RDX: 0000000000000044 RSI: 00007ffe153fc930 RDI: 0000000000000003
[21870.311067] RBP: 00007ffe153fc930 R08: 0000000000000044 R09: 00007ff062309540
[21870.312334] R10: 00000000000001b6 R11: 0000000000000246 R12: 000055636b8aa460
[21870.313623] R13: 0000000000000044 R14: 00007ff0620e1760 R15: 0000000000000044

So, this may be an environment issue?

Comment by Gerrit Updater [ 23/May/19 ]

Artem Blagodarenko (c17828@cray.com) uploaded a new patch: https://review.whamcloud.com/34948
Subject: LU-11915 tests: exlude ea_link from conf-sanity test 115
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7bdcca3d8cdb3e82b5b0caa8d95df6ed12c1fc93

Comment by Gerrit Updater [ 04/Jul/19 ]

Artem Blagodarenko (c17828@cray.com) uploaded a new patch: https://review.whamcloud.com/35417
Subject: LU-11915 deubugfs: add support for xattrs in external inodes
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: 1975e0da16366a920503a039371deb0586cfe74d

Comment by Gerrit Updater [ 08/Jul/19 ]

Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/35417/
Subject: LU-11915 deubugfs: add support for xattrs in external inodes
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set:
Commit: ec5a7546520431ac6259fab6752321e32184ab6a

Comment by Gerrit Updater [ 12/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34948/
Subject: LU-11915 tests: use trusted.* xattr for conf-sanity test_115
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b6333589fb05feadccae86acfb6ceeeba87f8b9b

Comment by Andreas Dilger [ 31/Jan/20 ]

The conf-sanity test_110 is still being skipped on master. What is still needed to get that running and passing again?

Comment by Artem Blagodarenko (Inactive) [ 31/Jan/20 ]

adilger, this issue is about conf-sanity test_115 that is "Access large xattr with inodes number over 2TB"

The conf-sanity test_110 is about large_dir feature. Should your last comment be moved to LU-11546?

Comment by Andreas Dilger [ 31/Jan/20 ]

I think that test_110() is marked disabled because of this ticket? At least if it is enabled, it is test_115 that fails.

Comment by Artem Blagodarenko (Inactive) [ 31/Jan/20 ]

I think that test_110() is marked disabled because of this ticket? At least if it is enabled, it is test_115 that fails.

Is this the testing system specification or something in conf-sanity sources?

Comment by Artem Blagodarenko (Inactive) [ 31/Jan/20 ]

Ok. I see the patch now.
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34166
Subject: LU-11915 tests: stop running conf-sanity test 110

Comment by Andreas Dilger [ 12/Feb/20 ]

Artem, I updated the e2fsprogs version number to 1.45.2.wc2 for the fixes under LU-13197, but that caused test_115 to fail. In fact, that test has not been running in testing ever since this patch landed. The test has been skipped because of ZFS or because of the e2fsprogs-1.45.2.wc1 version, or "format large MDT failed". With the new e2fsprogs-1.45.2.wc2 build it is now failing with:
https://testing.whamcloud.com/sub_tests/a35ed82a-4cb3-11ea-aeb7-52540065bddc

CMD: trevis-54vm3 mkdir -p /mnt/lustre-ost1; mount -t lustre   /dev/mapper/ost1_flakey /mnt/lustre-ost1
trevis-54vm3: mount.lustre: mount /dev/mapper/ost1_flakey at /mnt/lustre-ost1 failed: Address already in use
trevis-54vm3: The target service's index is already in use. (/dev/mapper/ost1_flakey)
Start of /dev/mapper/ost1_flakey on ost1 failed 98
 conf-sanity test_115: @@@@@@ FAIL: start OSS failed 

I don't think this is fallout from the previous test, because test_115() calls stopall at the beginning.
From the messages, it looks like it is using a flakey backing device, even though there is a check to avoid this:

        is_dm_flakey_dev $SINGLEMDS $(mdsdevname 1) &&
                skip "This test can not be executed on flakey dev"
Comment by Andreas Dilger [ 12/Feb/20 ]

In the logs it also shows:

LustreError: 140-5: Server lustre-OST0000 requested index 0, but that index is already in use. Use --writeconf to force
LustreError: 16154:0:(mgs_handler.c:524:mgs_target_reg()) Failed to write lustre-OST0000 log (-98)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark  conf-sanity test_115: @@@@@@ FAIL: start OSS failed 
Lustre: DEBUG MARKER: conf-sanity test_115: @@@@@@ FAIL: start OSS failed

It looks like the MGS needs to be reformatted for this test or something?

Comment by Gerrit Updater [ 12/Feb/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37548
Subject: LU-11915 tests: add debugging to conf-sanity test_115
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6bd24d6b04233cda09f09b79c5c6833e62827db6

Comment by Andreas Dilger [ 12/Feb/20 ]

Looking further at the logs, I see:

mkfs.lustre --mgs --fsname=lustre --mdt --index=0 ...
   Permanent disk data:
Target:     lustre:MDT0000
Mount type: ldiskfs
Flags:      0x65
              (MDT MGS first_time update)

So this is actually a combined MGS+MDS that is just newly formatted.

I've pushed a debug patch to see if there is something wrong with the config (left-over MGS or MDS device mounted).

Comment by Andreas Dilger [ 12/Feb/20 ]

I verified that the filesystems are newly formatted, and none of the modules are loaded during the test:
https://testing.whamcloud.com/test_sets/4244009c-4d94-11ea-b69a-52540065bddc

client: 
opening /dev/obd failed: No such file or directory
hint: the kernel modules may not be loaded
mds1: 
CMD: trevis-27vm12 lctl dl; mount | grep lustre
ost1: 
CMD: trevis-27vm3 lctl dl; mount | grep lustre

so it isn't at all clear to me why it considers the device already in use.

Comment by Gerrit Updater [ 25/Feb/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37548/
Subject: LU-11915 tests: add debugging to conf-sanity test_115
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 037abde0c33f4fff49c237e8588a45cc3da21c59

Comment by Gerrit Updater [ 05/Jun/20 ]

Artem Blagodarenko (artem.blagodarenko@hpe.com) uploaded a new patch: https://review.whamcloud.com/38849
Subject: LU-11915 tests: fix conf-sanity 115 test
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 76c4c3f42b655db45a632f3608c2607e25a90f29

Comment by Artem Blagodarenko (Inactive) [ 05/Jun/20 ]

With parameter "FLAKEY=false" the error "trevis-54vm3: The target service's index is already in use (/dev/mapper/ost1_flakey)" is gone. With some fixes, test passed on my local test environment.

Comment by Gerrit Updater [ 13/Dec/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/38849/
Subject: LU-11915 tests: fix conf-sanity 115 test
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ef13d5464ec7c91c8479ef3e987732dc6355d5ee

Comment by Alex Zhuravlev [ 13/Dec/21 ]

the last patch breaks local testing:

 SKIP: conf-sanity test_115 format large MDT failed
./../tests/test-framework.sh: line 5894: cannot create temp file for here-document: No space left on device
cannot run remote command on /mnt/build/lustre/tests/../utils/lctl with no_dsh
./../tests/test-framework.sh: line 6391: echo: write error: No space left on device
./../tests/test-framework.sh: line 6395: echo: write error: No space left on device
Stopping clients: tmp.pMDa8L24t2 /mnt/lustre (opts:)
Stopping clients: tmp.pMDa8L24t2 /mnt/lustre2 (opts:)
SKIP 115 (38s)
tee: /tmp/conf-sanity.log: No space left on device
Comment by Artem Blagodarenko (Inactive) [ 13/Dec/21 ]

bzzz, could you please share the session URL so I can get more details? Thanks.

Comment by Alex Zhuravlev [ 13/Dec/21 ]

artem_blagodarenko I can't, this is local setup (with small /tmp). I think you can reproduce this easily in VM with limited /tmp

Comment by Artem Blagodarenko (Inactive) [ 13/Dec/21 ]

Bzzz, despite the fact the test "requires" 3072 GiB it expect sparse files are used really. Here is disk usage on my local VM just after the test successfully passed.

[root@CO82 lustre-wc]# ls -ls /tmp/lustre-*
 340 -rw-r--r-- 1 root root 204800000 Dec 13 16:04 /tmp/lustre-mdt1
4256 -rw-r--r-- 1 root root 204800000 Dec 13 16:04 /tmp/lustre-ost1
4268 -rw-r--r-- 1 root root 409600000 Dec 13 16:04 /tmp/lustre-ost2
Generated at Sat Feb 10 02:48:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.