[LU-11915] conf-sanity test 115 is skipped or hangs Created: 01/Feb/19 Updated: 04/Aug/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Artem Blagodarenko |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | DNE, always_except | ||
| Environment: |
DNE |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||||||
| Description |
|
conf-sanity test_115 is only run for ldiskfs MDS file systems and is skipped for ZFS. Looking back for the past couple of weeks, this test is either skipped and in the past few days it has started to hang. For some reason, the test is skipped when the formatting of MDS1 fails
8316 local mds_opts="$(mkfs_opts mds1 ${mdsdev}) --device-size=$IMAGESIZE \
8317 --mkfsoptions='-O lazy_itable_init,ea_inode,^resize_inode,meta_bg \
8318 -i 1024'"
8319 add mds1 $mds_opts --mgs --reformat $mdsdev ||
8320 { skip_env "format large MDT failed"; return 0; }
Shouldn’t this be an error? Starting on January 30, 2019, conf-sanity test 115 started hanging only for review-dne-part-3 test sessions. Looking at the logs from a recent hang, https://testing.whamcloud.com/test_sets/d49db868-2610-11e9-8486-52540065bddc , the last thing seen in the client test_log is == conf-sanity test 115: Access large xattr with inodes number over 2TB ============================== 09:51:24 (1549014684)
Stopping clients: onyx-34vm6.onyx.whamcloud.com,onyx-34vm7 /mnt/lustre (opts:)
CMD: onyx-34vm6.onyx.whamcloud.com,onyx-34vm7 running=\$(grep -c /mnt/lustre' ' /proc/mounts);
if [ \$running -ne 0 ] ; then
echo Stopping client \$(hostname) /mnt/lustre opts:;
lsof /mnt/lustre || need_kill=no;
if [ x != x -a x\$need_kill != xno ]; then
pids=\$(lsof -t /mnt/lustre | sort -u);
if [ -n \"\$pids\" ]; then
kill -9 \$pids;
fi
fi;
while umount /mnt/lustre 2>&1 | grep -q busy; do
echo /mnt/lustre is still busy, wait one second && sleep 1;
done;
fi
The console logs don’t have much information about this test in them; no errors, LBUGS, etc. There were two new tests added to conf-sanity that run right before test 115; conf-santiy tests 110 and 111 added by https://review.whamcloud.com/22009 . Maybe there is some residual effect from these tests running in a DNE environment. Logs for other hangs are at Logs for skipping this test are at |
| Comments |
| Comment by Andreas Dilger [ 01/Feb/19 ] |
|
Probably this needs to reset back to the original filesystem parameters and reformat the filesystem if the large MDT format fails. |
| Comment by James Nunez (Inactive) [ 01/Feb/19 ] |
|
Artem, Would you please investigate these failures? Thank you, |
| Comment by Gerrit Updater [ 02/Feb/19 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34166 |
| Comment by Gerrit Updater [ 04/Feb/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34166/ |
| Comment by James Nunez (Inactive) [ 08/Feb/19 ] |
|
Maybe this is a DNE issue; https://testing.whamcloud.com/test_sets/e48fa04c-2b60-11e9-b3df-52540065bddc ? |
| Comment by Gerrit Updater [ 26/Feb/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34334 |
| Comment by Andreas Dilger [ 26/Feb/19 ] |
|
Hit this on a single-MDS test run: |
| Comment by Alexander Boyko [ 27/Feb/19 ] |
|
Andreas, I`ve created a big cleanup for conf-sanity and you try to do the same at your patch https://review.whamcloud.com/34334. reformat at every test would add much time for conf-sanity. Could you inspect https://review.whamcloud.com/#/c/33589/, it pass Maloo with different configurations. |
| Comment by Andreas Dilger [ 27/Feb/19 ] |
|
Add a link to the other ticket so it can be found. |
| Comment by Andreas Dilger [ 27/Feb/19 ] |
|
Alexander, I didn't know about That said, test_110 is still being skipped in patch https://review.whamcloud.com/33589 " |
| Comment by Andreas Dilger [ 27/Feb/19 ] |
|
Alexander, could you or Artem please update https://review.whamcloud.com/34334 to not conflict with your https://review.whamcloud.com/33589 patch, or maybe just change it to skip test_110 always and fix it in a separate patch, since this is still causing test failures. |
| Comment by Artem Blagodarenko (Inactive) [ 21/Mar/19 ] |
|
Andreas, it looks like conf-sanity test 115 can not pass because linkEA is never created in external node because of this fix: ommit e760042016bb5b12f9b21568304c02711930720f Author: Fan Yong <fan.yong@intel.com> Date: Sun Aug 28 18:15:37 2016 +0800 LU-8569 linkea: linkEA size limitation Under DNE mode, if we do not restrict the linkEA size, and if there are too many cross-MDTs hard links to the same object, then it will casue the llog overflow. On the other hand, too many linkEA entries in the linkEA will serious affect the linkEA performance because we only support to locate linkEA entry consecutively. So we need to restrict the linkEA size. Currently, it is 4096 bytes, that is independent from the backend. If too many hard links caused the linkEA overflowed, we will add overflow timestamp in the linkEA header. Such overflow timestamp has some functionalities: 1. It will prevent the object being migrated to other MDT, because some name entries may be not in the linkEA, so we cannot update these name entries for the migration. 2. It will tell the namespace LFSCK that the 'nlink' attribute may be more trustable than the linkEA, then avoid misguiding the namespace LFSCK to repair 'nlink' attribute based on linkEA. There will be subsequent patch(es) for namespace LFSCK to handle the linkEA size limitation and overflow cases. Signed-off-by: Fan Yong <fan.yong@intel.com> Change-Id: I2d6c2be04305c1d7e3af160d8b80e73b66a36483 Reviewed-on: https://review.whamcloud.com/23500 Tested-by: Jenkins Tested-by: Maloo <hpdd-maloo@intel.com> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Reviewed-by: wangdi <di.wang@intel.com> Reviewed-by: Lai Siyao <lai.siyao@intel.com> Reviewed-by: Oleg Drokin <oleg.drokin@intel.com> We need some other way to create xattrs in external node. |
| Comment by James Nunez (Inactive) [ 26/Apr/19 ] |
|
We are seeing conf-sanity test 115 hang; https://testing.whamcloud.com/test_sets/373dc784-6661-11e9-aeec-52540065bddc . The interesting thing here is that the test says it is going to be skipped, but hangs due to issues with the NFS server. From client 1 [21609.829724] echo Stopping client $(hostname) /mnt/lustre2 opts:-f; [21609.829724] lsof /mnt/lustre2 || need_kill=no; [21609.829724] if [ x-f != x -a x$need_kill != xno ]; then [21609.829724] pids=$(lsof -t /mnt/lustre2 | sort -u); [21800.653279] nfs: server trevis-1.trevis.whamcloud.com not responding, still trying [21816.012465] nfs: server trevis-1.trevis.whamcloud.com not responding, still trying [21831.371588] nfs: server trevis-2.trevis.whamcloud.com not responding, timed out [21843.530876] nfs: server trevis-1.trevis.whamcloud.com not responding, timed out [21870.281299] INFO: task tee:30317 blocked for more than 120 seconds. [21870.282667] Tainted: G OE 4.15.0-45-generic #48-Ubuntu [21870.283887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [21870.285300] tee D 0 30317 1211 0x00000000 [21870.286399] Call Trace: [21870.286996] __schedule+0x291/0x8a0 [21870.287718] ? update_curr+0x173/0x1d0 [21870.288504] ? bit_wait+0x60/0x60 [21870.289194] schedule+0x2c/0x80 [21870.289854] io_schedule+0x16/0x40 [21870.290563] bit_wait_io+0x11/0x60 [21870.291238] __wait_on_bit+0x4c/0x90 [21870.291947] out_of_line_wait_on_bit+0x90/0xb0 [21870.292797] ? bit_waitqueue+0x40/0x40 [21870.293622] nfs_wait_on_request+0x46/0x50 [nfs] [21870.294504] nfs_lock_and_join_requests+0x121/0x510 [nfs] [21870.295507] ? __switch_to_asm+0x34/0x70 [21870.296273] ? radix_tree_lookup_slot+0x22/0x50 [21870.297139] nfs_updatepage+0x155/0x920 [nfs] [21870.297996] nfs_write_end+0x19a/0x4f0 [nfs] [21870.298841] generic_perform_write+0xf6/0x1b0 [21870.299672] ? _cond_resched+0x19/0x40 [21870.300396] ? _cond_resched+0x19/0x40 [21870.301135] nfs_file_write+0xfd/0x240 [nfs] [21870.301999] new_sync_write+0xe7/0x140 [21870.302754] __vfs_write+0x29/0x40 [21870.303433] vfs_write+0xb1/0x1a0 [21870.304102] SyS_write+0x55/0xc0 [21870.304767] do_syscall_64+0x73/0x130 [21870.305497] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [21870.306442] RIP: 0033:0x7ff061e0a154 [21870.307144] RSP: 002b:00007ffe153fc848 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [21870.308494] RAX: ffffffffffffffda RBX: 0000000000000044 RCX: 00007ff061e0a154 [21870.309781] RDX: 0000000000000044 RSI: 00007ffe153fc930 RDI: 0000000000000003 [21870.311067] RBP: 00007ffe153fc930 R08: 0000000000000044 R09: 00007ff062309540 [21870.312334] R10: 00000000000001b6 R11: 0000000000000246 R12: 000055636b8aa460 [21870.313623] R13: 0000000000000044 R14: 00007ff0620e1760 R15: 0000000000000044 So, this may be an environment issue? |
| Comment by Gerrit Updater [ 23/May/19 ] |
|
Artem Blagodarenko (c17828@cray.com) uploaded a new patch: https://review.whamcloud.com/34948 |
| Comment by Gerrit Updater [ 04/Jul/19 ] |
|
Artem Blagodarenko (c17828@cray.com) uploaded a new patch: https://review.whamcloud.com/35417 |
| Comment by Gerrit Updater [ 08/Jul/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/35417/ |
| Comment by Gerrit Updater [ 12/Sep/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34948/ |
| Comment by Andreas Dilger [ 31/Jan/20 ] |
|
The conf-sanity test_110 is still being skipped on master. What is still needed to get that running and passing again? |
| Comment by Artem Blagodarenko (Inactive) [ 31/Jan/20 ] |
|
adilger, this issue is about conf-sanity test_115 that is "Access large xattr with inodes number over 2TB" The conf-sanity test_110 is about large_dir feature. Should your last comment be moved to |
| Comment by Andreas Dilger [ 31/Jan/20 ] |
|
I think that test_110() is marked disabled because of this ticket? At least if it is enabled, it is test_115 that fails. |
| Comment by Artem Blagodarenko (Inactive) [ 31/Jan/20 ] |
Is this the testing system specification or something in conf-sanity sources? |
| Comment by Artem Blagodarenko (Inactive) [ 31/Jan/20 ] |
|
Ok. I see the patch now. |
| Comment by Andreas Dilger [ 12/Feb/20 ] |
|
Artem, I updated the e2fsprogs version number to 1.45.2.wc2 for the fixes under CMD: trevis-54vm3 mkdir -p /mnt/lustre-ost1; mount -t lustre /dev/mapper/ost1_flakey /mnt/lustre-ost1 trevis-54vm3: mount.lustre: mount /dev/mapper/ost1_flakey at /mnt/lustre-ost1 failed: Address already in use trevis-54vm3: The target service's index is already in use. (/dev/mapper/ost1_flakey) Start of /dev/mapper/ost1_flakey on ost1 failed 98 conf-sanity test_115: @@@@@@ FAIL: start OSS failed I don't think this is fallout from the previous test, because test_115() calls stopall at the beginning.
is_dm_flakey_dev $SINGLEMDS $(mdsdevname 1) &&
skip "This test can not be executed on flakey dev"
|
| Comment by Andreas Dilger [ 12/Feb/20 ] |
|
In the logs it also shows: LustreError: 140-5: Server lustre-OST0000 requested index 0, but that index is already in use. Use --writeconf to force LustreError: 16154:0:(mgs_handler.c:524:mgs_target_reg()) Failed to write lustre-OST0000 log (-98) Lustre: DEBUG MARKER: /usr/sbin/lctl mark conf-sanity test_115: @@@@@@ FAIL: start OSS failed Lustre: DEBUG MARKER: conf-sanity test_115: @@@@@@ FAIL: start OSS failed It looks like the MGS needs to be reformatted for this test or something? |
| Comment by Gerrit Updater [ 12/Feb/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37548 |
| Comment by Andreas Dilger [ 12/Feb/20 ] |
|
Looking further at the logs, I see: mkfs.lustre --mgs --fsname=lustre --mdt --index=0 ...
Permanent disk data:
Target: lustre:MDT0000
Mount type: ldiskfs
Flags: 0x65
(MDT MGS first_time update)
So this is actually a combined MGS+MDS that is just newly formatted. I've pushed a debug patch to see if there is something wrong with the config (left-over MGS or MDS device mounted). |
| Comment by Andreas Dilger [ 12/Feb/20 ] |
|
I verified that the filesystems are newly formatted, and none of the modules are loaded during the test: client: opening /dev/obd failed: No such file or directory hint: the kernel modules may not be loaded mds1: CMD: trevis-27vm12 lctl dl; mount | grep lustre ost1: CMD: trevis-27vm3 lctl dl; mount | grep lustre so it isn't at all clear to me why it considers the device already in use. |
| Comment by Gerrit Updater [ 25/Feb/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37548/ |
| Comment by Gerrit Updater [ 05/Jun/20 ] |
|
Artem Blagodarenko (artem.blagodarenko@hpe.com) uploaded a new patch: https://review.whamcloud.com/38849 |
| Comment by Artem Blagodarenko (Inactive) [ 05/Jun/20 ] |
|
With parameter "FLAKEY=false" the error "trevis-54vm3: The target service's index is already in use (/dev/mapper/ost1_flakey)" is gone. With some fixes, test passed on my local test environment. |
| Comment by Gerrit Updater [ 13/Dec/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/38849/ |
| Comment by Alex Zhuravlev [ 13/Dec/21 ] |
|
the last patch breaks local testing:
SKIP: conf-sanity test_115 format large MDT failed
./../tests/test-framework.sh: line 5894: cannot create temp file for here-document: No space left on device
cannot run remote command on /mnt/build/lustre/tests/../utils/lctl with no_dsh
./../tests/test-framework.sh: line 6391: echo: write error: No space left on device
./../tests/test-framework.sh: line 6395: echo: write error: No space left on device
Stopping clients: tmp.pMDa8L24t2 /mnt/lustre (opts:)
Stopping clients: tmp.pMDa8L24t2 /mnt/lustre2 (opts:)
SKIP 115 (38s)
tee: /tmp/conf-sanity.log: No space left on device
|
| Comment by Artem Blagodarenko (Inactive) [ 13/Dec/21 ] |
|
bzzz, could you please share the session URL so I can get more details? Thanks. |
| Comment by Alex Zhuravlev [ 13/Dec/21 ] |
|
artem_blagodarenko I can't, this is local setup (with small /tmp). I think you can reproduce this easily in VM with limited /tmp |
| Comment by Artem Blagodarenko (Inactive) [ 13/Dec/21 ] |
|
Bzzz, despite the fact the test "requires" 3072 GiB it expect sparse files are used really. Here is disk usage on my local VM just after the test successfully passed. [root@CO82 lustre-wc]# ls -ls /tmp/lustre-* 340 -rw-r--r-- 1 root root 204800000 Dec 13 16:04 /tmp/lustre-mdt1 4256 -rw-r--r-- 1 root root 204800000 Dec 13 16:04 /tmp/lustre-ost1 4268 -rw-r--r-- 1 root root 409600000 Dec 13 16:04 /tmp/lustre-ost2 |