[LU-13160] sanity-hsm test 70 timeout Created: 21/Jan/20 Updated: 20/May/20 Resolved: 20/May/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.12.4 |
| Fix Version/s: | Lustre 2.14.0, Lustre 2.12.5 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Dongyang Li |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | rhel8 | ||
| Environment: |
RHEL 8.1 client + RHEL 7.7 server |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for jianyu <yujian@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/f4707d44-3ba7-11ea-bb75-52540065bddc test_70 failed with the following error: == sanity-hsm test 70: Copytool logs JSON register/unregister events to FIFO ========================= 06:37:48 (1579415868) CMD: trevis-12vm7 mktemp --tmpdir=/tmp -d sanity-hsm.test_70.XXXX CMD: trevis-12vm7 mkfifo -m 0644 /tmp/sanity-hsm.test_70.r3C7/fifo CMD: trevis-12vm7 cat /tmp/sanity-hsm.test_70.r3C7/fifo > /tmp/sanity-hsm.test_70.r3C7/events & echo \$! > /tmp/sanity-hsm.test_70.r3C7/monitor_pid Timeout occurred after 238 mins, last suite running was sanity-hsm, restarting cluster to continue tests VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Peter Jones [ 21/Jan/20 ] |
|
Yingjin Could you please advise? Thanks Peter |
| Comment by Jian Yu [ 29/Jan/20 ] |
|
The failure also occurred on Lustre b2_12 branch with RHEL 8.1 client: |
| Comment by Peter Jones [ 31/Jan/20 ] |
|
Dongyang Could you please look into why this regression has appeared using RHEL 8.1 rather than RHEL 8.0 clients? Peter |
| Comment by Dongyang Li [ 06/Feb/20 ] |
|
from the both test sessions, seems the process was stuck after the last cat command started on the second client. the stack log from the session suggests the cat process was indeed started: [22875.131497] cat S 0 1279 1 0x00000080 [22875.132478] Call Trace: [22875.132975] ? __schedule+0x253/0x830 [22875.133660] schedule+0x28/0x70 [22875.134253] pipe_wait+0x6c/0xb0 [22875.134880] ? finish_wait+0x80/0x80 [22875.135566] wait_for_partner+0x19/0x50 [22875.136269] fifo_open+0x27b/0x2b0 [22875.136911] ? pipe_release+0xa0/0xa0 [22875.137597] do_dentry_open+0x132/0x330 [22875.138310] path_openat+0x573/0x14d0 [22875.139008] do_filp_open+0x93/0x100 [22875.139681] ? __check_object_size+0xa3/0x181 [22875.140466] do_sys_open+0x184/0x220 [22875.141149] do_syscall_64+0x5b/0x1b0 [22875.141843] entry_SYSCALL_64_after_hwframe+0x65/0xca which is part of the copytool_monitor_setup, we should proceed to
copytool setup --event-fifo "$HSMTOOL_MONITOR_DIR/fifo"
and should be able to see CMD: trevis-16vm6 mkdir -p /tmp/arc1/sanity-hsm.test_70/ as the next command on the second client. but I couldn't see anything stopping us from doing that. Also tried to reproduce this with the same setup, no luck yet. Need some help here. |
| Comment by Jian Yu [ 12/Feb/20 ] |
|
Hi Dongyang,
I can reproduce this with two clients. I found the test passed with one client. Did you try to use two clients to reproduce the issue? |
| Comment by Dongyang Li [ 13/Feb/20 ] |
|
Yes I was using 2 clients. Are you reproducing it with the lab vms? if so can you create a setup and make sure the test case fails, then I can get into the vm and have a look? Thanks a lot |
| Comment by Jian Yu [ 13/Feb/20 ] |
|
Sure, Dongyang. Here are the test nodes: 2 Clients: onyx-22vm3 (local), onyx-22vm5 (remote) 1 MGS/MDS: onyx-22vm1 (1 MDT) 1 OSS: onyx-22vm4 (2 OSTs) Now sanity-hsm test 70 is hanging on onyx-22vm3: == sanity-hsm test 70: Copytool logs JSON register/unregister events to FIFO ========================= 06:18:09 (1581574689) CMD: onyx-22vm5 mktemp --tmpdir=/tmp -d sanity-hsm.test_70.XXXX CMD: onyx-22vm5 mkfifo -m 0644 /tmp/sanity-hsm.test_70.Whg8/fifo CMD: onyx-22vm5 cat /tmp/sanity-hsm.test_70.Whg8/fifo > /tmp/sanity-hsm.test_70.Whg8/events & echo \$! > /tmp/sanity-hsm.test_70.Whg8/monitor_pid |
| Comment by Dongyang Li [ 13/Feb/20 ] |
|
Many thanks Jian, Looks like the session on onyx-22vm3 has already finished, I was trying to start a new session with something like --only 70 but I noticed the cfg/local.sh was not setup. Can I just use auster under /usr/lib64/lustre/tests on 22vm3? if not can you provide a cmd for it? |
| Comment by Jian Yu [ 13/Feb/20 ] |
|
Sure, Dongyang. [root@onyx-22vm3 tests]# PDSH="pdsh -t 120 -S -R ssh -w" NAME=ncli RCLIENTS=onyx-22vm5 mds_HOST=onyx-22vm1 MDSDEV=/dev/vda5 MDSSIZE=2097152 ost_HOST=onyx-22vm4 OSTCOUNT=2 OSTSIZE=2097152 OSTDEV1=/dev/vda5 OSTDEV2=/dev/vda6 SHARED_DIRECTORY="/home/jianyu/test_logs" VERBOSE=true bash auster -d /home/jianyu/test_logs -r -s -v -k sanity-hsm --only 70 |
| Comment by Gerrit Updater [ 17/Feb/20 ] |
|
Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/37595 |
| Comment by Gerrit Updater [ 01/Mar/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37595/ |
| Comment by Gerrit Updater [ 02/Mar/20 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37773 |
| Comment by Gerrit Updater [ 06/Apr/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37773/ |
| Comment by James A Simmons [ 20/May/20 ] |
|
Is this complete? |
| Comment by Dongyang Li [ 20/May/20 ] |
|
Yes, I'm closing the ticket. |