[LU-13160] sanity-hsm test 70 timeout Created: 21/Jan/20  Updated: 20/May/20  Resolved: 20/May/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.4
Fix Version/s: Lustre 2.14.0, Lustre 2.12.5

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Dongyang Li
Resolution: Fixed Votes: 0
Labels: rhel8
Environment:

RHEL 8.1 client + RHEL 7.7 server


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for jianyu <yujian@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/f4707d44-3ba7-11ea-bb75-52540065bddc

test_70 failed with the following error:

== sanity-hsm test 70: Copytool logs JSON register/unregister events to FIFO ========================= 06:37:48 (1579415868)
CMD: trevis-12vm7 mktemp --tmpdir=/tmp -d sanity-hsm.test_70.XXXX
CMD: trevis-12vm7 mkfifo -m 0644 /tmp/sanity-hsm.test_70.r3C7/fifo
CMD: trevis-12vm7 cat /tmp/sanity-hsm.test_70.r3C7/fifo > /tmp/sanity-hsm.test_70.r3C7/events & echo \$! > /tmp/sanity-hsm.test_70.r3C7/monitor_pid

Timeout occurred after 238 mins, last suite running was sanity-hsm, restarting cluster to continue tests

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-hsm test_70 - Timeout occurred after 238 mins, last suite running was sanity-hsm, restarting cluster to continue tests



 Comments   
Comment by Peter Jones [ 21/Jan/20 ]

Yingjin

Could you please advise?

Thanks

Peter

Comment by Jian Yu [ 29/Jan/20 ]

The failure also occurred on Lustre b2_12 branch with RHEL 8.1 client:
https://testing.whamcloud.com/test_sets/59bb2c92-3ea7-11ea-9543-52540065bddc

Comment by Peter Jones [ 31/Jan/20 ]

Dongyang

Could you please look into why this regression has appeared using RHEL 8.1 rather than RHEL 8.0 clients?

Peter

Comment by Dongyang Li [ 06/Feb/20 ]

from the both test sessions, seems the process was stuck after the last cat command started on the second client. the stack log from the session suggests the cat process was indeed started:

[22875.131497] cat             S    0  1279      1 0x00000080
[22875.132478] Call Trace:
[22875.132975]  ? __schedule+0x253/0x830
[22875.133660]  schedule+0x28/0x70
[22875.134253]  pipe_wait+0x6c/0xb0
[22875.134880]  ? finish_wait+0x80/0x80
[22875.135566]  wait_for_partner+0x19/0x50
[22875.136269]  fifo_open+0x27b/0x2b0
[22875.136911]  ? pipe_release+0xa0/0xa0
[22875.137597]  do_dentry_open+0x132/0x330
[22875.138310]  path_openat+0x573/0x14d0
[22875.139008]  do_filp_open+0x93/0x100
[22875.139681]  ? __check_object_size+0xa3/0x181
[22875.140466]  do_sys_open+0x184/0x220
[22875.141149]  do_syscall_64+0x5b/0x1b0
[22875.141843]  entry_SYSCALL_64_after_hwframe+0x65/0xca

which is part of the copytool_monitor_setup,

we should proceed to 

copytool setup --event-fifo "$HSMTOOL_MONITOR_DIR/fifo"

and should be able to see

CMD: trevis-16vm6 mkdir -p /tmp/arc1/sanity-hsm.test_70/

as the next command on the second client.

but I couldn't see anything stopping us from doing that.

Also tried to reproduce this with the same setup, no luck yet.

Need some help here.

Comment by Jian Yu [ 12/Feb/20 ]

Hi Dongyang,

Also tried to reproduce this with the same setup, no luck yet.

I can reproduce this with two clients. I found the test passed with one client. Did you try to use two clients to reproduce the issue?

Comment by Dongyang Li [ 13/Feb/20 ]

Yes I was using 2 clients.

Are you reproducing it with the lab vms? if so can you create a setup and make sure the test case fails, then I can get into the vm and have a look?

Thanks a lot

Comment by Jian Yu [ 13/Feb/20 ]

Sure, Dongyang.

Here are the test nodes:

2 Clients: onyx-22vm3 (local), onyx-22vm5 (remote)
1 MGS/MDS: onyx-22vm1 (1 MDT)
1 OSS: onyx-22vm4 (2 OSTs)

Now sanity-hsm test 70 is hanging on onyx-22vm3:

== sanity-hsm test 70: Copytool logs JSON register/unregister events to FIFO ========================= 06:18:09 (1581574689)
CMD: onyx-22vm5 mktemp --tmpdir=/tmp -d sanity-hsm.test_70.XXXX
CMD: onyx-22vm5 mkfifo -m 0644 /tmp/sanity-hsm.test_70.Whg8/fifo
CMD: onyx-22vm5 cat /tmp/sanity-hsm.test_70.Whg8/fifo > /tmp/sanity-hsm.test_70.Whg8/events & echo \$! > /tmp/sanity-hsm.test_70.Whg8/monitor_pid
Comment by Dongyang Li [ 13/Feb/20 ]

Many thanks Jian,

Looks like the session on onyx-22vm3 has already finished, I was trying to start a new session with something like --only 70 but I noticed the cfg/local.sh was not setup.

Can I just use auster under /usr/lib64/lustre/tests on 22vm3? if not can you provide a cmd for it?

Comment by Jian Yu [ 13/Feb/20 ]

Sure, Dongyang.
I usually directly specify variable values to run auster. Here is the command I ran on onyx-22vm3 under /usr/lib64/lustre/tests:

[root@onyx-22vm3 tests]# PDSH="pdsh -t 120 -S -R ssh -w" NAME=ncli RCLIENTS=onyx-22vm5 mds_HOST=onyx-22vm1 MDSDEV=/dev/vda5 MDSSIZE=2097152 ost_HOST=onyx-22vm4 OSTCOUNT=2 OSTSIZE=2097152 OSTDEV1=/dev/vda5 OSTDEV2=/dev/vda6 SHARED_DIRECTORY="/home/jianyu/test_logs" VERBOSE=true bash auster -d /home/jianyu/test_logs -r -s -v -k sanity-hsm --only 70
Comment by Gerrit Updater [ 17/Feb/20 ]

Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/37595
Subject: LU-13160 tests: fix sanity-hsm monitor setup
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 093d9f2fe4d80060db46475bf32d4b985edebf97

Comment by Gerrit Updater [ 01/Mar/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37595/
Subject: LU-13160 tests: fix sanity-hsm monitor setup
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6724d8ca58e9b8474a180b013a4723cbdd8900d9

Comment by Gerrit Updater [ 02/Mar/20 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37773
Subject: LU-13160 tests: fix sanity-hsm monitor setup
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 2f639102be1ded4504027a735affa041dde3552a

Comment by Gerrit Updater [ 06/Apr/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37773/
Subject: LU-13160 tests: fix sanity-hsm monitor setup
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: c0a877ab3b049266042299a438d8d010ce3ce605

Comment by James A Simmons [ 20/May/20 ]

Is this complete?

Comment by Dongyang Li [ 20/May/20 ]

Yes, I'm closing the ticket.

Generated at Sat Feb 10 02:58:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.