[LU-7500]  lnet-selftest: test failed to respond and timed out Created: 01/Dec/15  Updated: 26/Jun/17

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Amir Shehata (Inactive)
Resolution: Unresolved Votes: 0
Labels: None
Environment:

EL7.1 Server/SLES11 SP3 Client
Master - Build# 3252


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/a417b6d0-945f-11e5-a5ac-5254006e85c2.

The sub-test lnet-selftest failed with the following error:

test failed to respond and timed out

lent-selftest got timed out and no other subtest ran. Could not find any useful information as log files were absent.



 Comments   
Comment by Saurabh Tandan (Inactive) [ 11/Dec/15 ]

master, build# 3264, 2.7.64 tag
Regression:EL7.1 Server/SLES11 SP3 Client
https://testing.hpdd.intel.com/test_sets/1d317972-9f2b-11e5-bf9b-5254006e85c2

Comment by Peter Jones [ 16/Dec/15 ]

Doug

Could you please advise on this one?

Thanks

Peter

Comment by Andreas Dilger [ 05/Jan/16 ]

The one suite_stdout log from https://testing.hpdd.intel.com/test_sets/6c6a9940-9f0a-11e5-ba94-5254006e85c2 shows this is a hang at unmount:

16:37:57:CMD: shadow-9vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
16:37:59:CMD: shadow-9vm3 grep -c /mnt/ost5' ' /proc/mounts
16:37:59:Stopping /mnt/ost5 (opts:-f) on shadow-9vm3
16:37:59:CMD: shadow-9vm3 umount -d -f /mnt/ost5
16:37:59:CMD: shadow-9vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
16:37:59:CMD: shadow-9vm3 grep -c /mnt/ost6' ' /proc/mounts
16:37:59:Stopping /mnt/ost6 (opts:-f) on shadow-9vm3
16:37:59:CMD: shadow-9vm3 umount -d -f /mnt/ost6
16:38:00:CMD: shadow-9vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
16:38:00:CMD: shadow-9vm3 grep -c /mnt/ost7' ' /proc/mounts
16:38:00:Stopping /mnt/ost7 (opts:-f) on shadow-9vm3
16:38:00:CMD: shadow-9vm3 umount -d -f /mnt/ost7
17:37:42:********** Timeout by autotest system **********

Unfortunately, there are no console logs from the OST that might indicate what the problem is.

Comment by James Nunez (Inactive) [ 06/Jan/16 ]

After speaking with Saurabh, the failure at https://testing.hpdd.intel.com/test_sets/6c6a9940-9f0a-11e5-ba94-5254006e85c2, and that Andreas posted the portion of the suite_stdout, is probably LU-7326/LU-7038 meaning it should not be considered as part of this ticket.

The two remaining failures listed in this ticket so far are for tests between SLES clients and CentOS servers and the only information we have about the failures is from the suite_stdout log:

19:44:42:-----============= acceptance-small: lnet-selftest ============----- Wed Dec  9 18:44:38 PST 2015
19:44:42:Running: bash /usr/lib64/lustre/tests/lnet-selftest.sh
19:44:42:CMD: shadow-14vm12,shadow-14vm7 /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d '@'
19:44:42:CMD: shadow-14vm5,shadow-14vm6 /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d '@'
19:44:42:Stopping clients: shadow-14vm5,shadow-14vm6 /mnt/lustre (opts:)
19:44:42:CMD: shadow-14vm5,shadow-14vm6 running=\$(grep -c /mnt/lustre' ' /proc/mounts);
19:44:42:if [ \$running -ne 0 ] ; then
19:44:42:echo Stopping client \$(hostname) /mnt/lustre opts:;
19:44:42:lsof /mnt/lustre || need_kill=no;
19:44:42:if [ x != x -a x\$need_kill != xno ]; then
19:44:42:    pids=\$(lsof -t /mnt/lustre | sort -u);
19:44:42:    if [ -n \"\$pids\" ]; then
19:44:42:             kill -9 \$pids;
19:44:42:    fi
19:44:42:fi;
19:44:42:while umount  /mnt/lustre 2>&1 | grep -q busy; do
19:44:42:    echo /mnt/lustre is still busy, wait one second && sleep 1;
19:44:42:done;
19:44:42:fi
20:45:17:********** Timeout by autotest system **********
Comment by Saurabh Tandan (Inactive) [ 10/Feb/16 ]

Another instance found for interop tag 2.7.66 - EL6.7 Server/2.7.1 Client, build# 3316
https://testing.hpdd.intel.com/test_sets/70b97fbe-cc98-11e5-b80c-5254006e85c2

Comment by nasf (Inactive) [ 07/Jun/16 ]

I hit similar trouble in conf-sanity
https://testing.hpdd.intel.com/test_logs/64716322-2c90-11e6-bbf5-5254006e85c2/show_text

Generated at Sat Feb 10 02:09:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.