[LU-13251] conf-sanity test_116 hangs Created: 14/Feb/20  Updated: 26/Oct/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for jianyu <yujian@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/f9a3b2f8-4f30-11ea-a90e-52540065bddc

test_116 failed with the following error:

== conf-sanity test 116: big size MDT support ======================================================== 14:30:44 (1581604244)
CMD: trevis-41vm12 which mkfs.xfs
/sbin/mkfs.xfs
Stopping clients: trevis-41vm10,trevis-41vm9.trevis.whamcloud.com /mnt/lustre (opts:)
CMD: trevis-41vm10,trevis-41vm9.trevis.whamcloud.com running=\$(grep -c /mnt/lustre' ' /proc/mounts);
if [ \$running -ne 0 ] ; then
echo Stopping client \$(hostname) /mnt/lustre opts:;
lsof /mnt/lustre || need_kill=no;
if [ x != x -a x\$need_kill != xno ]; then
    pids=\$(lsof -t /mnt/lustre | sort -u);
    if [ -n \"\$pids\" ]; then
             kill -9 \$pids;
    fi
fi;
while umount  /mnt/lustre 2>&1 | grep -q busy; do
    echo /mnt/lustre is still busy, wait one second && sleep 1;
done;
fi

Console log on OSS:

[51410.683520] Lustre: DEBUG MARKER: == conf-sanity test 116: big size MDT support ======================================================== 14:30:44 (1581604244)
[51831.596292] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
[51831.597893] LNetError: Skipped 6 previous similar messages
[51831.598897] LNetError: 10598:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
[51831.600826] LNetError: 10598:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 6 previous similar messages
[51831.602520] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
[51831.605249] LNetError: Skipped 6 previous similar messages
[52106.585997] LNetError: 10606:0:(peer.c:3706:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0
[52106.588260] LNetError: 10606:0:(peer.c:3706:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 2 previous similar messages
[52456.590555] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
[52456.592517] LNetError: Skipped 7 previous similar messages
[52456.593570] LNetError: 10598:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
[52456.595610] LNetError: 10598:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 7 previous similar messages
[52456.597393] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
[52456.600207] LNetError: Skipped 7 previous similar messages

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
conf-sanity test_116 - Timeout occurred after 917 mins, last suite running was conf-sanity



 Comments   
Comment by Jian Yu [ 14/Feb/20 ]

conf-sanity test 117 and 123aa also failed with this issue:
https://testing.whamcloud.com/test_sets/63c7f97e-4af3-11ea-aeb7-52540065bddc
https://testing.whamcloud.com/test_sets/1c4839c0-4dcc-11ea-aeb7-52540065bddc

Comment by Jian Yu [ 10/May/20 ]

+1 on master branch:
https://testing.whamcloud.com/test_sets/709ca0df-1122-49e3-9a6b-290e68078db2

Comment by James Nunez (Inactive) [ 05/Oct/20 ]

We're seeing the same issue for replay-single test 74; https://testing.whamcloud.com/test_sets/818fe728-c4ad-48b3-8aec-4440d4e899d2 .

Comment by James A Simmons [ 26/Oct/22 ]

We are seeing this on a production machine. Doesn't break the production machine but we do see it in the logs.

Generated at Sat Feb 10 02:59:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.