[LU-13251] conf-sanity test_116 hangs Created: 14/Feb/20 Updated: 26/Oct/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for jianyu <yujian@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/f9a3b2f8-4f30-11ea-a90e-52540065bddc test_116 failed with the following error: == conf-sanity test 116: big size MDT support ======================================================== 14:30:44 (1581604244)
CMD: trevis-41vm12 which mkfs.xfs
/sbin/mkfs.xfs
Stopping clients: trevis-41vm10,trevis-41vm9.trevis.whamcloud.com /mnt/lustre (opts:)
CMD: trevis-41vm10,trevis-41vm9.trevis.whamcloud.com running=\$(grep -c /mnt/lustre' ' /proc/mounts);
if [ \$running -ne 0 ] ; then
echo Stopping client \$(hostname) /mnt/lustre opts:;
lsof /mnt/lustre || need_kill=no;
if [ x != x -a x\$need_kill != xno ]; then
pids=\$(lsof -t /mnt/lustre | sort -u);
if [ -n \"\$pids\" ]; then
kill -9 \$pids;
fi
fi;
while umount /mnt/lustre 2>&1 | grep -q busy; do
echo /mnt/lustre is still busy, wait one second && sleep 1;
done;
fi
Console log on OSS: [51410.683520] Lustre: DEBUG MARKER: == conf-sanity test 116: big size MDT support ======================================================== 14:30:44 (1581604244) [51831.596292] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI [51831.597893] LNetError: Skipped 6 previous similar messages [51831.598897] LNetError: 10598:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1 [51831.600826] LNetError: 10598:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 6 previous similar messages [51831.602520] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs? [51831.605249] LNetError: Skipped 6 previous similar messages [52106.585997] LNetError: 10606:0:(peer.c:3706:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0 [52106.588260] LNetError: 10606:0:(peer.c:3706:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 2 previous similar messages [52456.590555] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI [52456.592517] LNetError: Skipped 7 previous similar messages [52456.593570] LNetError: 10598:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1 [52456.595610] LNetError: 10598:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 7 previous similar messages [52456.597393] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs? [52456.600207] LNetError: Skipped 7 previous similar messages VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Jian Yu [ 14/Feb/20 ] |
|
conf-sanity test 117 and 123aa also failed with this issue: |
| Comment by Jian Yu [ 10/May/20 ] |
|
+1 on master branch: |
| Comment by James Nunez (Inactive) [ 05/Oct/20 ] |
|
We're seeing the same issue for replay-single test 74; https://testing.whamcloud.com/test_sets/818fe728-c4ad-48b3-8aec-4440d4e899d2 . |
| Comment by James A Simmons [ 26/Oct/22 ] |
|
We are seeing this on a production machine. Doesn't break the production machine but we do see it in the logs. |