[LU-12744] conf-sanity test 32c hangs in migration Created: 10/Sep/19  Updated: 21/Apr/23  Resolved: 21/Apr/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: DNE
Environment:

DNE


Issue Links:
Related
is related to LU-15857 conf-sanity test_32c: timeout Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

conf-sanity test_32c hangs. The last update in the suite_log is

test migration
/tmp/t32/mnt/lustre /usr/lib64/lustre/tests

It looks like this issue started on 18-August-2019 with one exception. There is one conf-sanity test 32c test hang on 22-July-2019 that looks similar and may be the first occurrence of this issue; logs at https://testing.whamcloud.com/test_sets/61bcf480-ac5c-11e9-8fc1-52540065bddc, for patch https://review.whamcloud.com/35563/.

Looking at the hang at https://testing.whamcloud.com/test_sets/5528ff0a-d38c-11e9-a2b6-52540065bddc, the MDS1 (46vm11) has the following LNet errors

[10395.738022] Lustre: DEBUG MARKER: cat /tmp/t32/list
[10400.803342] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.194@tcp added to recovery queue. Health = 900
[10400.805485] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 4 previous similar messages
[10416.806341] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.194@tcp added to recovery queue. Health = 900
[10460.811339] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.194@tcp added to recovery queue. Health = 900
[10460.813456] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 2 previous similar messages
[10525.814339] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.194@tcp added to recovery queue. Health = 900
[10525.816593] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 4 previous similar messages
[10546.305918] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n osd*.*.force_sync=1
[10555.882400] Lustre: DEBUG MARKER: test -f /tmp/t32/sha1sums
[10556.249212] Lustre: DEBUG MARKER: cat /tmp/t32/sha1sums
[10575.523117] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n mdt.t32fs*.enable_remote_dir=1
[10875.789488] LNetError: 17659:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
[10875.791705] LNetError: 17659:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages
[11480.794448] LNetError: 17659:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
[11480.796650] LNetError: 17659:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages

In the client 1 () console logs we see similar LNet errors

[ 8464.161118] Lustre: DEBUG MARKER: == conf-sanity test 32c: dne upgrade test ============================================================ 04:16:14 (1568088974)
[ 8471.051044] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 416
[ 8476.053138] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
[ 8611.055031] LNetError: 20191:0:(lib-move.c:3044:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-1.2.3.4@tcp: -125
[ 8611.472153] Lustre: Mounted t32fs-client
[ 8616.057105] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
[ 8631.053042] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900
[ 8756.078407] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
[ 8766.082072] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
[ 8781.078070] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900
[ 8781.080441] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 9 previous similar messages
[ 8911.084112] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
[ 9047.082034] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900
[ 9047.084098] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 10 previous similar messages
[ 9071.086136] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
[ 9071.088202] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message
[ 9111.089105] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
[ 9111.091216] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 2 previous similar messages
[ 9271.093137] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
[ 9271.095269] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message
[ 9461.097114] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
[ 9461.099271] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 3 previous similar messages
[ 9567.085032] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900
[ 9567.087041] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 21 previous similar messages
[ 9846.101090] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
[ 9846.103176] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 7 previous similar messages
[10154.362691] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight
[10154.378201] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight
[10154.786948] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight
[10154.799759] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight
[10155.813149] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight
[10156.827367] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight
[10157.841681] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight
[10158.855999] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight
[10159.869779] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight
[10160.883948] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight
[10161.897838] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight
[10162.040091] Lustre: Unmounted t32fs-client
[10296.089054] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900
[10296.091112] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 22 previous similar messages
[10426.104119] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
[10426.106258] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 11 previous similar messages
[10536.911104] LustreError: 11-0: t32fs-MDT0000-mdc-ffff9cb8f61ea800: operation mds_connect to node 10.9.5.194@tcp failed: rc = -114
[10541.919003] LustreError: 11-0: t32fs-MDT0000-mdc-ffff9cb8f61ea800: operation mds_connect to node 10.9.5.194@tcp failed: rc = -114
[10546.940716] Lustre: Mounted t32fs-client
[10896.106040] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900
[10896.108273] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 13 previous similar messages
[11026.122103] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
[11026.124299] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 13 previous similar messages
[11506.110027] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900
[11506.112068] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 17 previous similar messages
[11636.126075] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
[11636.128150] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 13 previous similar messages

Logs for hung test sessions are at
https://testing.whamcloud.com/test_sets/4d846992-c1e9-11e9-98c8-52540065bddc
https://testing.whamcloud.com/test_sets/605ec094-c577-11e9-90ad-52540065bddc
https://testing.whamcloud.com/test_sets/54161a1a-c592-11e9-a2b6-52540065bddc



 Comments   
Comment by James Nunez (Inactive) [ 16/Sep/19 ]

We see a similar issue for conf-sanity test 116 hang at https://testing.whamcloud.com/test_sets/179c9e2e-d70d-11e9-9fc9-52540065bddc .

On the OSS (vm4), we see

[61594.376664] Lustre: DEBUG MARKER: == conf-sanity test 116: big size MDT support ======================================================== 17:28:44 (1568395724)
[61675.203648] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
[61675.205213] LNetError: Skipped 9 previous similar messages
[61675.206173] LNetError: 7800:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
[61675.207905] LNetError: 7800:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages
[61675.209464] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
[61675.212000] LNetError: Skipped 9 previous similar messages
[62005.203466] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0
[62005.205482] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 31 previous similar messages
[62305.206637] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
[62305.208161] LNetError: Skipped 9 previous similar messages
[62305.209285] LNetError: 7801:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
[62305.211124] LNetError: 7801:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages
[62305.212673] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
[62305.215211] LNetError: Skipped 9 previous similar messages
[62620.206491] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0
[62620.208563] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 31 previous similar messages
[62925.210714] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
[62925.212278] LNetError: Skipped 9 previous similar messages
[62925.213471] LNetError: 7800:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
[62925.215180] LNetError: 7800:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages
[62925.216751] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
[62925.219476] LNetError: Skipped 9 previous similar messages
[63220.210482] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0
[63220.212527] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 31 previous similar messages
[63555.213604] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
[63555.215120] LNetError: Skipped 9 previous similar messages
[63555.216062] LNetError: 7801:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
[63555.217715] LNetError: 7801:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages
[63555.219301] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
[63555.221883] LNetError: Skipped 9 previous similar messages
[63820.213571] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0
[63820.216233] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 30 previous similar messages
[64165.217617] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
[64165.219175] LNetError: Skipped 9 previous similar messages
[64165.220149] LNetError: 7800:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
[64165.222019] LNetError: 7800:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages
[64165.223598] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
[64165.226145] LNetError: Skipped 9 previous similar messages
[64430.217496] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0
[64430.219536] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 31 previous similar messages
[64795.220642] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
[64795.222149] LNetError: Skipped 9 previous similar messages
[64795.223084] LNetError: 7801:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
[64795.224887] LNetError: 7801:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages
[64795.226457] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
[64795.228985] LNetError: Skipped 9 previous similar messages
[65035.220473] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0
[65035.222497] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 31 previous similar messages
Comment by Jian Yu [ 21/Jan/20 ]

+1 on master branch: https://testing.whamcloud.com/test_sets/ef710ad8-3b62-11ea-80b4-52540065bddc

Comment by Jian Yu [ 14/Feb/20 ]

I created LU-13251 for conf-sanity test 116 failure because the issue does not only exist on DNE configuration. Many conf-sanity sub-tests failed with the same issue.

Generated at Sat Feb 10 02:55:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.