[LU-12744] conf-sanity test 32c hangs in migration Created: 10/Sep/19 Updated: 21/Apr/23 Resolved: 21/Apr/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | DNE | ||
| Environment: |
DNE |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
conf-sanity test_32c hangs. The last update in the suite_log is test migration /tmp/t32/mnt/lustre /usr/lib64/lustre/tests It looks like this issue started on 18-August-2019 with one exception. There is one conf-sanity test 32c test hang on 22-July-2019 that looks similar and may be the first occurrence of this issue; logs at https://testing.whamcloud.com/test_sets/61bcf480-ac5c-11e9-8fc1-52540065bddc, for patch https://review.whamcloud.com/35563/. Looking at the hang at https://testing.whamcloud.com/test_sets/5528ff0a-d38c-11e9-a2b6-52540065bddc, the MDS1 (46vm11) has the following LNet errors [10395.738022] Lustre: DEBUG MARKER: cat /tmp/t32/list [10400.803342] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.194@tcp added to recovery queue. Health = 900 [10400.805485] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 4 previous similar messages [10416.806341] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.194@tcp added to recovery queue. Health = 900 [10460.811339] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.194@tcp added to recovery queue. Health = 900 [10460.813456] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 2 previous similar messages [10525.814339] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.194@tcp added to recovery queue. Health = 900 [10525.816593] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 4 previous similar messages [10546.305918] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n osd*.*.force_sync=1 [10555.882400] Lustre: DEBUG MARKER: test -f /tmp/t32/sha1sums [10556.249212] Lustre: DEBUG MARKER: cat /tmp/t32/sha1sums [10575.523117] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n mdt.t32fs*.enable_remote_dir=1 [10875.789488] LNetError: 17659:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [10875.791705] LNetError: 17659:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages [11480.794448] LNetError: 17659:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [11480.796650] LNetError: 17659:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages In the client 1 () console logs we see similar LNet errors [ 8464.161118] Lustre: DEBUG MARKER: == conf-sanity test 32c: dne upgrade test ============================================================ 04:16:14 (1568088974) [ 8471.051044] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 416 [ 8476.053138] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [ 8611.055031] LNetError: 20191:0:(lib-move.c:3044:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-1.2.3.4@tcp: -125 [ 8611.472153] Lustre: Mounted t32fs-client [ 8616.057105] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [ 8631.053042] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900 [ 8756.078407] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [ 8766.082072] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [ 8781.078070] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900 [ 8781.080441] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 9 previous similar messages [ 8911.084112] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [ 9047.082034] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900 [ 9047.084098] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 10 previous similar messages [ 9071.086136] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [ 9071.088202] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message [ 9111.089105] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [ 9111.091216] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 2 previous similar messages [ 9271.093137] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [ 9271.095269] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message [ 9461.097114] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [ 9461.099271] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 3 previous similar messages [ 9567.085032] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900 [ 9567.087041] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 21 previous similar messages [ 9846.101090] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [ 9846.103176] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 7 previous similar messages [10154.362691] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight [10154.378201] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight [10154.786948] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight [10154.799759] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight [10155.813149] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight [10156.827367] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight [10157.841681] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight [10158.855999] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight [10159.869779] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight [10160.883948] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight [10161.897838] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight [10162.040091] Lustre: Unmounted t32fs-client [10296.089054] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900 [10296.091112] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 22 previous similar messages [10426.104119] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [10426.106258] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 11 previous similar messages [10536.911104] LustreError: 11-0: t32fs-MDT0000-mdc-ffff9cb8f61ea800: operation mds_connect to node 10.9.5.194@tcp failed: rc = -114 [10541.919003] LustreError: 11-0: t32fs-MDT0000-mdc-ffff9cb8f61ea800: operation mds_connect to node 10.9.5.194@tcp failed: rc = -114 [10546.940716] Lustre: Mounted t32fs-client [10896.106040] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900 [10896.108273] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 13 previous similar messages [11026.122103] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [11026.124299] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 13 previous similar messages [11506.110027] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900 [11506.112068] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 17 previous similar messages [11636.126075] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [11636.128150] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 13 previous similar messages Logs for hung test sessions are at |
| Comments |
| Comment by James Nunez (Inactive) [ 16/Sep/19 ] |
|
We see a similar issue for conf-sanity test 116 hang at https://testing.whamcloud.com/test_sets/179c9e2e-d70d-11e9-9fc9-52540065bddc . On the OSS (vm4), we see [61594.376664] Lustre: DEBUG MARKER: == conf-sanity test 116: big size MDT support ======================================================== 17:28:44 (1568395724) [61675.203648] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI [61675.205213] LNetError: Skipped 9 previous similar messages [61675.206173] LNetError: 7800:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1 [61675.207905] LNetError: 7800:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages [61675.209464] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs? [61675.212000] LNetError: Skipped 9 previous similar messages [62005.203466] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0 [62005.205482] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 31 previous similar messages [62305.206637] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI [62305.208161] LNetError: Skipped 9 previous similar messages [62305.209285] LNetError: 7801:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1 [62305.211124] LNetError: 7801:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages [62305.212673] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs? [62305.215211] LNetError: Skipped 9 previous similar messages [62620.206491] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0 [62620.208563] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 31 previous similar messages [62925.210714] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI [62925.212278] LNetError: Skipped 9 previous similar messages [62925.213471] LNetError: 7800:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1 [62925.215180] LNetError: 7800:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages [62925.216751] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs? [62925.219476] LNetError: Skipped 9 previous similar messages [63220.210482] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0 [63220.212527] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 31 previous similar messages [63555.213604] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI [63555.215120] LNetError: Skipped 9 previous similar messages [63555.216062] LNetError: 7801:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1 [63555.217715] LNetError: 7801:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages [63555.219301] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs? [63555.221883] LNetError: Skipped 9 previous similar messages [63820.213571] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0 [63820.216233] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 30 previous similar messages [64165.217617] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI [64165.219175] LNetError: Skipped 9 previous similar messages [64165.220149] LNetError: 7800:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1 [64165.222019] LNetError: 7800:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages [64165.223598] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs? [64165.226145] LNetError: Skipped 9 previous similar messages [64430.217496] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0 [64430.219536] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 31 previous similar messages [64795.220642] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI [64795.222149] LNetError: Skipped 9 previous similar messages [64795.223084] LNetError: 7801:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1 [64795.224887] LNetError: 7801:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 9 previous similar messages [64795.226457] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs? [64795.228985] LNetError: Skipped 9 previous similar messages [65035.220473] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0 [65035.222497] LNetError: 7809:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 31 previous similar messages |
| Comment by Jian Yu [ 21/Jan/20 ] |
|
+1 on master branch: https://testing.whamcloud.com/test_sets/ef710ad8-3b62-11ea-80b4-52540065bddc |
| Comment by Jian Yu [ 14/Feb/20 ] |
|
I created LU-13251 for conf-sanity test 116 failure because the issue does not only exist on DNE configuration. Many conf-sanity sub-tests failed with the same issue. |