Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 2.13.0
-
DNE
-
3
-
9223372036854775807
Description
conf-sanity test_32c hangs. The last update in the suite_log is
test migration /tmp/t32/mnt/lustre /usr/lib64/lustre/tests
It looks like this issue started on 18-August-2019 with one exception. There is one conf-sanity test 32c test hang on 22-July-2019 that looks similar and may be the first occurrence of this issue; logs at https://testing.whamcloud.com/test_sets/61bcf480-ac5c-11e9-8fc1-52540065bddc, for patch https://review.whamcloud.com/35563/.
Looking at the hang at https://testing.whamcloud.com/test_sets/5528ff0a-d38c-11e9-a2b6-52540065bddc, the MDS1 (46vm11) has the following LNet errors
[10395.738022] Lustre: DEBUG MARKER: cat /tmp/t32/list [10400.803342] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.194@tcp added to recovery queue. Health = 900 [10400.805485] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 4 previous similar messages [10416.806341] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.194@tcp added to recovery queue. Health = 900 [10460.811339] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.194@tcp added to recovery queue. Health = 900 [10460.813456] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 2 previous similar messages [10525.814339] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.194@tcp added to recovery queue. Health = 900 [10525.816593] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 4 previous similar messages [10546.305918] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n osd*.*.force_sync=1 [10555.882400] Lustre: DEBUG MARKER: test -f /tmp/t32/sha1sums [10556.249212] Lustre: DEBUG MARKER: cat /tmp/t32/sha1sums [10575.523117] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n mdt.t32fs*.enable_remote_dir=1 [10875.789488] LNetError: 17659:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [10875.791705] LNetError: 17659:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages [11480.794448] LNetError: 17659:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [11480.796650] LNetError: 17659:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages
In the client 1 () console logs we see similar LNet errors
[ 8464.161118] Lustre: DEBUG MARKER: == conf-sanity test 32c: dne upgrade test ============================================================ 04:16:14 (1568088974) [ 8471.051044] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 416 [ 8476.053138] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [ 8611.055031] LNetError: 20191:0:(lib-move.c:3044:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-1.2.3.4@tcp: -125 [ 8611.472153] Lustre: Mounted t32fs-client [ 8616.057105] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [ 8631.053042] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900 [ 8756.078407] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [ 8766.082072] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [ 8781.078070] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900 [ 8781.080441] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 9 previous similar messages [ 8911.084112] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [ 9047.082034] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900 [ 9047.084098] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 10 previous similar messages [ 9071.086136] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [ 9071.088202] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message [ 9111.089105] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [ 9111.091216] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 2 previous similar messages [ 9271.093137] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [ 9271.095269] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message [ 9461.097114] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [ 9461.099271] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 3 previous similar messages [ 9567.085032] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900 [ 9567.087041] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 21 previous similar messages [ 9846.101090] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [ 9846.103176] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 7 previous similar messages [10154.362691] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight [10154.378201] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight [10154.786948] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight [10154.799759] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight [10155.813149] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight [10156.827367] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight [10157.841681] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight [10158.855999] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight [10159.869779] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight [10160.883948] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight [10161.897838] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight [10162.040091] Lustre: Unmounted t32fs-client [10296.089054] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900 [10296.091112] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 22 previous similar messages [10426.104119] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [10426.106258] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 11 previous similar messages [10536.911104] LustreError: 11-0: t32fs-MDT0000-mdc-ffff9cb8f61ea800: operation mds_connect to node 10.9.5.194@tcp failed: rc = -114 [10541.919003] LustreError: 11-0: t32fs-MDT0000-mdc-ffff9cb8f61ea800: operation mds_connect to node 10.9.5.194@tcp failed: rc = -114 [10546.940716] Lustre: Mounted t32fs-client [10896.106040] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900 [10896.108273] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 13 previous similar messages [11026.122103] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [11026.124299] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 13 previous similar messages [11506.110027] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900 [11506.112068] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 17 previous similar messages [11636.126075] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0 [11636.128150] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 13 previous similar messages
Logs for hung test sessions are at
https://testing.whamcloud.com/test_sets/4d846992-c1e9-11e9-98c8-52540065bddc
https://testing.whamcloud.com/test_sets/605ec094-c577-11e9-90ad-52540065bddc
https://testing.whamcloud.com/test_sets/54161a1a-c592-11e9-a2b6-52540065bddc