Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12744

conf-sanity test 32c hangs in migration

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.13.0
    • DNE
    • 3
    • 9223372036854775807

    Description

      conf-sanity test_32c hangs. The last update in the suite_log is

      test migration
      /tmp/t32/mnt/lustre /usr/lib64/lustre/tests
      

      It looks like this issue started on 18-August-2019 with one exception. There is one conf-sanity test 32c test hang on 22-July-2019 that looks similar and may be the first occurrence of this issue; logs at https://testing.whamcloud.com/test_sets/61bcf480-ac5c-11e9-8fc1-52540065bddc, for patch https://review.whamcloud.com/35563/.

      Looking at the hang at https://testing.whamcloud.com/test_sets/5528ff0a-d38c-11e9-a2b6-52540065bddc, the MDS1 (46vm11) has the following LNet errors

      [10395.738022] Lustre: DEBUG MARKER: cat /tmp/t32/list
      [10400.803342] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.194@tcp added to recovery queue. Health = 900
      [10400.805485] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 4 previous similar messages
      [10416.806341] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.194@tcp added to recovery queue. Health = 900
      [10460.811339] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.194@tcp added to recovery queue. Health = 900
      [10460.813456] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 2 previous similar messages
      [10525.814339] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.194@tcp added to recovery queue. Health = 900
      [10525.816593] LNetError: 17654:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 4 previous similar messages
      [10546.305918] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n osd*.*.force_sync=1
      [10555.882400] Lustre: DEBUG MARKER: test -f /tmp/t32/sha1sums
      [10556.249212] Lustre: DEBUG MARKER: cat /tmp/t32/sha1sums
      [10575.523117] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n mdt.t32fs*.enable_remote_dir=1
      [10875.789488] LNetError: 17659:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
      [10875.791705] LNetError: 17659:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 38 previous similar messages
      [11480.794448] LNetError: 17659:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
      [11480.796650] LNetError: 17659:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 37 previous similar messages
      

      In the client 1 () console logs we see similar LNet errors

      [ 8464.161118] Lustre: DEBUG MARKER: == conf-sanity test 32c: dne upgrade test ============================================================ 04:16:14 (1568088974)
      [ 8471.051044] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 416
      [ 8476.053138] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
      [ 8611.055031] LNetError: 20191:0:(lib-move.c:3044:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-1.2.3.4@tcp: -125
      [ 8611.472153] Lustre: Mounted t32fs-client
      [ 8616.057105] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
      [ 8631.053042] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900
      [ 8756.078407] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
      [ 8766.082072] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
      [ 8781.078070] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900
      [ 8781.080441] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 9 previous similar messages
      [ 8911.084112] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
      [ 9047.082034] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900
      [ 9047.084098] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 10 previous similar messages
      [ 9071.086136] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
      [ 9071.088202] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message
      [ 9111.089105] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
      [ 9111.091216] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 2 previous similar messages
      [ 9271.093137] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
      [ 9271.095269] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 1 previous similar message
      [ 9461.097114] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
      [ 9461.099271] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 3 previous similar messages
      [ 9567.085032] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900
      [ 9567.087041] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 21 previous similar messages
      [ 9846.101090] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
      [ 9846.103176] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 7 previous similar messages
      [10154.362691] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight
      [10154.378201] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight
      [10154.786948] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight
      [10154.799759] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight
      [10155.813149] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight
      [10156.827367] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight
      [10157.841681] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight
      [10158.855999] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight
      [10159.869779] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight
      [10160.883948] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight
      [10161.897838] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight
      [10162.040091] Lustre: Unmounted t32fs-client
      [10296.089054] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900
      [10296.091112] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 22 previous similar messages
      [10426.104119] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
      [10426.106258] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 11 previous similar messages
      [10536.911104] LustreError: 11-0: t32fs-MDT0000-mdc-ffff9cb8f61ea800: operation mds_connect to node 10.9.5.194@tcp failed: rc = -114
      [10541.919003] LustreError: 11-0: t32fs-MDT0000-mdc-ffff9cb8f61ea800: operation mds_connect to node 10.9.5.194@tcp failed: rc = -114
      [10546.940716] Lustre: Mounted t32fs-client
      [10896.106040] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900
      [10896.108273] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 13 previous similar messages
      [11026.122103] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
      [11026.124299] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 13 previous similar messages
      [11506.110027] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.59@tcp added to recovery queue. Health = 900
      [11506.112068] LNetError: 20186:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 17 previous similar messages
      [11636.126075] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) lpni 1.2.3.4@tcp added to recovery queue. Health = 0
      [11636.128150] LNetError: 20191:0:(peer.c:3713:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 13 previous similar messages
      

      Logs for hung test sessions are at
      https://testing.whamcloud.com/test_sets/4d846992-c1e9-11e9-98c8-52540065bddc
      https://testing.whamcloud.com/test_sets/605ec094-c577-11e9-90ad-52540065bddc
      https://testing.whamcloud.com/test_sets/54161a1a-c592-11e9-a2b6-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: