Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11386

insanity test 0 hangs on failing over MDT

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.0
    • DNE
    • 3
    • 9223372036854775807

    Description

      insanity test 0 started hanging on July 19, 2018 for review-dne-part-4 with https://testing.whamcloud.com/test_sets/7b3d1b66-8af1-11e8-9028-52540065bddc .

      insanity test 0 fails all MDTs and fails all OSTs. For each successful failover, you will see the target being failed, rebooted, and the target being started like

      Failing mds1 on onyx-41vm9
      …
      reboot facets: mds1
      Failover mds1 to onyx-41vm9
      …
      mount facets: mds1
      …
      Starting mds1:   /dev/mapper/mds1_flakey /mnt/lustre-mds1
      …
      Started lustre-MDT0000
      

      We should see this for each MDT and for each OST. Yet, in the tests that hang, for example https://testing.whamcloud.com/test_sets/3f2ce6e0-ba91-11e8-8c12-52540065bddc, we see the first three MDTs fail, reboot and mount, but not the fourth MDT. The last thing we see in the suite_log is the third MDT starting

      CMD: trevis-37vm4 e2label /dev/mapper/mds3_flakey 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
      CMD: trevis-37vm4 e2label /dev/mapper/mds3_flakey 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
      CMD: trevis-37vm4 e2label /dev/mapper/mds3_flakey 2>/dev/null
      Started lustre-MDT0002
      

      Looking at the console log of the fourth MDT (vm5), we don’t see anything that indicates there is problem with MDT0003, the fourth MDT. In fact, there is nothing obviously wrong with any of the nodes by looking at their console logs.

      In all of these failures, the failover of one of the MDTs hangs. Here are links to logs for more insanity test 0’s that hang
      https://testing.whamcloud.com/test_sets/f7b26fc6-b9b7-11e8-8c12-52540065bddc
      https://testing.whamcloud.com/test_sets/3dc54fa2-adfa-11e8-bbd1-52540065bddc
      https://testing.whamcloud.com/test_sets/e0b3ea6e-8f5c-11e8-b0aa-52540065bddc

      Attachments

        Activity

          [LU-11386] insanity test 0 hangs on failing over MDT
          jamesanunez James Nunez (Inactive) made changes -
          Description Original: insanity test 0 started hanging on July 19, 2018 for review-dne-part-4 with https://testing.whamcloud.com/test_sets/7b3d1b66-8af1-11e8-9028-52540065bddc .

          insanity test 0 fails all MDTs and fails all OSTs. For each successful failover, you will see the target being failed, rebooted, and the target being started like
          {noformat}
          Failing mds1 on onyx-41vm9

          reboot facets: mds1
          Failover mds1 to onyx-41vm9

          mount facets: mds1

          Starting mds1: /dev/mapper/mds1_flakey /mnt/lustre-mds1

          Started lustre-MDT0000
          {noformat}

          We should see this for each MDT and for each OST. Yet, in the tests that hang, for example https://testing.whamcloud.com/test_sets/3f2ce6e0-ba91-11e8-8c12-52540065bddc, we see the first three MDTs fail, reboot and mount, but not the fourth MDT. The last thing we see in the suite_log is the third MDT starting
          {noformat}
          CMD: trevis-37vm4 e2label /dev/mapper/mds3_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
          CMD: trevis-37vm4 e2label /dev/mapper/mds3_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
          CMD: trevis-37vm4 e2label /dev/mapper/mds3_flakey 2>/dev/null
          Started lustre-MDT0002
          {noformat}

          Looking at the console log of the fourth MDT (vm5), we don’t see anything that indicates there is problem with MDT0003, the fourth MDT. In fact, there is nothing obviously wrong with any of the nodes by looking at their console logs.

          In all of these failures, the failover of one of the MDTs on the second MDS hangs. Here are links to logs for more insanity test 0’s that hang
          https://testing.whamcloud.com/test_sets/f7b26fc6-b9b7-11e8-8c12-52540065bddc
          https://testing.whamcloud.com/test_sets/3dc54fa2-adfa-11e8-bbd1-52540065bddc
          https://testing.whamcloud.com/test_sets/e0b3ea6e-8f5c-11e8-b0aa-52540065bddc
          New: insanity test 0 started hanging on July 19, 2018 for review-dne-part-4 with https://testing.whamcloud.com/test_sets/7b3d1b66-8af1-11e8-9028-52540065bddc .

          insanity test 0 fails all MDTs and fails all OSTs. For each successful failover, you will see the target being failed, rebooted, and the target being started like
          {noformat}
          Failing mds1 on onyx-41vm9

          reboot facets: mds1
          Failover mds1 to onyx-41vm9

          mount facets: mds1

          Starting mds1: /dev/mapper/mds1_flakey /mnt/lustre-mds1

          Started lustre-MDT0000
          {noformat}

          We should see this for each MDT and for each OST. Yet, in the tests that hang, for example https://testing.whamcloud.com/test_sets/3f2ce6e0-ba91-11e8-8c12-52540065bddc, we see the first three MDTs fail, reboot and mount, but not the fourth MDT. The last thing we see in the suite_log is the third MDT starting
          {noformat}
          CMD: trevis-37vm4 e2label /dev/mapper/mds3_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
          CMD: trevis-37vm4 e2label /dev/mapper/mds3_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
          CMD: trevis-37vm4 e2label /dev/mapper/mds3_flakey 2>/dev/null
          Started lustre-MDT0002
          {noformat}

          Looking at the console log of the fourth MDT (vm5), we don’t see anything that indicates there is problem with MDT0003, the fourth MDT. In fact, there is nothing obviously wrong with any of the nodes by looking at their console logs.

          In all of these failures, the failover of one of the MDTs hangs. Here are links to logs for more insanity test 0’s that hang
          https://testing.whamcloud.com/test_sets/f7b26fc6-b9b7-11e8-8c12-52540065bddc
          https://testing.whamcloud.com/test_sets/3dc54fa2-adfa-11e8-bbd1-52540065bddc
          https://testing.whamcloud.com/test_sets/e0b3ea6e-8f5c-11e8-b0aa-52540065bddc
          jamesanunez James Nunez (Inactive) made changes -
          Description Original: insanity test 0 started hanging on July 19, 2018 for review-dne-part-4 with https://testing.whamcloud.com/test_sets/7b3d1b66-8af1-11e8-9028-52540065bddc .

          insanity test 0 fails all MDTs and fails all OSTs. For each successful failover, you will see the target being failed, rebooted, and the target being started like
          {noformat}
          Failing mds1 on onyx-41vm9

          reboot facets: mds1
          Failover mds1 to onyx-41vm9

          mount facets: mds1

          Starting mds1: /dev/mapper/mds1_flakey /mnt/lustre-mds1

          Started lustre-MDT0000
          {noformat}

          We should see this for each MDT and for each OST. Yet, in the tests that hang, for example https://testing.whamcloud.com/test_sets/3f2ce6e0-ba91-11e8-8c12-52540065bddc, we see the first three MDTs fail, reboot and mount, but not the fourth MDT. The last thing we see in the suite_log is the third MDT starting
          {noformat}
          CMD: trevis-37vm4 e2label /dev/mapper/mds3_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
          CMD: trevis-37vm4 e2label /dev/mapper/mds3_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
          CMD: trevis-37vm4 e2label /dev/mapper/mds3_flakey 2>/dev/null
          Started lustre-MDT0002
          {noformat}

          Looking at the console log of the fourth MDT (vm5), we don’t see anything that indicates there is problem with MDT0003, the fourth MDT. In fact, there is nothing obviously wrong with any of the nodes by looking at their console logs.

          Here are links to logs for more insanity test 0’s that hang
          https://testing.whamcloud.com/test_sets/f7b26fc6-b9b7-11e8-8c12-52540065bddc
          https://testing.whamcloud.com/test_sets/3dc54fa2-adfa-11e8-bbd1-52540065bddc
          https://testing.whamcloud.com/test_sets/e0b3ea6e-8f5c-11e8-b0aa-52540065bddc
          New: insanity test 0 started hanging on July 19, 2018 for review-dne-part-4 with https://testing.whamcloud.com/test_sets/7b3d1b66-8af1-11e8-9028-52540065bddc .

          insanity test 0 fails all MDTs and fails all OSTs. For each successful failover, you will see the target being failed, rebooted, and the target being started like
          {noformat}
          Failing mds1 on onyx-41vm9

          reboot facets: mds1
          Failover mds1 to onyx-41vm9

          mount facets: mds1

          Starting mds1: /dev/mapper/mds1_flakey /mnt/lustre-mds1

          Started lustre-MDT0000
          {noformat}

          We should see this for each MDT and for each OST. Yet, in the tests that hang, for example https://testing.whamcloud.com/test_sets/3f2ce6e0-ba91-11e8-8c12-52540065bddc, we see the first three MDTs fail, reboot and mount, but not the fourth MDT. The last thing we see in the suite_log is the third MDT starting
          {noformat}
          CMD: trevis-37vm4 e2label /dev/mapper/mds3_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
          CMD: trevis-37vm4 e2label /dev/mapper/mds3_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
          CMD: trevis-37vm4 e2label /dev/mapper/mds3_flakey 2>/dev/null
          Started lustre-MDT0002
          {noformat}

          Looking at the console log of the fourth MDT (vm5), we don’t see anything that indicates there is problem with MDT0003, the fourth MDT. In fact, there is nothing obviously wrong with any of the nodes by looking at their console logs.

          In all of these failures, the failover of one of the MDTs on the second MDS hangs. Here are links to logs for more insanity test 0’s that hang
          https://testing.whamcloud.com/test_sets/f7b26fc6-b9b7-11e8-8c12-52540065bddc
          https://testing.whamcloud.com/test_sets/3dc54fa2-adfa-11e8-bbd1-52540065bddc
          https://testing.whamcloud.com/test_sets/e0b3ea6e-8f5c-11e8-b0aa-52540065bddc
          jamesanunez James Nunez (Inactive) created issue -

          People

            wc-triage WC Triage
            jamesanunez James Nunez (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: