Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.0
-
DNE
-
3
-
9223372036854775807
Description
insanity test 0 started hanging on July 19, 2018 for review-dne-part-4 with https://testing.whamcloud.com/test_sets/7b3d1b66-8af1-11e8-9028-52540065bddc .
insanity test 0 fails all MDTs and fails all OSTs. For each successful failover, you will see the target being failed, rebooted, and the target being started like
Failing mds1 on onyx-41vm9 … reboot facets: mds1 Failover mds1 to onyx-41vm9 … mount facets: mds1 … Starting mds1: /dev/mapper/mds1_flakey /mnt/lustre-mds1 … Started lustre-MDT0000
We should see this for each MDT and for each OST. Yet, in the tests that hang, for example https://testing.whamcloud.com/test_sets/3f2ce6e0-ba91-11e8-8c12-52540065bddc, we see the first three MDTs fail, reboot and mount, but not the fourth MDT. The last thing we see in the suite_log is the third MDT starting
CMD: trevis-37vm4 e2label /dev/mapper/mds3_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}' CMD: trevis-37vm4 e2label /dev/mapper/mds3_flakey 2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}' CMD: trevis-37vm4 e2label /dev/mapper/mds3_flakey 2>/dev/null Started lustre-MDT0002
Looking at the console log of the fourth MDT (vm5), we don’t see anything that indicates there is problem with MDT0003, the fourth MDT. In fact, there is nothing obviously wrong with any of the nodes by looking at their console logs.
In all of these failures, the failover of one of the MDTs hangs. Here are links to logs for more insanity test 0’s that hang
https://testing.whamcloud.com/test_sets/f7b26fc6-b9b7-11e8-8c12-52540065bddc
https://testing.whamcloud.com/test_sets/3dc54fa2-adfa-11e8-bbd1-52540065bddc
https://testing.whamcloud.com/test_sets/e0b3ea6e-8f5c-11e8-b0aa-52540065bddc