[LU-11795] replay-vbr test 8b fails with 'Restart of mds1 failed!' Created: 17/Dec/18 Updated: 08/Oct/20 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.1, Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | failover | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
replay-vbr test_8b fails with 'Restart of mds1 failed!'. So far, this test has only failed once; https://testing.whamcloud.com/test_sets/4fca0808-fd1b-11e8-8512-52540065bddc . Looking at the client test_log, we see the MDS has problems mount facets: mds1 CMD: trevis-16vm8 dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1 CMD: trevis-16vm8 test -b /dev/lvm-Role_MDS/P1 CMD: trevis-16vm8 loop_dev=\$(losetup -j /dev/lvm-Role_MDS/P1 | cut -d : -f 1); if [[ -z \$loop_dev ]]; then loop_dev=\$(losetup -f); losetup \$loop_dev /dev/lvm-Role_MDS/P1 || loop_dev=; fi; echo -n \$loop_dev trevis-16vm8: losetup: /dev/lvm-Role_MDS/P1: failed to set up loop device: No such file or directory CMD: trevis-16vm8 test -b /dev/lvm-Role_MDS/P1 CMD: trevis-16vm8 e2label /dev/lvm-Role_MDS/P1 trevis-16vm8: e2label: No such file or directory while trying to open /dev/lvm-Role_MDS/P1 trevis-16vm8: Couldn't find valid filesystem superblock. Starting mds1: -o loop /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1 CMD: trevis-16vm8 mkdir -p /mnt/lustre-mds1; mount -t lustre -o loop /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1 trevis-16vm8: mount: /dev/lvm-Role_MDS/P1: failed to setup loop device: No such file or directory Start of /dev/lvm-Role_MDS/P1 on mds1 failed 32 replay-vbr test_8b: @@@@@@ FAIL: Restart of mds1 failed! Looking at the MDS1 (vm8) console log, we see replay-vbr test 8a start up, MDS1 disconnect, some stack traces (possibly for the replay-vbr test_8c hang) and the next Lustre test script output is for replay-single test 0a. Similar console log content for MDS2 (vm7). In the dmesg log for the OSS (vm5), we see some errors for test 8b [ 8057.763283] Lustre: DEBUG MARKER: == replay-vbr test 8b: create | unlink, create shouldn't fail ======================================== 17:11:00 (1544490660) [ 8058.291385] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-16vm3: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [ 8058.489932] Lustre: DEBUG MARKER: trevis-16vm3: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [ 8067.329126] LNetError: 6975:0:(socklnd.c:1679:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.4.191@tcp[1], ip 10.9.4.191:7988, with error, wanted: 152, left: 152, last alive is 5 secs ago [ 8067.330996] LustreError: 6975:0:(events.c:305:request_in_callback()) event type 2, status -5, service ost [ 8067.331917] LustreError: 26344:0:(pack_generic.c:590:__lustre_unpack_msg()) message length 0 too small for magic/version check [ 8067.332999] LustreError: 26344:0:(sec.c:2068:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-10.9.4.191@tcp x1619515714571600 [ 8077.892741] Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-vbr test_8b: @@@@@@ FAIL: Restart of mds1 failed! [ 8078.081333] Lustre: DEBUG MARKER: replay-vbr test_8b: @@@@@@ FAIL: Restart of mds1 failed! |
| Comments |
| Comment by James Nunez (Inactive) [ 25/Apr/19 ] |
|
We're seeing a very similar issue with not being able to start an OSS during recovery-mds-scale test_failover_ost; https://testing.whamcloud.com/test_sets/a2fd473a-6632-11e9-bd0e-52540065bddc . From the suite_log, we see trevis-34vm6: losetup: /dev/lvm-Role_OSS/P1: failed to set up loop device: No such file or directory CMD: trevis-34vm6 test -b /dev/lvm-Role_OSS/P1 CMD: trevis-34vm6 e2label /dev/lvm-Role_OSS/P1 trevis-34vm6: e2label: No such file or directory while trying to open /dev/lvm-Role_OSS/P1 trevis-34vm6: Couldn't find valid filesystem superblock. Starting ost1: -o loop /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1 CMD: trevis-34vm6 mkdir -p /mnt/lustre-ost1; mount -t lustre -o loop /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1 trevis-34vm6: mount: /dev/lvm-Role_OSS/P1: failed to setup loop device: No such file or directory Start of /dev/lvm-Role_OSS/P1 on ost1 failed 32 recovery-mds-scale test_failover_ost: @@@@@@ FAIL: Restart of ost1 failed! |
| Comment by James Nunez (Inactive) [ 17/Oct/19 ] |
|
We're seeing the same errors in non-failover testing as in all tests failing for performance-sanity; https://testing.whamcloud.com/test_sets/c9195b96-e591-11e9-9874-52540065bddc. |