[LU-11256] replay-vbr test 7f is failing with 'Restart of mds1 failed!' Created: 15/Aug/18 Updated: 30/Nov/18 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0, Lustre 2.10.4, Lustre 2.10.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
replay-vbr test_7f fails on mounting an MDS. It’s not clear when this test started failing with this error, but it looks like this test didn’t fail on MDS mount for two and a half months and started failing again on July 14, 2018. All failures of this type since March of 2018 are listed below. Looking at the failure at https://testing.whamcloud.com/test_sets/5b253cd8-878f-11e8-9028-52540065bddc, in the test_log, the only sign of trouble is when we try and mount the failover MDS Failing mds1 on trevis-4vm8 + pm -h powerman --off trevis-4vm8 Command completed successfully reboot facets: mds1 + pm -h powerman --on trevis-4vm8 Command completed successfully Failover mds1 to trevis-4vm7 12:30:33 (1531571433) waiting for trevis-4vm7 network 900 secs ... 12:30:33 (1531571433) network interface is UP CMD: trevis-4vm7 hostname mount facets: mds1 CMD: trevis-4vm7 test -b /dev/lvm-Role_MDS/P1 CMD: trevis-4vm7 e2label /dev/lvm-Role_MDS/P1 trevis-4vm7: e2label: No such file or directory while trying to open /dev/lvm-Role_MDS/P1 trevis-4vm7: Couldn't find valid filesystem superblock. Starting mds1: -o loop /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1 CMD: trevis-4vm7 mkdir -p /mnt/lustre-mds1; mount -t lustre -o loop /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1 trevis-4vm7: mount: /dev/lvm-Role_MDS/P1: failed to setup loop device: No such file or directory Start of /dev/lvm-Role_MDS/P1 on mds1 failed 32 replay-vbr test_7f: @@@@@@ FAIL: Restart of mds1 failed! In all the following cases, test 7g hangs when test 7f fails in this way. 2018-08-15 2.10.5 RC2 – fails in “test_7f.5 last” In the following test session, replay-vbr test 7e fails in the way described above and test 7f hangs |
| Comments |
| Comment by Andreas Dilger [ 16/Aug/18 ] |
|
It looks like there is a check in mount_facet() that is checking if the "device" is a block device, and if not then add "-o loop" to the mount options.
mount_facet() {
if [ $(facet_fstype $facet) == ldiskfs ] &&
! do_facet $facet test -b ${!dev}; then
opts=$(csa_add "$opts" -o loop)
fi
case $fstype in
ldiskfs)
devicelabel=$(do_facet ${facet} "$E2LABEL ${!dev}");;
esac
echo "Starting ${facet}: $opts ${!dev} $mntpt"
if [ $RC -ne 0 ]; then
echo "Start of ${!dev} on ${facet} failed ${RC}"
return $RC
fi
To further debug this issue, it would make sense to add some additional debugging into the "test -b" failure case to determine if the device even exists:
if [ $(facet_fstype $facet) == ldiskfs ] &&
! do_facet $facet test -b ${!dev}; then
if ! do_facet $facet test -e ${!dev}; then
do_facet $facet "ls -lR $(dirname ${!dev})"
error "$facet: device ${!dev} does not exist"
fi
opts=$(csa_add "$opts" -o loop)
fi
|
| Comment by James Nunez (Inactive) [ 20/Aug/18 ] |
|
Similar failure on recovery-scale-mds in test failover_mds at https://testing.whamcloud.com/test_sets/4e5eb398-a24d-11e8-a5f2-52540065bddc |