[LU-11256] replay-vbr test 7f is failing with 'Restart of mds1 failed!' Created: 15/Aug/18  Updated: 30/Nov/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.10.4, Lustre 2.10.5
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-10519 replay-vbr fails to start running tes... Open
is related to LU-10708 replay-single test_20b: Restart of md... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-vbr test_7f fails on mounting an MDS. It’s not clear when this test started failing with this error, but it looks like this test didn’t fail on MDS mount for two and a half months and started failing again on July 14, 2018. All failures of this type since March of 2018 are listed below.

Looking at the failure at https://testing.whamcloud.com/test_sets/5b253cd8-878f-11e8-9028-52540065bddc, in the test_log, the only sign of trouble is when we try and mount the failover MDS

Failing mds1 on trevis-4vm8
+ pm -h powerman --off trevis-4vm8
Command completed successfully
reboot facets: mds1
+ pm -h powerman --on trevis-4vm8
Command completed successfully
Failover mds1 to trevis-4vm7
12:30:33 (1531571433) waiting for trevis-4vm7 network 900 secs ...
12:30:33 (1531571433) network interface is UP
CMD: trevis-4vm7 hostname
mount facets: mds1
CMD: trevis-4vm7 test -b /dev/lvm-Role_MDS/P1
CMD: trevis-4vm7 e2label /dev/lvm-Role_MDS/P1
trevis-4vm7: e2label: No such file or directory while trying to open /dev/lvm-Role_MDS/P1
trevis-4vm7: Couldn't find valid filesystem superblock.
Starting mds1:   -o loop /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
CMD: trevis-4vm7 mkdir -p /mnt/lustre-mds1; mount -t lustre   -o loop 		                   /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
trevis-4vm7: mount: /dev/lvm-Role_MDS/P1: failed to setup loop device: No such file or directory
Start of /dev/lvm-Role_MDS/P1 on mds1 failed 32
 replay-vbr test_7f: @@@@@@ FAIL: Restart of mds1 failed! 

In all the following cases, test 7g hangs when test 7f fails in this way.

2018-08-15 2.10.5 RC2 – fails in “test_7f.5 last”
https://testing.whamcloud.com/test_sets/a75d306e-a081-11e8-8ee3-52540065bddc
2018-08-02 2.10.4.14 – fails in “test_7f.5 last”
https://testing.whamcloud.com/test_sets/7405ad54-9645-11e8-a9f7-52540065bddc
2018-07-14 2.10.4.8 - fails in “test_7f.1 last”
https://testing.whamcloud.com/test_sets/5b253cd8-878f-11e8-9028-52540065bddc
2018-04-12 2.11.50.51 - fails in “test_7f.4 last”
https://testing.whamcloud.com/test_sets/37bad538-3e69-11e8-b45c-52540065bddc
2018-03-03 2.10.3.35 - fails in “test_7f.4 last”
https://testing.whamcloud.com/test_sets/f33a4326-1f0f-11e8-a6ca-52540065bddc

In the following test session, replay-vbr test 7e fails in the way described above and test 7f hangs
2018-07-15 2.10.4.8 - fails in “test_7e.5 last”
https://testing.whamcloud.com/test_sets/0ca6ca46-87fc-11e8-b376-52540065bddc
2018-03-14 2.10.59 - fails in “test_7e.5 last”
https://testing.whamcloud.com/test_sets/d26f60b0-2809-11e8-b6a0-52540065bddc



 Comments   
Comment by Andreas Dilger [ 16/Aug/18 ]

It looks like there is a check in mount_facet() that is checking if the "device" is a block device, and if not then add "-o loop" to the mount options.

mount_facet() {    
        if [ $(facet_fstype $facet) == ldiskfs ] &&
           ! do_facet $facet test -b ${!dev}; then
                opts=$(csa_add "$opts" -o loop)
        fi

        case $fstype in
        ldiskfs)
                devicelabel=$(do_facet ${facet} "$E2LABEL ${!dev}");;
        esac

        echo "Starting ${facet}: $opts ${!dev} $mntpt"
        if [ $RC -ne 0 ]; then
                echo "Start of ${!dev} on ${facet} failed ${RC}"
                return $RC
        fi

To further debug this issue, it would make sense to add some additional debugging into the "test -b" failure case to determine if the device even exists:

        if [ $(facet_fstype $facet) == ldiskfs ] &&
           ! do_facet $facet test -b ${!dev}; then
                if ! do_facet $facet test -e ${!dev}; then
                        do_facet $facet "ls -lR $(dirname ${!dev})"
                        error "$facet: device ${!dev} does not exist"
                fi
                opts=$(csa_add "$opts" -o loop)
        fi
Comment by James Nunez (Inactive) [ 20/Aug/18 ]

Similar failure on recovery-scale-mds in test failover_mds at https://testing.whamcloud.com/test_sets/4e5eb398-a24d-11e8-a5f2-52540065bddc

Generated at Sat Feb 10 02:42:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.