[LU-12245] replay-vbr test 5b fails with 'Restart of mds1 failed!' Created: 29/Apr/19 Updated: 15/Nov/19 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.1, Lustre 2.12.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | failover | ||
| Environment: |
SLES12 SP4 |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
replay-vbr test_5b fails with 'Restart of mds1 failed!', so far, only for SLES12 SP4. Looking the suite_log for a recent failure with logs at https://testing.whamcloud.com/test_sets/be1853f6-6692-11e9-8bb1-52540065bddc , we see that mounting the failed over MDS does not work Failover mds1 to trevis-35vm7 00:10:41 (1556089841) waiting for trevis-35vm7 network 900 secs ... 00:10:41 (1556089841) network interface is UP CMD: trevis-35vm7 hostname mount facets: mds1 CMD: trevis-35vm7 dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1 CMD: trevis-35vm7 test -b /dev/lvm-Role_MDS/P1 CMD: trevis-35vm7 loop_dev=\$(losetup -j /dev/lvm-Role_MDS/P1 | cut -d : -f 1); if [[ -z \$loop_dev ]]; then loop_dev=\$(losetup -f); losetup \$loop_dev /dev/lvm-Role_MDS/P1 || loop_dev=; fi; echo -n \$loop_dev trevis-35vm7: losetup: /dev/lvm-Role_MDS/P1: failed to set up loop device: No such file or directory CMD: trevis-35vm7 test -b /dev/lvm-Role_MDS/P1 CMD: trevis-35vm7 e2label /dev/lvm-Role_MDS/P1 trevis-35vm7: e2label: No such file or directory while trying to open /dev/lvm-Role_MDS/P1 trevis-35vm7: Couldn't find valid filesystem superblock. Starting mds1: -o loop /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1 CMD: trevis-35vm7 mkdir -p /mnt/lustre-mds1; mount -t lustre -o loop /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1 trevis-35vm7: mount: /dev/lvm-Role_MDS/P1: failed to setup loop device: No such file or directory Start of /dev/lvm-Role_MDS/P1 on mds1 failed 32 replay-vbr test_5b: @@@@@@ FAIL: Restart of mds1 failed! Looking at the console log for MDS1 (vm8), we see the MDS failover [ 266.800298] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == replay-vbr test 5b: link checks version of target parent ========================================== 00:10:22 \(1556089822\) [ 266.982167] Lustre: DEBUG MARKER: == replay-vbr test 5b: link checks version of target parent ========================================== 00:10:22 (1556089822) [ 267.088184] Lustre: lustre-MDT0000: Connection restored to 144cd783-70f4-6475-93c8-b9b0a8f6fd6b (at 10.9.5.120@tcp) [ 267.089958] Lustre: Skipped 6 previous similar messages [ 267.890140] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param mdd.lustre-MDT0000.sync_permission=0 [ 268.211897] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param mdt.lustre-MDT0000.commit_on_sharing=0 [ 268.641380] Lustre: DEBUG MARKER: sync; sync; sync [ 269.842265] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno [ 270.163063] Lustre: DEBUG MARKER: modprobe dm-flakey; [ 270.163063] dmsetup targets | grep -q flakey [ 270.487834] Lustre: DEBUG MARKER: dmsetup table /dev/mapper/mds1_flakey [ 270.820089] Lustre: DEBUG MARKER: dmsetup suspend --nolockfs --noflush /dev/mapper/mds1_flakey [ 271.139890] Lustre: DEBUG MARKER: dmsetup load /dev/mapper/mds1_flakey --table "0 20971520 flakey 252:0 0 0 1800 1 drop_writes" [ 271.459528] Lustre: DEBUG MARKER: dmsetup resume /dev/mapper/mds1_flakey [ 271.812452] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000 [ 271.975373] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000 [ 272.633700] Lustre: DEBUG MARKER: /usr/sbin/lctl dl [ 272.964835] Lustre: DEBUG MARKER: modprobe dm-flakey; [ 272.964835] dmsetup targets | grep -q flakey [ 273.353504] Lustre: DEBUG MARKER: /usr/sbin/lctl dl <ConMan> Console [trevis-35vm8] disconnected from <trevis-35:6007> at 04-24 07:10. Looking at the failover MDS (vm7), we don’t see an indication that the MDS failed over and we don’t see replay-vbr test 5c start on either MDS. There are no other reply-vbr test 5b failures like this in the past four months. |
| Comments |
| Comment by James Nunez (Inactive) [ 15/Oct/19 ] |
|
We see a similar test hang for replay-single test 101 at https://testing.whamcloud.com/test_sets/6c0fb4d6-ea6e-11e9-be86-52540065bddc |
| Comment by Sebastien Buisson [ 15/Nov/19 ] |
|
Possibly a new occurence via recovery-small test_136: |