Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.0, Lustre 2.10.6
-
None
-
3
-
9223372036854775807
Description
replay-single test_11 fails with error message “Restart of mds1 failed!”. Looking at the client test_log at https://testing.whamcloud.com/test_sets/79ecfc38-f0e8-11e8-bfe1-52540065bddc , we see a problem with the failover/mount of mds1
Failover mds1 to trevis-39vm8 18:18:54 (1543169934) waiting for trevis-39vm8 network 900 secs ... 18:18:54 (1543169934) network interface is UP CMD: trevis-39vm8 hostname mount facets: mds1 CMD: trevis-39vm8 test -b /dev/lvm-Role_MDS/P1 CMD: trevis-39vm8 e2label /dev/lvm-Role_MDS/P1 trevis-39vm8: e2label: No such file or directory while trying to open /dev/lvm-Role_MDS/P1 trevis-39vm8: Couldn't find valid filesystem superblock. Starting mds1: -o loop /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1 CMD: trevis-39vm8 mkdir -p /mnt/lustre-mds1; mount -t lustre -o loop /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1 trevis-39vm8: mount: /dev/lvm-Role_MDS/P1: failed to setup loop device: No such file or directory Start of /dev/lvm-Role_MDS/P1 on mds1 failed 32 replay-single test_11: @@@@@@ FAIL: Restart of mds1 failed!
In the console log for MDS (vm7), we see the node failing
[ 117.663213] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == replay-single test 11: create open write rename \|X\| create-old-name read ========================== 18:18:38 \(1543169918\) [ 117.849978] Lustre: DEBUG MARKER: == replay-single test 11: create open write rename |X| create-old-name read ========================== 18:18:38 (1543169918) [ 118.031444] Lustre: DEBUG MARKER: sync; sync; sync [ 119.317310] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno [ 119.653586] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 readonly [ 119.911790] LustreError: 6330:0:(osd_handler.c:2198:osd_ro()) *** setting lustre-MDT0000 read-only *** [ 120.034914] Turning device dm-0 (0xfc00000) read-only [ 120.202030] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000 [ 120.367027] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000 [ 120.608673] Lustre: DEBUG MARKER: /usr/sbin/lctl dl <ConMan> Console [trevis-39vm7] disconnected from <trevis-39:6006> at 11-25 18:18. <ConMan> Console [trevis-39vm7] connected to <trevis-39:6006> at 11-25 18:19. ......... ok
In the console log for the failover MDS (vm8), we see the start of replay-single test 10 and then a call trace
[ 117.876287] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == replay-single test 10: create \|X\| rename unlink =================================================== 18:17:21 \(1543169841\) [ 118.074071] Lustre: DEBUG MARKER: == replay-single test 10: create |X| rename unlink =================================================== 18:17:21 (1543169841) [ 118.243565] Lustre: DEBUG MARKER: sync; sync; sync [ 119.576474] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno [ 119.905069] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 readonly [ 120.175396] LustreError: 5633:0:(osd_handler.c:2198:osd_ro()) *** setting lustre-MDT0000 read-only *** [ 120.348338] Turning device dm-0 (0xfc00000) read-only [ 120.510660] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000 [ 120.672831] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000 [ 120.914538] Lustre: DEBUG MARKER: /usr/sbin/lctl dl <ConMan> Console [trevis-39vm8] disconnected from <trevis-39:6007> at 11-25 18:17. <ConMan> Console [trevis-39vm8] connected to <trevis-39:6007> at 11-25 18:17. ......... ok … trevis-39vm8 login: [ 130.156081] random: crng init done [ 133.845539] session1: session recovery timed out after 120 secs [ 133.846770] scsi 2:0:0:0: rejecting I/O to offline device [ 133.847790] scsi 2:0:0:0: rejecting I/O to offline device [ 133.849029] scsi 2:0:0:0: rejecting I/O to offline device [ 133.870016] FS-Cache: Loaded [ 133.905941] FS-Cache: Netfs 'nfs' registered for caching [ 133.917634] Key type dns_resolver registered [ 133.946337] NFS: Registering the id_resolver key type [ 133.947290] Key type id_resolver registered [ 133.948061] Key type id_legacy registered [ 3805.931636] SysRq : Changing Loglevel [ 3805.932529] Loglevel set to 8 [ 3806.457206] SysRq : Show State [ 3806.457932] task PC stack pid father [ 3806.459017] systemd S ffff98503c140000 0 1 0 0x00000000 [ 3806.460287] Call Trace: [ 3806.460809] [<ffffffffa4167bc9>] schedule+0x29/0x70 [ 3806.461812] [<ffffffffa4166dfd>] schedule_hrtimeout_range_clock+0x12d/0x150 [ 3806.463126] [<ffffffffa3c8e869>] ? ep_scan_ready_list.isra.7+0x1b9/0x1f0 [ 3806.464436] [<ffffffffa4166e33>] schedule_hrtimeout_range+0x13/0x20 [ 3806.465578] [<ffffffffa3c8eafe>] ep_poll+0x23e/0x360 [ 3806.466496] [<ffffffffa3c531f1>] ? do_unlinkat+0xf1/0x2d0 [ 3806.467542] [<ffffffffa3ad67b0>] ? wake_up_state+0x20/0x20 [ 3806.468515] [<ffffffffa3c8ffcd>] SyS_epoll_wait+0xed/0x120 [ 3806.469550] [<ffffffffa4174d15>] ? system_call_after_swapgs+0xa2/0x146 [ 3806.470680] [<ffffffffa4174ddb>] system_call_fastpath+0x22/0x27 [ 3806.471796] [<ffffffffa4174d21>] ? system_call_after_swapgs+0xae/0x146 …
Neither of these console logs has the start of recovery-small test 12.
When replay-single test 11 fails in this way, we see test 12 hang.
I’ve look back at results for all branches since April and only found one failure that looks the same as this one. Logs are at https://testing.whamcloud.com/test_sets/efc9036e-a90a-11e8-80f7-52540065bddc .
Attachments
Issue Links
- mentioned in
-
Page Loading...