[LU-11722] replay-single test 11 fails with “Restart of mds1 failed!” Created: 30/Nov/18  Updated: 30/Nov/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.10.6
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-single test_11 fails with error message “Restart of mds1 failed!”. Looking at the client test_log at https://testing.whamcloud.com/test_sets/79ecfc38-f0e8-11e8-bfe1-52540065bddc , we see a problem with the failover/mount of mds1

Failover mds1 to trevis-39vm8
18:18:54 (1543169934) waiting for trevis-39vm8 network 900 secs ...
18:18:54 (1543169934) network interface is UP
CMD: trevis-39vm8 hostname
mount facets: mds1
CMD: trevis-39vm8 test -b /dev/lvm-Role_MDS/P1
CMD: trevis-39vm8 e2label /dev/lvm-Role_MDS/P1
trevis-39vm8: e2label: No such file or directory while trying to open /dev/lvm-Role_MDS/P1
trevis-39vm8: Couldn't find valid filesystem superblock.
Starting mds1:   -o loop /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
CMD: trevis-39vm8 mkdir -p /mnt/lustre-mds1; mount -t lustre   -o loop 		                   /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
trevis-39vm8: mount: /dev/lvm-Role_MDS/P1: failed to setup loop device: No such file or directory
Start of /dev/lvm-Role_MDS/P1 on mds1 failed 32
 replay-single test_11: @@@@@@ FAIL: Restart of mds1 failed! 

In the console log for MDS (vm7), we see the node failing

[  117.663213] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == replay-single test 11: create open write rename \|X\| create-old-name read ========================== 18:18:38 \(1543169918\)
[  117.849978] Lustre: DEBUG MARKER: == replay-single test 11: create open write rename |X| create-old-name read ========================== 18:18:38 (1543169918)
[  118.031444] Lustre: DEBUG MARKER: sync; sync; sync
[  119.317310] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno
[  119.653586] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 readonly
[  119.911790] LustreError: 6330:0:(osd_handler.c:2198:osd_ro()) *** setting lustre-MDT0000 read-only ***
[  120.034914] Turning device dm-0 (0xfc00000) read-only
[  120.202030] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
[  120.367027] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
[  120.608673] Lustre: DEBUG MARKER: /usr/sbin/lctl dl

<ConMan> Console [trevis-39vm7] disconnected from <trevis-39:6006> at 11-25 18:18.

<ConMan> Console [trevis-39vm7] connected to <trevis-39:6006> at 11-25 18:19.
......... ok

In the console log for the failover MDS (vm8), we see the start of replay-single test 10 and then a call trace

[  117.876287] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == replay-single test 10: create \|X\| rename unlink =================================================== 18:17:21 \(1543169841\)
[  118.074071] Lustre: DEBUG MARKER: == replay-single test 10: create |X| rename unlink =================================================== 18:17:21 (1543169841)
[  118.243565] Lustre: DEBUG MARKER: sync; sync; sync
[  119.576474] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno
[  119.905069] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 readonly
[  120.175396] LustreError: 5633:0:(osd_handler.c:2198:osd_ro()) *** setting lustre-MDT0000 read-only ***
[  120.348338] Turning device dm-0 (0xfc00000) read-only
[  120.510660] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
[  120.672831] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
[  120.914538] Lustre: DEBUG MARKER: /usr/sbin/lctl dl

<ConMan> Console [trevis-39vm8] disconnected from <trevis-39:6007> at 11-25 18:17.

<ConMan> Console [trevis-39vm8] connected to <trevis-39:6007> at 11-25 18:17.
......... ok
…
trevis-39vm8 login: [  130.156081] random: crng init done
[  133.845539]  session1: session recovery timed out after 120 secs
[  133.846770] scsi 2:0:0:0: rejecting I/O to offline device
[  133.847790] scsi 2:0:0:0: rejecting I/O to offline device
[  133.849029] scsi 2:0:0:0: rejecting I/O to offline device
[  133.870016] FS-Cache: Loaded
[  133.905941] FS-Cache: Netfs 'nfs' registered for caching
[  133.917634] Key type dns_resolver registered
[  133.946337] NFS: Registering the id_resolver key type
[  133.947290] Key type id_resolver registered
[  133.948061] Key type id_legacy registered
[ 3805.931636] SysRq : Changing Loglevel
[ 3805.932529] Loglevel set to 8
[ 3806.457206] SysRq : Show State
[ 3806.457932]   task                        PC stack   pid father
[ 3806.459017] systemd         S ffff98503c140000     0     1      0 0x00000000
[ 3806.460287] Call Trace:
[ 3806.460809]  [<ffffffffa4167bc9>] schedule+0x29/0x70
[ 3806.461812]  [<ffffffffa4166dfd>] schedule_hrtimeout_range_clock+0x12d/0x150
[ 3806.463126]  [<ffffffffa3c8e869>] ? ep_scan_ready_list.isra.7+0x1b9/0x1f0
[ 3806.464436]  [<ffffffffa4166e33>] schedule_hrtimeout_range+0x13/0x20
[ 3806.465578]  [<ffffffffa3c8eafe>] ep_poll+0x23e/0x360
[ 3806.466496]  [<ffffffffa3c531f1>] ? do_unlinkat+0xf1/0x2d0
[ 3806.467542]  [<ffffffffa3ad67b0>] ? wake_up_state+0x20/0x20
[ 3806.468515]  [<ffffffffa3c8ffcd>] SyS_epoll_wait+0xed/0x120
[ 3806.469550]  [<ffffffffa4174d15>] ? system_call_after_swapgs+0xa2/0x146
[ 3806.470680]  [<ffffffffa4174ddb>] system_call_fastpath+0x22/0x27
[ 3806.471796]  [<ffffffffa4174d21>] ? system_call_after_swapgs+0xae/0x146
…

Neither of these console logs has the start of recovery-small test 12.

When replay-single test 11 fails in this way, we see test 12 hang.

I’ve look back at results for all branches since April and only found one failure that looks the same as this one. Logs are at https://testing.whamcloud.com/test_sets/efc9036e-a90a-11e8-80f7-52540065bddc .


Generated at Sat Feb 10 02:46:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.