Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.10.0
Affects Version/s: None
Labels:
- llnl
Environment:
lustre-2.8.0_3.chaos-1.ch6.x86_64
16 MDTs

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

MDT[0-1,6-16] (decimal) have timed out of recovery; appx 1473 clients recovered, 1 evicted.
MDT[2-5] reach the timeout, and report in the log that recovery has hung and should be aborted. After lctl abort_recovery, the nodes begin emitting large numbers of errors in the console log. The nodes are up but mrsh into them hangs, as if they are too busy to service the mrsh session.

2016-10-15 15:49:40 [ 1088.878945] Lustre: lsh-MDT0002: Recovery already passed deadline 0:32, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
2016-10-15 15:49:40 [ 1088.899333] Lustre: Skipped 157 previous similar messages
2016-10-15 15:50:12 [ 1121.013380] Lustre: lsh-MDT0002: Recovery already passed deadline 1:04, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
2016-10-15 15:50:12 [ 1121.033744] Lustre: Skipped 735 previous similar messages

<ConMan> Console [zinc3] departed by <root@localhost> on pts/0 at 10-15 15:50.
2016-10-15 15:50:52 [ 1161.329645] LustreError: 38991:0:(mdt_handler.c:5737:mdt_iocontrol()) lsh-MDT0002: Aborting recovery for device
2016-10-15 15:50:52 [ 1161.341983] LustreError: 38991:0:(ldlm_lib.c:2565:target_stop_recovery_thread()) lsh-MDT0002: Aborting recovery
2016-10-15 15:50:52 [ 1161.343686] LustreError: 18435:0:(lod_dev.c:419:lod_sub_recovery_thread()) lsh-MDT0004-osp-MDT0002 getting update log failed: rc = -108
2016-10-15 15:50:52 [ 1161.377751] Lustre: 18461:0:(ldlm_lib.c:2014:target_recovery_overseer()) recovery is aborted, evict exports in recovery

The earliest such messages are:

2016-10-15 15:50:52 [ 1161.390842] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295056926 batchid = 35538 flags = 0 ops = 42 params = 32
2016-10-15 15:50:52 [ 1161.408040] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295056931 batchid = 35542 flags = 0 ops = 42 params = 32

The last few are:

2016-10-15 15:52:11 [ 1240.343780] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064355 batchid = 39987 flags = 0 ops = 42 params = 32
2016-10-15 15:52:11 [ 1240.361375] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064356 batchid = 39999 flags = 0 ops = 42 params = 32
2016-10-15 15:52:11 [ 1240.378995] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064357 batchid = 40018 flags = 0 ops = 42 params = 32
2016-10-15 15:52:11 [ 1240.396579] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064358 batchid = 40011 flags = 0 ops = 42 params = 32
2016-10-15 15:52:11 [ 1240.414180] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064360 batchid = 40005 flags = 0 ops = 42 params = 32

We have seen this type of behavior on multiple DNE filesystems. Also, is there any way to determine if these errors have been corrected, abandoned, etc?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

console.jet7.gz
27/Oct/16 6:46 PM
1.12 MB
Olaf Faaland
dk.recovery_stuck.jet7.1477593159.gz
27/Oct/16 6:46 PM
53 kB
Olaf Faaland
dk.recovery_stuck.jet7.1477593344.gz
27/Oct/16 6:46 PM
7 kB
Olaf Faaland
lustre.log.gz
27/Oct/16 6:53 PM
4.33 MB
Olaf Faaland
dk.jet1.1478223101.gz
04/Nov/16 1:35 AM
596 kB
Olaf Faaland
dk.jet1.1478565846.gz
08/Nov/16 1:12 AM
681 kB
Olaf Faaland
mdt0b.0x240019a58_0x6_0x0.tgz
08/Nov/16 1:46 AM
12.23 MB
Olaf Faaland
mdt09.0x240019a58_0x6_0x0.tgz
08/Nov/16 10:01 PM
12.23 MB
Giuseppe Di Natale
logs.2016-11-14.tgz
14/Nov/16 6:25 PM
12.23 MB
Olaf Faaland
console_logs.nov28.tgz
28/Nov/16 11:59 PM
18 kB
Olaf Faaland
target_to_node_map.nov28.txt
28/Nov/16 11:59 PM
0.3 kB
Olaf Faaland
dk.zinc13.1480375634.gz
29/Nov/16 12:06 AM
13.32 MB
Olaf Faaland
dk.zinc7.1480375634.gz
29/Nov/16 12:06 AM
13.52 MB
Olaf Faaland
lsh-mdt000c-1b70.nov28.tgz
29/Nov/16 12:06 AM
6.81 MB
Olaf Faaland
dk.zinc1.1480375634.gz
29/Nov/16 12:20 AM
12.12 MB
Olaf Faaland
console.jet11.2016-12-13-14-47
14/Dec/16 9:41 PM
14 kB
Olaf Faaland
console.since-dec13.tgz
15/Dec/16 12:56 AM
1.71 MB
Olaf Faaland
0x48000a04b-0x1-0x0.tgz
19/Dec/16 9:39 PM
106 kB
Olaf Faaland
console.zinc11.2016-12-19
21/Dec/16 12:14 AM
169 kB
Olaf Faaland

Issue Links

is related to

LU-8916 ods-zfs doesn't manage ZAP sizes correctly

Resolved

LU-8787 zpool containing MDT0000 out of space

Closed

is related to

LU-7675 replay-single test_101 times out after aborting recovery on mount of the mds1

Resolved

LU-7426 DNE3: improve llog format for remote update llog

Open

Recovery already passed deadline with DNE

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates