[LU-4290] osp_sync_threads encounters EIO on mount - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.6.0
Affects Version/s: Lustre 2.4.1
Labels:
None
Environment:
RHEL 6.4/distro IB

Severity:
2
Rank (Obsolete):
11773

Description

We encountered this assertion in production, libcfs_panic_on_lbug was set to 1, so server rebooted. On mount, the same assertion and lbug would occur. Filesystem will mount with panic_on_lbug set to 0. We've captured a crash dump and lustre log messages with the debug flags:

[root@atlas-mds3 ~]# cat /proc/sys/lnet/debug
trace ioctl neterror warning other error emerg ha config console

Ran e2fsck:
e2fsck -f -j /dev/mapper/atlas2-mdt1-journal /dev/mapper/atlas2-mdt1

and only fixed the quota inconsistencies it found.

At the moment, we are back to production after the osp_sync_threads lbugs on mount. There are hung task messages about osp_sync_threads as would be expected. We want to fix the root issue that is causing the assertions.

kernel messages during one of the failed mounts
Nov 21 21:16:44 atlas-mds3 kernel: [ 911.319839] LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. quota=on. Opts:
Nov 21 21:16:44 atlas-mds3 kernel: [ 911.986208] Lustre: mdt_num_threads module parameter is deprecated, use mds_num_threads instead or unset both for dynamic thread startup
Nov 21 21:16:46 atlas-mds3 kernel: [ 913.069371] Lustre: atlas2-MDT0000: used disk, loading
Nov 21 21:16:47 atlas-mds3 kernel: [ 914.261572] LustreError: 18945:0:(osp_sync.c:862:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 0 changes, 0 in progress, 0 in flight: -5
Nov 21 21:16:47 atlas-mds3 kernel: [ 914.278318] LustreError: 18945:0:(osp_sync.c:862:osp_sync_thread()) LBUG
Nov 21 21:16:47 atlas-mds3 kernel: [ 914.286036] Pid: 18945, comm: osp-syn-256
Nov 21 21:16:47 atlas-mds3 kernel: [ 914.290841]
Nov 21 21:16:47 atlas-mds3 kernel: [ 914.290844] Call Trace:

We also see this message:
Nov 21 23:01:01 atlas-mds3 kernel: [ 1512.633528] ERST: NVRAM ERST Log Address Range is not implemented yet

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

6305.llog_out
73 kB
22/Nov/13 6:50 PM
6306.llog_out
73 kB
22/Nov/13 6:50 PM
6307.llog_out
73 kB
22/Nov/13 6:50 PM
6308.llog_out
73 kB
22/Nov/13 6:50 PM
lustre-log.1385095225.19969.gz
38 kB
22/Nov/13 5:47 AM
lustre-log.1385095225.19971.gz
7 kB
22/Nov/13 5:47 AM
lustre-log.1385095225.19973.gz
2.88 MB
22/Nov/13 5:47 AM
lustre-log.1385095225.19975.gz
5 kB
22/Nov/13 5:47 AM

Activity

[LU-4290] osp_sync_threads encounters EIO on mount

Alex Zhuravlev added a comment - 25/Nov/13 4:39 PM

looks so. I suggest to rename them using direct ldiskfs mount and restart MDS.

Alex Zhuravlev added a comment - 25/Nov/13 4:39 PM looks so. I suggest to rename them using direct ldiskfs mount and restart MDS.

Blake Caldwell added a comment - 22/Nov/13 8:13 PM

If someone could please confirm whether the attached llogs contain the expected information, we will proceed with renaming them. Until then we will operate in this non-optimal state. Thanks.

Blake Caldwell added a comment - 22/Nov/13 8:13 PM If someone could please confirm whether the attached llogs contain the expected information, we will proceed with renaming them. Until then we will operate in this non-optimal state. Thanks.

Blake Caldwell added a comment - 22/Nov/13 6:49 PM

Alex,

We pulled the files below from debugfs and put them through llog_reader. The output of each (0x18a4, 0x18a3, 0x18a2, 0x18a1) is attached. We compared it to another log file that was processed successfully, which was empty.

We will wait to hear back if these look good and will make a copy and remove.

Blake Caldwell added a comment - 22/Nov/13 6:49 PM Alex, We pulled the files below from debugfs and put them through llog_reader. The output of each (0x18a4, 0x18a3, 0x18a2, 0x18a1) is attached. We compared it to another log file that was processed successfully, which was empty. We will wait to hear back if these look good and will make a copy and remove.

Alex Zhuravlev added a comment - 22/Nov/13 5:54 PM

0x18a3:1:0 - 1 is a sequence, so /O/1 - a hierarchy storying all the objects from sequence 1.

(gdb) p 0x18a3
$1 = 6307
(gdb) p 0x18a3 & 31
$1 = 3

/O/1/d3/6307

I'd suggest to rename the files to something like 6307.B to be able to recover in case, not just remove them.

Alex Zhuravlev added a comment - 22/Nov/13 5:54 PM 0x18a3:1:0 - 1 is a sequence, so /O/1 - a hierarchy storying all the objects from sequence 1. (gdb) p 0x18a3 $1 = 6307 (gdb) p 0x18a3 & 31 $1 = 3 /O/1/d3/6307 I'd suggest to rename the files to something like 6307.B to be able to recover in case, not just remove them.

David Dillow added a comment - 22/Nov/13 5:51 PM

Here's logs being processed by the the four threads that LBUG'd:
00000040:00080000:15.0:1385095225.247128:0:19969:0:(llog_cat.c:558:llog_cat_process_cb()) processing log 0x18a1:1:0 at index 6 of catalog 0x200:1
00000040:00080000:15.0:1385095225.253880:0:19971:0:(llog_cat.c:558:llog_cat_process_cb()) processing log 0x18a2:1:0 at index 6 of catalog 0x202:1
00000040:00080000:15.0:1385095225.257044:0:19973:0:(llog_cat.c:558:llog_cat_process_cb()) processing log 0x18a3:1:0 at index 6 of catalog 0x204:1
00000040:00080000:15.0:1385095225.261700:0:19975:0:(llog_cat.c:558:llog_cat_process_cb()) processing log 0x18a4:1:0 at index 6 of catalog 0x206:1

David Dillow added a comment - 22/Nov/13 5:51 PM Here's logs being processed by the the four threads that LBUG'd: 00000040:00080000:15.0:1385095225.247128:0:19969:0:(llog_cat.c:558:llog_cat_process_cb()) processing log 0x18a1:1:0 at index 6 of catalog 0x200:1 00000040:00080000:15.0:1385095225.253880:0:19971:0:(llog_cat.c:558:llog_cat_process_cb()) processing log 0x18a2:1:0 at index 6 of catalog 0x202:1 00000040:00080000:15.0:1385095225.257044:0:19973:0:(llog_cat.c:558:llog_cat_process_cb()) processing log 0x18a3:1:0 at index 6 of catalog 0x204:1 00000040:00080000:15.0:1385095225.261700:0:19975:0:(llog_cat.c:558:llog_cat_process_cb()) processing log 0x18a4:1:0 at index 6 of catalog 0x206:1

David Dillow added a comment - 22/Nov/13 5:46 PM

Alex, can you comment on the mapping from 0x18a3:1:0 to /O/1/d3/6307 so we can replicate it for the other logs?

David Dillow added a comment - 22/Nov/13 5:46 PM Alex, can you comment on the mapping from 0x18a3:1:0 to /O/1/d3/6307 so we can replicate it for the other logs?

David Dillow added a comment - 22/Nov/13 5:45 PM

We'll need to find four files IIRC, since we had four separate OST processes LBUG, correct?

Blake, we should see how many of the llog_cat.c:558 messages we can find and correlate with the 4 threads that died; we may have to iterate a few times to get them all.

David Dillow added a comment - 22/Nov/13 5:45 PM We'll need to find four files IIRC, since we had four separate OST processes LBUG, correct? Blake, we should see how many of the llog_cat.c:558 messages we can find and correlate with the 4 threads that died; we may have to iterate a few times to get them all.

Alex Zhuravlev added a comment - 22/Nov/13 3:50 PM

the responsibility of osp_sync_thread is to destroy OST objects once the corresponding files are unlinked. technically it should be OK to run MDS with OSP thread stuck. but I'd rather remove the file manually...

00000040:00080000:15.0:1385095225.257044:0:19973:0:(llog_cat.c:558:llog_cat_process_cb()) processing log 0x18a3:1:0 at index 6 of catalog 0x204:1

so, it should be /O/1/d3/6307. I'd verify this with llog_reader utility and file's size should a bit more than 227712. it'd be great if you can attach output of llog_reader here. when the file is confirmed you can remove it manually and start MDS.

Alex Zhuravlev added a comment - 22/Nov/13 3:50 PM the responsibility of osp_sync_thread is to destroy OST objects once the corresponding files are unlinked. technically it should be OK to run MDS with OSP thread stuck. but I'd rather remove the file manually... 00000040:00080000:15.0:1385095225.257044:0:19973:0:(llog_cat.c:558:llog_cat_process_cb()) processing log 0x18a3:1:0 at index 6 of catalog 0x204:1 so, it should be /O/1/d3/6307. I'd verify this with llog_reader utility and file's size should a bit more than 227712. it'd be great if you can attach output of llog_reader here. when the file is confirmed you can remove it manually and start MDS.

Blake Caldwell added a comment - 22/Nov/13 3:22 PM

We will have an opportunity next Tuesday if more work is needed (truncate llog?). When running with osp_sync_threads tasks that are hung are we turning away client I/O for that OST?

Blake Caldwell added a comment - 22/Nov/13 3:22 PM We will have an opportunity next Tuesday if more work is needed (truncate llog?). When running with osp_sync_threads tasks that are hung are we turning away client I/O for that OST?

Blake Caldwell added a comment - 22/Nov/13 2:15 PM

Delete which file for OST llog?

Blake Caldwell added a comment - 22/Nov/13 2:15 PM Delete which file for OST llog?

Alex Zhuravlev added a comment - 22/Nov/13 1:44 PM

I can't say for sure, but llog code detected missing records in the file, which might be a result of I/O issues.
the simplest solution to the issue is to remove that file at the cost of orphaned objects on OST (so, lost disk space) until lfsck run.

Alex Zhuravlev added a comment - 22/Nov/13 1:44 PM I can't say for sure, but llog code detected missing records in the file, which might be a result of I/O issues. the simplest solution to the issue is to remove that file at the cost of orphaned objects on OST (so, lost disk space) until lfsck run.

People

Assignee:: Alex Zhuravlev

Reporter:: Blake Caldwell

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 22/Nov/13 4:48 AM

Updated:: 20/Mar/14 2:28 PM

Resolved:: 20/Mar/14 2:28 PM