Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.5.3
Labels:
- llnl

Severity:
3
Rank (Obsolete):
16286

Description

We are testing our 2.5.3-based branch using osd-zfs. The clients, lnet router, and server nodes all had version 2.5.3-1chaos installed (see github.com/chaos/lustre).

On the recommendation from ~~LU-5803~~, I made a test build of lustre that consists of 2.5.3-1chaos + http://review.whamcloud.com/12365.

I installed this on the servers only. At the time, we had the SWL IO test running (mixture of ior, mdtest, simul, etc. all running at the same time).

I then rebooted just the servers onto the test build. The OSS nodes show lots of startup error messages, many that we didn't see without this new patch. Granted, it was just one time.

See attached file named simply "log". This is the console log from one of the OSS nodes.

Here's my initial view of what is going on:

The OSS nodes boot significantly faster than the MGS/MDS node. We have retry set to 32. I suspect that this noise is related to the MGS not yet having started:

2014-10-24 14:21:38 LustreError: 7421:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff880fef274800 x1482881169883144/t0(0) o253->MGC10.1.1.169@o2ib9@10.1.1.1
69@o2ib9:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
2014-10-24 14:21:38 LustreError: 7421:0:(obd_mount_server.c:1120:server_register_target()) lcy-OST0001: error registering with the MGS: rc = -5 (not fatal)

The MGS/MDS node doesn't start mounting the MGS and MDS devices until 14:25:47 and 14:25:47, respectively.

The MDS enters recovery at this time:

2014-10-24 14:26:23 zwicky-lcy-mds1 login: Lustre: lcy-MDT0000: Will be in recovery for at least 5:00, or until 134 clients reconnect.

So there are at least 4 problems here. We may need to split them up into separate subtickets:

OSS noise before MGS/MDS has started
tgt_recov "blocked for more than 102 seconds" (some of the 16 OSS nodes did this)
"waking for gap in transno", the MDS and some of the OSS nodes show a swath of these
Many OSS nodes hit "2014-10-24 14:36:23 LustreError: 7479:0:(ost_handler.c:1776:ost_blocking_ast()) Error -2 syncing data on lock cancel" within a few minutes of recovery being complete

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

log
24/Oct/14 9:48 PM
11 kB
Christopher Morrone

Issue Links

is related to

LU-5079 conf-sanity test_47 timeout

Resolved

LU-6664 (ost_handler.c:1765:ost_blocking_ast()) Error -2 syncing data on lock cancel

Resolved

is related to

LU-5816 Silence misleading kernel message"task tgt_recov:XXX blocked for more than 120 seconds"

Resolved

Activity

People

Assignee:: Oleg Drokin

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 24/Oct/14 9:48 PM

Updated:: 18/Jul/16 9:52 PM

Resolved:: 25/Apr/16 8:26 PM