Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
- llnl
- p4hc

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We are seeing a single message on one of our MDS nodes time out over and over with no apparent end in sight. The message is destined for an OST, so each time it times out the connection with the OST is assumed dead, and the MDS reconnects. Here is a sample of the timeout:

00000100:00000100:15.0:1437780834.157231:0:12825:0:(client.c:1976:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1437780188/real 1437780188]  req@ffff880c55ff7400 x1506778035514940/t0(0) o6->lsd-OST003a-osc-MDT0000@172.19.2.163@o2ib100:28/4 lens 664/432 e 27 to 1 dl 1437780833 ref 1 fl Rpc:Xru/2/ffffffff rc -11/-1
00000100:00000200:15.0:1437780834.157246:0:12825:0:(events.c:99:reply_in_callback()) @@@ type 6, status 0  req@ffff880c55ff7400 x1506778035514940/t0(0) o6->lsd-OST003a-osc-MDT0000@172.19.2.163@o2ib100:28/4 lens 664/432 e 27 to 1 dl 1437780833 ref 1 fl Rpc:Xru/2/ffffffff rc -11/-1
00000100:00000200:15.0:1437780834.157250:0:12825:0:(events.c:120:reply_in_callback()) @@@ unlink  req@ffff880c55ff7400 x1506778035514940/t0(0) o6->lsd-OST003a-osc-MDT0000@172.19.2.163@o2ib100:28/4 lens 664/432 e 27 to 1 dl 1437780833 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
00000400:00000200:15.0:1437780834.157254:0:12825:0:(lib-md.c:73:lnet_md_unlink()) Unlinking md ffff88072424e1c0
00000100:02000400:15.0:1437780834.157258:0:12825:0:(import.c:179:ptlrpc_set_import_discon()) lsd-OST003a-osc-MDT0000: Connection to lsd-OST003a (at 172.19.2.163@o2ib100) was lost; in progress operations using this service will wait for recovery to complete

Thread 12825 (ptlrpcd_14) is repeatedly timing out an opcode 6 (OST_DESTROY) with OST lsd-OST003a. Interestingly, it has the same transaction number each time. I cannot, however, see any resend of the request, so the OSS never really has an opportunity to try to reply to the message. It is especially strange that the send time is getting updated each time even though the message does not seem to be resent.

This problem has persisted through reboots of both the MDS and OSTs. The sysadmins tried failing over the OST, and the problem followed the OST to the failover OSS (that part probably isn't terribly surprising).

This has persisted for weeks now.

Currently, the servers are running lustre version 2.5.4-4chaos.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

debug_log.pilsner-mds1.LU-7020.2015.10.15.txt.gz
16/Oct/15 12:09 AM
0.3 kB
D. Marc Stearman
debug_log.pilsner59.LU-7020.2015.10.15.txt.gz
16/Oct/15 12:09 AM
837 kB
D. Marc Stearman
pilsner59.stack.out.gz
19/Oct/15 3:38 PM
82 kB
Cameron Harr
pilsner59.stack.out.03Nov2015.gz
03/Nov/15 4:44 PM
89 kB
Cameron Harr

Issue Links

is related to

LU-5242 Test hang sanity test_132, test_133: umount ost

Resolved

Activity

People

Assignee:: Hongchao Zhang

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 18/Aug/15 9:32 PM

Updated:: 19/Feb/16 7:05 PM

Resolved:: 20/Jan/16 1:54 PM