Description
We are seeing a single message on one of our MDS nodes time out over and over with no apparent end in sight. The message is destined for an OST, so each time it times out the connection with the OST is assumed dead, and the MDS reconnects. Here is a sample of the timeout:
00000100:00000100:15.0:1437780834.157231:0:12825:0:(client.c:1976:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1437780188/real 1437780188] req@ffff880c55ff7400 x1506778035514940/t0(0) o6->lsd-OST003a-osc-MDT0000@172.19.2.163@o2ib100:28/4 lens 664/432 e 27 to 1 dl 1437780833 ref 1 fl Rpc:Xru/2/ffffffff rc -11/-1 00000100:00000200:15.0:1437780834.157246:0:12825:0:(events.c:99:reply_in_callback()) @@@ type 6, status 0 req@ffff880c55ff7400 x1506778035514940/t0(0) o6->lsd-OST003a-osc-MDT0000@172.19.2.163@o2ib100:28/4 lens 664/432 e 27 to 1 dl 1437780833 ref 1 fl Rpc:Xru/2/ffffffff rc -11/-1 00000100:00000200:15.0:1437780834.157250:0:12825:0:(events.c:120:reply_in_callback()) @@@ unlink req@ffff880c55ff7400 x1506778035514940/t0(0) o6->lsd-OST003a-osc-MDT0000@172.19.2.163@o2ib100:28/4 lens 664/432 e 27 to 1 dl 1437780833 ref 1 fl Rpc:X/2/ffffffff rc -11/-1 00000400:00000200:15.0:1437780834.157254:0:12825:0:(lib-md.c:73:lnet_md_unlink()) Unlinking md ffff88072424e1c0 00000100:02000400:15.0:1437780834.157258:0:12825:0:(import.c:179:ptlrpc_set_import_discon()) lsd-OST003a-osc-MDT0000: Connection to lsd-OST003a (at 172.19.2.163@o2ib100) was lost; in progress operations using this service will wait for recovery to complete
Thread 12825 (ptlrpcd_14) is repeatedly timing out an opcode 6 (OST_DESTROY) with OST lsd-OST003a. Interestingly, it has the same transaction number each time. I cannot, however, see any resend of the request, so the OSS never really has an opportunity to try to reply to the message. It is especially strange that the send time is getting updated each time even though the message does not seem to be resent.
This problem has persisted through reboots of both the MDS and OSTs. The sysadmins tried failing over the OST, and the problem followed the OST to the failover OSS (that part probably isn't terribly surprising).
This has persisted for weeks now.
Currently, the servers are running lustre version 2.5.4-4chaos.
Attachments
Issue Links
- is related to
-
LU-5242 Test hang sanity test_132, test_133: umount ost
- Resolved