[LU-2455] lctl ping takes too long to timeout Created: 10/Dec/12 Updated: 05/Mar/16 Resolved: 05/Mar/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shuichi Ihara (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 5797 |
| Description |
|
lctl ping in theory takes a timeout parameter, with the default timeout being 1 second. In practice, however, the timeout can be significantly longer. It appears that it changes the timeout to 60s if it needs to UNLINK. Is there any way to eliminate this or is it mandatory that if there is an error, pings could take up to 60s? As it is now, the timeout is not very useful. |
| Comments |
| Comment by Peter Jones [ 11/Dec/12 ] |
|
Bruno Could you please advise on this one? Thanks Peter |
| Comment by Bruno Faccini (Inactive) [ 11/Dec/12 ] |
|
I am afraid that's the way it is coded in lnet_ping() routine, and this stands for both the 60s time-out value and the automatic wait/retry with such time-out. BTW, can you provide "lctl ping" output and also if possible enable "echo +neterror +net > /proc/sys/lnet/[debug,print]" when you get this kind of error/situation ?? This will help me to definitely confirm the responsible code path in the sources. |
| Comment by Kit Westneat (Inactive) [ 11/Dec/12 ] |
|
Here's the relevant dk output, I'll attach the full dk: 00000400:00000200:0.0:1355245748.810589:0:5818:0:(lib-move.c:2705:LNetGet()) LNetGet -> 12345-10.10.10.179@tcp1 If I am reading it right, the ping times out correctly after 1s, but then the unlinking takes 20s. Here is how I reproduced it: On an IB network, we have seen it take over 50s. |
| Comment by Kit Westneat (Inactive) [ 11/Dec/12 ] |
|
full dk from the time period of the ping |
| Comment by Bruno Faccini (Inactive) [ 11/Dec/12 ] |
|
Unlink will occur asynchronously ("lnet_md_unlink()) Queueing unlink") because at least one msg may still references it. And the way/timing for msgs to be terminated/discarded seems NAL dependent, so this may explain the differences you've seen. Will try to fully explain that using your reproducing method. |
| Comment by Kit Westneat (Inactive) [ 13/Dec/12 ] |
|
I guess the problem is that it's difficult to put a timeout on a TCP connect operation. That appears to be what is blocking for 20s. It's too bad that there is no way to handle the cleanup in the background, and return to userspace before the connect times out. I saw that the MD is created with auto-unlink enabled, but there doesn't appear to be a corresponding "autofree" for the event queue, is that correct? Would it be possible to add an event handler callback that would do the cleanup? That way the lnet_ping code could enqueue the unlink and then return an error immediately. It's probably not that straightforward I suppose. |