[LU-7676] OSS Servers stuck in connecting/disconnect loop - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
1
Rank (Obsolete):
9223372036854775807

Description

We have had several OSS started to get into a state of disconnect and reconnect with clients. Sometimes they clear-up and then re-enter the same state later. Even with reboot the will enter into the same state.

Attaching Lustre Debug dump. Please advice on what additional info is need for debugging.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

out.1452910586.gz
0.2 kB
16/Jan/16 2:28 AM

Issue Links

is related to

LU-7054 ib_cm scalling issue when lustre clients connect to OSS

Resolved

LU-7569 IB leaf switch caused LNet routers to crash

Resolved

Activity

[LU-7676] OSS Servers stuck in connecting/disconnect loop

Jay Lan (Inactive) added a comment - 18/Jan/16 8:53 PM

It looks like http://review.whamcloud.com/#/c/18025/ is a backport of patch ~~LU-7569~~ http://review.whamcloud.com/#/c/17892/ that I have been asking for. Thanks.

Jay Lan (Inactive) added a comment - 18/Jan/16 8:53 PM It looks like http://review.whamcloud.com/#/c/18025/ is a backport of patch LU-7569 http://review.whamcloud.com/#/c/17892/ that I have been asking for. Thanks.

Jay Lan (Inactive) added a comment - 18/Jan/16 7:54 PM

Patch 18026 missed a newline at line 1909.

Jay Lan (Inactive) added a comment - 18/Jan/16 7:54 PM Patch 18026 missed a newline at line 1909.

Doug Oucharek (Inactive) added a comment - 18/Jan/16 6:29 PM

NASA: Please apply both patches being discussed here: http://review.whamcloud.com/#/c/18025/ and http://review.whamcloud.com/18026.

Doug Oucharek (Inactive) added a comment - 18/Jan/16 6:29 PM NASA: Please apply both patches being discussed here: http://review.whamcloud.com/#/c/18025/ and http://review.whamcloud.com/18026 .

Doug Oucharek (Inactive) added a comment - 18/Jan/16 5:41 PM

Liang: is your patch in addition to the one Amir ported or is it a replacement for it?

Doug Oucharek (Inactive) added a comment - 18/Jan/16 5:41 PM Liang: is your patch in addition to the one Amir ported or is it a replacement for it?

Liang Zhen (Inactive) added a comment - 17/Jan/16 2:56 PM - edited

I checked my original patch, seems I forgot to call set_current_state() before schedule_timeout(), which can't really help because current thread wouldn't sleep. I have updated the patch uploaded by Amir (http://review.whamcloud.com/#/c/16470/), I also ported it to 2_5_fe (http://review.whamcloud.com/18026)

Liang Zhen (Inactive) added a comment - 17/Jan/16 2:56 PM - edited I checked my original patch, seems I forgot to call set_current_state() before schedule_timeout(), which can't really help because current thread wouldn't sleep. I have updated the patch uploaded by Amir ( http://review.whamcloud.com/#/c/16470/ ), I also ported it to 2_5_fe ( http://review.whamcloud.com/18026 )

Amir Shehata (Inactive) added a comment - 17/Jan/16 7:30 AM

Ported the patch here:
http://review.whamcloud.com/#/c/18025/

Amir Shehata (Inactive) added a comment - 17/Jan/16 7:30 AM Ported the patch here: http://review.whamcloud.com/#/c/18025/

Amir Shehata (Inactive) added a comment - 16/Jan/16 7:41 PM

I looked through the log file attached and I see 442 instances of connection races, which occurs when two nodes are attempting to reconnect. This could result in a flurry of reconnects, which could consume memory. There is a prototype patch that has been done to address the same issue on another site. I'm in the process of porting it to NASA's branch and I'll push it in later today for you to try.

Amir Shehata (Inactive) added a comment - 16/Jan/16 7:41 PM I looked through the log file attached and I see 442 instances of connection races, which occurs when two nodes are attempting to reconnect. This could result in a flurry of reconnects, which could consume memory. There is a prototype patch that has been done to address the same issue on another site. I'm in the process of porting it to NASA's branch and I'll push it in later today for you to try.

Bob Ciotti (Inactive) added a comment - 16/Jan/16 7:55 AM

We still have two production filesystems down. This is a critical problem.

We are going to try to run jobs on the remaining filesystems, but there were issues doing this earlier. So risky.

We are going to investigate network issues. We have found no HW problems.

Assuming that its not a network problem, do you have any suggestions as to where we should look? Debug settings? Other information we can provide to you? Mahmoud said that the traces uploaded show from boot to encountering the issue.

Bob Ciotti (Inactive) added a comment - 16/Jan/16 7:55 AM We still have two production filesystems down. This is a critical problem. We are going to try to run jobs on the remaining filesystems, but there were issues doing this earlier. So risky. We are going to investigate network issues. We have found no HW problems. Assuming that its not a network problem, do you have any suggestions as to where we should look? Debug settings? Other information we can provide to you? Mahmoud said that the traces uploaded show from boot to encountering the issue.

Bob Ciotti (Inactive) added a comment - 16/Jan/16 6:59 AM - edited

We also see many messages like this:
out.nbp2-oss18.1452913951.gz.denum:
00000800:00000200:15.0:1452913946.993806:0:21340:0:(o2iblnd.c:1898:kiblnd_pool_alloc_node()) Another thread is allocating new TX pool, waiting 1024 HZs for her to complete.trips = 83498830

This was part of a patch generated in https://jira.hpdd.intel.com/browse/LU-7054
http://review.whamcloud.com/#/c/16470/2/lnet/klnds/o2iblnd/o2iblnd.c
but we still see that there are a large number of "complete.trips" through. I has assumed that the "waiting HZs" of 1024 would slow this down, or does it simply schedule other threads if one waiting and not sleep (unclear to me), but in the traces I've looked at, I dont see any new pools being successfully created (and the indication of how long pool creation took to complete).

You must forgive me, grasping a little from memory... I seem to recall that there were some competition between the freeing (unregister) and pool allocation, is it possible that a something slow in the deallocation prevents new pools from being created?

Also, since I'm not familiar with this code (and I'm looking at this on my apple watch)
the "schedule_timeout(interval)", mapped to an inline null function. So I couldn't decipher yet.

Bob Ciotti (Inactive) added a comment - 16/Jan/16 6:59 AM - edited We also see many messages like this: out.nbp2-oss18.1452913951.gz.denum: 00000800:00000200:15.0:1452913946.993806:0:21340:0:(o2iblnd.c:1898:kiblnd_pool_alloc_node()) Another thread is allocating new TX pool, waiting 1024 HZs for her to complete.trips = 83498830 This was part of a patch generated in https://jira.hpdd.intel.com/browse/LU-7054 http://review.whamcloud.com/#/c/16470/2/lnet/klnds/o2iblnd/o2iblnd.c but we still see that there are a large number of "complete.trips" through. I has assumed that the "waiting HZs" of 1024 would slow this down, or does it simply schedule other threads if one waiting and not sleep (unclear to me), but in the traces I've looked at, I dont see any new pools being successfully created (and the indication of how long pool creation took to complete). You must forgive me, grasping a little from memory... I seem to recall that there were some competition between the freeing (unregister) and pool allocation, is it possible that a something slow in the deallocation prevents new pools from being created? Also, since I'm not familiar with this code (and I'm looking at this on my apple watch) the "schedule_timeout(interval)", mapped to an inline null function. So I couldn't decipher yet.

Oleg Drokin added a comment - 16/Jan/16 6:31 AM

That unexpectedly long timeout is more of the same.
Network/network driver/network card is slow in trying to unregister buffers we are trying to unregister. Slow as in it takes over 300 seconds to unregister such buffers )this is what triggers the message).

I think this is another sign of unhealthy network/card/driver. It's not normal for connection to a peer to fail with ETIMEOUT (-110)/UNREACHABLE as seen in the last snippet in my previous comment.

Oleg Drokin added a comment - 16/Jan/16 6:31 AM That unexpectedly long timeout is more of the same. Network/network driver/network card is slow in trying to unregister buffers we are trying to unregister. Slow as in it takes over 300 seconds to unregister such buffers )this is what triggers the message). I think this is another sign of unhealthy network/card/driver. It's not normal for connection to a peer to fail with ETIMEOUT (-110)/UNREACHABLE as seen in the last snippet in my previous comment.

Mahmoud Hanafi added a comment - 16/Jan/16 5:57 AM - edited

We haven't been able to identify any network issues. As far as we can tell the network is find.

what do you make of these messages. The downward slide of the servers is pre-seeded by these

Jan 15 19:10:09 nbp2-oss20 kernel: Lustre: 22081:0niobuf.c:285tlrpc_abort_bulk()) Unexpectedly long timeout: desc ffff880494ed2000
Jan 15 19:10:09 nbp2-oss20 kernel: Lustre: 22081:0niobuf.c:285tlrpc_abort_bulk()) Skipped 5 previous similar messages
Jan 15 19:10:22 nbp2-oss18 kernel: Lustre: 21874:0niobuf.c:285tlrpc_abort_bulk()) Unexpectedly long timeout: desc ffff881b124e4000
Jan 15 19:10:22 nbp2-oss18 kernel: Lustre: 21874:0niobuf.c:285tlrpc_abort_bulk()) Skipped 9 previous similar messages

Mahmoud Hanafi added a comment - 16/Jan/16 5:57 AM - edited We haven't been able to identify any network issues. As far as we can tell the network is find. what do you make of these messages. The downward slide of the servers is pre-seeded by these Jan 15 19:10:09 nbp2-oss20 kernel: Lustre: 22081:0niobuf.c:285tlrpc_abort_bulk()) Unexpectedly long timeout: desc ffff880494ed2000 Jan 15 19:10:09 nbp2-oss20 kernel: Lustre: 22081:0niobuf.c:285tlrpc_abort_bulk()) Skipped 5 previous similar messages Jan 15 19:10:22 nbp2-oss18 kernel: Lustre: 21874:0niobuf.c:285tlrpc_abort_bulk()) Unexpectedly long timeout: desc ffff881b124e4000 Jan 15 19:10:22 nbp2-oss18 kernel: Lustre: 21874:0niobuf.c:285tlrpc_abort_bulk()) Skipped 9 previous similar messages

People

Assignee:: Oleg Drokin

Reporter:: Mahmoud Hanafi

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 16/Jan/16 2:28 AM

Updated:: 22/Sep/16 10:38 PM

Resolved:: 22/Sep/16 10:38 PM