[LU-12215] OSS network errors 2.12 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.12.0
Labels:
None
Environment:
CentOS 7.6, Servers and clients 2.12.0+patches

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

I recently commented in several tickets regarding OSS issues. I think this is some kind of deadlock like we had for Oak in LU-12162 or ~~LU-12018~~. But we run with 2.12.0+ the patch for ~~LU-12018~~ (quota: do not start a thread under memory pressure) so I don't think this is the same issue. I tried a SysRq+t and I see a high numbers of threads blocked in ldiskfs. Because the dump took a SUPER long time, and we couldn't wait more, I tried a crash dump but it failed. So I have logs and a partial sysrq-t.

What makes me think of a new deadlock is that the load keeps increasing if the server doesn't crash:

fir-io2-s2:  19:09:23 up  8:45,  1 user,  load average: 797.66, 750.86, 519.06

Symptoms can be either this trace like in LU-11644 reported by NASA:

Apr 22 17:12:27 fir-io2-s2 kernel: Pid: 83769, comm: ll_ost01_036 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Apr 22 17:12:27 fir-io2-s2 kernel: Call Trace:
Apr 22 17:12:27 fir-io2-s2 kernel: [<ffffffffc1232640>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc]
Apr 22 17:12:27 fir-io2-s2 kernel: [<ffffffffc11effe5>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
Apr 22 17:12:27 fir-io2-s2 kernel: [<ffffffffc121169b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
Apr 22 17:12:27 fir-io2-s2 kernel: [<ffffffffc181310b>] ofd_intent_policy+0x69b/0x920 [ofd]
Apr 22 17:12:27 fir-io2-s2 kernel: [<ffffffffc11f0d26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
Apr 22 17:12:27 fir-io2-s2 kernel: [<ffffffffc12196d7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
Apr 22 17:12:27 fir-io2-s2 kernel: [<ffffffffc12a00b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
Apr 22 17:12:27 fir-io2-s2 kernel: [<ffffffffc12a710a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Apr 22 17:12:27 fir-io2-s2 kernel: [<ffffffffc124b6db>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Apr 22 17:12:27 fir-io2-s2 kernel: [<ffffffffc124f00c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Apr 22 17:12:27 fir-io2-s2 kernel: [<ffffffffa90c1c31>] kthread+0xd1/0xe0
Apr 22 17:12:27 fir-io2-s2 kernel: [<ffffffffa9774c24>] ret_from_fork_nospec_begin+0xe/0x21
Apr 22 17:12:27 fir-io2-s2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
Apr 22 17:12:27 fir-io2-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1555978347.83769

or these network-related messages:

Apr 22 17:12:47 fir-io2-s2 kernel: LustreError: 81272:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff90da74f1d200
Apr 22 17:12:53 fir-io2-s2 kernel: LustreError: 81276:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff90fedd133c00
Apr 22 17:12:53 fir-io2-s2 kernel: LustreError: 81276:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff90fedd133c00

Apr 22 18:48:53 fir-io2-s2 kernel: LustreError: 81277:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff90e22fd35000
Apr 22 18:48:53 fir-io2-s2 kernel: LustreError: 38519:0:(ldlm_lib.c:3264:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff90e109b76850 x1631329926812992/t0(0) o4->95e6fd6a-706d-ff18-fa02-0b0e9d53d014@10.8.19.8@o2ib6:301/0 lens 488/448 e 0 to 0 dl 1555984331 ref 1 fl Interpret:/0/0 rc 0/0

but clearly it's not a network issue, just a server deadlock. Again, very like the issue on Oak from LU-12162. I'm not sure the list of the tasks provided here is enough to troubleshoot this, but it would be great if you could take a look. This should be considered at least a Sev 2 as we have been pretty much down lately due to this. At the moment, I'm trying to restart without ost quota disabled to see if that's better. Thanks much.

NOTE: we tried to run with net enabled but this doesn't seem to help. also we see a spike in msgs_alloc like NASA but I think it's just due to the deadlock.

Attaching kernel logs + sysrq-t (PARTIAL) as fir-io2-s2_kernel+sysrq_PARTIAL_20190422.log

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

fir-io1-s1-previous_bulk.log
108 kB
23/Apr/19 4:56 PM
fir-io2-s2_kernel+sysrq_PARTIAL_20190422.log
2.70 MB
23/Apr/19 2:37 AM
fir-io3-s2_foreach_bt_20190422_204054.txt
1.10 MB
23/Apr/19 6:59 AM
fir-io3-s2-vmcore-dmesg-20190422_204054.log
825 kB
23/Apr/19 6:59 AM
fir-lnet-msgs_alloc-20190423.png
258 kB
23/Apr/19 4:12 PM
fir-ossbw-20190423.png
302 kB
23/Apr/19 4:12 PM
lnet-fir-oss.txt
1 kB
23/Apr/19 6:01 PM
lnet-sh-1-fdr.txt
1 kB
23/Apr/19 6:01 PM
lnet-sh-2-edr.txt
1 kB
23/Apr/19 6:01 PM
lnet-sh-rtr-fir-1.txt
2 kB
23/Apr/19 6:01 PM
lnet-sh-rtr-fir-2.txt
2 kB
23/Apr/19 6:01 PM

Activity

[LU-12215] OSS network errors 2.12

Patrick Farrell (Inactive) added a comment - 12/Jul/19 9:12 PM

This particular issue has not reocurred recently, and there is another bug (https://jira.whamcloud.com/browse/LU-12451) capturing similar issues, so we will focus on that and re-open this if necessary.

Patrick Farrell (Inactive) added a comment - 12/Jul/19 9:12 PM This particular issue has not reocurred recently, and there is another bug ( https://jira.whamcloud.com/browse/LU-12451 ) capturing similar issues, so we will focus on that and re-open this if necessary.

Patrick Farrell (Inactive) added a comment - 12/Jul/19 9:11 PM

OK - Rather than keep open an old ticket, we'll focus in the new ones. Thanks, Stephane.

Patrick Farrell (Inactive) added a comment - 12/Jul/19 9:11 PM OK - Rather than keep open an old ticket, we'll focus in the new ones. Thanks, Stephane.

Stephane Thiell added a comment - 12/Jul/19 9:06 PM

Hi Patrick,

No, not recently, at least not with these specific errors. The last problem we had on Fir is described in LU-12451. I'm not sure why we had some routers running out of rtr credits.

Stephane Thiell added a comment - 12/Jul/19 9:06 PM Hi Patrick, No, not recently, at least not with these specific errors. The last problem we had on Fir is described in LU-12451 . I'm not sure why we had some routers running out of rtr credits.

Patrick Farrell (Inactive) added a comment - 12/Jul/19 8:44 PM

Stephane,

I'm kind of thinking this specific bug has been overcome by events and is maybe captured elsewhere, as we've worked through various issues recently - Are you still seeing the issues described here?

Patrick Farrell (Inactive) added a comment - 12/Jul/19 8:44 PM Stephane, I'm kind of thinking this specific bug has been overcome by events and is maybe captured elsewhere, as we've worked through various issues recently - Are you still seeing the issues described here?

Stephane Thiell added a comment - 01/May/19 5:44 AM

The lack of peer credits seems to have been a consequence of having enabled lnet debugging. We did some tests with +net at some point but this was a bad idea I guess. Setting debug and subsystem_debug back to -all seems to have fixed it. We're still seeing the following messages but it's hard to tell what could be wrong:

fir-md1-s2: Apr 30 20:23:26 fir-md1-s2 kernel: LNetError: 121174:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5)
fir-md1-s2: Apr 30 20:23:26 fir-md1-s2 kernel: LNetError: 121174:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1 previous similar message
fir-md1-s2: Apr 30 20:33:38 fir-md1-s2 kernel: LNetError: 121174:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5)
fir-md1-s2: Apr 30 20:51:04 fir-md1-s2 kernel: LNetError: 121172:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5)
fir-md1-s2: Apr 30 21:06:09 fir-md1-s2 kernel: LNetError: 121173:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5)
fir-md1-s2: Apr 30 21:06:09 fir-md1-s2 kernel: LNetError: 121173:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 3 previous similar messages
fir-md1-s2: Apr 30 21:18:03 fir-md1-s2 kernel: LNetError: 121176:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5)
fir-md1-s2: Apr 30 21:18:03 fir-md1-s2 kernel: LNetError: 121176:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1 previous similar message
fir-md1-s2: Apr 30 21:47:15 fir-md1-s2 kernel: LNetError: 121173:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5)
fir-md1-s2: Apr 30 22:25:11 fir-md1-s2 kernel: LNetError: 121175:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5)
fir-io1-s2: Apr 30 19:51:58 fir-io1-s2 kernel: LNetError: 108541:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5)
fir-io1-s2: Apr 30 19:52:03 fir-io1-s2 kernel: LNetError: 108539:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5)
fir-io1-s2: Apr 30 20:08:25 fir-io1-s2 kernel: LNetError: 108535:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5)
fir-io1-s2: Apr 30 20:09:26 fir-io1-s2 kernel: LNetError: 108545:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5)
fir-io1-s2: Apr 30 20:10:03 fir-io1-s2 kernel: LNetError: 108541:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5)
fir-io1-s2: Apr 30 20:13:34 fir-io1-s2 kernel: LNetError: 108535:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5)
fir-io1-s2: Apr 30 20:17:04 fir-io1-s2 kernel: LNetError: 108533:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5)

Stephane Thiell added a comment - 01/May/19 5:44 AM The lack of peer credits seems to have been a consequence of having enabled lnet debugging. We did some tests with +net at some point but this was a bad idea I guess. Setting debug and subsystem_debug back to -all seems to have fixed it. We're still seeing the following messages but it's hard to tell what could be wrong: fir-md1-s2: Apr 30 20:23:26 fir-md1-s2 kernel: LNetError: 121174:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5) fir-md1-s2: Apr 30 20:23:26 fir-md1-s2 kernel: LNetError: 121174:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1 previous similar message fir-md1-s2: Apr 30 20:33:38 fir-md1-s2 kernel: LNetError: 121174:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5) fir-md1-s2: Apr 30 20:51:04 fir-md1-s2 kernel: LNetError: 121172:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5) fir-md1-s2: Apr 30 21:06:09 fir-md1-s2 kernel: LNetError: 121173:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5) fir-md1-s2: Apr 30 21:06:09 fir-md1-s2 kernel: LNetError: 121173:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 3 previous similar messages fir-md1-s2: Apr 30 21:18:03 fir-md1-s2 kernel: LNetError: 121176:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5) fir-md1-s2: Apr 30 21:18:03 fir-md1-s2 kernel: LNetError: 121176:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1 previous similar message fir-md1-s2: Apr 30 21:47:15 fir-md1-s2 kernel: LNetError: 121173:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5) fir-md1-s2: Apr 30 22:25:11 fir-md1-s2 kernel: LNetError: 121175:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5) fir-io1-s2: Apr 30 19:51:58 fir-io1-s2 kernel: LNetError: 108541:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5) fir-io1-s2: Apr 30 19:52:03 fir-io1-s2 kernel: LNetError: 108539:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5) fir-io1-s2: Apr 30 20:08:25 fir-io1-s2 kernel: LNetError: 108535:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5) fir-io1-s2: Apr 30 20:09:26 fir-io1-s2 kernel: LNetError: 108545:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5) fir-io1-s2: Apr 30 20:10:03 fir-io1-s2 kernel: LNetError: 108541:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5) fir-io1-s2: Apr 30 20:13:34 fir-io1-s2 kernel: LNetError: 108535:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5) fir-io1-s2: Apr 30 20:17:04 fir-io1-s2 kernel: LNetError: 108533:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5)

Stephane Thiell added a comment - 29/Apr/19 6:50 PM - edited

I see quite a lot of these msgs this morning, on OSS but also on one of our MDS:

Apr 29 09:29:55 fir-md1-s2 kernel: LNetError: 105174:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5)
Apr 29 10:01:32 fir-md1-s2 kernel: LNetError: 105171:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5)
Apr 29 11:35:44 fir-md1-s2 kernel: LNetError: 105174:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
Apr 29 11:35:59 fir-md1-s2 kernel: LNetError: 105169:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
Apr 29 11:38:34 fir-md1-s2 kernel: LNetError: 105165:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5)
Apr 29 11:38:52 fir-md1-s2 kernel: LNetError: 105177:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
Apr 29 11:40:57 fir-md1-s2 kernel: LNetError: 105169:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
Apr 29 11:41:43 fir-md1-s2 kernel: LNetError: 105167:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5)
Apr 29 11:42:53 fir-md1-s2 kernel: LNetError: 105169:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
Apr 29 11:44:39 fir-md1-s2 kernel: LNetError: 105172:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5)

When enabling +net logging, I see some of the following messages (which are our Lustre routers):

[root@fir-md1-s2 tmp]# strings dk.1556563279 | grep 'no credits' | tail
00000800:00000200:7.0:1556563272.841468:0:105176:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.210@o2ib7: no credits
00000800:00000200:21.0:1556563272.841474:0:105168:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.209@o2ib7: no credits
00000800:00000200:29.0:1556563272.841478:0:105169:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.209@o2ib7: no credits
00000800:00000200:31.0:1556563272.841481:0:105177:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.210@o2ib7: no credits
00000800:00000200:13.0:1556563272.841481:0:105170:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.209@o2ib7: no credits
00000800:00000200:45.0:1556563272.841484:0:105171:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.209@o2ib7: no credits
00000800:00000200:47.0:1556563272.841486:0:105178:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.210@o2ib7: no credits
00000800:00000200:6.0:1556563272.841486:0:105173:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.211@o2ib7: no credits
00000800:00000200:21.0:1556563272.841487:0:105168:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.209@o2ib7: no credits
00000800:00000200:29.0:1556563272.841491:0:105169:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.209@o2ib7: no credits

Below, o2ib7 is Fir's IB EDR fabric,
10.0.10.5x@o2ib7 are MDS,
10.0.10.1xx@o2ib7 are OSS
10.0.10.2xx@o2ib7 are routers

[root@fir-md1-s2 tmp]# cat /sys/kernel/debug/lnet/nis
nid                      status alive refs peer  rtr   max    tx   min
0@lo                       down     0    2    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
10.0.10.52@o2ib7             up    -1   24    8    0    64    56    39
10.0.10.52@o2ib7             up    -1   22    8    0    64    56    24
10.0.10.52@o2ib7             up    -1   23    8    0    64    56    39
10.0.10.52@o2ib7             up    -1   24    8    0    64    56    40

[root@fir-md1-s2 tmp]# cat /sys/kernel/debug/lnet/peers  | grep o2ib7
10.0.10.202@o2ib7           4    up    -1     8     8     8     8 -2451 0
10.0.10.3@o2ib7             1    NA    -1     8     8     8     8     4 0
10.0.10.105@o2ib7           1    NA    -1     8     8     8     8   -39 0
10.0.10.212@o2ib7           8    up    -1     8     8     8     4 -1376 2520
10.0.10.102@o2ib7           1    NA    -1     8     8     8     8   -32 0
10.0.10.204@o2ib7           4    up    -1     8     8     8     8 -3054 0
10.0.10.107@o2ib7           1    NA    -1     8     8     8     8   -40 0
10.0.10.209@o2ib7           5    up    -1     8     8     8     7 -1386 496
10.0.10.201@o2ib7           4    up    -1     8     8     8     8 -2971 0
10.0.10.104@o2ib7           1    NA    -1     8     8     8     8   -40 0
10.0.10.211@o2ib7           8    up    -1     8     8     8     4 -1168 1520
10.0.10.101@o2ib7           1    NA    -1     8     8     8     8   -30 0
10.0.10.203@o2ib7           4    up    -1     8     8     8     8 -2430 0
10.0.10.106@o2ib7           1    NA    -1     8     8     8     8   -36 0
10.0.10.51@o2ib7            1    NA    -1     8     8     8     8  -437 0
10.0.10.103@o2ib7           1    NA    -1     8     8     8     8   -32 0
10.0.10.108@o2ib7           1    NA    -1     8     8     8     8   -33 0
10.0.10.210@o2ib7           5    up    -1     8     8     8     7 -1166 65608

Perhaps we should increase the peer_credits so that our MDS can handle more load to Sherlock's Lnet routers:

    - net type: o2ib7
      local NI(s):
        - nid: 10.0.10.52@o2ib7
          status: up
          interfaces:
              0: ib0
          statistics:
              send_count: 911564827
              recv_count: 916989206
              drop_count: 16536
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
              peercredits_hiw: 4
              map_on_demand: 0
              concurrent_sends: 0
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          lnd tunables:
          dev cpt: 2
          tcp bonding: 0
          CPT: "[0,1,2,3]"

But that means we have to change it on the routers too I assume.

Stephane Thiell added a comment - 29/Apr/19 6:50 PM - edited I see quite a lot of these msgs this morning, on OSS but also on one of our MDS: Apr 29 09:29:55 fir-md1-s2 kernel: LNetError: 105174:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5) Apr 29 10:01:32 fir-md1-s2 kernel: LNetError: 105171:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5) Apr 29 11:35:44 fir-md1-s2 kernel: LNetError: 105174:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) Apr 29 11:35:59 fir-md1-s2 kernel: LNetError: 105169:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) Apr 29 11:38:34 fir-md1-s2 kernel: LNetError: 105165:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5) Apr 29 11:38:52 fir-md1-s2 kernel: LNetError: 105177:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) Apr 29 11:40:57 fir-md1-s2 kernel: LNetError: 105169:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) Apr 29 11:41:43 fir-md1-s2 kernel: LNetError: 105167:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5) Apr 29 11:42:53 fir-md1-s2 kernel: LNetError: 105169:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) Apr 29 11:44:39 fir-md1-s2 kernel: LNetError: 105172:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5) When enabling +net logging, I see some of the following messages (which are our Lustre routers): [root@fir-md1-s2 tmp]# strings dk.1556563279 | grep 'no credits' | tail 00000800:00000200:7.0:1556563272.841468:0:105176:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.210@o2ib7: no credits 00000800:00000200:21.0:1556563272.841474:0:105168:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.209@o2ib7: no credits 00000800:00000200:29.0:1556563272.841478:0:105169:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.209@o2ib7: no credits 00000800:00000200:31.0:1556563272.841481:0:105177:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.210@o2ib7: no credits 00000800:00000200:13.0:1556563272.841481:0:105170:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.209@o2ib7: no credits 00000800:00000200:45.0:1556563272.841484:0:105171:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.209@o2ib7: no credits 00000800:00000200:47.0:1556563272.841486:0:105178:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.210@o2ib7: no credits 00000800:00000200:6.0:1556563272.841486:0:105173:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.211@o2ib7: no credits 00000800:00000200:21.0:1556563272.841487:0:105168:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.209@o2ib7: no credits 00000800:00000200:29.0:1556563272.841491:0:105169:0:(o2iblnd_cb.c:894:kiblnd_post_tx_locked()) 10.0.10.209@o2ib7: no credits Below, o2ib7 is Fir's IB EDR fabric, 10.0.10.5x@o2ib7 are MDS, 10.0.10.1xx@o2ib7 are OSS 10.0.10.2xx@o2ib7 are routers [root@fir-md1-s2 tmp]# cat /sys/kernel/debug/lnet/nis nid status alive refs peer rtr max tx min 0@lo down 0 2 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 0@lo down 0 0 0 0 0 0 0 10.0.10.52@o2ib7 up -1 24 8 0 64 56 39 10.0.10.52@o2ib7 up -1 22 8 0 64 56 24 10.0.10.52@o2ib7 up -1 23 8 0 64 56 39 10.0.10.52@o2ib7 up -1 24 8 0 64 56 40 [root@fir-md1-s2 tmp]# cat /sys/kernel/debug/lnet/peers | grep o2ib7 10.0.10.202@o2ib7 4 up -1 8 8 8 8 -2451 0 10.0.10.3@o2ib7 1 NA -1 8 8 8 8 4 0 10.0.10.105@o2ib7 1 NA -1 8 8 8 8 -39 0 10.0.10.212@o2ib7 8 up -1 8 8 8 4 -1376 2520 10.0.10.102@o2ib7 1 NA -1 8 8 8 8 -32 0 10.0.10.204@o2ib7 4 up -1 8 8 8 8 -3054 0 10.0.10.107@o2ib7 1 NA -1 8 8 8 8 -40 0 10.0.10.209@o2ib7 5 up -1 8 8 8 7 -1386 496 10.0.10.201@o2ib7 4 up -1 8 8 8 8 -2971 0 10.0.10.104@o2ib7 1 NA -1 8 8 8 8 -40 0 10.0.10.211@o2ib7 8 up -1 8 8 8 4 -1168 1520 10.0.10.101@o2ib7 1 NA -1 8 8 8 8 -30 0 10.0.10.203@o2ib7 4 up -1 8 8 8 8 -2430 0 10.0.10.106@o2ib7 1 NA -1 8 8 8 8 -36 0 10.0.10.51@o2ib7 1 NA -1 8 8 8 8 -437 0 10.0.10.103@o2ib7 1 NA -1 8 8 8 8 -32 0 10.0.10.108@o2ib7 1 NA -1 8 8 8 8 -33 0 10.0.10.210@o2ib7 5 up -1 8 8 8 7 -1166 65608 Perhaps we should increase the peer_credits so that our MDS can handle more load to Sherlock's Lnet routers: - net type: o2ib7 local NI(s): - nid: 10.0.10.52@o2ib7 status: up interfaces: 0: ib0 statistics: send_count: 911564827 recv_count: 916989206 drop_count: 16536 tunables: peer_timeout: 180 peer_credits: 8 peer_buffer_credits: 0 credits: 256 peercredits_hiw: 4 map_on_demand: 0 concurrent_sends: 0 fmr_pool_size: 512 fmr_flush_trigger: 384 fmr_cache: 1 ntx: 512 conns_per_peer: 1 lnd tunables: dev cpt: 2 tcp bonding: 0 CPT: "[0,1,2,3]" But that means we have to change it on the routers too I assume.

Stephane Thiell added a comment - 23/Apr/19 6:04 PM

Perhaps something is wrong with our lnet config? I attached the output of lnetctl net show -v on:

Fir servers (1 x IB EDR) as lnet-fir-oss.txt
Fir(EDR)-Sherlock1(FDR) lnet routers (EDR-FDR) as lnet-sh-rtr-fir-1.txt
Fir(EDR)-Sherlock2(EDR) lnet routers (EDR-EDR) as lnet-sh-rtr-fir-2.txt
Sherlock1 client (1xFDR) as lnet-sh-1-fdr.txt
Sherlock2 client (1xEDR) as lnet-sh-2-edr.txt

We usually don't tune much lnet and try to stick to the default values.

Stephane Thiell added a comment - 23/Apr/19 6:04 PM Perhaps something is wrong with our lnet config? I attached the output of lnetctl net show -v on: Fir servers (1 x IB EDR) as lnet-fir-oss.txt Fir(EDR)-Sherlock1(FDR) lnet routers (EDR-FDR) as lnet-sh-rtr-fir-1.txt Fir(EDR)-Sherlock2(EDR) lnet routers (EDR-EDR) as lnet-sh-rtr-fir-2.txt Sherlock1 client (1xFDR) as lnet-sh-1-fdr.txt Sherlock2 client (1xEDR) as lnet-sh-2-edr.txt We usually don't tune much lnet and try to stick to the default values.

Stephane Thiell added a comment - 23/Apr/19 5:36 PM - edited

Please note that we're also using the patch from ~~LU-12065~~ (lnd: increase CQ entries) everywhere now (servers, routers, clients). This one was added following some networking issues on OSS that were tracked in LU-12096.

Stephane Thiell added a comment - 23/Apr/19 5:36 PM - edited Please note that we're also using the patch from LU-12065 (lnd: increase CQ entries) everywhere now (servers, routers, clients). This one was added following some networking issues on OSS that were tracked in LU-12096 .

Stephane Thiell added a comment - 23/Apr/19 4:58 PM

Yes, I can see many of them actually. An example is attached in fir-io1-s1-previous_bulk.log, just scroll a bit and you'll see. Then we can also see a few backtraces that NASA is seeing too. Looking at the I/O graph during that time, I see a small drop in I/O at the time of the event, and then it recovers.

Stephane Thiell added a comment - 23/Apr/19 4:58 PM Yes, I can see many of them actually. An example is attached in fir-io1-s1-previous_bulk.log , just scroll a bit and you'll see. Then we can also see a few backtraces that NASA is seeing too. Looking at the I/O graph during that time, I see a small drop in I/O at the time of the event, and then it recovers.

Patrick Farrell (Inactive) added a comment - 23/Apr/19 4:48 PM - edited

That's possible. To clarify an earlier question:
Do you see the "-125" error code associated with those messages in earlier instances?

By the way, if you are rebooting clients, is the dip in activity perhaps from the MDS related hang on client loss we've discussed previously that you recently reminded me is still an issue? (Are you still not unmounting on reboot? )

[Edit:]

Actually, the length of the gap mostly rules that out.

Patrick Farrell (Inactive) added a comment - 23/Apr/19 4:48 PM - edited That's possible. To clarify an earlier question: Do you see the "-125" error code associated with those messages in earlier instances? By the way, if you are rebooting clients, is the dip in activity perhaps from the MDS related hang on client loss we've discussed previously that you recently reminded me is still an issue? (Are you still not unmounting on reboot? ) [Edit:] Actually, the length of the gap mostly rules that out.

People

Assignee:: Patrick Farrell (Inactive)

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 23/Apr/19 2:40 AM

Updated:: 13/Jul/19 3:44 PM

Resolved:: 12/Jul/19 9:12 PM