[LU-519] (o2iblnd_cb.c:980:kiblnd_tx_complete()) ASSERTION(tx->tx_sending > 0) failed Created: 20/Jul/11  Updated: 25/May/16  Resolved: 10/May/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Minor
Reporter: Prakash Surya (Inactive) Assignee: Doug Oucharek (Inactive)
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

lustre-modules-1.8.5.0-5chaos_2.6.18_107chaos.ch4.4
lustre-1.8.5.0-5chaos_2.6.18_107chaos.ch4.4
lustre-tools-llnl-1.2-6.ch4.4
chaos 4.4-3


Attachments: File print_failed_tx.diff    
Issue Links:
Related
Severity: 3
Bugzilla ID: 22,723
Rank (Obsolete): 9755

 Description   

This is the same bug that was discussed on bugzilla ticket 22723 found here: https://bugzilla.lustre.org/show_bug.cgi?id=22723

We have run into this issue in the past, and we recently ran into the issue again on one of our machines running chaos 4.4-3, with lustre 1.8.5 installed.

This time we had a patch installed for Liang from the above mentioned bugzilla ticket to print some more debugging information to the console:

18632:0:(o2iblnd_cb.c:980:kiblnd_tx_complete()) ASSERTION(tx->tx_sending > 0)
failed: TX: ffffc200003da540, type: IMMEDIATE, magic: deadbeef, sending: 0,
waiting: 0, queued: 0, cookie: 9361598, comps: 1, status: 0

And here is some more console output from the crash this time around:

2011-07-15 19:44:40 LustreError:
18632:0:(o2iblnd_cb.c:980:kiblnd_tx_complete()) ASSERTION(tx->tx_sending > 0)
failed: TX: ffffc200003da540, type: IMMEDIATE, magic: deadbeef, sending: 0,
waiting: 0, queued: 0, cookie: 9361598, comps: 1, status: 0
2011-07-15 19:44:40 ib_mthca 0000:07:00.0: SQ 000413 full (292641 head, 292873
tail, 4096 max, 0 nreq)
2011-07-15 19:44:40 LustreError:
19245:0:(o2iblnd_cb.c:886:kiblnd_post_tx_locked()) Error -12 posting transmit
to 192.168.123.110@o2ib1
2011-07-15 19:44:40 Lustre: lsa-OST0036-osc-ffff8100cc104000: Request obd_ping
sent 0s ago to 172.16.68.23@tcp has failed due to network error (limit 105s)
2011-07-15 19:44:40 Lustre: Skipped 1 previous similar message
2011-07-15 19:44:40 Lustre: lsa-OST0036-osc-ffff8100cc104000: Connection to
lsa-OST0036 (at 172.16.68.23@tcp) was lost; in progress operations using this
service will wait for recovery to complete
2011-07-15 19:44:40 LustreError:
18632:0:(o2iblnd_cb.c:980:kiblnd_tx_complete()) LBUG
2011-07-15 19:44:40 Pid: 18632, comm: kiblnd_sd_03
2011-07-15 19:44:40
2011-07-15 19:44:40 Call Trace:
2011-07-15 19:44:40  [<ffffffff885d478f>] libcfs_debug_dumpstack+0x5f/0x80
[libcfs]
2011-07-15 19:44:40  [<ffffffff885d4cbf>] lbug_with_loc+0x7f/0xd0 [libcfs]
2011-07-15 19:44:40  [<ffffffff88657965>] kiblnd_tx_complete+0x155/0x460
[ko2iblnd]
2011-07-15 19:44:40  [<ffffffff80091f8c>] __wake_up_common+0x3e/0x68
2011-07-15 19:44:40  [<ffffffff88658f1c>] kiblnd_complete+0xbc/0xe0 [ko2iblnd]
2011-07-15 19:44:40  [<ffffffff8865eeee>] kiblnd_scheduler+0x50e/0x6b0
[ko2iblnd]
2011-07-15 19:44:40  [<ffffffff80093b5a>] default_wake_function+0x0/0xf
2011-07-15 19:44:40  [<ffffffff8006101d>] child_rip+0xa/0x11
2011-07-15 19:44:40  [<ffffffff80061013>] child_rip+0x0/0x11
2011-07-15 19:44:40  [<ffffffff8865e9e0>] kiblnd_scheduler+0x0/0x6b0 [ko2iblnd]
2011-07-15 19:44:40  [<ffffffff80061013>] child_rip+0x0/0x11
2011-07-15 19:44:40
2011-07-15 19:44:40 ib_mthca 00Linux version 2.6.18-107chaos
(mockbuild@chaos4-builder1) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)) #1
SMP Thu Jun 23 14:36:14 PDT 2011
2011-07-15 19:44:41 Command line: initrd=initrd console=ttyS0,115200n8
elevator=deadline swiotlb=65536 selinux=0 BOOT_IMAGE=vmlinuz
BOOTIF=01-00-30-48-57-9b-24 irqpoll maxcpus=1 reset_devices  memmap=exactmap
memmap=640K@0K memmap=5312K@16384K memmap=125104K@22336K elfcorehdr=147440K
memmap=56K#3391360K memmap=69K#3391416K memmap=4K$3391484K memmap=4K$4173824K
memmap=1024K$4175872K memmap=9216K$4185088K
2011-07-15 19:44:41 BIOS-provided physical RAM map:



 Comments   
Comment by Prakash Surya (Inactive) [ 20/Jul/11 ]

Sorry, I forgot to put the formatting markup around the console output. Am I able to edit the description?

Comment by Peter Jones [ 20/Jul/11 ]

Prakash, if you can't I can! Let me know what changes you need made.

Liang, it would be great if you could make an initial assessment of this output but please let me know if you then need this issue to be reassigned to another engineer

Thanks

Peter

Comment by Prakash Surya (Inactive) [ 21/Jul/11 ]

Just for readability purposes, I usually put the copy/pasted console output in 'noformat' tags. If you could add these I think it would definitely help people decipher the bug. Let me know if it isn't clear where the tags should go.

Thanks Peter!

Comment by Liang Zhen (Inactive) [ 22/Jul/11 ]

thanks, it did prove that we got an unexpected completion event while ib_post_send failed (we shouldn't get any event on that case), so the TX's been finalized for twice... I will need to look into ofa source.

Comment by Liang Zhen (Inactive) [ 22/Jul/11 ]

One thing interesting is:

2011-07-15 19:44:40 ib_mthca 0000:07:00.0: SQ 000413 full (292641 head, 292873
tail, 4096 max, 0 nreq)

we can see (head < tail), how can this happen? In my understanding we should always see head >= tail, or am I wrong?

Liang

Comment by Peter Jones [ 26/Jul/11 ]

Prakash

Finally got to fixing the formatting in the initial comment - sorry for the delay

Peter

Comment by Prakash Surya (Inactive) [ 26/Jul/11 ]

Thanks Peter!

Comment by Peter Jones [ 27/Jul/11 ]

Lai

Could you please dig into this one further? Liang is available to help if need be.

Thanks

Peter

Comment by Lai Siyao [ 02/Aug/11 ]

Liang,

Could you give me a hint which part of OFA source code should I look into to verify TX may be finalized twice? HCA driver (eg. mthca) or core? Thanks!

Comment by Peter Jones [ 08/Aug/11 ]

Liang

Ah, I see that you were not added as a watcher to this ticket so you missed this question

"Liang,

Could you give me a hint which part of OFA source code should I look into to verify TX may be finalized twice? HCA driver (eg. mthca) or core? Thanks!"

Comment by Liang Zhen (Inactive) [ 08/Aug/11 ]

Lai, I think it's mthca because we got this error message:
ib_mthca 0000:07:00.0: SQ 000413 full (292641 head, 292873 tail, 4096 max, 0 nreq)
we probably need to look into post_send implementation in drivers/infiniband/hw/mthca/mthca_qp.c

we can see the LASSERT give us information like this:
(o2iblnd_cb.c:980:kiblnd_tx_complete()) ASSERTION(tx->tx_sending > 0)
failed: TX: ffffc200003da540, type: IMMEDIATE, magic: deadbeef, sending: 0,
waiting: 0, queued: 0, cookie: 9361598, comps: 1, status: 0

these information are outputted by this patch
https://bugzilla.lustre.org/attachment.cgi?id=31074

magic number is 0xdeadbeef and completion couter is 1, which means the TX has already been finalized.
I suspect kiblnd_post_tx_locked()->ib_post_send() failed for some reason so the TX should be finalized in kiblnd_post_tx_locked() and we shouldn't see any completion
event for this TX anymore, but for some unknown reason we did see an unexpected event and call kiblnd_tx_complete() for the same TX again.

Comment by Lai Siyao [ 05/Dec/11 ]

The ib_mthca SQ full looks similar to the problem of http://linux.derkeiler.com/Mailing-Lists/Kernel/2006-11/msg00108.html , Prakash, could you check the patch is included in your kernel?

Liang, the ASSERTED tx->tx_comps: 1 means the completion has been handled already, and this is the second time it's called. Therefor, I'm not sure this tx is finalized in kiblnd_post_tx_locked() or in kiblnd_tx_complete().

I made a debug patch to print address of failed tx in kiblnd_post_tx_locked(), Prakash, could you merge this patch (along with Liang's) and reproduce this issue?

Comment by Lai Siyao [ 05/Dec/11 ]

print address of failed tx in kiblnd_post_tx_locked().

Comment by Prakash Surya (Inactive) [ 19/Jan/12 ]

We recently hit this again. Here's the report from our admin:

fyi: it seems to have happened again on atlas825 at a newer version of lustre
lustre-modules-1.8.5.0-6chaos_2.6.18_108chaos.ch4.4
lustre-1.8.5.0-6chaos_2.6.18_108chaos.ch4.4
lustre-tools-llnl-1.3-1.ch4.4
No infiniband errors reported for that node at that time.
I'll leave it for a couple of days, but it appears hung

<ConMan> Console [atlas825] log at 2012-01-19 13:00:00 PST.

<ConMan> Console [atlas825] log at 2012-01-19 14:00:00 PST.
2012-01-19 14:40:35 LustreError: 18699:0:(o2iblnd_cb.c:980:kiblnd_tx_complete()) ASSERTION(tx->tx_sending > 0) failed:
TX: ffffc200003d7e30, type: IMMEDIATE, magic: deadbeef, sending: 0, waiting: 0, queued: 0, cookie: 263119177, comps: 1,
status: 0
2012-01-19 14:40:35 LustreError: 18699:0:(o2iblnd_cb.c:980:kiblnd_tx_complete()) LBUG
2012-01-19 14:40:35 Pid: 18699, comm: kiblnd_sd_02
2012-01-19 14:40:35
2012-01-19 14:40:35 Call Trace:
2012-01-19 14:40:35  [<ffffffff885e078f>] libcfs_debug_dumpstack+0x5f/0x80 [libcfs]
2012-01-19 14:40:35  [<ffffffff885e0cbf>] lbug_with_loc+0x7f/0xd0 [libcfs]
2012-01-19 14:40:35  [<ffffffff88663965>] kiblnd_tx_complete+0x155/0x460 [ko2iblnd]
2012-01-19 14:40:35  [<ffffffff80091f8c>] __wake_up_common+0x3e/0x68
2012-01-19 14:40:35  [<ffffffff88664f1c>] kiblnd_complete+0xbc/0xe0 [ko2iblnd]
2012-01-19 14:40:35  [<ffffffff8866aeee>] kiblnd_scheduler+0x50e/0x6b0 [ko2iblnd]
2012-01-19 14:40:35  [<ffffffff80093b5a>] default_wake_function+0x0/0xf
2012-01-19 14:40:35  [<ffffffff8006101d>] child_rip+0xa/0x11
2012-01-19 14:40:35  [<ffffffff80061013>] child_rip+0x0/0x11
2012-01-19 14:40:35  [<ffffffff8866a9e0>] kiblnd_scheduler+0x0/0x6b0 [ko2iblnd]
2012-01-19 14:40:35  [<ffffffff80061013>] child_rip+0x0/0x11
2012-01-19 14:40:35
2012-01-19 14:40:35 Jan 19 14:40:35 Kernel panic - not syncing: LBUG
2012-01-19 14:40:35 atlas825 LustreE rr
<ConMan> Console [atlas825] log at 2012-01-19 15:00:00 PST.

<ConMan> Console [atlas825] joined by <root@localhost> on pts/159 at 01-19 15:49.

<ConMan> Console [atlas825] departed by <root@localhost> on pts/159 at 01-19 15:49.
Comment by Liang Zhen (Inactive) [ 20/Jan/12 ]

Prakash, would it be possible to post your o2iblnd source code at here so I can check through it?

Comment by Prakash Surya (Inactive) [ 23/Jan/12 ]

Liang, our Lustre tree (with the o2iblnd source) can be found on our CHAOS github account: https://github.com/chaos/lustre

Comment by Doug Oucharek (Inactive) [ 15/Jun/12 ]

I came across some references on the web where some IB drivers may do a "partial send" when returning an ENOMEM from ib_post_send() (see http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg01390.html). I noticed that the first console log posted in this Jira does show a -12 (-ENOMEM) being reported from kiblnd_post_tx_locked(). Here is what I suspect has happened:

1- In kiblnd_post_tx_locked(), while holding the ibc_lock, tx_sending is incremented.
2- Then ib_post_send() is called.
3- ib_post_send() experiences a memory issue but does a partial send. -ENOMEM is returned.
4- Because of the non-zero return code, tx_sending is decremented (becomes zero) and then the ibc_lock is released.
5- The partial send triggers a complete callback which executes before the -ENOMEM error message gets printed. That callback triggers the ASSERT.

Seems there are two bugs here: 1- that the IB driver experienced a memory issue, and 2- that o2iblnd ASSERTS on a situation which can happen (partial send on error).

Comment by Doug Oucharek (Inactive) [ 19/Jun/12 ]

I have submitted a patch to Gerrit to cover the case where Lustre is asserting in a situation which, apparently, can happen. The Gerrit URL is: http://review.whamcloud.com/#change,3148

Comment by Liang Zhen (Inactive) [ 19/Jun/12 ]

Doug, I think we only set IB_SEND_SIGNALED for the last ib_send_wr (see kiblnd_init_tx_msg), so not sure whether the callback will still be called for partial sending if w/o IB_SEND_SIGNALED

Comment by Doug Oucharek (Inactive) [ 20/Jun/12 ]

In this discussion thread: http://comments.gmane.org/gmane.linux.drivers.rdma/6119 , Roland Dreier indicates that the following is in the IB spec: "If a QP transitions to error state, then all work requests, whether or not they were signaled, generate a completion with status 'flush'."

This indicates to me that is it possible to get a completion on a partial send even if only the last wr is set with IB_SEND_SIGNALED.

Comment by Liang Zhen (Inactive) [ 25/Jun/12 ]

does it means that it only and always generate a completion event if ib_post_send() returned -ENOMEM? what if it returns other error code, will there still be completion event?

Comment by Doug Oucharek (Inactive) [ 25/Jun/12 ]

That's a good question, Liang. I'm not familiar with the IB driver(s) so I'm not sure. Can we assume that when tx_status is non-zero, we should always fail in the completion? From this Jira report, I know that a tx_status of ENOMEM should trigger a fail. Don't know about other.

Is there somewhere to see all possible tx_status conditions which can come from the IB driver?

Comment by Gerrit Updater [ 17/Nov/14 ]

Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/12747
Subject: LU-519 o2iblnd: check wr_id returned by ib_poll_cq
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 524f19e949f38670483968fe0854074be31bc4ff

Comment by Liang Zhen (Inactive) [ 17/Nov/14 ]

I suspect this is a driver bug, which results in ib_poll_cq() return +ve but uninitialised wc::wr_id, if this indeed happened, kiblnd_scheduler will refer to stale pointer in stack (see LU-5271) and run into unpredictable situation, this patch may help to confirm this:

http://review.whamcloud.com/12747

Comment by Gerrit Updater [ 12/Jan/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12747/
Subject: LU-519 o2iblnd: check wr_id returned by ib_poll_cq
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f571d0fdb78516b615574cfcd903436eba975fc5

Comment by Doug Oucharek (Inactive) [ 10/May/16 ]

The merged patch will close the connection which gets into a bad state and log the details. As such, we should no longer be seeing the assert. If the new log shows up, then a new ticket should be opened with the details so we can address that.

Generated at Sat Feb 10 01:07:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.