[LU-17] LBUG in ksocklnd Created: 22/Nov/10  Updated: 03/Feb/11  Resolved: 03/Feb/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Kit Westneat (Inactive) Assignee: Liang Zhen (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Bugzilla ID: 24,218
Rank (Obsolete): 10093

 Description   

While testing iozone, a client hit this LBUG:

Nov 18 15:43:53 psana0107 kernel: LustreError: 28278:0:(socklnd_cb.c:550:ksocknal_process_transmit()) ASSERTION(rc < 0) failed
Nov 18 15:43:53 psana0107 kernel: LustreError: 28278:0:(socklnd_cb.c:550:ksocknal_process_transmit()) LBUG
Nov 18 15:43:53 psana0107 kernel: Pid: 28278, comm: socknal_sd01
Nov 18 15:43:53 psana0107 kernel:
Nov 18 15:43:53 psana0107 kernel: Call Trace:
Nov 18 15:43:53 psana0107 kernel: [<ffffffff884b56a1>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
Nov 18 15:43:53 psana0107 kernel: [<ffffffff884b5bda>] lbug_with_loc+0x7a/0xd0 [libcfs]
Nov 18 15:43:53 psana0107 kernel: [<ffffffff884bdf40>] tracefile_init+0x0/0x110 [libcfs]
Nov 18 15:43:53 psana0107 kernel: [<ffffffff886a582a>] ksocknal_process_transmit+0x33a/0x640 [ksocklnd]
Nov 18 15:43:53 psana0107 kernel: [<ffffffff886a75cb>] ksocknal_scheduler+0x38b/0x640 [ksocklnd]
Nov 18 15:43:53 psana0107 kernel: [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e
Nov 18 15:43:53 psana0107 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Nov 18 15:43:53 psana0107 kernel: [<ffffffff886a7240>] ksocknal_scheduler+0x0/0x640 [ksocklnd]
Nov 18 15:43:53 psana0107 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11

Liang Zhen posted this patch in LU-15 originally:
http://review.whamcloud.com/#change,127

This bug is to track work on that.



 Comments   
Comment by Dan Ferber (Inactive) [ 22/Nov/10 ]

Thanks Kit. I assigned to Liang as he is working on this.

Comment by Robert Read (Inactive) [ 22/Nov/10 ]

Liang, please post the patch to a new bugzilla bug and request an inspection from Isaac.

Comment by Dan Ferber (Inactive) [ 22/Nov/10 ]

Cross referencing Liang's comment in bug LU-15, which Kit opened this LU-17 bug up for:

Liang Zhen added a comment - 18/Nov/10 8:06 PM
I've posted patch on http://review.whamcloud.com/#change,127
(description of the patch should be "fix contention on ksock_tx_t)

Description of the problem:

If the connection is closed before ksocknal_transmit() returns to ksocknal_process_transmit(), then nobody has refcount on conn::ksnc_sock and all pending ZC requests will be finalized by ksocknal_connsock_decref->ksocknal_finalize_zcreq
ksocknal_finalize_zcreq will mark not-acked ZC request as error by setting tx::tx_reside = -1.
This is race because ksocknal_process_transmit() will check tx::tx_resid right after calling ksocknal_transmit(), and it can get
both tx->tx_resid != 0 and rc == 0 then hit later LASSERT(rc < 0).
I've added Jay and Lai as reviewer, also, I will file a bug on bugzilla and try to push it into mainstream.

but this bug is not reason of bad performance, so we still need to survey on performance issue.

Regards
Liang

Comment by Liang Zhen (Inactive) [ 22/Nov/10 ]

I've filed a bug on BZ and try to push it into mainstream
https://bugzilla.lustre.org/show_bug.cgi?id=24218

Regards
Liang

Comment by Liang Zhen (Inactive) [ 16/Dec/10 ]

Patch landed on 2.x, still pending on test for 1.8

Comment by Liang Zhen (Inactive) [ 03/Feb/11 ]

Patch landed on both 1.8.* and 2.*, mark it as resolved

Generated at Sat Feb 10 01:02:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.