[LU-1742] Fix 'Timed out tx' error message Created: 13/Aug/12  Updated: 29/Oct/20  Resolved: 29/Oct/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Trivial
Reporter: Brian Behlendorf Assignee: Cyril Bordage
Resolution: Fixed Votes: 0
Labels: easy, llnl

Issue Links:
Related
is related to LU-13675 LNetError: 14769:0:(o2iblnd.h:1003:ki... Resolved
Severity: 3
Rank (Obsolete): 9757

 Description   

Misleading error message from kiblnd_check_txs_locked(). The value reported in the error message is how many seconds we exceeded the deadline by. What I (and everyone else here) would have expected before reading the source is that the value would be the total time outstanding before timing out the RDMA.

LNetError: 3073:0:(o2iblnd_cb.c:2988:kiblnd_check_txs_locked()) Timed out tx: active_txs, 10 seconds


 Comments   
Comment by Brian Behlendorf [ 13/Aug/12 ]

http://review.whamcloud.com/3622

Comment by Johann Lombardi (Inactive) [ 13/Aug/12 ]

Hi Brian, did you really intend to file this bug as a severity 1?

Comment by Isaac Huang (Inactive) [ 13/Aug/12 ]

Another problem with this error message is that it doesn't tell us how long the tx has been actually on the wire, e.g. the error message above told us a tx expired 60 seconds (10 + default ko2iblnd timeout) after it was queued BUT:

  1. It could have been sitting in the TX queue waiting for a credit for the most of the 60 seconds and just barely got onto the active queue, which indicates problems with flow control protocol. Or:
  2. It could have spent most of the 60 seconds on the wire without completion, which usually indicates problems in the IB fabric.

It'd be very useful to be able to distinguish the two cases.

Comment by Peter Jones [ 13/Aug/12 ]

Brian B, I have dropped the severity because I assume that LLNL is not down as a result of this issue. Please speak up if I am mistaken

Isaac please can you take care of this one.

Comment by Brian Behlendorf [ 13/Aug/12 ]

Sorry, this was accidentally filed as high priority. The fix to update the error message is of course not critical.

However, ORI-735 is a big deal for us since it's currently preventing us from running IOR on Sequoia. So we need to absolutely get to the bottom of why that is happening , we're just starting to investigate in the context of ORI-735. I'll probably update the patch based on Isaac's suggestion so we can get some more visibility in to actually what's going wrong.

Comment by Gerrit Updater [ 26/Sep/18 ]

Sonia Sharma (sharmaso@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33235
Subject: LU-1742 o2iblnd: 'Timed out tx' error message
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 244d556e8784bbfdafcc13937ec73d53ef416b1a

Comment by Gerrit Updater [ 10/Jun/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33235/
Subject: LU-1742 o2iblnd: 'Timed out tx' error message
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7308662efc02fde077216f54728ecf278f31311b

Patch has been reverted due to LU-13675.

Comment by Peter Jones [ 17/Jun/20 ]

Is this still a live issue for LLNL?

Comment by Gerrit Updater [ 29/Oct/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/3622/
Subject: LU-1742 o2iblnd: 'Timed out tx' error message
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2be289a2b1f12bddce06e9eee65b5581b0b9fb5d

Comment by Peter Jones [ 29/Oct/20 ]

Landed for 2.14

Generated at Sat Feb 10 01:19:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.