[LU-1517] no retry for the bulk operation Created: 13/Jun/12  Updated: 22/Feb/13  Resolved: 12/Dec/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.8
Fix Version/s: Lustre 2.4.0, Lustre 2.1.4, Lustre 1.8.9

Type: Bug Priority: Critical
Reporter: Alexander Boyko Assignee: Keith Mannthey (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Severity: 3
Rank (Obsolete): 3998

 Description   
lnd_cb.c:558:kgnilnd_setup_phys_buffer()) failed to allocate tx_phys
[2012-04-07 02:08:24][c5-0c0s5n2]LNet: 29099:0:(gnilnd_cb.c:1068:kgnilnd_tx_done()) $$ error -12 on tx 0xffff88000fe06b40-><?> id 0/0 state GNILND_TX_ALLOCD age 17481575s  msg@0xffff88000fe06bc0 m/v/ty/ck/pck/pl b00fbabe/8/3/0/78db/0 x0:GNILND_MSG_PUT_REQ
[2012-04-07 02:08:24][c5-0c0s5n2]LustreError: 29099:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc ffff880627c24000

The error is detected on both client and server; the server expects the client to retry but it doesn't. In the mean time, the OSS issues a lock callback to the client, but the client does not respond because it is waiting for the I/O to complete. Eventually the OSS evicts the client. Lustre does not retry the bulk op when it detects the error.



 Comments   
Comment by Alexander Boyko [ 13/Jun/12 ]

Review request http://review.whamcloud.com/3102

Comment by Colin Faber [X] (Inactive) [ 14/Aug/12 ]

Hi,

Have WC/Intel had time to review our proposed patch yet?

-cf

Comment by Cory Spitz [ 14/Aug/12 ]

Cray has been using this patch and it is effective.

Comment by Keith Mannthey (Inactive) [ 02/Oct/12 ]

The patch has been reviewed and is awaiting final merger.

Comment by Keith Mannthey (Inactive) [ 04/Oct/12 ]

I just wanted to say that 1.8 is in more of a maintenance mode at this point. The patch looks fine but very few things are landing in 1.8 right now.

Comment by Cory Spitz [ 04/Oct/12 ]

Understood. Trigerring this bug results in client eviction though. Just FYI.

Comment by Johann Lombardi (Inactive) [ 10/Oct/12 ]

I'm fine with landing the patch on b1_8. That said, it seems that master suffers from the same issue, right?
If so, it would be great to push a patch against master.
Thanks in advance!

Comment by Cory Spitz [ 10/Oct/12 ]

Johann, for master, have you seen LU-901 and change #4092? Or does it only partially address the fault?

Comment by Cory Spitz [ 18/Oct/12 ]

Yes, it is curious that we landed this fix to b1_8 before master. FYI, Cray has been using this fix on our 2.2 for some time and testing has gone well. We should push it to master now while we wait for LU-901.

Comment by Johann Lombardi (Inactive) [ 18/Oct/12 ]

Sure, let's push review 3102 to all branches first (b1_8, b2_1, b2_2, b2_3 and master). Then more intrusive changes (like 4092) can be considered on master.

Comment by Keith Mannthey (Inactive) [ 18/Oct/12 ]

The patch ported easily as expected so I sent it to the various branches.

http://review.whamcloud.com/4296 <- b2_1
http://review.whamcloud.com/4297 <- b2_2
http://review.whamcloud.com/4298 <- b2_3
http://review.whamcloud.com/4299 <- Master

Comment by Cory Spitz [ 21/Nov/12 ]

I think that we should at least land the master patch.

Comment by Keith Mannthey (Inactive) [ 12/Dec/12 ]

Ok it seems the patches for 1.8,2.3 and Master has been merged. 2.2 and 2.3 are dead branches at this point.

I think this issue is safe to close. Please reopen if you disagree.

Generated at Sat Feb 10 01:17:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.