[LU-5266] LBUG on Failover -ldlm_process_extent_lock()) ASSERTION( lock->l_granted_mode != lock->l_req_mode ) Created: 27/Jun/14  Updated: 22/Oct/14  Resolved: 11/Jul/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.6.0, Lustre 2.5.4

Type: Bug Priority: Blocker
Reporter: Cliff White (Inactive) Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: HB
Environment:

Hyperion - 2.5.60 build 2538


Attachments: Text File iws19.crash.txt     File iws23.dmesg     Text File iws29.lustre-log.1405540543.8118.txt     Text File iws29.lustre-log.1405540562.8181.txt     Text File iws29.messages.txt     Text File lustre-log.1404916388.10827.txt     Text File lustre-log.1404916402.10701.txt     Text File lustre-log.1404916421.10764.txt    
Issue Links:
Related
is related to LU-2827 mdt_intent_fixup_resent() cannot find... Resolved
is related to LU-5496 fix for LU-5266 Resolved
Severity: 3
Rank (Obsolete): 14694

 Description   

After hard failover of devices to server iws19, server wedged, then hit LBUG.
Services never complete recovery, and appear to either restart the timer or something:

Jun 27 10:08:53 iws19 kernel: Lustre: lustre-OST000c: Will be in recovery for at least 2:30, or until 316 clients reconnect
Jun 27 10:11:53 iws19 kernel: Lustre: lustre-OST000c: recovery is timed out, evict stale exports
Jun 27 10:18:30 iws19 kernel: Lustre: lustre-OST000c: Client c52d4856-d1df-b87b-911c-f1bfbc23a24d (at 192.168.124.182@o2ib) reconnecting, waiting for 316 clients in recovery for 2:27

The server reports being cpu-bound prior to the failure
Console log attached - unfortunately the dump after the LBUG failed.



 Comments   
Comment by Oleg Drokin [ 30/Jun/14 ]

We need the backtrace for the crash please.

Comment by Andreas Dilger [ 30/Jun/14 ]

Cliff,
Did you try to reboot the server again and/or try more failovers after this one?

Comment by Cliff White (Inactive) [ 30/Jun/14 ]

Yes, we had multiple failures

Comment by Jodi Levi (Inactive) [ 30/Jun/14 ]

HongChao,
could you please look into this one?
Thank you!

Comment by Vitaly Fertman [ 30/Jun/14 ]

not sure if this failure is the same as the fixed one, but as caught by the same assertion, to not create another ticket, I put it here: http://review.whamcloud.com/10903

Comment by Hongchao Zhang [ 02/Jul/14 ]

Yes, the issue could be triggered by the resent lock request.

Comment by Andreas Dilger [ 04/Jul/14 ]

Was this introduced by http://review.whamcloud.com/5978 ?

Comment by Cliff White (Inactive) [ 09/Jul/14 ]

I tested the patch on Hyperion - no more LBUG, but did have a few evictions. lustre-logs and console logs from one run attached.

Comment by Peter Jones [ 11/Jul/14 ]

Landed for 2.6

Comment by Cliff White (Inactive) [ 16/Jul/14 ]

While testing this patch, still have some client evictions and log dumps. Message log and lustre log from latest attached.

Comment by Andreas Dilger [ 18/Jul/14 ]

This should probably go into a new bug, since the LASSERT is fixed.

Comment by Vitaly Fertman [ 15/Aug/14 ]

the previous fix was not exactly correct: http://review.whamcloud.com/11469

Comment by Peter Jones [ 15/Aug/14 ]

Vitaly
Could you please open a new JIRA ticket to track this additional change. The original fix was in the already GA 2.6 release
Thanks
Peter

Generated at Sat Feb 10 01:49:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.