[LU-4555] Patched (LU-2779) 2.4.1 Lustre Clients still crashing with LBUG Created: 28/Jan/14  Updated: 22/May/14  Resolved: 09/Apr/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oz Rentas Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Attachments: Text File lustre-client-2.4.1-ddn1.0_2.6.32_358.18.1.el6_lustre.es50.x86_64_ES.src.rpm    
Issue Links:
Related
is related to LU-4581 ASSERTION( (!(page->cp_type == CPT_CA... Resolved
Severity: 2
Rank (Obsolete): 12443

 Description   

Customer has applied patch from LU-2779 to 2.4.1 clients.

We've verified the patch has been applied:

diff -ru /usr/src/lustre-2.4.1/lustre/osc/osc_cache.c ./lustre/osc/osc_cache.c
— /usr/src/lustre-2.4.1/lustre/osc/osc_cache.c 2013-09-19 11:06:59.000000000 -0700
+++ ./lustre/osc/osc_cache.c 2013-12-18 06:52:09.000000000 -0800
@@ -896,7 +896,7 @@
"%s: wait ext to %d timedout, recovery in progress?\n",
osc_export(obj)>exp_obd>obd_name, state);

  • lwi = LWI_INTR(LWI_ON_SIGNAL_NOOP, NULL);
    + lwi = LWI_INTR(NULL, NULL);
    rc = l_wait_event(ext->oe_waitq, extent_wait_cb(ext, state),
    &lwi);
    }

And, the client RPMs have been rebuilt and installed. However, the Lustre clients are still failing with the following error:

LustreError: 17450:0:(cl_lock.c:1964:discard_cb()) ASSERTION( (!(page->cp_type == CPT_CACHEABLE) || (!PageWriteback(cl_page_vmpage(env, page)))) ) failed:
LustreError: 17450:0:(cl_lock.c:1964:discard_cb()) LBUG Kernel panic - not syncing: LBUG
Pid: 17450, comm: tar Tainted: GF

The patched RPM is attached. Please advise.



 Comments   
Comment by Peter Jones [ 28/Jan/14 ]

Bobijam

Could you please assist with this ticket?

Thanks

Peter

Comment by Oz Rentas [ 04/Feb/14 ]

Any updates on this one? Please advise.
Thanks

Comment by Bruno Faccini (Inactive) [ 05/Feb/14 ]

Hello Oz,
Is there any crash-dump available for this issue, and if yes can you provide it with the necessary vmlinux and lustre modules ?
BTW, LU-4581 seems to also report the same issue, even if we still need to clarify if it occurs running with patch/change #5419 from LU-2779.

Comment by Jinshan Xiong (Inactive) [ 05/Feb/14 ]

duplicate of LU-4581. I closed this one because it has a stack trace over there.

Comment by Peter Jones [ 21/Mar/14 ]

This is still occurring at DDN site so this does not appear to be a duplicate of LU-4581. Can we please get a level set on where are with this ticket? Oz, how frequently does this issue occur at the customer site? Are there any logs/stacks/crash dumps associated with the crashes?

Comment by Bruno Faccini (Inactive) [ 26/Mar/14 ]

Here are some news/updates for this ticket after the conf-call :
_ it has been agreed that further update for this problem at AWE will be in this ticket and no longer in LU-4581 for LLNL.
_ there are no known recent occurrence of this problem at AWE.
_ as already requested in LU-4581, customer is running with D_CACHE traces enabled on their Clients.
_ debug buffer size has also been increased (exact value to be provided).
_ a crash-dump will be taken and available for debugging upon next occurrence.
_ would be also of interest to have the exact Lustre version being run on Clients/Servers along with the list/detail of any additional patches applied.

Comment by John Fuchs-Chesney (Inactive) [ 09/Apr/14 ]

We have the opinion that the 2.4.3 patch has fixed this problem.
~ jfc.

Generated at Sat Feb 10 01:43:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.