[LU-4576] Seeing kernel panics in osc_extent_wait for 2.4.0 Lustre clients Created: 03/Feb/14  Updated: 20/Feb/14  Resolved: 20/Feb/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: James A Simmons Assignee: Jinshan Xiong (Inactive)
Resolution: Duplicate Votes: 0
Labels: mn4
Environment:

Cray 2.4 clients running SLES11 SP1 distro


Issue Links:
Related
is related to LU-4509 clio can be stuck in osc_extent_wait Resolved
Severity: 1
Rank (Obsolete): 12499

 Description   

2014-02-02T11:59:55.236185-05:00 c24-0c0s2n3 [<ffffffffa02ef8ae>] cfs_waitq_wait+0xe/0x10 [libcfs]
2014-02-02T11:59:55.236191-05:00 c24-0c0s2n3 [<ffffffffa08a15b4>] osc_extent_wait+0x544/0x600 [osc]
2014-02-02T11:59:55.236197-05:00 c24-0c0s2n3 [<ffffffffa08a1c02>] osc_cache_wait_range+0x592/0x8f0 [osc]
2014-02-02T11:59:55.236203-05:00 c24-0c0s2n3 [<ffffffffa08a9a21>] osc_cache_writeback_range+0x1001/0x144c [osc]
2014-02-02T11:59:55.236208-05:00 c24-0c0s2n3 [<ffffffffa088e77e>] osc_lock_flush+0x7e/0x260 [osc]
2014-02-02T11:59:55.236214-05:00 c24-0c0s2n3 [<ffffffffa088f0d1>] osc_lock_cancel+0x101/0x1e0 [osc]
2014-02-02T11:59:55.236227-05:00 c24-0c0s2n3 [<ffffffffa04fa085>] cl_lock_cancel+0x1e5/0x360 [obdclass]
2014-02-02T11:59:55.236233-05:00 c24-0c0s2n3 [<ffffffffa08901a8>] osc_ldlm_blocking_ast+0x198/0x3a0 [osc]
2014-02-02T11:59:55.236239-05:00 c24-0c0s2n3 [<ffffffffa061965b>] ldlm_cancel_callback+0x6b/0x190 [ptlrpc]
2014-02-02T11:59:55.236244-05:00 c24-0c0s2n3 [<ffffffffa06374fa>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
2014-02-02T11:59:55.236250-05:00 c24-0c0s2n3 [<ffffffffa063a9bc>] ldlm_cli_cancel_list_local+0xec/0x280 [ptlrpc]
2014-02-02T11:59:55.236255-05:00 c24-0c0s2n3 [<ffffffffa063bb45>] ldlm_cancel_lru_local+0x35/0x40 [ptlrpc]
2014-02-02T11:59:55.236261-05:00 c24-0c0s2n3 [<ffffffffa063bf1f>] ldlm_prep_elc_req+0x3cf/0x480 [ptlrpc]
2014-02-02T11:59:55.236266-05:00 c24-0c0s2n3 [<ffffffffa063bff8>] ldlm_prep_enqueue_req+0x28/0x30 [ptlrpc]
2014-02-02T11:59:55.236272-05:00 c24-0c0s2n3 [<ffffffffa08746a3>] osc_enqueue_base+0x103/0x550 [osc]
2014-02-02T11:59:55.236277-05:00 c24-0c0s2n3 [<ffffffffa088f9ce>] osc_lock_enqueue+0x4ee/0x940 [osc]
2014-02-02T11:59:55.236283-05:00 c24-0c0s2n3 [<ffffffffa04f9633>] cl_enqueue_try+0xf3/0x450 [obdclass]
2014-02-02T11:59:55.236289-05:00 c24-0c0s2n3 [<ffffffffa0928cca>] lov_lock_enqueue+0x1aa/0xce0 [lov]
2014-02-02T11:59:55.236294-05:00 c24-0c0s2n3 [<ffffffffa04f9633>] cl_enqueue_try+0xf3/0x450 [obdclass]
2014-02-02T11:59:55.236300-05:00 c24-0c0s2n3 [<ffffffffa04fd5bf>] cl_enqueue_locked+0x7f/0x1f0 [obdclass]
2014-02-02T11:59:55.236305-05:00 c24-0c0s2n3 [<ffffffffa04fd7b7>] cl_lock_request+0x87/0x320 [obdclass]
2014-02-02T11:59:55.236311-05:00 c24-0c0s2n3 [<ffffffffa09fcadf>] cl_glimpse_lock+0x17f/0x480 [lustre]
2014-02-02T11:59:55.236317-05:00 c24-0c0s2n3 [<ffffffffa09fcf99>] cl_glimpse_size0+0x1b9/0x210 [lustre]
2014-02-02T11:59:55.236322-05:00 c24-0c0s2n3 [<ffffffffa09b3050>] ll_inode_revalidate_it+0x1b0/0x1d0 [lustre]
2014-02-02T11:59:55.236328-05:00 c24-0c0s2n3 [<ffffffffa09b3241>] ll_getattr+0x61/0x180 [lustre]
2014-02-02T11:59:55.236334-05:00 c24-0c0s2n3 [<ffffffff81116ff8>] vfs_getattr+0x28/0x50
2014-02-02T11:59:55.236340-05:00 c24-0c0s2n3 [<ffffffff81117348>] vfs_fstatat+0x68/0x80
2014-02-02T11:59:55.236448-05:00 c24-0c0s2n3 [<ffffffff8111739b>] vfs_stat+0x1b/0x20
2014-02-02T11:59:55.236454-05:00 c24-0c0s2n3 [<ffffffff81117564>] sys_newstat+0x24/0x50
2014-02-02T11:59:55.236460-05:00 c24-0c0s2n3 [<ffffffff8100305b>] system_call_fastpath+0x16/0x1b



 Comments   
Comment by Peter Jones [ 03/Feb/14 ]

Jinshan

Could you please advise on this one?

Thanks

Peter

Comment by Jinshan Xiong (Inactive) [ 03/Feb/14 ]

Can you please apply the patch: http://review.whamcloud.com/8922 and see if it can help?

Jinshan

Comment by Andreas Dilger [ 03/Feb/14 ]

Link to LU-4509 where the 8922 patch has just landed to master.

Comment by Patrick Farrell (Inactive) [ 03/Feb/14 ]

James - Can you specify what the kernel panic is? I see a stack trace, but no panic...? (Sorry, perhaps this should be obvious to me.)

Comment by James A Simmons [ 04/Feb/14 ]

I will ask the admin for the full log tomorrow.

Comment by James A Simmons [ 04/Feb/14 ]

We haven't tried the LU-4509 patch yet but from what I heard Cray's testing that patch didn't resolved the problem. I did get a hold of the logs and after examining them have determined that the true bug is LU-4300.

Comment by John Lewis (Inactive) [ 04/Feb/14 ]

To clarify, the issue observed has not been a kernel panic, lbug, etc. Client applications are hanging, the stack trace in the description is characteristic when killing the node via an NMI.

Comment by Peter Jones [ 20/Feb/14 ]

Believed to be a duplicate of LU-4300

Generated at Sat Feb 10 01:43:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.