[LU-3941] (lov_lock.c:671:lov_lock_enqueue()) ASSERTION( sublock->cll_conflict == ((void *)0) ) failed - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.1.6
Labels:
- mn1

Severity:
3
Rank (Obsolete):
10424

Description

A LBUG was hit at MeteoFrance running Lustre 2.1.6 with the following messages:

<ConMan> Console [beaufix1034] log at 2013-09-08 04:00:00 UTC.
2013-09-08 04:03:18 LustreError: 11-0: an error occurred while communicating with 172.23.80.66@o2ib1. The obd_ping operation failed with -107
2013-09-08 04:03:18 Lustre: scratch-OST0014-osc-ffff880464f5a400: Connection to scratch-OST0014 (at 172.23.80.66@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
2013-09-08 04:03:18 LustreError: 167-0: This client was evicted by scratch-OST0014; in progress operations using this service will fail.
2013-09-08 04:03:18 LustreError: 31914:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST0014-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
2013-09-08 04:03:18 LustreError: 31914:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff88016d5fbe40 (48993688/0/0/0) (rc: 1)
2013-09-08 04:03:18 Lustre: scratch-OST0014-osc-ffff880464f5a400: Connection restored to scratch-OST0014 (at 172.23.80.66@o2ib1)
2013-09-08 04:19:08 LustreError: 11-0: an error occurred while communicating with 172.23.80.69@o2ib1. The obd_ping operation failed with -107
2013-09-08 04:19:08 Lustre: scratch-OST001f-osc-ffff880464f5a400: Connection to scratch-OST001f (at 172.23.80.69@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
2013-09-08 04:19:08 LustreError: 167-0: This client was evicted by scratch-OST001f; in progress operations using this service will fail.
2013-09-08 04:19:08 LustreError: 32128:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST001f-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
2013-09-08 04:19:08 LustreError: 32128:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff88047ba341c0 (48992886/0/0/0) (rc: 1)
2013-09-08 04:19:08 Lustre: scratch-OST001f-osc-ffff880464f5a400: Connection restored to scratch-OST001f (at 172.23.80.69@o2ib1)
2013-09-08 04:29:08 LustreError: 11-0: an error occurred while communicating with 172.23.64.63@o2ib. The obd_ping operation failed with -107
2013-09-08 04:29:08 Lustre: scratch-OST0004-osc-ffff880464f5a400: Connection to scratch-OST0004 (at 172.23.64.63@o2ib) was lost; in progress operations using this service will wait for recovery to complete
2013-09-08 04:29:08 LustreError: 167-0: This client was evicted by scratch-OST0004; in progress operations using this service will fail.
2013-09-08 04:29:08 LustreError: 32232:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST0004-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
2013-09-08 04:29:08 LustreError: 32232:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff880812ed6bc0 (48991503/0/0/0) (rc: 1)
2013-09-08 04:29:08 Lustre: scratch-OST0004-osc-ffff880464f5a400: Connection restored to scratch-OST0004 (at 172.23.64.63@o2ib)

<ConMan> Console [beaufix1034] log at 2013-09-08 05:00:00 UTC.
2013-09-08 05:04:58 LustreError: 11-0: an error occurred while communicating with 172.23.80.68@o2ib1. The obd_ping operation failed with -107
2013-09-08 05:04:58 Lustre: scratch-OST002b-osc-ffff880464f5a400: Connection to scratch-OST002b (at 172.23.80.68@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
2013-09-08 05:04:58 LustreError: 167-0: This client was evicted by scratch-OST002b; in progress operations using this service will fail.
2013-09-08 05:04:58 LustreError: 33732:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST002b-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
2013-09-08 05:04:58 LustreError: 33732:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff880320afb480 (48996861/0/0/0) (rc: 1)
2013-09-08 05:04:59 Lustre: scratch-OST002b-osc-ffff880464f5a400: Connection restored to scratch-OST002b (at 172.23.80.68@o2ib1)
2013-09-08 05:05:05 LustreError: 33253:0:(lov_lock.c:671:lov_lock_enqueue()) ASSERTION( sublock->cll_conflict == ((void *)0) ) failed:
2013-09-08 05:05:05 LustreError: 33253:0:(lov_lock.c:671:lov_lock_enqueue()) LBUG
2013-09-08 05:05:05 Pid: 33253, comm: arpege
2013-09-08 05:05:05
2013-09-08 05:05:05 Call Trace:
2013-09-08 05:05:05 [<ffffffffa03977f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2013-09-08 05:05:05 [<ffffffffa0397e07>] lbug_with_loc+0x47/0xb0 [libcfs]
2013-09-08 05:05:05 [<ffffffffa0741690>] ? osc_lock_enqueue+0x0/0x810 [osc]
2013-09-08 05:05:05 [<ffffffffa08858dd>] lov_lock_enqueue+0x71d/0x820 [lov]
2013-09-08 05:05:05 [<ffffffffa04b37fc>] cl_enqueue_try+0xfc/0x300 [obdclass]
2013-09-08 05:05:05 [<ffffffffa04b4bfd>] cl_enqueue_locked+0x6d/0x200 [obdclass]
2013-09-08 05:05:05 [<ffffffffa04b587e>] cl_lock_request+0x7e/0x1c0 [obdclass]
2013-09-08 05:05:05 [<ffffffffa09f02cf>] cl_glimpse_lock+0x16f/0x410 [lustre]
2013-09-08 05:05:05 [<ffffffffa09f0a94>] cl_glimpse_size+0x154/0x160 [lustre]
2013-09-08 05:05:05 [<ffffffffa09b49ae>] ll_inode_revalidate_it+0x5e/0x170 [lustre]
2013-09-08 05:05:05 [<ffffffffa09b4b09>] ll_getattr_it+0x49/0x170 [lustre]
2013-09-08 05:05:05 [<ffffffffa09b4c67>] ll_getattr+0x37/0x40 [lustre]
2013-09-08 05:05:05 [<ffffffff8116c591>] vfs_getattr+0x51/0x80
2013-09-08 05:05:05 [<ffffffff81113abf>] ? free_hot_page+0x2f/0x60
2013-09-08 05:05:05 [<ffffffff8116c84f>] vfs_fstat+0x3f/0x60
2013-09-08 05:05:05 [<ffffffff8116c894>] sys_newfstat+0x24/0x40
2013-09-08 05:05:05 [<ffffffff81173e85>] ? putname+0x35/0x50
2013-09-08 05:05:06 [<ffffffff811647e1>] ? do_sys_open+0x101/0x140
2013-09-08 05:05:06 [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b
2013-09-08 05:05:06
2013-09-08 05:05:06 Kernel panic - not syncing: LBUG
2013-09-08 05:05:06 Pid: 33253, comm: arpege Tainted: G W --------------- H 2.6.32-279.5.2.bl6.Bull.36.x86_64 #1
2013-09-08 05:05:06 Call Trace:
2013-09-08 05:05:06 [<ffffffff81495fe3>] ? panic+0xa0/0x168
2013-09-08 05:05:06 [<ffffffffa0397e5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
2013-09-08 05:05:06 [<ffffffffa0741690>] ? osc_lock_enqueue+0x0/0x810 [osc]
2013-09-08 05:05:06 [<ffffffffa08858dd>] ? lov_lock_enqueue+0x71d/0x820 [lov]
2013-09-08 05:05:06 [<ffffffffa04b37fc>] ? cl_enqueue_try+0xfc/0x300 [obdclass]
2013-09-08 05:05:06 [<ffffffffa04b4bfd>] ? cl_enqueue_locked+0x6d/0x200 [obdclass]
2013-09-08 05:05:06 [<ffffffffa04b587e>] ? cl_lock_request+0x7e/0x1c0 [obdclass]
2013-09-08 05:05:06 [<ffffffffa09f02cf>] ? cl_glimpse_lock+0x16f/0x410 [lustre]
2013-09-08 05:05:06 [<ffffffffa09f0a94>] ? cl_glimpse_size+0x154/0x160 [lustre]
2013-09-08 05:05:06 [<ffffffffa09b49ae>] ? ll_inode_revalidate_it+0x5e/0x170 [lustre]
2013-09-08 05:05:06 [<ffffffffa09b4b09>] ? ll_getattr_it+0x49/0x170 [lustre]
2013-09-08 05:05:06 [<ffffffffa09b4c67>] ? ll_getattr+0x37/0x40 [lustre]
2013-09-08 05:05:06 [<ffffffff8116c591>] ? vfs_getattr+0x51/0x80
2013-09-08 05:05:06 [<ffffffff81113abf>] ? free_hot_page+0x2f/0x60
2013-09-08 05:05:06 [<ffffffff8116c84f>] ? vfs_fstat+0x3f/0x60
2013-09-08 05:05:06 [<ffffffff8116c894>] ? sys_newfstat+0x24/0x40
2013-09-08 05:05:06 [<ffffffff81173e85>] ? putname+0x35/0x50
2013-09-08 05:05:06 [<ffffffff811647e1>] ? do_sys_open+0x101/0x140
2013-09-08 05:05:06 [<ffffffff810030f2>] ? system_call_fastpath+0x16/0x1b

This looks very similar to ~~LU-3874~~, but the patch that was suspected in this jira is not applied to our version (only in master and b2_4).

Unfortunately, we do not have any dump to analyze further.

Thanks,
Sebastien.

Attachments

Activity

[LU-3941] (lov_lock.c:671:lov_lock_enqueue()) ASSERTION( sublock->cll_conflict == ((void *)0) ) failed

Sebastien Piechurski added a comment - 04/Nov/13 3:41 PM

Hi Bruno,

Actually, there was no new occurrence of this issue. There were only 4 occurrences in 5 days mid september, and then nothing.
Is the patch planned for full review and landing on b2_1 ?

Sebastien Piechurski added a comment - 04/Nov/13 3:41 PM Hi Bruno, Actually, there was no new occurrence of this issue. There were only 4 occurrences in 5 days mid september, and then nothing. Is the patch planned for full review and landing on b2_1 ?

Bruno Faccini (Inactive) added a comment - 04/Nov/13 2:48 PM

Seb, looking into the history/comments of this ticket, I wonder if you already gave a try to my patch on-site, or if problem simply did not re-occur ??

Bruno Faccini (Inactive) added a comment - 04/Nov/13 2:48 PM Seb, looking into the history/comments of this ticket, I wonder if you already gave a try to my patch on-site, or if problem simply did not re-occur ??

Bruno Faccini (Inactive) added a comment - 11/Oct/13 9:44 AM

Cool, thanks Seb for these details and to add the debug flags.

Bruno Faccini (Inactive) added a comment - 11/Oct/13 9:44 AM Cool, thanks Seb for these details and to add the debug flags.

Sebastien Piechurski added a comment - 11/Oct/13 8:56 AM

Hi Bruno,

We don't have any new occurrences.
We will enable debug logs in case we have a new occurrence.
The application is an hybrid one (MPI + OpenMP) which might access some files concurrently, inducing multiple locks on the same region of the same objects.

We are also tracing another issue where this application creates some lock callback timeouts on the OSS with eviction of some clients. I don't know if those 2 problems are tied or not, but seems quite close. We will open a new Jira when we have more data about it.

Sebastien Piechurski added a comment - 11/Oct/13 8:56 AM Hi Bruno, We don't have any new occurrences. We will enable debug logs in case we have a new occurrence. The application is an hybrid one (MPI + OpenMP) which might access some files concurrently, inducing multiple locks on the same region of the same objects. We are also tracing another issue where this application creates some lock callback timeouts on the OSS with eviction of some clients. I don't know if those 2 problems are tied or not, but seems quite close. We will open a new Jira when we have more data about it.

Bruno Faccini (Inactive) added a comment - 11/Oct/13 8:48 AM

Patch successfully passed standard auto-tests and additional parallel-scale* requested.

Sebs, any news on your side ? Did you trigger new occurrences already ?? Have you been able to add Lustre-debug flags ? Any details about Arpege app to provide ?

Bruno Faccini (Inactive) added a comment - 11/Oct/13 8:48 AM Patch successfully passed standard auto-tests and additional parallel-scale* requested. Sebs, any news on your side ? Did you trigger new occurrences already ?? Have you been able to add Lustre-debug flags ? Any details about Arpege app to provide ?

Bruno Faccini (Inactive) added a comment - 08/Oct/13 8:56 AM

b2_1 backport, of change #7569 from ~~LU-3027~~, is at http://review.whamcloud.com/7878.

Bruno Faccini (Inactive) added a comment - 08/Oct/13 8:56 AM b2_1 backport, of change #7569 from LU-3027 , is at http://review.whamcloud.com/7878 .

Bruno Faccini (Inactive) added a comment - 07/Oct/13 6:07 PM

Crash-dump analysis seem to indicate the reason of the LBUG is that Glimpse-lock's per-stripe lov_lock_sub.sub_got/osc_lock sub-lock's structs do not have respectively CEF_ASYNC nor ols_glimpse set, and this allows to have conflicts/cll_conflict to be resolved (and CLO_WAIT returned) in osc_lock_enqueue_wait(), but this is not expected to occur for a Glimpse-lock (ie, with cld_enq_flags&CEF_ASYNC and ols_flags&LDLM_FL_HAS_INTENT) in lov_lock_enqueue_one() where CLO_WAIT return is avoided without regard of cll_conflict.
Then this cause the LBUG/Assert to trigger in lov_lock_enqueue() due to cll_conflict being found set upon forced 0 return from lov_lock_enqueue_one().

According to this I have now strong assumptions that change #7569 for ~~LU-3027~~, which cause only extent datas (and not cld_enq_flags!) to be modified in sub_got by lov_sublock_modify(), could be the fix for this problem. Will back-port it to b2_1 and will see how it works.

Bruno Faccini (Inactive) added a comment - 07/Oct/13 6:07 PM Crash-dump analysis seem to indicate the reason of the LBUG is that Glimpse-lock's per-stripe lov_lock_sub.sub_got/osc_lock sub-lock's structs do not have respectively CEF_ASYNC nor ols_glimpse set, and this allows to have conflicts/cll_conflict to be resolved (and CLO_WAIT returned) in osc_lock_enqueue_wait(), but this is not expected to occur for a Glimpse-lock (ie, with cld_enq_flags&CEF_ASYNC and ols_flags&LDLM_FL_HAS_INTENT) in lov_lock_enqueue_one() where CLO_WAIT return is avoided without regard of cll_conflict. Then this cause the LBUG/Assert to trigger in lov_lock_enqueue() due to cll_conflict being found set upon forced 0 return from lov_lock_enqueue_one(). According to this I have now strong assumptions that change #7569 for LU-3027 , which cause only extent datas (and not cld_enq_flags!) to be modified in sub_got by lov_sublock_modify(), could be the fix for this problem. Will back-port it to b2_1 and will see how it works.

Bruno Faccini (Inactive) added a comment - 05/Oct/13 4:09 PM - edited

Oleg, thanks.

Sebs, does this problem occur frequently on nodes running "Arpege" app ?? If yes, do you have some details about it (MPI? OpenMP? I/Os profile/workload ? MPI-IO? ...)?

Also, can you add some Lustre-debug flags (mainly "warning dlmtrace error emerg") to get more infos on next occurence ?, because actually you seem to run with /proc/sys/lnet/debug=0 ...

Bruno Faccini (Inactive) added a comment - 05/Oct/13 4:09 PM - edited Oleg, thanks. Sebs, does this problem occur frequently on nodes running "Arpege" app ?? If yes, do you have some details about it (MPI? OpenMP? I/Os profile/workload ? MPI-IO? ...)? Also, can you add some Lustre-debug flags (mainly "warning dlmtrace error emerg") to get more infos on next occurence ?, because actually you seem to run with /proc/sys/lnet/debug=0 ...

Oleg Drokin added a comment - 05/Oct/13 1:35 AM

apparently this is being hit in maloo too: https://maloo.whamcloud.com/sub_tests/5eac2b4e-174d-11e3-9d30-52540035b04c (on 2.4.92 tag)

Oleg Drokin added a comment - 05/Oct/13 1:35 AM apparently this is being hit in maloo too: https://maloo.whamcloud.com/sub_tests/5eac2b4e-174d-11e3-9d30-52540035b04c (on 2.4.92 tag)

Bruno Faccini (Inactive) added a comment - 04/Oct/13 10:37 AM

Hello Sebs!
Crash-dump log contains a bunch of hung arpege threads stacks dumps (from previous runs) like :

INFO: task arpege:10985 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
arpege        D 0000000000000025     0 10985  10978 0x00000000
 ffff88045d703968 0000000000000086 ffff880419aec000 0000000000000086
 ffff880419aec000 ffff88042b25fa60 0000000000000000 ffff88042b25fa78
 ffff88035de67b10 ffff88045d703fd8 000000000000e160 ffff88035de67b10
Call Trace:
 [<ffffffff81497c5e>] __mutex_lock_slowpath+0x14e/0x190
 [<ffffffff81497aeb>] mutex_lock+0x2b/0x50
 [<ffffffffa0545d87>] cl_lock_mutex_get+0x77/0xe0 [obdclass]
 [<ffffffffa0548107>] cl_lock_enclosure+0x187/0x230 [obdclass]
 [<ffffffffa05481f3>] cl_lock_closure_build+0x43/0x150 [obdclass]
 [<ffffffffa09729b6>] lovsub_lock_closure+0x76/0x150 [lov]
 [<ffffffffa0548230>] cl_lock_closure_build+0x80/0x150 [obdclass]
 [<ffffffffa0968d5b>] lov_sublock_lock+0x6b/0x2f0 [lov]
 [<ffffffffa096a325>] lov_lock_enqueue+0x165/0x820 [lov]
 [<ffffffffa05477fc>] cl_enqueue_try+0xfc/0x300 [obdclass]
 [<ffffffffa0548bfd>] cl_enqueue_locked+0x6d/0x200 [obdclass]
 [<ffffffffa054987e>] cl_lock_request+0x7e/0x1c0 [obdclass]
 [<ffffffffa0a212cf>] cl_glimpse_lock+0x16f/0x410 [lustre]
 [<ffffffffa0a21a94>] cl_glimpse_size+0x154/0x160 [lustre]
 [<ffffffffa09e59ae>] ll_inode_revalidate_it+0x5e/0x170 [lustre]
 [<ffffffffa09e5b09>] ll_getattr_it+0x49/0x170 [lustre]
 [<ffffffffa09e5c67>] ll_getattr+0x37/0x40 [lustre]
 [<ffffffff8116c591>] vfs_getattr+0x51/0x80
 [<ffffffff81113abf>] ? free_hot_page+0x2f/0x60
 [<ffffffff8116c84f>] vfs_fstat+0x3f/0x60
 [<ffffffff8116c894>] sys_newfstat+0x24/0x40
 [<ffffffff81173e85>] ? putname+0x35/0x50
 [<ffffffff811647e1>] ? do_sys_open+0x101/0x140
 [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b

and also several lock dumps like folllowing :

LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) lock@ffff8804655ec0b8[7 4 0 4 0 00000000] P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0] {
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     vvp@ffff8802fb74e6f0:
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     lov@ffff880465f7ca20: 4
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     0 0: ---
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     1 0: lock@ffff88084a564a58[0 5 0 0 0 00000000] R(1):[0, 18446744073709551615]@[0x100260000:0x31d165b:0x0] {
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     lovsub@ffff88086678baa0: [1 ffff880465f7ca20 P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0]]
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     osc@ffff880869f0eee8: ffff88086a61f000 00101001 0x75459d96781d6cfc 4 (null) size: 2097152 mtime: 1378913901 atime: 1378913901 ctime: 1378913901 blocks: 4096
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff88084a564a58
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     2 0: lock@ffff8806b82be7b8[0 5 0 0 0 00000000] R(1):[0, 18446744073709551615]@[0x100160000:0x31d36ce:0x0] {
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     lovsub@ffff88087991c960: [2 ffff880465f7ca20 P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0]]
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     osc@ffff880869f0e368: ffff8807f14e3d80 00101001 0x75459d96781d6d03 4 (null) size: 2097152 mtime: 1378913901 atime: 1378913901 ctime: 1378913901 blocks: 4096
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff8806b82be7b8
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     3 0: lock@ffff880869fcb078[0 5 0 0 0 00000000] R(1):[0, 18446744073709551615]@[0x100100000:0x31d431c:0x0] {
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     lovsub@ffff88068ae36a20: [3 ffff880465f7ca20 P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0]]
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     osc@ffff880869f0e700: ffff88068ae3ad80 00101001 0x75459d96781d6d0a 4 (null) size: 2097152 mtime: 1378913901 atime: 1378913901 ctime: 1378913901 blocks: 4096
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff880869fcb078
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff8804655ec0b8
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) unuse return -108

Since stacks look similar to the panic stack (and stack of others threads racing against same [sub-]lock structs ...) for this problem , it is a bit suspect and I wonder if is an usual behavior/symptom on-site ??

Bruno Faccini (Inactive) added a comment - 04/Oct/13 10:37 AM Hello Sebs! Crash-dump log contains a bunch of hung arpege threads stacks dumps (from previous runs) like : INFO: task arpege:10985 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. arpege D 0000000000000025 0 10985 10978 0x00000000 ffff88045d703968 0000000000000086 ffff880419aec000 0000000000000086 ffff880419aec000 ffff88042b25fa60 0000000000000000 ffff88042b25fa78 ffff88035de67b10 ffff88045d703fd8 000000000000e160 ffff88035de67b10 Call Trace: [<ffffffff81497c5e>] __mutex_lock_slowpath+0x14e/0x190 [<ffffffff81497aeb>] mutex_lock+0x2b/0x50 [<ffffffffa0545d87>] cl_lock_mutex_get+0x77/0xe0 [obdclass] [<ffffffffa0548107>] cl_lock_enclosure+0x187/0x230 [obdclass] [<ffffffffa05481f3>] cl_lock_closure_build+0x43/0x150 [obdclass] [<ffffffffa09729b6>] lovsub_lock_closure+0x76/0x150 [lov] [<ffffffffa0548230>] cl_lock_closure_build+0x80/0x150 [obdclass] [<ffffffffa0968d5b>] lov_sublock_lock+0x6b/0x2f0 [lov] [<ffffffffa096a325>] lov_lock_enqueue+0x165/0x820 [lov] [<ffffffffa05477fc>] cl_enqueue_try+0xfc/0x300 [obdclass] [<ffffffffa0548bfd>] cl_enqueue_locked+0x6d/0x200 [obdclass] [<ffffffffa054987e>] cl_lock_request+0x7e/0x1c0 [obdclass] [<ffffffffa0a212cf>] cl_glimpse_lock+0x16f/0x410 [lustre] [<ffffffffa0a21a94>] cl_glimpse_size+0x154/0x160 [lustre] [<ffffffffa09e59ae>] ll_inode_revalidate_it+0x5e/0x170 [lustre] [<ffffffffa09e5b09>] ll_getattr_it+0x49/0x170 [lustre] [<ffffffffa09e5c67>] ll_getattr+0x37/0x40 [lustre] [<ffffffff8116c591>] vfs_getattr+0x51/0x80 [<ffffffff81113abf>] ? free_hot_page+0x2f/0x60 [<ffffffff8116c84f>] vfs_fstat+0x3f/0x60 [<ffffffff8116c894>] sys_newfstat+0x24/0x40 [<ffffffff81173e85>] ? putname+0x35/0x50 [<ffffffff811647e1>] ? do_sys_open+0x101/0x140 [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b and also several lock dumps like folllowing : LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) lock@ffff8804655ec0b8[7 4 0 4 0 00000000] P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0] { LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) vvp@ffff8802fb74e6f0: LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) lov@ffff880465f7ca20: 4 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) 0 0: --- LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) 1 0: lock@ffff88084a564a58[0 5 0 0 0 00000000] R(1):[0, 18446744073709551615]@[0x100260000:0x31d165b:0x0] { LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) lovsub@ffff88086678baa0: [1 ffff880465f7ca20 P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0]] LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) osc@ffff880869f0eee8: ffff88086a61f000 00101001 0x75459d96781d6cfc 4 (null) size: 2097152 mtime: 1378913901 atime: 1378913901 ctime: 1378913901 blocks: 4096 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff88084a564a58 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) 2 0: lock@ffff8806b82be7b8[0 5 0 0 0 00000000] R(1):[0, 18446744073709551615]@[0x100160000:0x31d36ce:0x0] { LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) lovsub@ffff88087991c960: [2 ffff880465f7ca20 P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0]] LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) osc@ffff880869f0e368: ffff8807f14e3d80 00101001 0x75459d96781d6d03 4 (null) size: 2097152 mtime: 1378913901 atime: 1378913901 ctime: 1378913901 blocks: 4096 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff8806b82be7b8 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) 3 0: lock@ffff880869fcb078[0 5 0 0 0 00000000] R(1):[0, 18446744073709551615]@[0x100100000:0x31d431c:0x0] { LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) lovsub@ffff88068ae36a20: [3 ffff880465f7ca20 P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0]] LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) osc@ffff880869f0e700: ffff88068ae3ad80 00101001 0x75459d96781d6d0a 4 (null) size: 2097152 mtime: 1378913901 atime: 1378913901 ctime: 1378913901 blocks: 4096 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff880869fcb078 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff8804655ec0b8 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) unuse return -108 Since stacks look similar to the panic stack (and stack of others threads racing against same [sub-] lock structs ...) for this problem , it is a bit suspect and I wonder if is an usual behavior/symptom on-site ??

Bruno Faccini (Inactive) added a comment - 20/Sep/13 8:03 AM

Hello Seb,
So finally you got an other occurrence with a valid crash-dump ?
I will look into it and update you soon.
Thanks for your help, bye.

Bruno Faccini (Inactive) added a comment - 20/Sep/13 8:03 AM Hello Seb, So finally you got an other occurrence with a valid crash-dump ? I will look into it and update you soon. Thanks for your help, bye.

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Sebastien Buisson (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 12/Sep/13 4:12 PM

Updated:: 27/Nov/13 4:26 PM

Resolved:: 27/Nov/13 4:26 PM