[LU-3941] (lov_lock.c:671:lov_lock_enqueue()) ASSERTION( sublock->cll_conflict == ((void *)0) ) failed - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.1.6
Labels:
- mn1

Severity:
3
Rank (Obsolete):
10424

Description

A LBUG was hit at MeteoFrance running Lustre 2.1.6 with the following messages:

<ConMan> Console [beaufix1034] log at 2013-09-08 04:00:00 UTC.
2013-09-08 04:03:18 LustreError: 11-0: an error occurred while communicating with 172.23.80.66@o2ib1. The obd_ping operation failed with -107
2013-09-08 04:03:18 Lustre: scratch-OST0014-osc-ffff880464f5a400: Connection to scratch-OST0014 (at 172.23.80.66@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
2013-09-08 04:03:18 LustreError: 167-0: This client was evicted by scratch-OST0014; in progress operations using this service will fail.
2013-09-08 04:03:18 LustreError: 31914:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST0014-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
2013-09-08 04:03:18 LustreError: 31914:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff88016d5fbe40 (48993688/0/0/0) (rc: 1)
2013-09-08 04:03:18 Lustre: scratch-OST0014-osc-ffff880464f5a400: Connection restored to scratch-OST0014 (at 172.23.80.66@o2ib1)
2013-09-08 04:19:08 LustreError: 11-0: an error occurred while communicating with 172.23.80.69@o2ib1. The obd_ping operation failed with -107
2013-09-08 04:19:08 Lustre: scratch-OST001f-osc-ffff880464f5a400: Connection to scratch-OST001f (at 172.23.80.69@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
2013-09-08 04:19:08 LustreError: 167-0: This client was evicted by scratch-OST001f; in progress operations using this service will fail.
2013-09-08 04:19:08 LustreError: 32128:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST001f-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
2013-09-08 04:19:08 LustreError: 32128:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff88047ba341c0 (48992886/0/0/0) (rc: 1)
2013-09-08 04:19:08 Lustre: scratch-OST001f-osc-ffff880464f5a400: Connection restored to scratch-OST001f (at 172.23.80.69@o2ib1)
2013-09-08 04:29:08 LustreError: 11-0: an error occurred while communicating with 172.23.64.63@o2ib. The obd_ping operation failed with -107
2013-09-08 04:29:08 Lustre: scratch-OST0004-osc-ffff880464f5a400: Connection to scratch-OST0004 (at 172.23.64.63@o2ib) was lost; in progress operations using this service will wait for recovery to complete
2013-09-08 04:29:08 LustreError: 167-0: This client was evicted by scratch-OST0004; in progress operations using this service will fail.
2013-09-08 04:29:08 LustreError: 32232:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST0004-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
2013-09-08 04:29:08 LustreError: 32232:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff880812ed6bc0 (48991503/0/0/0) (rc: 1)
2013-09-08 04:29:08 Lustre: scratch-OST0004-osc-ffff880464f5a400: Connection restored to scratch-OST0004 (at 172.23.64.63@o2ib)

<ConMan> Console [beaufix1034] log at 2013-09-08 05:00:00 UTC.
2013-09-08 05:04:58 LustreError: 11-0: an error occurred while communicating with 172.23.80.68@o2ib1. The obd_ping operation failed with -107
2013-09-08 05:04:58 Lustre: scratch-OST002b-osc-ffff880464f5a400: Connection to scratch-OST002b (at 172.23.80.68@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
2013-09-08 05:04:58 LustreError: 167-0: This client was evicted by scratch-OST002b; in progress operations using this service will fail.
2013-09-08 05:04:58 LustreError: 33732:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST002b-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
2013-09-08 05:04:58 LustreError: 33732:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff880320afb480 (48996861/0/0/0) (rc: 1)
2013-09-08 05:04:59 Lustre: scratch-OST002b-osc-ffff880464f5a400: Connection restored to scratch-OST002b (at 172.23.80.68@o2ib1)
2013-09-08 05:05:05 LustreError: 33253:0:(lov_lock.c:671:lov_lock_enqueue()) ASSERTION( sublock->cll_conflict == ((void *)0) ) failed:
2013-09-08 05:05:05 LustreError: 33253:0:(lov_lock.c:671:lov_lock_enqueue()) LBUG
2013-09-08 05:05:05 Pid: 33253, comm: arpege
2013-09-08 05:05:05
2013-09-08 05:05:05 Call Trace:
2013-09-08 05:05:05 [<ffffffffa03977f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2013-09-08 05:05:05 [<ffffffffa0397e07>] lbug_with_loc+0x47/0xb0 [libcfs]
2013-09-08 05:05:05 [<ffffffffa0741690>] ? osc_lock_enqueue+0x0/0x810 [osc]
2013-09-08 05:05:05 [<ffffffffa08858dd>] lov_lock_enqueue+0x71d/0x820 [lov]
2013-09-08 05:05:05 [<ffffffffa04b37fc>] cl_enqueue_try+0xfc/0x300 [obdclass]
2013-09-08 05:05:05 [<ffffffffa04b4bfd>] cl_enqueue_locked+0x6d/0x200 [obdclass]
2013-09-08 05:05:05 [<ffffffffa04b587e>] cl_lock_request+0x7e/0x1c0 [obdclass]
2013-09-08 05:05:05 [<ffffffffa09f02cf>] cl_glimpse_lock+0x16f/0x410 [lustre]
2013-09-08 05:05:05 [<ffffffffa09f0a94>] cl_glimpse_size+0x154/0x160 [lustre]
2013-09-08 05:05:05 [<ffffffffa09b49ae>] ll_inode_revalidate_it+0x5e/0x170 [lustre]
2013-09-08 05:05:05 [<ffffffffa09b4b09>] ll_getattr_it+0x49/0x170 [lustre]
2013-09-08 05:05:05 [<ffffffffa09b4c67>] ll_getattr+0x37/0x40 [lustre]
2013-09-08 05:05:05 [<ffffffff8116c591>] vfs_getattr+0x51/0x80
2013-09-08 05:05:05 [<ffffffff81113abf>] ? free_hot_page+0x2f/0x60
2013-09-08 05:05:05 [<ffffffff8116c84f>] vfs_fstat+0x3f/0x60
2013-09-08 05:05:05 [<ffffffff8116c894>] sys_newfstat+0x24/0x40
2013-09-08 05:05:05 [<ffffffff81173e85>] ? putname+0x35/0x50
2013-09-08 05:05:06 [<ffffffff811647e1>] ? do_sys_open+0x101/0x140
2013-09-08 05:05:06 [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b
2013-09-08 05:05:06
2013-09-08 05:05:06 Kernel panic - not syncing: LBUG
2013-09-08 05:05:06 Pid: 33253, comm: arpege Tainted: G W --------------- H 2.6.32-279.5.2.bl6.Bull.36.x86_64 #1
2013-09-08 05:05:06 Call Trace:
2013-09-08 05:05:06 [<ffffffff81495fe3>] ? panic+0xa0/0x168
2013-09-08 05:05:06 [<ffffffffa0397e5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
2013-09-08 05:05:06 [<ffffffffa0741690>] ? osc_lock_enqueue+0x0/0x810 [osc]
2013-09-08 05:05:06 [<ffffffffa08858dd>] ? lov_lock_enqueue+0x71d/0x820 [lov]
2013-09-08 05:05:06 [<ffffffffa04b37fc>] ? cl_enqueue_try+0xfc/0x300 [obdclass]
2013-09-08 05:05:06 [<ffffffffa04b4bfd>] ? cl_enqueue_locked+0x6d/0x200 [obdclass]
2013-09-08 05:05:06 [<ffffffffa04b587e>] ? cl_lock_request+0x7e/0x1c0 [obdclass]
2013-09-08 05:05:06 [<ffffffffa09f02cf>] ? cl_glimpse_lock+0x16f/0x410 [lustre]
2013-09-08 05:05:06 [<ffffffffa09f0a94>] ? cl_glimpse_size+0x154/0x160 [lustre]
2013-09-08 05:05:06 [<ffffffffa09b49ae>] ? ll_inode_revalidate_it+0x5e/0x170 [lustre]
2013-09-08 05:05:06 [<ffffffffa09b4b09>] ? ll_getattr_it+0x49/0x170 [lustre]
2013-09-08 05:05:06 [<ffffffffa09b4c67>] ? ll_getattr+0x37/0x40 [lustre]
2013-09-08 05:05:06 [<ffffffff8116c591>] ? vfs_getattr+0x51/0x80
2013-09-08 05:05:06 [<ffffffff81113abf>] ? free_hot_page+0x2f/0x60
2013-09-08 05:05:06 [<ffffffff8116c84f>] ? vfs_fstat+0x3f/0x60
2013-09-08 05:05:06 [<ffffffff8116c894>] ? sys_newfstat+0x24/0x40
2013-09-08 05:05:06 [<ffffffff81173e85>] ? putname+0x35/0x50
2013-09-08 05:05:06 [<ffffffff811647e1>] ? do_sys_open+0x101/0x140
2013-09-08 05:05:06 [<ffffffff810030f2>] ? system_call_fastpath+0x16/0x1b

This looks very similar to ~~LU-3874~~, but the patch that was suspected in this jira is not applied to our version (only in master and b2_4).

Unfortunately, we do not have any dump to analyze further.

Thanks,
Sebastien.

Attachments

Activity

[LU-3941] (lov_lock.c:671:lov_lock_enqueue()) ASSERTION( sublock->cll_conflict == ((void *)0) ) failed

Bruno Faccini (Inactive) added a comment - 11/Oct/13 8:48 AM

Patch successfully passed standard auto-tests and additional parallel-scale* requested.

Sebs, any news on your side ? Did you trigger new occurrences already ?? Have you been able to add Lustre-debug flags ? Any details about Arpege app to provide ?

Bruno Faccini (Inactive) added a comment - 11/Oct/13 8:48 AM Patch successfully passed standard auto-tests and additional parallel-scale* requested. Sebs, any news on your side ? Did you trigger new occurrences already ?? Have you been able to add Lustre-debug flags ? Any details about Arpege app to provide ?

Bruno Faccini (Inactive) added a comment - 08/Oct/13 8:56 AM

b2_1 backport, of change #7569 from ~~LU-3027~~, is at http://review.whamcloud.com/7878.

Bruno Faccini (Inactive) added a comment - 08/Oct/13 8:56 AM b2_1 backport, of change #7569 from LU-3027 , is at http://review.whamcloud.com/7878 .

Bruno Faccini (Inactive) added a comment - 07/Oct/13 6:07 PM

Crash-dump analysis seem to indicate the reason of the LBUG is that Glimpse-lock's per-stripe lov_lock_sub.sub_got/osc_lock sub-lock's structs do not have respectively CEF_ASYNC nor ols_glimpse set, and this allows to have conflicts/cll_conflict to be resolved (and CLO_WAIT returned) in osc_lock_enqueue_wait(), but this is not expected to occur for a Glimpse-lock (ie, with cld_enq_flags&CEF_ASYNC and ols_flags&LDLM_FL_HAS_INTENT) in lov_lock_enqueue_one() where CLO_WAIT return is avoided without regard of cll_conflict.
Then this cause the LBUG/Assert to trigger in lov_lock_enqueue() due to cll_conflict being found set upon forced 0 return from lov_lock_enqueue_one().

According to this I have now strong assumptions that change #7569 for ~~LU-3027~~, which cause only extent datas (and not cld_enq_flags!) to be modified in sub_got by lov_sublock_modify(), could be the fix for this problem. Will back-port it to b2_1 and will see how it works.

Bruno Faccini (Inactive) added a comment - 07/Oct/13 6:07 PM Crash-dump analysis seem to indicate the reason of the LBUG is that Glimpse-lock's per-stripe lov_lock_sub.sub_got/osc_lock sub-lock's structs do not have respectively CEF_ASYNC nor ols_glimpse set, and this allows to have conflicts/cll_conflict to be resolved (and CLO_WAIT returned) in osc_lock_enqueue_wait(), but this is not expected to occur for a Glimpse-lock (ie, with cld_enq_flags&CEF_ASYNC and ols_flags&LDLM_FL_HAS_INTENT) in lov_lock_enqueue_one() where CLO_WAIT return is avoided without regard of cll_conflict. Then this cause the LBUG/Assert to trigger in lov_lock_enqueue() due to cll_conflict being found set upon forced 0 return from lov_lock_enqueue_one(). According to this I have now strong assumptions that change #7569 for LU-3027 , which cause only extent datas (and not cld_enq_flags!) to be modified in sub_got by lov_sublock_modify(), could be the fix for this problem. Will back-port it to b2_1 and will see how it works.

Bruno Faccini (Inactive) added a comment - 05/Oct/13 4:09 PM - edited

Oleg, thanks.

Sebs, does this problem occur frequently on nodes running "Arpege" app ?? If yes, do you have some details about it (MPI? OpenMP? I/Os profile/workload ? MPI-IO? ...)?

Also, can you add some Lustre-debug flags (mainly "warning dlmtrace error emerg") to get more infos on next occurence ?, because actually you seem to run with /proc/sys/lnet/debug=0 ...

Bruno Faccini (Inactive) added a comment - 05/Oct/13 4:09 PM - edited Oleg, thanks. Sebs, does this problem occur frequently on nodes running "Arpege" app ?? If yes, do you have some details about it (MPI? OpenMP? I/Os profile/workload ? MPI-IO? ...)? Also, can you add some Lustre-debug flags (mainly "warning dlmtrace error emerg") to get more infos on next occurence ?, because actually you seem to run with /proc/sys/lnet/debug=0 ...

Oleg Drokin added a comment - 05/Oct/13 1:35 AM

apparently this is being hit in maloo too: https://maloo.whamcloud.com/sub_tests/5eac2b4e-174d-11e3-9d30-52540035b04c (on 2.4.92 tag)

Oleg Drokin added a comment - 05/Oct/13 1:35 AM apparently this is being hit in maloo too: https://maloo.whamcloud.com/sub_tests/5eac2b4e-174d-11e3-9d30-52540035b04c (on 2.4.92 tag)

Bruno Faccini (Inactive) added a comment - 04/Oct/13 10:37 AM

Hello Sebs!
Crash-dump log contains a bunch of hung arpege threads stacks dumps (from previous runs) like :

INFO: task arpege:10985 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
arpege        D 0000000000000025     0 10985  10978 0x00000000
 ffff88045d703968 0000000000000086 ffff880419aec000 0000000000000086
 ffff880419aec000 ffff88042b25fa60 0000000000000000 ffff88042b25fa78
 ffff88035de67b10 ffff88045d703fd8 000000000000e160 ffff88035de67b10
Call Trace:
 [<ffffffff81497c5e>] __mutex_lock_slowpath+0x14e/0x190
 [<ffffffff81497aeb>] mutex_lock+0x2b/0x50
 [<ffffffffa0545d87>] cl_lock_mutex_get+0x77/0xe0 [obdclass]
 [<ffffffffa0548107>] cl_lock_enclosure+0x187/0x230 [obdclass]
 [<ffffffffa05481f3>] cl_lock_closure_build+0x43/0x150 [obdclass]
 [<ffffffffa09729b6>] lovsub_lock_closure+0x76/0x150 [lov]
 [<ffffffffa0548230>] cl_lock_closure_build+0x80/0x150 [obdclass]
 [<ffffffffa0968d5b>] lov_sublock_lock+0x6b/0x2f0 [lov]
 [<ffffffffa096a325>] lov_lock_enqueue+0x165/0x820 [lov]
 [<ffffffffa05477fc>] cl_enqueue_try+0xfc/0x300 [obdclass]
 [<ffffffffa0548bfd>] cl_enqueue_locked+0x6d/0x200 [obdclass]
 [<ffffffffa054987e>] cl_lock_request+0x7e/0x1c0 [obdclass]
 [<ffffffffa0a212cf>] cl_glimpse_lock+0x16f/0x410 [lustre]
 [<ffffffffa0a21a94>] cl_glimpse_size+0x154/0x160 [lustre]
 [<ffffffffa09e59ae>] ll_inode_revalidate_it+0x5e/0x170 [lustre]
 [<ffffffffa09e5b09>] ll_getattr_it+0x49/0x170 [lustre]
 [<ffffffffa09e5c67>] ll_getattr+0x37/0x40 [lustre]
 [<ffffffff8116c591>] vfs_getattr+0x51/0x80
 [<ffffffff81113abf>] ? free_hot_page+0x2f/0x60
 [<ffffffff8116c84f>] vfs_fstat+0x3f/0x60
 [<ffffffff8116c894>] sys_newfstat+0x24/0x40
 [<ffffffff81173e85>] ? putname+0x35/0x50
 [<ffffffff811647e1>] ? do_sys_open+0x101/0x140
 [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b

and also several lock dumps like folllowing :

LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) lock@ffff8804655ec0b8[7 4 0 4 0 00000000] P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0] {
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     vvp@ffff8802fb74e6f0:
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     lov@ffff880465f7ca20: 4
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     0 0: ---
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     1 0: lock@ffff88084a564a58[0 5 0 0 0 00000000] R(1):[0, 18446744073709551615]@[0x100260000:0x31d165b:0x0] {
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     lovsub@ffff88086678baa0: [1 ffff880465f7ca20 P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0]]
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     osc@ffff880869f0eee8: ffff88086a61f000 00101001 0x75459d96781d6cfc 4 (null) size: 2097152 mtime: 1378913901 atime: 1378913901 ctime: 1378913901 blocks: 4096
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff88084a564a58
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     2 0: lock@ffff8806b82be7b8[0 5 0 0 0 00000000] R(1):[0, 18446744073709551615]@[0x100160000:0x31d36ce:0x0] {
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     lovsub@ffff88087991c960: [2 ffff880465f7ca20 P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0]]
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     osc@ffff880869f0e368: ffff8807f14e3d80 00101001 0x75459d96781d6d03 4 (null) size: 2097152 mtime: 1378913901 atime: 1378913901 ctime: 1378913901 blocks: 4096
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff8806b82be7b8
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     3 0: lock@ffff880869fcb078[0 5 0 0 0 00000000] R(1):[0, 18446744073709551615]@[0x100100000:0x31d431c:0x0] {
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     lovsub@ffff88068ae36a20: [3 ffff880465f7ca20 P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0]]
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     osc@ffff880869f0e700: ffff88068ae3ad80 00101001 0x75459d96781d6d0a 4 (null) size: 2097152 mtime: 1378913901 atime: 1378913901 ctime: 1378913901 blocks: 4096
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff880869fcb078
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff8804655ec0b8
LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) unuse return -108

Since stacks look similar to the panic stack (and stack of others threads racing against same [sub-]lock structs ...) for this problem , it is a bit suspect and I wonder if is an usual behavior/symptom on-site ??

Bruno Faccini (Inactive) added a comment - 04/Oct/13 10:37 AM Hello Sebs! Crash-dump log contains a bunch of hung arpege threads stacks dumps (from previous runs) like : INFO: task arpege:10985 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. arpege D 0000000000000025 0 10985 10978 0x00000000 ffff88045d703968 0000000000000086 ffff880419aec000 0000000000000086 ffff880419aec000 ffff88042b25fa60 0000000000000000 ffff88042b25fa78 ffff88035de67b10 ffff88045d703fd8 000000000000e160 ffff88035de67b10 Call Trace: [<ffffffff81497c5e>] __mutex_lock_slowpath+0x14e/0x190 [<ffffffff81497aeb>] mutex_lock+0x2b/0x50 [<ffffffffa0545d87>] cl_lock_mutex_get+0x77/0xe0 [obdclass] [<ffffffffa0548107>] cl_lock_enclosure+0x187/0x230 [obdclass] [<ffffffffa05481f3>] cl_lock_closure_build+0x43/0x150 [obdclass] [<ffffffffa09729b6>] lovsub_lock_closure+0x76/0x150 [lov] [<ffffffffa0548230>] cl_lock_closure_build+0x80/0x150 [obdclass] [<ffffffffa0968d5b>] lov_sublock_lock+0x6b/0x2f0 [lov] [<ffffffffa096a325>] lov_lock_enqueue+0x165/0x820 [lov] [<ffffffffa05477fc>] cl_enqueue_try+0xfc/0x300 [obdclass] [<ffffffffa0548bfd>] cl_enqueue_locked+0x6d/0x200 [obdclass] [<ffffffffa054987e>] cl_lock_request+0x7e/0x1c0 [obdclass] [<ffffffffa0a212cf>] cl_glimpse_lock+0x16f/0x410 [lustre] [<ffffffffa0a21a94>] cl_glimpse_size+0x154/0x160 [lustre] [<ffffffffa09e59ae>] ll_inode_revalidate_it+0x5e/0x170 [lustre] [<ffffffffa09e5b09>] ll_getattr_it+0x49/0x170 [lustre] [<ffffffffa09e5c67>] ll_getattr+0x37/0x40 [lustre] [<ffffffff8116c591>] vfs_getattr+0x51/0x80 [<ffffffff81113abf>] ? free_hot_page+0x2f/0x60 [<ffffffff8116c84f>] vfs_fstat+0x3f/0x60 [<ffffffff8116c894>] sys_newfstat+0x24/0x40 [<ffffffff81173e85>] ? putname+0x35/0x50 [<ffffffff811647e1>] ? do_sys_open+0x101/0x140 [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b and also several lock dumps like folllowing : LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) lock@ffff8804655ec0b8[7 4 0 4 0 00000000] P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0] { LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) vvp@ffff8802fb74e6f0: LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) lov@ffff880465f7ca20: 4 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) 0 0: --- LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) 1 0: lock@ffff88084a564a58[0 5 0 0 0 00000000] R(1):[0, 18446744073709551615]@[0x100260000:0x31d165b:0x0] { LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) lovsub@ffff88086678baa0: [1 ffff880465f7ca20 P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0]] LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) osc@ffff880869f0eee8: ffff88086a61f000 00101001 0x75459d96781d6cfc 4 (null) size: 2097152 mtime: 1378913901 atime: 1378913901 ctime: 1378913901 blocks: 4096 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff88084a564a58 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) 2 0: lock@ffff8806b82be7b8[0 5 0 0 0 00000000] R(1):[0, 18446744073709551615]@[0x100160000:0x31d36ce:0x0] { LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) lovsub@ffff88087991c960: [2 ffff880465f7ca20 P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0]] LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) osc@ffff880869f0e368: ffff8807f14e3d80 00101001 0x75459d96781d6d03 4 (null) size: 2097152 mtime: 1378913901 atime: 1378913901 ctime: 1378913901 blocks: 4096 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff8806b82be7b8 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) 3 0: lock@ffff880869fcb078[0 5 0 0 0 00000000] R(1):[0, 18446744073709551615]@[0x100100000:0x31d431c:0x0] { LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) lovsub@ffff88068ae36a20: [3 ffff880465f7ca20 P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0]] LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) osc@ffff880869f0e700: ffff88068ae3ad80 00101001 0x75459d96781d6d0a 4 (null) size: 2097152 mtime: 1378913901 atime: 1378913901 ctime: 1378913901 blocks: 4096 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff880869fcb078 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff8804655ec0b8 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) unuse return -108 Since stacks look similar to the panic stack (and stack of others threads racing against same [sub-] lock structs ...) for this problem , it is a bit suspect and I wonder if is an usual behavior/symptom on-site ??

Bruno Faccini (Inactive) added a comment - 20/Sep/13 8:03 AM

Hello Seb,
So finally you got an other occurrence with a valid crash-dump ?
I will look into it and update you soon.
Thanks for your help, bye.

Bruno Faccini (Inactive) added a comment - 20/Sep/13 8:03 AM Hello Seb, So finally you got an other occurrence with a valid crash-dump ? I will look into it and update you soon. Thanks for your help, bye.

Sebastien Piechurski added a comment - 19/Sep/13 3:54 PM

Oleg,
In case this helps confirming your thoughts, I am currently uploading a dump with the associated kernel and modules on the ftp server in uploads/~~LU-3941~~.
Please ignore the vmcore file as the transfer was interrupted and am unable to resume it. Use the compressed vmcore.xz instead.

Sebastien Piechurski added a comment - 19/Sep/13 3:54 PM Oleg, In case this helps confirming your thoughts, I am currently uploading a dump with the associated kernel and modules on the ftp server in uploads/ LU-3941 . Please ignore the vmcore file as the transfer was interrupted and am unable to resume it. Use the compressed vmcore.xz instead.

Sebastien Buisson (Inactive) added a comment - 18/Sep/13 10:31 AM

Hi,

I am not sure to understand what we should do on our Lustre 2.1 here: integrate patch http://review.whamcloud.com/#/c/7569/2? For what purpose, fix the issue or ease issue diagnostic?

TIA,
Sebastien.

Sebastien Buisson (Inactive) added a comment - 18/Sep/13 10:31 AM Hi, I am not sure to understand what we should do on our Lustre 2.1 here: integrate patch http://review.whamcloud.com/#/c/7569/2? For what purpose, fix the issue or ease issue diagnostic? TIA, Sebastien.

Oleg Drokin added a comment - 13/Sep/13 4:55 PM

I believe this is all the evidence we need to confirm the long suspected fact that lu3874 is a long standing issue that patch lu3027 just made easier to hit.

as such patch http://review.whamcloud.com/#/c/7569/2 should help here too.

We probably should close this as a dup of lu-3874 once confirmed.

Oleg Drokin added a comment - 13/Sep/13 4:55 PM I believe this is all the evidence we need to confirm the long suspected fact that lu3874 is a long standing issue that patch lu3027 just made easier to hit. as such patch http://review.whamcloud.com/#/c/7569/2 should help here too. We probably should close this as a dup of lu-3874 once confirmed.

Bruno Faccini (Inactive) added a comment - 12/Sep/13 5:02 PM

Hello Seb,
Too bad for the lack of crash-dump ...
Also, what about the impacted application/process "arpege", is a multi-[process,thread,node] application ??

Bruno Faccini (Inactive) added a comment - 12/Sep/13 5:02 PM Hello Seb, Too bad for the lack of crash-dump ... Also, what about the impacted application/process "arpege", is a multi- [process,thread,node] application ??

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Sebastien Buisson (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 12/Sep/13 4:12 PM

Updated:: 27/Nov/13 4:26 PM

Resolved:: 27/Nov/13 4:26 PM