Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3941

(lov_lock.c:671:lov_lock_enqueue()) ASSERTION( sublock->cll_conflict == ((void *)0) ) failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.1.6
    • 3
    • 10424

    Description

      A LBUG was hit at MeteoFrance running Lustre 2.1.6 with the following messages:

      <ConMan> Console [beaufix1034] log at 2013-09-08 04:00:00 UTC.
      2013-09-08 04:03:18 LustreError: 11-0: an error occurred while communicating with 172.23.80.66@o2ib1. The obd_ping operation failed with -107
      2013-09-08 04:03:18 Lustre: scratch-OST0014-osc-ffff880464f5a400: Connection to scratch-OST0014 (at 172.23.80.66@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
      2013-09-08 04:03:18 LustreError: 167-0: This client was evicted by scratch-OST0014; in progress operations using this service will fail.
      2013-09-08 04:03:18 LustreError: 31914:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST0014-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
      2013-09-08 04:03:18 LustreError: 31914:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff88016d5fbe40 (48993688/0/0/0) (rc: 1)
      2013-09-08 04:03:18 Lustre: scratch-OST0014-osc-ffff880464f5a400: Connection restored to scratch-OST0014 (at 172.23.80.66@o2ib1)
      2013-09-08 04:19:08 LustreError: 11-0: an error occurred while communicating with 172.23.80.69@o2ib1. The obd_ping operation failed with -107
      2013-09-08 04:19:08 Lustre: scratch-OST001f-osc-ffff880464f5a400: Connection to scratch-OST001f (at 172.23.80.69@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
      2013-09-08 04:19:08 LustreError: 167-0: This client was evicted by scratch-OST001f; in progress operations using this service will fail.
      2013-09-08 04:19:08 LustreError: 32128:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST001f-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
      2013-09-08 04:19:08 LustreError: 32128:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff88047ba341c0 (48992886/0/0/0) (rc: 1)
      2013-09-08 04:19:08 Lustre: scratch-OST001f-osc-ffff880464f5a400: Connection restored to scratch-OST001f (at 172.23.80.69@o2ib1)
      2013-09-08 04:29:08 LustreError: 11-0: an error occurred while communicating with 172.23.64.63@o2ib. The obd_ping operation failed with -107
      2013-09-08 04:29:08 Lustre: scratch-OST0004-osc-ffff880464f5a400: Connection to scratch-OST0004 (at 172.23.64.63@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      2013-09-08 04:29:08 LustreError: 167-0: This client was evicted by scratch-OST0004; in progress operations using this service will fail.
      2013-09-08 04:29:08 LustreError: 32232:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST0004-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
      2013-09-08 04:29:08 LustreError: 32232:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff880812ed6bc0 (48991503/0/0/0) (rc: 1)
      2013-09-08 04:29:08 Lustre: scratch-OST0004-osc-ffff880464f5a400: Connection restored to scratch-OST0004 (at 172.23.64.63@o2ib)
      
      <ConMan> Console [beaufix1034] log at 2013-09-08 05:00:00 UTC.
      2013-09-08 05:04:58 LustreError: 11-0: an error occurred while communicating with 172.23.80.68@o2ib1. The obd_ping operation failed with -107
      2013-09-08 05:04:58 Lustre: scratch-OST002b-osc-ffff880464f5a400: Connection to scratch-OST002b (at 172.23.80.68@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
      2013-09-08 05:04:58 LustreError: 167-0: This client was evicted by scratch-OST002b; in progress operations using this service will fail.
      2013-09-08 05:04:58 LustreError: 33732:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST002b-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
      2013-09-08 05:04:58 LustreError: 33732:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff880320afb480 (48996861/0/0/0) (rc: 1)
      2013-09-08 05:04:59 Lustre: scratch-OST002b-osc-ffff880464f5a400: Connection restored to scratch-OST002b (at 172.23.80.68@o2ib1)
      2013-09-08 05:05:05 LustreError: 33253:0:(lov_lock.c:671:lov_lock_enqueue()) ASSERTION( sublock->cll_conflict == ((void *)0) ) failed:
      2013-09-08 05:05:05 LustreError: 33253:0:(lov_lock.c:671:lov_lock_enqueue()) LBUG
      2013-09-08 05:05:05 Pid: 33253, comm: arpege
      2013-09-08 05:05:05
      2013-09-08 05:05:05 Call Trace:
      2013-09-08 05:05:05 [<ffffffffa03977f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      2013-09-08 05:05:05 [<ffffffffa0397e07>] lbug_with_loc+0x47/0xb0 [libcfs]
      2013-09-08 05:05:05 [<ffffffffa0741690>] ? osc_lock_enqueue+0x0/0x810 [osc]
      2013-09-08 05:05:05 [<ffffffffa08858dd>] lov_lock_enqueue+0x71d/0x820 [lov]
      2013-09-08 05:05:05 [<ffffffffa04b37fc>] cl_enqueue_try+0xfc/0x300 [obdclass]
      2013-09-08 05:05:05 [<ffffffffa04b4bfd>] cl_enqueue_locked+0x6d/0x200 [obdclass]
      2013-09-08 05:05:05 [<ffffffffa04b587e>] cl_lock_request+0x7e/0x1c0 [obdclass]
      2013-09-08 05:05:05 [<ffffffffa09f02cf>] cl_glimpse_lock+0x16f/0x410 [lustre]
      2013-09-08 05:05:05 [<ffffffffa09f0a94>] cl_glimpse_size+0x154/0x160 [lustre]
      2013-09-08 05:05:05 [<ffffffffa09b49ae>] ll_inode_revalidate_it+0x5e/0x170 [lustre]
      2013-09-08 05:05:05 [<ffffffffa09b4b09>] ll_getattr_it+0x49/0x170 [lustre]
      2013-09-08 05:05:05 [<ffffffffa09b4c67>] ll_getattr+0x37/0x40 [lustre]
      2013-09-08 05:05:05 [<ffffffff8116c591>] vfs_getattr+0x51/0x80
      2013-09-08 05:05:05 [<ffffffff81113abf>] ? free_hot_page+0x2f/0x60
      2013-09-08 05:05:05 [<ffffffff8116c84f>] vfs_fstat+0x3f/0x60
      2013-09-08 05:05:05 [<ffffffff8116c894>] sys_newfstat+0x24/0x40
      2013-09-08 05:05:05 [<ffffffff81173e85>] ? putname+0x35/0x50
      2013-09-08 05:05:06 [<ffffffff811647e1>] ? do_sys_open+0x101/0x140
      2013-09-08 05:05:06 [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b
      2013-09-08 05:05:06
      2013-09-08 05:05:06 Kernel panic - not syncing: LBUG
      2013-09-08 05:05:06 Pid: 33253, comm: arpege Tainted: G W --------------- H 2.6.32-279.5.2.bl6.Bull.36.x86_64 #1
      2013-09-08 05:05:06 Call Trace:
      2013-09-08 05:05:06 [<ffffffff81495fe3>] ? panic+0xa0/0x168
      2013-09-08 05:05:06 [<ffffffffa0397e5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      2013-09-08 05:05:06 [<ffffffffa0741690>] ? osc_lock_enqueue+0x0/0x810 [osc]
      2013-09-08 05:05:06 [<ffffffffa08858dd>] ? lov_lock_enqueue+0x71d/0x820 [lov]
      2013-09-08 05:05:06 [<ffffffffa04b37fc>] ? cl_enqueue_try+0xfc/0x300 [obdclass]
      2013-09-08 05:05:06 [<ffffffffa04b4bfd>] ? cl_enqueue_locked+0x6d/0x200 [obdclass]
      2013-09-08 05:05:06 [<ffffffffa04b587e>] ? cl_lock_request+0x7e/0x1c0 [obdclass]
      2013-09-08 05:05:06 [<ffffffffa09f02cf>] ? cl_glimpse_lock+0x16f/0x410 [lustre]
      2013-09-08 05:05:06 [<ffffffffa09f0a94>] ? cl_glimpse_size+0x154/0x160 [lustre]
      2013-09-08 05:05:06 [<ffffffffa09b49ae>] ? ll_inode_revalidate_it+0x5e/0x170 [lustre]
      2013-09-08 05:05:06 [<ffffffffa09b4b09>] ? ll_getattr_it+0x49/0x170 [lustre]
      2013-09-08 05:05:06 [<ffffffffa09b4c67>] ? ll_getattr+0x37/0x40 [lustre]
      2013-09-08 05:05:06 [<ffffffff8116c591>] ? vfs_getattr+0x51/0x80
      2013-09-08 05:05:06 [<ffffffff81113abf>] ? free_hot_page+0x2f/0x60
      2013-09-08 05:05:06 [<ffffffff8116c84f>] ? vfs_fstat+0x3f/0x60
      2013-09-08 05:05:06 [<ffffffff8116c894>] ? sys_newfstat+0x24/0x40
      2013-09-08 05:05:06 [<ffffffff81173e85>] ? putname+0x35/0x50
      2013-09-08 05:05:06 [<ffffffff811647e1>] ? do_sys_open+0x101/0x140
      2013-09-08 05:05:06 [<ffffffff810030f2>] ? system_call_fastpath+0x16/0x1b
      

      This looks very similar to LU-3874, but the patch that was suspected in this jira is not applied to our version (only in master and b2_4).

      Unfortunately, we do not have any dump to analyze further.

      Thanks,
      Sebastien.

      Attachments

        Activity

          [LU-3941] (lov_lock.c:671:lov_lock_enqueue()) ASSERTION( sublock->cll_conflict == ((void *)0) ) failed

          Patch successfully passed standard auto-tests and additional parallel-scale* requested.

          Sebs, any news on your side ? Did you trigger new occurrences already ?? Have you been able to add Lustre-debug flags ? Any details about Arpege app to provide ?

          bfaccini Bruno Faccini (Inactive) added a comment - Patch successfully passed standard auto-tests and additional parallel-scale* requested. Sebs, any news on your side ? Did you trigger new occurrences already ?? Have you been able to add Lustre-debug flags ? Any details about Arpege app to provide ?

          b2_1 backport, of change #7569 from LU-3027, is at http://review.whamcloud.com/7878.

          bfaccini Bruno Faccini (Inactive) added a comment - b2_1 backport, of change #7569 from LU-3027 , is at http://review.whamcloud.com/7878 .

          Crash-dump analysis seem to indicate the reason of the LBUG is that Glimpse-lock's per-stripe lov_lock_sub.sub_got/osc_lock sub-lock's structs do not have respectively CEF_ASYNC nor ols_glimpse set, and this allows to have conflicts/cll_conflict to be resolved (and CLO_WAIT returned) in osc_lock_enqueue_wait(), but this is not expected to occur for a Glimpse-lock (ie, with cld_enq_flags&CEF_ASYNC and ols_flags&LDLM_FL_HAS_INTENT) in lov_lock_enqueue_one() where CLO_WAIT return is avoided without regard of cll_conflict.
          Then this cause the LBUG/Assert to trigger in lov_lock_enqueue() due to cll_conflict being found set upon forced 0 return from lov_lock_enqueue_one().

          According to this I have now strong assumptions that change #7569 for LU-3027, which cause only extent datas (and not cld_enq_flags!) to be modified in sub_got by lov_sublock_modify(), could be the fix for this problem. Will back-port it to b2_1 and will see how it works.

          bfaccini Bruno Faccini (Inactive) added a comment - Crash-dump analysis seem to indicate the reason of the LBUG is that Glimpse-lock's per-stripe lov_lock_sub.sub_got/osc_lock sub-lock's structs do not have respectively CEF_ASYNC nor ols_glimpse set, and this allows to have conflicts/cll_conflict to be resolved (and CLO_WAIT returned) in osc_lock_enqueue_wait(), but this is not expected to occur for a Glimpse-lock (ie, with cld_enq_flags&CEF_ASYNC and ols_flags&LDLM_FL_HAS_INTENT) in lov_lock_enqueue_one() where CLO_WAIT return is avoided without regard of cll_conflict. Then this cause the LBUG/Assert to trigger in lov_lock_enqueue() due to cll_conflict being found set upon forced 0 return from lov_lock_enqueue_one(). According to this I have now strong assumptions that change #7569 for LU-3027 , which cause only extent datas (and not cld_enq_flags!) to be modified in sub_got by lov_sublock_modify(), could be the fix for this problem. Will back-port it to b2_1 and will see how it works.
          bfaccini Bruno Faccini (Inactive) added a comment - - edited

          Oleg, thanks.

          Sebs, does this problem occur frequently on nodes running "Arpege" app ?? If yes, do you have some details about it (MPI? OpenMP? I/Os profile/workload ? MPI-IO? ...)?

          Also, can you add some Lustre-debug flags (mainly "warning dlmtrace error emerg") to get more infos on next occurence ?, because actually you seem to run with /proc/sys/lnet/debug=0 ...

          bfaccini Bruno Faccini (Inactive) added a comment - - edited Oleg, thanks. Sebs, does this problem occur frequently on nodes running "Arpege" app ?? If yes, do you have some details about it (MPI? OpenMP? I/Os profile/workload ? MPI-IO? ...)? Also, can you add some Lustre-debug flags (mainly "warning dlmtrace error emerg") to get more infos on next occurence ?, because actually you seem to run with /proc/sys/lnet/debug=0 ...
          green Oleg Drokin added a comment -

          apparently this is being hit in maloo too: https://maloo.whamcloud.com/sub_tests/5eac2b4e-174d-11e3-9d30-52540035b04c (on 2.4.92 tag)

          green Oleg Drokin added a comment - apparently this is being hit in maloo too: https://maloo.whamcloud.com/sub_tests/5eac2b4e-174d-11e3-9d30-52540035b04c (on 2.4.92 tag)

          Hello Sebs!
          Crash-dump log contains a bunch of hung arpege threads stacks dumps (from previous runs) like :

          INFO: task arpege:10985 blocked for more than 120 seconds.
          "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
          arpege        D 0000000000000025     0 10985  10978 0x00000000
           ffff88045d703968 0000000000000086 ffff880419aec000 0000000000000086
           ffff880419aec000 ffff88042b25fa60 0000000000000000 ffff88042b25fa78
           ffff88035de67b10 ffff88045d703fd8 000000000000e160 ffff88035de67b10
          Call Trace:
           [<ffffffff81497c5e>] __mutex_lock_slowpath+0x14e/0x190
           [<ffffffff81497aeb>] mutex_lock+0x2b/0x50
           [<ffffffffa0545d87>] cl_lock_mutex_get+0x77/0xe0 [obdclass]
           [<ffffffffa0548107>] cl_lock_enclosure+0x187/0x230 [obdclass]
           [<ffffffffa05481f3>] cl_lock_closure_build+0x43/0x150 [obdclass]
           [<ffffffffa09729b6>] lovsub_lock_closure+0x76/0x150 [lov]
           [<ffffffffa0548230>] cl_lock_closure_build+0x80/0x150 [obdclass]
           [<ffffffffa0968d5b>] lov_sublock_lock+0x6b/0x2f0 [lov]
           [<ffffffffa096a325>] lov_lock_enqueue+0x165/0x820 [lov]
           [<ffffffffa05477fc>] cl_enqueue_try+0xfc/0x300 [obdclass]
           [<ffffffffa0548bfd>] cl_enqueue_locked+0x6d/0x200 [obdclass]
           [<ffffffffa054987e>] cl_lock_request+0x7e/0x1c0 [obdclass]
           [<ffffffffa0a212cf>] cl_glimpse_lock+0x16f/0x410 [lustre]
           [<ffffffffa0a21a94>] cl_glimpse_size+0x154/0x160 [lustre]
           [<ffffffffa09e59ae>] ll_inode_revalidate_it+0x5e/0x170 [lustre]
           [<ffffffffa09e5b09>] ll_getattr_it+0x49/0x170 [lustre]
           [<ffffffffa09e5c67>] ll_getattr+0x37/0x40 [lustre]
           [<ffffffff8116c591>] vfs_getattr+0x51/0x80
           [<ffffffff81113abf>] ? free_hot_page+0x2f/0x60
           [<ffffffff8116c84f>] vfs_fstat+0x3f/0x60
           [<ffffffff8116c894>] sys_newfstat+0x24/0x40
           [<ffffffff81173e85>] ? putname+0x35/0x50
           [<ffffffff811647e1>] ? do_sys_open+0x101/0x140
           [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b
          

          and also several lock dumps like folllowing :

          LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) lock@ffff8804655ec0b8[7 4 0 4 0 00000000] P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0] {
          LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     vvp@ffff8802fb74e6f0:
          LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     lov@ffff880465f7ca20: 4
          LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     0 0: ---
          LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     1 0: lock@ffff88084a564a58[0 5 0 0 0 00000000] R(1):[0, 18446744073709551615]@[0x100260000:0x31d165b:0x0] {
          LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     lovsub@ffff88086678baa0: [1 ffff880465f7ca20 P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0]]
          LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     osc@ffff880869f0eee8: ffff88086a61f000 00101001 0x75459d96781d6cfc 4 (null) size: 2097152 mtime: 1378913901 atime: 1378913901 ctime: 1378913901 blocks: 4096
          LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff88084a564a58
          LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     2 0: lock@ffff8806b82be7b8[0 5 0 0 0 00000000] R(1):[0, 18446744073709551615]@[0x100160000:0x31d36ce:0x0] {
          LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     lovsub@ffff88087991c960: [2 ffff880465f7ca20 P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0]]
          LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     osc@ffff880869f0e368: ffff8807f14e3d80 00101001 0x75459d96781d6d03 4 (null) size: 2097152 mtime: 1378913901 atime: 1378913901 ctime: 1378913901 blocks: 4096
          LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff8806b82be7b8
          LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     3 0: lock@ffff880869fcb078[0 5 0 0 0 00000000] R(1):[0, 18446744073709551615]@[0x100100000:0x31d431c:0x0] {
          LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     lovsub@ffff88068ae36a20: [3 ffff880465f7ca20 P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0]]
          LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())     osc@ffff880869f0e700: ffff88068ae3ad80 00101001 0x75459d96781d6d0a 4 (null) size: 2097152 mtime: 1378913901 atime: 1378913901 ctime: 1378913901 blocks: 4096
          LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff880869fcb078
          LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try())
          LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff8804655ec0b8
          LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) unuse return -108
          

          Since stacks look similar to the panic stack (and stack of others threads racing against same [sub-]lock structs ...) for this problem , it is a bit suspect and I wonder if is an usual behavior/symptom on-site ??

          bfaccini Bruno Faccini (Inactive) added a comment - Hello Sebs! Crash-dump log contains a bunch of hung arpege threads stacks dumps (from previous runs) like : INFO: task arpege:10985 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. arpege D 0000000000000025 0 10985 10978 0x00000000 ffff88045d703968 0000000000000086 ffff880419aec000 0000000000000086 ffff880419aec000 ffff88042b25fa60 0000000000000000 ffff88042b25fa78 ffff88035de67b10 ffff88045d703fd8 000000000000e160 ffff88035de67b10 Call Trace: [<ffffffff81497c5e>] __mutex_lock_slowpath+0x14e/0x190 [<ffffffff81497aeb>] mutex_lock+0x2b/0x50 [<ffffffffa0545d87>] cl_lock_mutex_get+0x77/0xe0 [obdclass] [<ffffffffa0548107>] cl_lock_enclosure+0x187/0x230 [obdclass] [<ffffffffa05481f3>] cl_lock_closure_build+0x43/0x150 [obdclass] [<ffffffffa09729b6>] lovsub_lock_closure+0x76/0x150 [lov] [<ffffffffa0548230>] cl_lock_closure_build+0x80/0x150 [obdclass] [<ffffffffa0968d5b>] lov_sublock_lock+0x6b/0x2f0 [lov] [<ffffffffa096a325>] lov_lock_enqueue+0x165/0x820 [lov] [<ffffffffa05477fc>] cl_enqueue_try+0xfc/0x300 [obdclass] [<ffffffffa0548bfd>] cl_enqueue_locked+0x6d/0x200 [obdclass] [<ffffffffa054987e>] cl_lock_request+0x7e/0x1c0 [obdclass] [<ffffffffa0a212cf>] cl_glimpse_lock+0x16f/0x410 [lustre] [<ffffffffa0a21a94>] cl_glimpse_size+0x154/0x160 [lustre] [<ffffffffa09e59ae>] ll_inode_revalidate_it+0x5e/0x170 [lustre] [<ffffffffa09e5b09>] ll_getattr_it+0x49/0x170 [lustre] [<ffffffffa09e5c67>] ll_getattr+0x37/0x40 [lustre] [<ffffffff8116c591>] vfs_getattr+0x51/0x80 [<ffffffff81113abf>] ? free_hot_page+0x2f/0x60 [<ffffffff8116c84f>] vfs_fstat+0x3f/0x60 [<ffffffff8116c894>] sys_newfstat+0x24/0x40 [<ffffffff81173e85>] ? putname+0x35/0x50 [<ffffffff811647e1>] ? do_sys_open+0x101/0x140 [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b and also several lock dumps like folllowing : LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) lock@ffff8804655ec0b8[7 4 0 4 0 00000000] P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0] { LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) vvp@ffff8802fb74e6f0: LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) lov@ffff880465f7ca20: 4 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) 0 0: --- LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) 1 0: lock@ffff88084a564a58[0 5 0 0 0 00000000] R(1):[0, 18446744073709551615]@[0x100260000:0x31d165b:0x0] { LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) lovsub@ffff88086678baa0: [1 ffff880465f7ca20 P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0]] LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) osc@ffff880869f0eee8: ffff88086a61f000 00101001 0x75459d96781d6cfc 4 (null) size: 2097152 mtime: 1378913901 atime: 1378913901 ctime: 1378913901 blocks: 4096 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff88084a564a58 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) 2 0: lock@ffff8806b82be7b8[0 5 0 0 0 00000000] R(1):[0, 18446744073709551615]@[0x100160000:0x31d36ce:0x0] { LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) lovsub@ffff88087991c960: [2 ffff880465f7ca20 P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0]] LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) osc@ffff880869f0e368: ffff8807f14e3d80 00101001 0x75459d96781d6d03 4 (null) size: 2097152 mtime: 1378913901 atime: 1378913901 ctime: 1378913901 blocks: 4096 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff8806b82be7b8 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) 3 0: lock@ffff880869fcb078[0 5 0 0 0 00000000] R(1):[0, 18446744073709551615]@[0x100100000:0x31d431c:0x0] { LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) lovsub@ffff88068ae36a20: [3 ffff880465f7ca20 P(0):[0, 18446744073709551615]@[0x200dbc17a:0x19529:0x0]] LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) osc@ffff880869f0e700: ffff88068ae3ad80 00101001 0x75459d96781d6d0a 4 (null) size: 2097152 mtime: 1378913901 atime: 1378913901 ctime: 1378913901 blocks: 4096 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff880869fcb078 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff8804655ec0b8 LustreError: 42482:0:(cl_lock.c:1413:cl_unuse_try()) unuse return -108 Since stacks look similar to the panic stack (and stack of others threads racing against same [sub-] lock structs ...) for this problem , it is a bit suspect and I wonder if is an usual behavior/symptom on-site ??

          Hello Seb,
          So finally you got an other occurrence with a valid crash-dump ?
          I will look into it and update you soon.
          Thanks for your help, bye.

          bfaccini Bruno Faccini (Inactive) added a comment - Hello Seb, So finally you got an other occurrence with a valid crash-dump ? I will look into it and update you soon. Thanks for your help, bye.

          Oleg,
          In case this helps confirming your thoughts, I am currently uploading a dump with the associated kernel and modules on the ftp server in uploads/LU-3941.
          Please ignore the vmcore file as the transfer was interrupted and am unable to resume it. Use the compressed vmcore.xz instead.

          spiechurski Sebastien Piechurski added a comment - Oleg, In case this helps confirming your thoughts, I am currently uploading a dump with the associated kernel and modules on the ftp server in uploads/ LU-3941 . Please ignore the vmcore file as the transfer was interrupted and am unable to resume it. Use the compressed vmcore.xz instead.

          Hi,

          I am not sure to understand what we should do on our Lustre 2.1 here: integrate patch http://review.whamcloud.com/#/c/7569/2? For what purpose, fix the issue or ease issue diagnostic?

          TIA,
          Sebastien.

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi, I am not sure to understand what we should do on our Lustre 2.1 here: integrate patch http://review.whamcloud.com/#/c/7569/2? For what purpose, fix the issue or ease issue diagnostic? TIA, Sebastien.
          green Oleg Drokin added a comment -

          I believe this is all the evidence we need to confirm the long suspected fact that lu3874 is a long standing issue that patch lu3027 just made easier to hit.

          as such patch http://review.whamcloud.com/#/c/7569/2 should help here too.

          We probably should close this as a dup of lu-3874 once confirmed.

          green Oleg Drokin added a comment - I believe this is all the evidence we need to confirm the long suspected fact that lu3874 is a long standing issue that patch lu3027 just made easier to hit. as such patch http://review.whamcloud.com/#/c/7569/2 should help here too. We probably should close this as a dup of lu-3874 once confirmed.

          Hello Seb,
          Too bad for the lack of crash-dump ...
          Also, what about the impacted application/process "arpege", is a multi-[process,thread,node] application ??

          bfaccini Bruno Faccini (Inactive) added a comment - Hello Seb, Too bad for the lack of crash-dump ... Also, what about the impacted application/process "arpege", is a multi- [process,thread,node] application ??

          People

            bfaccini Bruno Faccini (Inactive)
            sebastien.buisson Sebastien Buisson (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: