Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3941

(lov_lock.c:671:lov_lock_enqueue()) ASSERTION( sublock->cll_conflict == ((void *)0) ) failed

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.1.6
    • 3
    • 10424

    Description

      A LBUG was hit at MeteoFrance running Lustre 2.1.6 with the following messages:

      <ConMan> Console [beaufix1034] log at 2013-09-08 04:00:00 UTC.
      2013-09-08 04:03:18 LustreError: 11-0: an error occurred while communicating with 172.23.80.66@o2ib1. The obd_ping operation failed with -107
      2013-09-08 04:03:18 Lustre: scratch-OST0014-osc-ffff880464f5a400: Connection to scratch-OST0014 (at 172.23.80.66@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
      2013-09-08 04:03:18 LustreError: 167-0: This client was evicted by scratch-OST0014; in progress operations using this service will fail.
      2013-09-08 04:03:18 LustreError: 31914:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST0014-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
      2013-09-08 04:03:18 LustreError: 31914:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff88016d5fbe40 (48993688/0/0/0) (rc: 1)
      2013-09-08 04:03:18 Lustre: scratch-OST0014-osc-ffff880464f5a400: Connection restored to scratch-OST0014 (at 172.23.80.66@o2ib1)
      2013-09-08 04:19:08 LustreError: 11-0: an error occurred while communicating with 172.23.80.69@o2ib1. The obd_ping operation failed with -107
      2013-09-08 04:19:08 Lustre: scratch-OST001f-osc-ffff880464f5a400: Connection to scratch-OST001f (at 172.23.80.69@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
      2013-09-08 04:19:08 LustreError: 167-0: This client was evicted by scratch-OST001f; in progress operations using this service will fail.
      2013-09-08 04:19:08 LustreError: 32128:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST001f-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
      2013-09-08 04:19:08 LustreError: 32128:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff88047ba341c0 (48992886/0/0/0) (rc: 1)
      2013-09-08 04:19:08 Lustre: scratch-OST001f-osc-ffff880464f5a400: Connection restored to scratch-OST001f (at 172.23.80.69@o2ib1)
      2013-09-08 04:29:08 LustreError: 11-0: an error occurred while communicating with 172.23.64.63@o2ib. The obd_ping operation failed with -107
      2013-09-08 04:29:08 Lustre: scratch-OST0004-osc-ffff880464f5a400: Connection to scratch-OST0004 (at 172.23.64.63@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      2013-09-08 04:29:08 LustreError: 167-0: This client was evicted by scratch-OST0004; in progress operations using this service will fail.
      2013-09-08 04:29:08 LustreError: 32232:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST0004-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
      2013-09-08 04:29:08 LustreError: 32232:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff880812ed6bc0 (48991503/0/0/0) (rc: 1)
      2013-09-08 04:29:08 Lustre: scratch-OST0004-osc-ffff880464f5a400: Connection restored to scratch-OST0004 (at 172.23.64.63@o2ib)
      
      <ConMan> Console [beaufix1034] log at 2013-09-08 05:00:00 UTC.
      2013-09-08 05:04:58 LustreError: 11-0: an error occurred while communicating with 172.23.80.68@o2ib1. The obd_ping operation failed with -107
      2013-09-08 05:04:58 Lustre: scratch-OST002b-osc-ffff880464f5a400: Connection to scratch-OST002b (at 172.23.80.68@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
      2013-09-08 05:04:58 LustreError: 167-0: This client was evicted by scratch-OST002b; in progress operations using this service will fail.
      2013-09-08 05:04:58 LustreError: 33732:0:(ldlm_resource.c:751:ldlm_resource_complain()) Namespace scratch-OST002b-osc-ffff880464f5a400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
      2013-09-08 05:04:58 LustreError: 33732:0:(ldlm_resource.c:757:ldlm_resource_complain()) Resource: ffff880320afb480 (48996861/0/0/0) (rc: 1)
      2013-09-08 05:04:59 Lustre: scratch-OST002b-osc-ffff880464f5a400: Connection restored to scratch-OST002b (at 172.23.80.68@o2ib1)
      2013-09-08 05:05:05 LustreError: 33253:0:(lov_lock.c:671:lov_lock_enqueue()) ASSERTION( sublock->cll_conflict == ((void *)0) ) failed:
      2013-09-08 05:05:05 LustreError: 33253:0:(lov_lock.c:671:lov_lock_enqueue()) LBUG
      2013-09-08 05:05:05 Pid: 33253, comm: arpege
      2013-09-08 05:05:05
      2013-09-08 05:05:05 Call Trace:
      2013-09-08 05:05:05 [<ffffffffa03977f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      2013-09-08 05:05:05 [<ffffffffa0397e07>] lbug_with_loc+0x47/0xb0 [libcfs]
      2013-09-08 05:05:05 [<ffffffffa0741690>] ? osc_lock_enqueue+0x0/0x810 [osc]
      2013-09-08 05:05:05 [<ffffffffa08858dd>] lov_lock_enqueue+0x71d/0x820 [lov]
      2013-09-08 05:05:05 [<ffffffffa04b37fc>] cl_enqueue_try+0xfc/0x300 [obdclass]
      2013-09-08 05:05:05 [<ffffffffa04b4bfd>] cl_enqueue_locked+0x6d/0x200 [obdclass]
      2013-09-08 05:05:05 [<ffffffffa04b587e>] cl_lock_request+0x7e/0x1c0 [obdclass]
      2013-09-08 05:05:05 [<ffffffffa09f02cf>] cl_glimpse_lock+0x16f/0x410 [lustre]
      2013-09-08 05:05:05 [<ffffffffa09f0a94>] cl_glimpse_size+0x154/0x160 [lustre]
      2013-09-08 05:05:05 [<ffffffffa09b49ae>] ll_inode_revalidate_it+0x5e/0x170 [lustre]
      2013-09-08 05:05:05 [<ffffffffa09b4b09>] ll_getattr_it+0x49/0x170 [lustre]
      2013-09-08 05:05:05 [<ffffffffa09b4c67>] ll_getattr+0x37/0x40 [lustre]
      2013-09-08 05:05:05 [<ffffffff8116c591>] vfs_getattr+0x51/0x80
      2013-09-08 05:05:05 [<ffffffff81113abf>] ? free_hot_page+0x2f/0x60
      2013-09-08 05:05:05 [<ffffffff8116c84f>] vfs_fstat+0x3f/0x60
      2013-09-08 05:05:05 [<ffffffff8116c894>] sys_newfstat+0x24/0x40
      2013-09-08 05:05:05 [<ffffffff81173e85>] ? putname+0x35/0x50
      2013-09-08 05:05:06 [<ffffffff811647e1>] ? do_sys_open+0x101/0x140
      2013-09-08 05:05:06 [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b
      2013-09-08 05:05:06
      2013-09-08 05:05:06 Kernel panic - not syncing: LBUG
      2013-09-08 05:05:06 Pid: 33253, comm: arpege Tainted: G W --------------- H 2.6.32-279.5.2.bl6.Bull.36.x86_64 #1
      2013-09-08 05:05:06 Call Trace:
      2013-09-08 05:05:06 [<ffffffff81495fe3>] ? panic+0xa0/0x168
      2013-09-08 05:05:06 [<ffffffffa0397e5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      2013-09-08 05:05:06 [<ffffffffa0741690>] ? osc_lock_enqueue+0x0/0x810 [osc]
      2013-09-08 05:05:06 [<ffffffffa08858dd>] ? lov_lock_enqueue+0x71d/0x820 [lov]
      2013-09-08 05:05:06 [<ffffffffa04b37fc>] ? cl_enqueue_try+0xfc/0x300 [obdclass]
      2013-09-08 05:05:06 [<ffffffffa04b4bfd>] ? cl_enqueue_locked+0x6d/0x200 [obdclass]
      2013-09-08 05:05:06 [<ffffffffa04b587e>] ? cl_lock_request+0x7e/0x1c0 [obdclass]
      2013-09-08 05:05:06 [<ffffffffa09f02cf>] ? cl_glimpse_lock+0x16f/0x410 [lustre]
      2013-09-08 05:05:06 [<ffffffffa09f0a94>] ? cl_glimpse_size+0x154/0x160 [lustre]
      2013-09-08 05:05:06 [<ffffffffa09b49ae>] ? ll_inode_revalidate_it+0x5e/0x170 [lustre]
      2013-09-08 05:05:06 [<ffffffffa09b4b09>] ? ll_getattr_it+0x49/0x170 [lustre]
      2013-09-08 05:05:06 [<ffffffffa09b4c67>] ? ll_getattr+0x37/0x40 [lustre]
      2013-09-08 05:05:06 [<ffffffff8116c591>] ? vfs_getattr+0x51/0x80
      2013-09-08 05:05:06 [<ffffffff81113abf>] ? free_hot_page+0x2f/0x60
      2013-09-08 05:05:06 [<ffffffff8116c84f>] ? vfs_fstat+0x3f/0x60
      2013-09-08 05:05:06 [<ffffffff8116c894>] ? sys_newfstat+0x24/0x40
      2013-09-08 05:05:06 [<ffffffff81173e85>] ? putname+0x35/0x50
      2013-09-08 05:05:06 [<ffffffff811647e1>] ? do_sys_open+0x101/0x140
      2013-09-08 05:05:06 [<ffffffff810030f2>] ? system_call_fastpath+0x16/0x1b
      

      This looks very similar to LU-3874, but the patch that was suspected in this jira is not applied to our version (only in master and b2_4).

      Unfortunately, we do not have any dump to analyze further.

      Thanks,
      Sebastien.

      Attachments

        Activity

          People

            bfaccini Bruno Faccini (Inactive)
            sebastien.buisson Sebastien Buisson (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: