Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6271

(osc_cache.c:3150:discard_cb()) ASSERTION( (!(page->cp_type == CPT_CACHEABLE) || (!PageDirty(cl_page_vmpage(page)))) ) failed:

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.7.0
    • Hyperon
    • 3
    • 17581

    Description

      While performing failover testing:

      LustreError: 167-0: lustre-OST0000-osc-ffff88083b339000: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
      Feb 23 09:22:48 iwc260 kernel: LustreError: 51585:0:(ldlm_resource.c:777:ldlm_resource_complain()) lustre-OST0000-osc-ffff88083b339000: namespace resource [0x2d097bb:0x0:0x0].0 (ffff88106a992e40) refcount nonzero (1) after lock cleanup; forcing cleanup.
      Feb 23 09:22:48 iwc260 kernel: LustreError: 51585:0:(ldlm_resource.c:1374:ldlm_resource_dump()) --- Resource: [0x2d097bb:0x0:0x0].0 (ffff88106a992e40) refcount = 3
      Feb 23 09:22:48 iwc260 kernel: LustreError: 51585:0:(ldlm_resource.c:1377:ldlm_resource_dump()) Granted locks (in reverse order):
      Feb 23 09:22:48 iwc260 kernel: LustreError: 51585:0:(ldlm_resource.c:1380:ldlm_resource_dump()) ### ### ns: lustre-OST0000-osc-ffff88083b339000 lock: ffff88106a60d540/0xa1f402a8812988f4 lrc: 3/0,1 mode: PW/PW res: [0x2d097bb:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->8191) flags: 0x126400020000 nid: local remote: 0x72a9f4c80e66e07b expref: -99 pid: 51570 timeout: 0 lvb_type: 1
      Feb 23 09:22:48 iwc260 kernel: Lustre: lustre-OST0000-osc-ffff88083b339000: Connection restored to lustre-OST0000 (at 192.168.120.13@o2ib)
      Feb 23 09:22:49 iwc260 kernel: LustreError: 90252:0:(osc_cache.c:3150:discard_cb()) ASSERTION( (!(page->cp_type == CPT_CACHEABLE) || (!PageDirty(cl_page_vmpage(page)))) ) failed:
      Feb 23 09:22:49 iwc260 kernel: LustreError: 90252:0:(osc_cache.c:3150:discard_cb()) ASSERTION( (!(page->cp_type == CPT_CACHEABLE) || (!PageDirty(cl_page_vmpage(page)))) ) failed:
      Feb 23 09:22:49 iwc260 kernel: LustreError: 90252:0:(osc_cache.c:3150:discard_cb()) LBUG
      Feb 23 09:22:49 iwc260 kernel: LustreError: 90252:0:(osc_cache.c:3150:discard_cb()) LBUG
      Feb 23 09:22:49 iwc260 kernel: Pid: 90252, comm: ldlm_bl_36
      Feb 23 09:22:49 iwc260 kernel:
      Feb 23 09:22:49 iwc260 kernel: Call Trace:
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0435895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0435e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b2ba56>] discard_cb+0x156/0x190 [osc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b2bdcc>] osc_page_gang_lookup+0x1ac/0x330 [osc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b2b900>] ? discard_cb+0x0/0x190 [osc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b2c094>] osc_lock_discard_pages+0x144/0x240 [osc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa04461c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b2b900>] ? discard_cb+0x0/0x190 [osc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b2298b>] osc_lock_flush+0x8b/0x260 [osc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b22e08>] osc_ldlm_blocking_ast+0x2a8/0x3c0 [osc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0761a6c>] ldlm_cancel_callback+0x6c/0x170 [ptlrpc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa077450a>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0779120>] ldlm_cli_cancel+0x60/0x360 [ptlrpc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b22c3b>] osc_ldlm_blocking_ast+0xdb/0x3c0 [osc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa04461c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa077cb60>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa077d0c1>] ldlm_bl_thread_main+0x291/0x3f0 [ptlrpc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa077ce30>] ? ldlm_bl_thread_main+0x0/0x3f0 [ptlrpc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffff8109abf6>] kthread+0x96/0xa0
      Feb 23 09:22:49 iwc260 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
      Feb 23 09:22:49 iwc260 kernel: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      Feb 23 09:22:49 iwc260 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      Attachments

        1. iwc189.lbug.log.txt.gz
          0.2 kB
        2. iwc260.log.gz
          4.81 MB
        3. l-23.LU-6271.txt.gz
          0.2 kB
        4. lustre-log.iwc37.txt
          329 kB
        5. r2i1n4.messages.gz
          96 kB
        6. server_messages.tar.gz
          924 kB

        Issue Links

          Activity

            [LU-6271] (osc_cache.c:3150:discard_cb()) ASSERTION( (!(page->cp_type == CPT_CACHEABLE) || (!PageDirty(cl_page_vmpage(page)))) ) failed:
            jay Jinshan Xiong (Inactive) added a comment - - edited

            To nip this in the bud - it seems like any ongoing work should move to a new ticket...

            To be clear, the customer was using a build with their own patches, there is no evidence that this issue can be seen on master.

            jay Jinshan Xiong (Inactive) added a comment - - edited To nip this in the bud - it seems like any ongoing work should move to a new ticket... To be clear, the customer was using a build with their own patches, there is no evidence that this issue can be seen on master.
            pjones Peter Jones added a comment -

            To nip this in the bud - it seems like any ongoing work should move to a new ticket...

            pjones Peter Jones added a comment - To nip this in the bud - it seems like any ongoing work should move to a new ticket...

            From the backtrace, the truncate process was waiting for i_alloc_sem while holding inode mutex but nobody should have held i_alloc_sem from the stack. Unfortunately I don't have a sle11-sp3 kernel source code on hand otherwise I could take a look.

            One alternative way is to get a kernel dump when you see this problem and search the variable address of alloc_sem probably you can find it on some proc's stack.

            jay Jinshan Xiong (Inactive) added a comment - From the backtrace, the truncate process was waiting for i_alloc_sem while holding inode mutex but nobody should have held i_alloc_sem from the stack. Unfortunately I don't have a sle11-sp3 kernel source code on hand otherwise I could take a look. One alternative way is to get a kernel dump when you see this problem and search the variable address of alloc_sem probably you can find it on some proc's stack.

            From the log, obviously the root cause was a deadlock problem. It has nothing to do with these patches.

            Nov 16 14:22:19 r2i1n4 kernel: [274907.716536] shald_dpd.out   D ffff88201d872010     0   825    772 0x00000000
            Nov 16 14:22:19 r2i1n4 kernel: [274907.716541]  ffff88201d873c30 0000000000000082 ffff88201d872010 0000000000010900
            Nov 16 14:22:19 r2i1n4 kernel: [274907.716548]  0000000000010900 0000000000010900 0000000000010900 ffff88201d873fd8
            Nov 16 14:22:19 r2i1n4 kernel: [274907.716554]  ffff88201d873fd8 0000000000010900 ffff881fbd5ac440 ffff881020932600
            Nov 16 14:22:19 r2i1n4 kernel: [274907.716561] Call Trace:
            Nov 16 14:22:19 r2i1n4 kernel: [274907.716568]  [<ffffffff814651a8>] __mutex_lock_slowpath+0xf8/0x150
            Nov 16 14:22:19 r2i1n4 kernel: [274907.716575]  [<ffffffff81464c3a>] mutex_lock+0x1a/0x40
            Nov 16 14:22:19 r2i1n4 kernel: [274907.716585]  [<ffffffff810fd10a>] generic_file_aio_write+0x3a/0xb0
            Nov 16 14:22:19 r2i1n4 kernel: [274907.716622]  [<ffffffffa0d51e2e>] vvp_io_write_start+0x12e/0x530 [lustre]
            Nov 16 14:22:19 r2i1n4 kernel: [274907.716707]  [<ffffffffa088d539>] cl_io_start+0x69/0x140 [obdclass]
            Nov 16 14:22:19 r2i1n4 kernel: [274907.716790]  [<ffffffffa0890993>] cl_io_loop+0xa3/0x190 [obdclass]
            Nov 16 14:22:19 r2i1n4 kernel: [274907.716860]  [<ffffffffa0cf0cc7>] ll_file_io_generic+0x757/0x830 [lustre]
            Nov 16 14:22:19 r2i1n4 kernel: [274907.716892]  [<ffffffffa0d01e14>] ll_file_aio_write+0x1b4/0x5d0 [lustre]
            Nov 16 14:22:19 r2i1n4 kernel: [274907.716926]  [<ffffffffa0d02415>] ll_file_write+0x1e5/0x270 [lustre]
            Nov 16 14:22:19 r2i1n4 kernel: [274907.716942]  [<ffffffff8115ed4e>] vfs_write+0xce/0x140
            Nov 16 14:22:19 r2i1n4 kernel: [274907.716951]  [<ffffffff8115eec3>] sys_write+0x53/0xa0
            Nov 16 14:22:19 r2i1n4 kernel: [274907.716958]  [<ffffffff8146e1f2>] system_call_fastpath+0x16/0x1b
            Nov 16 14:22:19 r2i1n4 kernel: [274907.716971]  [<00007fffec69579d>] 0x7fffec69579c
            
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717055] shald_dpd.out   D ffff88201dc04010     0   826    772 0x00000000
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717060]  ffff88201dc05b70 0000000000000086 ffff88201dc04010 0000000000010900
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717067]  0000000000010900 0000000000010900 0000000000010900 ffff88201dc05fd8
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717073]  ffff88201dc05fd8 0000000000010900 ffff881fd7fce600 ffffffff81a11020
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717080] Call Trace:
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717087]  [<ffffffff814651a8>] __mutex_lock_slowpath+0xf8/0x150
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717094]  [<ffffffff81464c3a>] mutex_lock+0x1a/0x40
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717121]  [<ffffffffa0d1adb0>] ll_setattr_raw+0x2c0/0xb60 [lustre]
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717143]  [<ffffffff81178faf>] notify_change+0x19f/0x2f0
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717151]  [<ffffffff8115d517>] do_truncate+0x57/0x80
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717160]  [<ffffffff8116c053>] do_last+0x603/0x800
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717168]  [<ffffffff8116ceb9>] path_openat+0xd9/0x420
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717175]  [<ffffffff8116d33c>] do_filp_open+0x4c/0xc0
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717183]  [<ffffffff8115de5f>] do_sys_open+0x17f/0x250
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717191]  [<ffffffff8146e1f2>] system_call_fastpath+0x16/0x1b
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717202]  [<00007fffeb9057bd>] 0x7fffeb9057bc
            
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717285] shald_dpd.out   D 0000000000000000     0   827    772 0x00000000
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717290]  ffff881fbd5afa60 0000000000000086 ffff881fbd5ae010 0000000000010900
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717297]  0000000000010900 0000000000010900 0000000000010900 ffff881fbd5affd8
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717303]  ffff881fbd5affd8 0000000000010900 ffff881fa07c20c0 ffff881020fa2080
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717310] Call Trace:
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717318]  [<ffffffff814661a5>] rwsem_down_failed_common+0xb5/0x160
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717329]  [<ffffffff81264533>] call_rwsem_down_write_failed+0x13/0x20
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717337]  [<ffffffff814654bc>] down_write+0x1c/0x20
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717370]  [<ffffffffa0d51159>] vvp_io_setattr_start+0x129/0x170 [lustre]
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717447]  [<ffffffffa088d539>] cl_io_start+0x69/0x140 [obdclass]
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717528]  [<ffffffffa0890993>] cl_io_loop+0xa3/0x190 [obdclass]
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717605]  [<ffffffffa0d4905f>] cl_setattr_ost+0x24f/0x2b0 [lustre]
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717663]  [<ffffffffa0d1b2e4>] ll_setattr_raw+0x7f4/0xb60 [lustre]
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717683]  [<ffffffff81178faf>] notify_change+0x19f/0x2f0
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717692]  [<ffffffff8115d517>] do_truncate+0x57/0x80
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717699]  [<ffffffff8116c053>] do_last+0x603/0x800
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717708]  [<ffffffff8116ceb9>] path_openat+0xd9/0x420
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717715]  [<ffffffff8116d33c>] do_filp_open+0x4c/0xc0
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717723]  [<ffffffff8115de5f>] do_sys_open+0x17f/0x250
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717731]  [<ffffffff8146e1f2>] system_call_fastpath+0x16/0x1b 
            

            You could try patch 14915 but most likely you're going to end up with the same problem.

            jay Jinshan Xiong (Inactive) added a comment - From the log, obviously the root cause was a deadlock problem. It has nothing to do with these patches. Nov 16 14:22:19 r2i1n4 kernel: [274907.716536] shald_dpd.out D ffff88201d872010 0 825 772 0x00000000 Nov 16 14:22:19 r2i1n4 kernel: [274907.716541] ffff88201d873c30 0000000000000082 ffff88201d872010 0000000000010900 Nov 16 14:22:19 r2i1n4 kernel: [274907.716548] 0000000000010900 0000000000010900 0000000000010900 ffff88201d873fd8 Nov 16 14:22:19 r2i1n4 kernel: [274907.716554] ffff88201d873fd8 0000000000010900 ffff881fbd5ac440 ffff881020932600 Nov 16 14:22:19 r2i1n4 kernel: [274907.716561] Call Trace: Nov 16 14:22:19 r2i1n4 kernel: [274907.716568] [<ffffffff814651a8>] __mutex_lock_slowpath+0xf8/0x150 Nov 16 14:22:19 r2i1n4 kernel: [274907.716575] [<ffffffff81464c3a>] mutex_lock+0x1a/0x40 Nov 16 14:22:19 r2i1n4 kernel: [274907.716585] [<ffffffff810fd10a>] generic_file_aio_write+0x3a/0xb0 Nov 16 14:22:19 r2i1n4 kernel: [274907.716622] [<ffffffffa0d51e2e>] vvp_io_write_start+0x12e/0x530 [lustre] Nov 16 14:22:19 r2i1n4 kernel: [274907.716707] [<ffffffffa088d539>] cl_io_start+0x69/0x140 [obdclass] Nov 16 14:22:19 r2i1n4 kernel: [274907.716790] [<ffffffffa0890993>] cl_io_loop+0xa3/0x190 [obdclass] Nov 16 14:22:19 r2i1n4 kernel: [274907.716860] [<ffffffffa0cf0cc7>] ll_file_io_generic+0x757/0x830 [lustre] Nov 16 14:22:19 r2i1n4 kernel: [274907.716892] [<ffffffffa0d01e14>] ll_file_aio_write+0x1b4/0x5d0 [lustre] Nov 16 14:22:19 r2i1n4 kernel: [274907.716926] [<ffffffffa0d02415>] ll_file_write+0x1e5/0x270 [lustre] Nov 16 14:22:19 r2i1n4 kernel: [274907.716942] [<ffffffff8115ed4e>] vfs_write+0xce/0x140 Nov 16 14:22:19 r2i1n4 kernel: [274907.716951] [<ffffffff8115eec3>] sys_write+0x53/0xa0 Nov 16 14:22:19 r2i1n4 kernel: [274907.716958] [<ffffffff8146e1f2>] system_call_fastpath+0x16/0x1b Nov 16 14:22:19 r2i1n4 kernel: [274907.716971] [<00007fffec69579d>] 0x7fffec69579c Nov 16 14:22:19 r2i1n4 kernel: [274907.717055] shald_dpd.out D ffff88201dc04010 0 826 772 0x00000000 Nov 16 14:22:19 r2i1n4 kernel: [274907.717060] ffff88201dc05b70 0000000000000086 ffff88201dc04010 0000000000010900 Nov 16 14:22:19 r2i1n4 kernel: [274907.717067] 0000000000010900 0000000000010900 0000000000010900 ffff88201dc05fd8 Nov 16 14:22:19 r2i1n4 kernel: [274907.717073] ffff88201dc05fd8 0000000000010900 ffff881fd7fce600 ffffffff81a11020 Nov 16 14:22:19 r2i1n4 kernel: [274907.717080] Call Trace: Nov 16 14:22:19 r2i1n4 kernel: [274907.717087] [<ffffffff814651a8>] __mutex_lock_slowpath+0xf8/0x150 Nov 16 14:22:19 r2i1n4 kernel: [274907.717094] [<ffffffff81464c3a>] mutex_lock+0x1a/0x40 Nov 16 14:22:19 r2i1n4 kernel: [274907.717121] [<ffffffffa0d1adb0>] ll_setattr_raw+0x2c0/0xb60 [lustre] Nov 16 14:22:19 r2i1n4 kernel: [274907.717143] [<ffffffff81178faf>] notify_change+0x19f/0x2f0 Nov 16 14:22:19 r2i1n4 kernel: [274907.717151] [<ffffffff8115d517>] do_truncate+0x57/0x80 Nov 16 14:22:19 r2i1n4 kernel: [274907.717160] [<ffffffff8116c053>] do_last+0x603/0x800 Nov 16 14:22:19 r2i1n4 kernel: [274907.717168] [<ffffffff8116ceb9>] path_openat+0xd9/0x420 Nov 16 14:22:19 r2i1n4 kernel: [274907.717175] [<ffffffff8116d33c>] do_filp_open+0x4c/0xc0 Nov 16 14:22:19 r2i1n4 kernel: [274907.717183] [<ffffffff8115de5f>] do_sys_open+0x17f/0x250 Nov 16 14:22:19 r2i1n4 kernel: [274907.717191] [<ffffffff8146e1f2>] system_call_fastpath+0x16/0x1b Nov 16 14:22:19 r2i1n4 kernel: [274907.717202] [<00007fffeb9057bd>] 0x7fffeb9057bc Nov 16 14:22:19 r2i1n4 kernel: [274907.717285] shald_dpd.out D 0000000000000000 0 827 772 0x00000000 Nov 16 14:22:19 r2i1n4 kernel: [274907.717290] ffff881fbd5afa60 0000000000000086 ffff881fbd5ae010 0000000000010900 Nov 16 14:22:19 r2i1n4 kernel: [274907.717297] 0000000000010900 0000000000010900 0000000000010900 ffff881fbd5affd8 Nov 16 14:22:19 r2i1n4 kernel: [274907.717303] ffff881fbd5affd8 0000000000010900 ffff881fa07c20c0 ffff881020fa2080 Nov 16 14:22:19 r2i1n4 kernel: [274907.717310] Call Trace: Nov 16 14:22:19 r2i1n4 kernel: [274907.717318] [<ffffffff814661a5>] rwsem_down_failed_common+0xb5/0x160 Nov 16 14:22:19 r2i1n4 kernel: [274907.717329] [<ffffffff81264533>] call_rwsem_down_write_failed+0x13/0x20 Nov 16 14:22:19 r2i1n4 kernel: [274907.717337] [<ffffffff814654bc>] down_write+0x1c/0x20 Nov 16 14:22:19 r2i1n4 kernel: [274907.717370] [<ffffffffa0d51159>] vvp_io_setattr_start+0x129/0x170 [lustre] Nov 16 14:22:19 r2i1n4 kernel: [274907.717447] [<ffffffffa088d539>] cl_io_start+0x69/0x140 [obdclass] Nov 16 14:22:19 r2i1n4 kernel: [274907.717528] [<ffffffffa0890993>] cl_io_loop+0xa3/0x190 [obdclass] Nov 16 14:22:19 r2i1n4 kernel: [274907.717605] [<ffffffffa0d4905f>] cl_setattr_ost+0x24f/0x2b0 [lustre] Nov 16 14:22:19 r2i1n4 kernel: [274907.717663] [<ffffffffa0d1b2e4>] ll_setattr_raw+0x7f4/0xb60 [lustre] Nov 16 14:22:19 r2i1n4 kernel: [274907.717683] [<ffffffff81178faf>] notify_change+0x19f/0x2f0 Nov 16 14:22:19 r2i1n4 kernel: [274907.717692] [<ffffffff8115d517>] do_truncate+0x57/0x80 Nov 16 14:22:19 r2i1n4 kernel: [274907.717699] [<ffffffff8116c053>] do_last+0x603/0x800 Nov 16 14:22:19 r2i1n4 kernel: [274907.717708] [<ffffffff8116ceb9>] path_openat+0xd9/0x420 Nov 16 14:22:19 r2i1n4 kernel: [274907.717715] [<ffffffff8116d33c>] do_filp_open+0x4c/0xc0 Nov 16 14:22:19 r2i1n4 kernel: [274907.717723] [<ffffffff8115de5f>] do_sys_open+0x17f/0x250 Nov 16 14:22:19 r2i1n4 kernel: [274907.717731] [<ffffffff8146e1f2>] system_call_fastpath+0x16/0x1b You could try patch 14915 but most likely you're going to end up with the same problem.
            icostelloddn Ian Costello added a comment - - edited

            NOTE: the 2.5.39-ddn7 has the patches:

            20401c7 LU-6271 osc: faulty assertion in osc_object_prune()
            b08341e LU-6271 osc: further OSC cleanup after eviction
            42e9257 LU-6271 osc: handle osc eviction correctly

            I think we should seriously consider using patch (written by Dongyang Li):
            http://review.whamcloud.com/#/c/14915/

            I am going to test the above patch at NCI, where we are able to reproduce the crash and inactive OST (due to client eviction)...

            icostelloddn Ian Costello added a comment - - edited NOTE: the 2.5.39-ddn7 has the patches: 20401c7 LU-6271 osc: faulty assertion in osc_object_prune() b08341e LU-6271 osc: further OSC cleanup after eviction 42e9257 LU-6271 osc: handle osc eviction correctly I think we should seriously consider using patch (written by Dongyang Li): http://review.whamcloud.com/#/c/14915/ I am going to test the above patch at NCI, where we are able to reproduce the crash and inactive OST (due to client eviction)...

            In short, clients were evicted suddenly and the OSTs stayed INACTIVE for those clients until the clients were restarted forcibly.

            icostelloddn Ian Costello added a comment - In short, clients were evicted suddenly and the OSTs stayed INACTIVE for those clients until the clients were restarted forcibly.

            uploaded 2 logs file detailing the problem, customer is running the ddn-ieel branch 2.5.39-ddn7.

            icostelloddn Ian Costello added a comment - uploaded 2 logs file detailing the problem, customer is running the ddn-ieel branch 2.5.39-ddn7.

            server message

            icostelloddn Ian Costello added a comment - server message

            client messages

            icostelloddn Ian Costello added a comment - client messages

            This bug is not resolved. At a customer site today, one of the issues reported with the patch series was reproduced:

            Here is the gathered information from one of the engineers:

            Nov 13 21:22:32 r2i1n4 kernel: [41261.553520] LustreError: 11-0: work-OST0018-osc-ffff880f4a556c00: Communicating with 10.148.254.146@o2ib, operation obd_ping failed with -107.
            Nov 13 21:22:32 r2i1n4 kernel: [41261.553537] Lustre: work-OST0018-osc-ffff880f4a556c00: Connection to work-OST0018 (at 10.148.254.146@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            Nov 13 21:22:32 r2i1n4 kernel: [41261.553744] LustreError: 167-0: work-OST0018-osc-ffff880f4a556c00: This client was evicted by work-OST0018; in progress operations using this service will fail.

            Nov 13 21:22:19 oss3 kernel: LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 10.148.4.3@o2ib ns: filter-work-OST0018_UUID lock: ffff881bf46fa1c0/0x1b850f2869999e97 lrc: 3/0,0 mode: PW/PW res: [0xcc6a24:0x0:0x0].0 rrc: 31 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000080010020 nid: 10.148.4.3@o2ib remote: 0x7a3c9c39e0502fcc expref: 5 pid: 20187 timeout: 4335795111 lvb_type: 0

            r2i1n4:~ # lctl dl
            0 UP mgc MGC10.148.254.141@o2ib 2dc4c54f-dab6-1190-fca7-d7747093f936 5

            35 UP osc work-OST0017-osc-ffff880f4a556c00 9d007967-5de8-eb2e-eaf9-99d4b17d74e6 5
            36 IN osc work-OST0018-osc-ffff880f4a556c00 9d007967-5de8-eb2e-eaf9-99d4b17d74e6 5
            37 UP osc work-OST0019-osc-ffff880f4a556c00 9d007967-5de8-eb2e-eaf9-99d4b17d74e6 5

            r2i1n4:~ # lctl --device 36 activate
            error: activate: failed: Invalid argument

            r2i1n4:~ # cat /proc/fs/lustre/osc/work-OST0018-osc-ffff880f4a556c00/import
            import:
            name: work-OST0018-osc-ffff880f4a556c00
            target: work-OST0018_UUID
            state: EVICTED
            instance: 21
            connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type]
            import_flags: [ invalid, replayable, pingable, connect_tried ]
            connection:
            failover_nids: [ 10.148.254.146@o2ib, 10.148.254.147@o2ib ]
            current_connection: 10.148.254.146@o2ib
            connection_attempts: 2
            generation: 2
            in-progress_invalidations: 1
            rpcs:
            inflight: 0
            unregistering: 0
            timeouts: 0
            avg_waittime: 803 usec
            service_estimates:
            services: 1 sec
            network: 1 sec
            transactions:
            last_replay: 0
            peer_committed: 90194391109
            last_checked: 90194391109
            read_data_averages:
            bytes_per_rpc: 19550
            usec_per_rpc: 938
            MB_per_sec: 20.84
            write_data_averages:
            bytes_per_rpc: 508
            usec_per_rpc: 5386
            MB_per_sec: 0.09

            r2i1n4:~ # ps -ef | grep ll_imp_inval
            root 832 2 0 Nov13 ? 00:00:00 [ll_imp_inval]
            root 25849 25765 0 14:18 pts/0 00:00:00 grep ll_imp_inval

            Nov 16 14:22:19 r2i1n4 kernel: [274907.717827] ll_imp_inval S ffff881ffc253b00 0 832 2 0x00000000
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717832] ffff881ffc34dc50 0000000000000046 ffff881ffc34c010 0000000000010900
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717838] 0000000000010900 0000000000010900 0000000000010900 ffff881ffc34dfd8
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717845] ffff881ffc34dfd8 0000000000010900 ffff881fd7ffa580 ffff881020f2c300
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717852] Call Trace:
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717879] [<ffffffffa0bcf361>] osc_object_invalidate+0x1c1/0x280 [osc]
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717907] [<ffffffffa0bb5bae>] osc_ldlm_resource_invalidate+0xae/0x170 [osc]
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717930] [<ffffffffa0716d68>] cfs_hash_for_each_relax+0x178/0x340 [libcfs]
            Nov 16 14:22:19 r2i1n4 kernel: [274907.717971] [<ffffffffa0718fc0>] cfs_hash_for_each_nolock+0x70/0x1c0 [libcfs]
            Nov 16 14:22:19 r2i1n4 kernel: [274907.718008] [<ffffffffa0bc989c>] osc_import_event+0xfac/0x1420 [osc]
            Nov 16 14:22:19 r2i1n4 kernel: [274907.718064] [<ffffffffa0a88592>] ptlrpc_invalidate_import+0x292/0x890 [ptlrpc]
            Nov 16 14:22:19 r2i1n4 kernel: [274907.718151] [<ffffffffa0a8a702>] ptlrpc_invalidate_import_thread+0x42/0x2e0 [ptlrpc]
            Nov 16 14:22:19 r2i1n4 kernel: [274907.718197] [<ffffffff810843f6>] kthread+0x96/0xa0
            Nov 16 14:22:19 r2i1n4 kernel: [274907.718205] [<ffffffff8146f364>] kernel_thread_helper+0x4/0x10

            icostelloddn Ian Costello added a comment - This bug is not resolved. At a customer site today, one of the issues reported with the patch series was reproduced: Here is the gathered information from one of the engineers: Nov 13 21:22:32 r2i1n4 kernel: [41261.553520] LustreError: 11-0: work-OST0018-osc-ffff880f4a556c00: Communicating with 10.148.254.146@o2ib, operation obd_ping failed with -107. Nov 13 21:22:32 r2i1n4 kernel: [41261.553537] Lustre: work-OST0018-osc-ffff880f4a556c00: Connection to work-OST0018 (at 10.148.254.146@o2ib) was lost; in progress operations using this service will wait for recovery to complete Nov 13 21:22:32 r2i1n4 kernel: [41261.553744] LustreError: 167-0: work-OST0018-osc-ffff880f4a556c00: This client was evicted by work-OST0018; in progress operations using this service will fail. Nov 13 21:22:19 oss3 kernel: LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 10.148.4.3@o2ib ns: filter-work-OST0018_UUID lock: ffff881bf46fa1c0/0x1b850f2869999e97 lrc: 3/0,0 mode: PW/PW res: [0xcc6a24:0x0:0x0] .0 rrc: 31 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000080010020 nid: 10.148.4.3@o2ib remote: 0x7a3c9c39e0502fcc expref: 5 pid: 20187 timeout: 4335795111 lvb_type: 0 r2i1n4:~ # lctl dl 0 UP mgc MGC10.148.254.141@o2ib 2dc4c54f-dab6-1190-fca7-d7747093f936 5 … 35 UP osc work-OST0017-osc-ffff880f4a556c00 9d007967-5de8-eb2e-eaf9-99d4b17d74e6 5 36 IN osc work-OST0018-osc-ffff880f4a556c00 9d007967-5de8-eb2e-eaf9-99d4b17d74e6 5 37 UP osc work-OST0019-osc-ffff880f4a556c00 9d007967-5de8-eb2e-eaf9-99d4b17d74e6 5 r2i1n4:~ # lctl --device 36 activate error: activate: failed: Invalid argument r2i1n4:~ # cat /proc/fs/lustre/osc/work-OST0018-osc-ffff880f4a556c00/import import: name: work-OST0018-osc-ffff880f4a556c00 target: work-OST0018_UUID state: EVICTED instance: 21 connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, alt_checksum_algorithm, fid_is_enabled, version_recovery, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, lvb_type] import_flags: [ invalid, replayable, pingable, connect_tried ] connection: failover_nids: [ 10.148.254.146@o2ib, 10.148.254.147@o2ib ] current_connection: 10.148.254.146@o2ib connection_attempts: 2 generation: 2 in-progress_invalidations: 1 rpcs: inflight: 0 unregistering: 0 timeouts: 0 avg_waittime: 803 usec service_estimates: services: 1 sec network: 1 sec transactions: last_replay: 0 peer_committed: 90194391109 last_checked: 90194391109 read_data_averages: bytes_per_rpc: 19550 usec_per_rpc: 938 MB_per_sec: 20.84 write_data_averages: bytes_per_rpc: 508 usec_per_rpc: 5386 MB_per_sec: 0.09 r2i1n4:~ # ps -ef | grep ll_imp_inval root 832 2 0 Nov13 ? 00:00:00 [ll_imp_inval] root 25849 25765 0 14:18 pts/0 00:00:00 grep ll_imp_inval Nov 16 14:22:19 r2i1n4 kernel: [274907.717827] ll_imp_inval S ffff881ffc253b00 0 832 2 0x00000000 Nov 16 14:22:19 r2i1n4 kernel: [274907.717832] ffff881ffc34dc50 0000000000000046 ffff881ffc34c010 0000000000010900 Nov 16 14:22:19 r2i1n4 kernel: [274907.717838] 0000000000010900 0000000000010900 0000000000010900 ffff881ffc34dfd8 Nov 16 14:22:19 r2i1n4 kernel: [274907.717845] ffff881ffc34dfd8 0000000000010900 ffff881fd7ffa580 ffff881020f2c300 Nov 16 14:22:19 r2i1n4 kernel: [274907.717852] Call Trace: Nov 16 14:22:19 r2i1n4 kernel: [274907.717879] [<ffffffffa0bcf361>] osc_object_invalidate+0x1c1/0x280 [osc] Nov 16 14:22:19 r2i1n4 kernel: [274907.717907] [<ffffffffa0bb5bae>] osc_ldlm_resource_invalidate+0xae/0x170 [osc] Nov 16 14:22:19 r2i1n4 kernel: [274907.717930] [<ffffffffa0716d68>] cfs_hash_for_each_relax+0x178/0x340 [libcfs] Nov 16 14:22:19 r2i1n4 kernel: [274907.717971] [<ffffffffa0718fc0>] cfs_hash_for_each_nolock+0x70/0x1c0 [libcfs] Nov 16 14:22:19 r2i1n4 kernel: [274907.718008] [<ffffffffa0bc989c>] osc_import_event+0xfac/0x1420 [osc] Nov 16 14:22:19 r2i1n4 kernel: [274907.718064] [<ffffffffa0a88592>] ptlrpc_invalidate_import+0x292/0x890 [ptlrpc] Nov 16 14:22:19 r2i1n4 kernel: [274907.718151] [<ffffffffa0a8a702>] ptlrpc_invalidate_import_thread+0x42/0x2e0 [ptlrpc] Nov 16 14:22:19 r2i1n4 kernel: [274907.718197] [<ffffffff810843f6>] kthread+0x96/0xa0 Nov 16 14:22:19 r2i1n4 kernel: [274907.718205] [<ffffffff8146f364>] kernel_thread_helper+0x4/0x10

            People

              jay Jinshan Xiong (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: