[LU-1650] crash of lustre clients in osc_req_attr_set() routine Created: 20/Jul/12  Updated: 19/Dec/12  Resolved: 19/Dec/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0, Lustre 2.3.0
Fix Version/s: Lustre 2.4.0, Lustre 2.1.4

Type: Bug Priority: Major
Reporter: Gregoire Pichon Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: cl_lock, client, crash
Environment:

lustre 2.1.2 + a few additional patches
bullxlinux 6.1.1 (rhel6.1.1, kernel 2.6.32)


Attachments: Text File LU-1650_debug.patch     File LU1650_trace_cl_lock_at_page_dup_VS_osc_req_attr_set_LBUG.tgz     Text File cartan1139_crash_debug.txt     Text File cartan1206_crash_debug.txt     Text File console-grove409.txt     File console.hype205    
Issue Links:
Related
is related to LU-2172 osc_req_attr_set LBUG Resolved
Severity: 3
Rank (Obsolete): 4053

 Description   

Our lustre release is made of lustre 2.1.2 plus the following patches:

  • ORNL-22 "general ptlrpcd threads pool support"
  • LU-1144 implement a NUMA aware ptlrpcd binding policy
  • LU-1164 add the ability to choose the number of ko2iblnd threads at module
  • LU-857 support SELinux
  • LU-1110 MDS Oops in osd_xattr_get() during file open by FID
  • LU-1363 llite: Not held lock when calling security_d_instantiate
  • LU-948/LU-1059 recovery hang
  • LU-969/LU-1408 stack overflow
  • LU-645/BZ23978 getcwd failure
  • LU-1428 MDT service threads spinning in cfs_hash_for_each_relax()
  • LU-1299 loading large enough binary from lustre trigger OOM killer
  • LU-1493 assertion failed in dqacq_completion()
  • LU-1194 OSS LBUG in_llog_recov_thread_stop() during umount

At CEA, there are several lustre client crashes/LBUGs with this same signature/stack like following :

........
LustreError: 6614:0:(osc_io.c:698:osc_req_attr_set()) no cover page!
LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) page@ffff880b9f71e180[2 ffff8806a0423448:2342912 ^(null)_ffff880b9f71e0c0 3 0 1 (null) ffff880af0057f80 0x0]
LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) page@ffff880b9f71e0c0[1 ffff8806e43a6aa8:780288 ^ffff880b9f71e180_(null) 3 0 1 (null) (null) 0x0]
LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) vvp-page@ffff880ba2855640(1:0:0) vm@ffffea0028af1e40 1400000000000801 3:0 ffff880b9f71e180 2342912 lru
LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) lov-page@ffff880ba285bf48
LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) osc-page@ffff880b9e38d070: 1< 0x845fed 1 0 - - + > 2< 3196059648 0 4096 0x7 0x8 | (null) ffff88102de0e8c8 ffff8806a08d4e40 ffffffffa08e6140 ffff880b9e38d070 > 3< + ffff880a8b481100 1 3668 0 > 4< 0 0 32 69689344 - | - - - - > 5< - - - - | 0 - - | 0 - ->
LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) end page@ffff880b9f71e0c0
LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) dump uncover page!
Pid: 6614, comm: %%USR123_%%%A456
 
Call Trace:
[<ffffffffa03bc865>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[<ffffffffa08e3979>] osc_req_attr_set+0x2f9/0x310 [osc]
[<ffffffffa04de979>] cl_req_attr_set+0xc9/0x250 [obdclass]
[<ffffffffa08d088b>] osc_send_oap_rpc+0xc2b/0x1b40 [osc]
[<ffffffffa03bd86e>] ? cfs_mem_cache_free+0xe/0x10 [libcfs]
[<ffffffffa08d1a4e>] osc_check_rpcs+0x2ae/0x4c0 [osc]
[<ffffffffa08e4037>] osc_io_submit+0x1e7/0x540 [osc]
[<ffffffffa04ded00>] cl_io_submit_rw+0x70/0x180 [obdclass]
[<ffffffffa0962a4e>] lov_io_submit+0x4ee/0xc30 [lov]
[<ffffffffa04ded00>] cl_io_submit_rw+0x70/0x180 [obdclass]
[<ffffffffa04e0f40>] cl_io_read_page+0xb0/0x170 [obdclass]
[<ffffffffa04d5349>] ? cl_page_assume+0xf9/0x2d0 [obdclass]
[<ffffffffa0a1a6b6>] ll_readpage+0x96/0x200 [lustre]
[<ffffffff810fc9dc>] generic_file_aio_read+0x1fc/0x700
[<ffffffffa0a4237b>] vvp_io_read_start+0x13b/0x3e0 [lustre]
[<ffffffffa04defca>] cl_io_start+0x6a/0x140 [obdclass]
[<ffffffffa04e322c>] cl_io_loop+0xcc/0x190 [obdclass]
[<ffffffffa09f1ef7>] ll_file_io_generic+0x3a7/0x560 [lustre]
[<ffffffffa09f21e9>] ll_file_aio_read+0x139/0x2c0 [lustre]
[<ffffffffa09f26a9>] ll_file_read+0x169/0x2a0 [lustre]
[<ffffffff8115e355>] vfs_read+0xb5/0x1a0
[<ffffffff8115e491>] sys_read+0x51/0x90
[<ffffffff81003172>] system_call_fastpath+0x16/0x1b
 
LustreError: 6614:0:(osc_io.c:702:osc_req_attr_set()) LBUG
Pid: 6614, comm: %%USR123_%%%A456

and also, but much less frequently with a stack like :

LustreError: 23020:0:(cl_lock.c:906:cl_lock_hold_release()) failed at lock->cll_state != CLS_HELD.
LustreError: 23020:0:(cl_lock.c:906:cl_lock_hold_release()) LBUG
Pid: 23020, comm: %%USR789_%%A37_

Call Trace:
  [<ffffffffa03dc865>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
  [<ffffffffa03dce77>] lbug_with_loc+0x47/0xb0 [libcfs]
  [<ffffffffa04f95e2>] cl_lock_hold_release+0x2a2/0x2b0 [obdclass]
  [<ffffffffa04fadf2>] cl_lock_release+0x82/0x180 [obdclass]
  [<ffffffffa0502338>] cl_lock_link_fini+0x68/0x160 [obdclass]
  [<ffffffffa0502565>] cl_io_unlock+0x135/0x2e0 [obdclass]
  [<ffffffffa0503245>] cl_io_loop+0xe5/0x190 [obdclass]
  [<ffffffffa0a5ac13>] cl_setattr_ost+0x1c3/0x240 [lustre]
  [<ffffffffa0a2e59a>] ll_setattr_raw+0x96a/0xf20 [lustre]
  [<ffffffffa0a2ebaf>] ll_setattr+0x5f/0x100 [lustre]
  [<ffffffff811796d8>] notify_change+0x168/0x340
  [<ffffffff8115be54>] do_truncate+0x64/0xa0
  [<ffffffff8116dff1>] do_filp_open+0x821/0xd30
  [<ffffffff8112af80>] ? unmap_region+0x110/0x130
  [<ffffffff8117a6a2>] ? alloc_fd+0x92/0x160
  [<ffffffff8115ac29>] do_sys_open+0x69/0x140
  [<ffffffff8115ad40>] sys_open+0x20/0x30
  [<ffffffff81003172>] system_call_fastpath+0x16/0x1b

The second LBUG() is an assertion introduced by LU-1299. So we reverted the patch for LU-1299 and delivered a new version of lustre, but the first LBUG() was still hit by several clients.

I have asked the support team to provide additional debug data (debug log with "dlmtrace" enabled, and a dump image). But unfortunately it will take some time to get these information.



 Comments   
Comment by Peter Jones [ 20/Jul/12 ]

Assigning to Jinshan who has already looked into this issue a little

Comment by Gregoire Pichon [ 23/Jul/12 ]

Hese are the crash-dump analysis from the support team:

  • cartan1139_crash_debug.txt, for the "(osc_io.c:702:osc_req_attr_set()) LBUG" in Lustre 2.1.2/Bull.2.215.
  • cartan1206_crash_debug.txt, for the "(cl_lock.c:906:cl_lock_hold_release()) failed at lock->cll_state != CLS_HELD." LBUG in Lustre 2.1.2/Bull.2.215.
Comment by Gregoire Pichon [ 23/Jul/12 ]

Jinshan,

From LU-1299, I see a note of Christopher Morrone (08/May/12 6:17 PM) who reports a crash in "(cl_lock.c:906:cl_lock_hold_release()) ASSERTION(lock->cll_state != CLS_HELD) failed". You answered him this proble had been fixed in patchset 7 (of master patch).

Could you confirm this problem is also fixed by patch from b2_1 set 10 ? This is the patch version we have integrated in our Lustre version.

Thanks.

Comment by Jinshan Xiong (Inactive) [ 23/Jul/12 ]

I've found the root cause of the 2nd problem and will fix LU-1299 for b2_1. The 1st problem may be related to LU-1299 because it may cause caching pages in VM w/o cl_lock. After I submit the new patch, please apply it and see if you can still hit the 1st problem.

Comment by Jinshan Xiong (Inactive) [ 24/Jul/12 ]

It's my fault that the fix has already been in master but I forgot to merge it into the patch for b2_1. I've fixed this problem in patch set 11 of http://review.whamcloud.com/2574.

Comment by Gregoire Pichon [ 24/Jul/12 ]

Thank you Jinshan.
We will apply the new patch as soon as it is approved.

Comment by Christopher Morrone [ 30/Jul/12 ]

We're still hitting the same assertion with patch set 11 of http://review.whamcloud.com/2574. Will attach console log excerpt.

We were running 2.1.2-3chaos.

Comment by Christopher Morrone [ 30/Jul/12 ]

Console log from assertion when running patchset 11 from the LU-1299 patch.

Comment by Jinshan Xiong (Inactive) [ 30/Jul/12 ]

Hi Chris, Do you collect the backtrace or it had the exactly same backtrace as before?

Comment by Prakash Surya (Inactive) [ 30/Jul/12 ]

Jinshan, the back trace is the same as what is posted in the description. For completeness it's pasted below as well:

crash> bt
PID: 28339  TASK: ffff88040aa03500  CPU: 6   COMMAND: "mpi_dump"
 #0 [ffff88040a9a3a90] machine_kexec at ffffffff8103216b
 #1 [ffff88040a9a3af0] crash_kexec at ffffffff810b8d12
 #2 [ffff88040a9a3bc0] panic at ffffffff814ee999
 #3 [ffff88040a9a3c40] lbug_with_loc at ffffffffa0438e1b [libcfs]
 #4 [ffff88040a9a3c60] cl_lock_hold_release at ffffffffa05b124f [obdclass]
 #5 [ffff88040a9a3c80] cl_lock_release at ffffffffa05b24a2 [obdclass]
 #6 [ffff88040a9a3cb0] cl_lock_link_fini at ffffffffa05b8128 [obdclass]
 #7 [ffff88040a9a3cf0] cl_io_unlock at ffffffffa05b8295 [obdclass]
 #8 [ffff88040a9a3d30] cl_io_loop at ffffffffa05b8ced [obdclass]
 #9 [ffff88040a9a3d80] cl_setattr_ost at ffffffffa09cb383 [lustre]
#10 [ffff88040a9a3de0] ll_setattr_raw at ffffffffa09a5854 [lustre]
#11 [ffff88040a9a3e60] ll_setattr at ffffffffa09a5c8f [lustre]
#12 [ffff88040a9a3e70] notify_change at ffffffff81193bc8
#13 [ffff88040a9a3ee0] do_truncate at ffffffff811762a4
#14 [ffff88040a9a3f50] sys_ftruncate at ffffffff81176570
#15 [ffff88040a9a3f80] system_call_fastpath at ffffffff8100b0f2
    RIP: 00002aaaac1d09d7  RSP: 00007fffffffb028  RFLAGS: 00010246
    RAX: 000000000000004d  RBX: ffffffff8100b0f2  RCX: 0000000000000000
    RDX: 0000000020000008  RSI: 0000000020000008  RDI: 0000000000000005
    RBP: 00007fffffffc860   R8: 0000000000000001   R9: 0000000000000000
    R10: 0000000000000001  R11: 0000000000000203  R12: 0000000000606080
    R13: 00007fffffffd160  R14: 0000000000000000  R15: 0000000020000008
    ORIG_RAX: 000000000000004d  CS: 0033  SS: 002b
Comment by Prakash Surya (Inactive) [ 30/Jul/12 ]

We also crashed a few of our Orion compute nodes over the weekend. I believe this bug was the cause of at least one of them. The stack trace is below:

crash> bt
PID: 71817  TASK: ffff88082ae9c080  CPU: 14  COMMAND: "ior"
 #0 [ffff880488f552d0] machine_kexec at ffffffff8103216b
 #1 [ffff880488f55330] crash_kexec at ffffffff810b8d12
 #2 [ffff880488f55400] panic at ffffffff814ee999
 #3 [ffff880488f55480] lbug_with_loc at ffffffffa032af2b [libcfs]
 #4 [ffff880488f554a0] osc_req_attr_set at ffffffffa0a329b1 [osc]
 #5 [ffff880488f554e0] cl_req_attr_set at ffffffffa06d3799 [obdclass]
 #6 [ffff880488f55540] osc_build_rpc at ffffffffa0a266d9 [osc]
 #7 [ffff880488f55670] osc_send_read_rpc at ffffffffa0a3b013 [osc]
 #8 [ffff880488f55810] osc_io_unplug0 at ffffffffa0a3ebcb [osc]
 #9 [ffff880488f55920] osc_io_unplug at ffffffffa0a3f9c1 [osc]
#10 [ffff880488f55930] osc_queue_sync_pages at ffffffffa0a45ec8 [osc]
#11 [ffff880488f559a0] osc_io_submit at ffffffffa0a32f60 [osc]
#12 [ffff880488f55a40] cl_io_submit_rw at ffffffffa06d325c [obdclass]
#13 [ffff880488f55a90] lov_io_submit at ffffffffa0abe032 [lov]
#14 [ffff880488f55b20] cl_io_submit_rw at ffffffffa06d325c [obdclass]
#15 [ffff880488f55b70] cl_io_read_page at ffffffffa06d4fe6 [obdclass]
#16 [ffff880488f55bc0] ll_readpage at ffffffffa0b592c0 [lustre]
#17 [ffff880488f55c00] generic_file_aio_read at ffffffff8111262c
#18 [ffff880488f55ce0] vvp_io_read_start at ffffffffa0b7e806 [lustre]
#19 [ffff880488f55d50] cl_io_start at ffffffffa06d3392 [obdclass]
#20 [ffff880488f55d80] cl_io_loop at ffffffffa06d6ca4 [obdclass]
#21 [ffff880488f55db0] ll_file_io_generic at ffffffffa0b38a87 [lustre]
#22 [ffff880488f55e20] ll_file_aio_read at ffffffffa0b39929 [lustre]
#23 [ffff880488f55e80] ll_file_read at ffffffffa0b39be3 [lustre]
#24 [ffff880488f55ef0] vfs_read at ffffffff811782d5
#25 [ffff880488f55f30] sys_read at ffffffff81178411
#26 [ffff880488f55f80] system_call_fastpath at ffffffff8100b0f2
    RIP: 00002aaaab65b4ed  RSP: 00007fffffffd040  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffffffff8100b0f2  RCX: 0000008e50133f96
    RDX: 0000000008000000  RSI: 00002aaab8001000  RDI: 0000000000000009
    RBP: 0000000008000000   R8: 0000000000000064   R9: 00000000ea31beaf
    R10: 00007fffffffcd6c  R11: 0000000000000293  R12: 00002aaab8001000
    R13: 0000000000000003  R14: 00000000006c2ac0  R15: 0000000000000002
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b

EDIT: We are not running with the patch from LU-1299 on Orion, so perhaps these crashes on the branch are expected?

Comment by Jinshan Xiong (Inactive) [ 30/Jul/12 ]

I've found the root cause of this problem and I'm working on a patch. This problem is revealed by patch of Lu-1299 because the assertion was added in that patch.

The root cause of this patch is that we shouldn't set the lock to be CLS_HELD after unusing it.

Comment by Jinshan Xiong (Inactive) [ 30/Jul/12 ]

Will you please apply this patch (http://review.whamcloud.com/3497) and LU-1299 patch set 11 or 12 to see if it can fix the problem?

Comment by Ian Colle (Inactive) [ 31/Jul/12 ]

Also observed at LLNL on Orion

Comment by Christopher Morrone [ 31/Jul/12 ]

I've added the http://review.whamcloud.com/3497 patch for testing and we currently have patch set 11 of of the original LU-1299 patch (http://review.whamcloud.com/2574).

Comment by Bruno Faccini (Inactive) [ 13/Sep/12 ]

We also still frequently suffer the "(osc_io.c:702:osc_req_attr_set()) LBUG" with our/Bull Lustre v2.1.2 build including LU-1299 (patch set 11) and ORNL-22 patches. An interesting finding we made is that in some crash-dump we have found one of the concerned object's linked lock with a page range/index matching the one of the searched page and not in some other crash-dump !!!

This seem to mean that we hit some timing window issue where the expected but missing "matching" lock may have been delayed to be linked to the object locks list ...

BTW, we have develloped a local debug-patch to allow the same condition to be earlier trigger by moving the LBUG in case of no lock finding in a "clone" routine of cl_lock_at_page() routine and to be only called from osc_req_attr_set(). This will allow to detect and LBUG under coh_lock_guard protection this time, and why not crash with the future matching lock bringer to be stuck waiting on coh_lock_guard !!...

We will also try to run with "max_ptlrpcds=1" and "ptlrpcd_bind_policy=1" settings of ptlrpc module, since we quite stronger think that the "multi+binding ptlrpcds" feature may have a certain impact for this problem ...

Comment by Jinshan Xiong (Inactive) [ 13/Sep/12 ]

An interesting finding we made is that in some crash-dump we have found one of the concerned object's linked lock with a page range/index matching the one of the searched page and not in some other crash-dump !!!

Did you notice the lock state when this issue was happening? My gut feeling is that this page must be a readahead page and the routine of adding a readahead page into cache and checking matched cl_lock are not atomic. Please dump the lock state and page state when you see this problem next time.

Another issue observed in LU-1299 can cause the similar problem as well. In LU-1299, a cached cl_lock can be (wrongly) set to error if an unexpected signal is received, this causes the lock will be erred out w/o cleaning up covered pages.

Comment by Michael McKay (Inactive) [ 14/Sep/12 ]

I am currently looking at an issue that looks similar to the first stack trace in this ticket. The crashing trace in my case is:
[<ffffffff81007b11>] try_stack_unwind+0x1b1/0x200
[<ffffffff81006420>] dump_trace+0x90/0x300
[<ffffffffa014b8d2>] libcfs_debug_dumpstack+0x52/0x80 [libcfs]
[<ffffffffa0514536>] osc_req_attr_set+0x206/0x2a0 [osc]
[<ffffffffa024d0ae>] cl_req_attr_set+0xce/0x1f0 [obdclass]
[<ffffffffa0501e77>] osc_send_oap_rpc+0xc67/0x1c50 [osc]
[<ffffffffa0503136>] osc_check_rpcs+0x2d6/0x470 [osc]
[<ffffffffa0514bf0>] osc_io_submit+0x220/0x4d0 [osc]
[<ffffffffa024d3d8>] cl_io_submit_rw+0x78/0x1a0 [obdclass]
[<ffffffffa0575924>] lov_io_submit+0x254/0xb40 [lov]
[<ffffffffa024d3d8>] cl_io_submit_rw+0x78/0x1a0 [obdclass]
[<ffffffffa05e8424>] ll_writepage+0x2c4/0x480 [lustre]
[<ffffffff810d0b22>] __writepage+0x12/0x50
[<ffffffff810d1213>] write_cache_pages+0x1d3/0x3b0
[<ffffffff810d140f>] generic_writepages+0x1f/0x30
[<ffffffff810d143c>] do_writepages+0x1c/0x40
[<ffffffff8112c422>] writeback_single_inode+0x92/0x2c0
[<ffffffff8112cfca>] writeback_inodes_wb+0x26a/0x3e0
[<ffffffff8112d2e0>] wb_writeback+0x1a0/0x260
[<ffffffff8112d67c>] wb_do_writeback+0x1fc/0x210
[<ffffffff8112d6cb>] bdi_writeback_task+0x3b/0xc0
[<ffffffff810e2321>] bdi_start_fn+0x81/0xf0
[<ffffffff81063dae>] kthread+0x8e/0xa0
[<ffffffff81003d3a>] child_rip+0xa/0x20
> 2012-04-20T22:00:48.362745-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:702:osc_req_attr_set()) LBUG

In this case the object has 2 locks:
LOCK1: is a CLM_PHANTOM lock for the whole file that is being Cancelled
LOCK2: is a WRITE lock for the index that this write request wants, but is currently in the CLS_QUEUING state and is blocked on LOCK1 while its being cancelled.

Could this be a similar issue?

Comment by Jinshan Xiong (Inactive) [ 15/Sep/12 ]

Hi Michael, it looks similar. Actually when this bug is hit, it should dump all locks of this object. Can anybody post it here on the ticket?

Comment by Michael McKay (Inactive) [ 17/Sep/12 ]

Here is the output if the locks when the assertion hit.

7196:0:(osc_io.c:698:osc_req_attr_set()) lock@ffff8803f4522358[3 2 0 1 1 00000006] P(0):[0, 18446744073709551615]@[0x100000000:0x4b1f5c0:0x0]

{ 7196:0:(osc_io.c:698:osc_req_attr_set()) lovsub@ffff8803f525e4e0: [0 ffff880404198470 P(0):[0, 18446744073709551615]@[0x23d979800:0x1:0x0]] 7196:0:(osc_io.c:698:osc_req_attr_set()) osc@ffff8803f4a3ab50: (null) 00001000 0x85ead42e54740750 1 (null) size: 0 mtime: 0 atime: 0 ctime: 0 blocks: 0 7196:0:(osc_io.c:698:osc_req_attr_set()) }

lock@ffff8803f4522358
7196:0:(osc_io.c:698:osc_req_attr_set()) no cover page!
7196:0:(osc_io.c:698:osc_req_attr_set()) lock@ffff8803f4522dd8[1 1 0 1 1 00000000] W(2):[0, 0]@[0x100000000:0x4b1f5c0:0x0]

{ 7196:0:(osc_io.c:698:osc_req_attr_set()) lovsub@ffff8803f7d0a4e0: [0 ffff8807e9ae1a88 W(2):[0, 18446744073709551615]@[0x23d979800:0x1:0x0]] > 2012-04-20T22:00:48.199478-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:698:osc_req_attr_set()) osc@ffff8803f5c37b50: (null) 00000000 0x0 0 (null) size: 0 mtime: 0 atime: 0 ctime: 0 blocks: 0 7196:0:(osc_io.c:698:osc_req_attr_set()) }

lock@ffff8803f4522dd8
7196:0:(osc_io.c:698:osc_req_attr_set()) no cover page!
7196:0:(osc_io.c:700:osc_req_attr_set()) page@ffff8803f4901680[3 ffff8807eb639b08:0 ^(null)_ffff8803f49015c0 2 0 1 (null) ffff8807e9b9c380 0x0]
7196:0:(osc_io.c:700:osc_req_attr_set()) page@ffff8803f49015c0[1 ffff8807ea301b58:0 ^ffff8803f4901680_(null) 2 0 1 (null) (null) 0x0]
7196:0:(osc_io.c:700:osc_req_attr_set()) vvp-page@ffff8803f7438f50(0:0:0) vm@ffffea000dc16b30 200000000005019 4:0 ffff8803f4901680 0 lru
7196:0:(osc_io.c:700:osc_req_attr_set()) lov-page@ffff8803f757d978
7196:0:(osc_io.c:700:osc_req_attr_set()) osc-page@ffff8803f4b317a8: 1< 0x845fed 2 0 - - + > 2< 0 0 2016 0x3 0x420 | (null) ffff880401df2628 ffff8807e847c340 ffffffffa0516a20 ffff8803f4b317a8 > 3< + ffff8807ffdfe0c0 1 39 0 > 4< 0 0 8 33759232 - | - - - - > 5< - - - - | 0 - - | 0 - ->
7196:0:(osc_io.c:700:osc_req_attr_set()) end page@ffff8803f49015c0
7196:0:(osc_io.c:700:osc_req_attr_set()) dump uncover page!

Comment by Jinshan Xiong (Inactive) [ 28/Sep/12 ]

I'm working on this issue.

Comment by Jinshan Xiong (Inactive) [ 28/Sep/12 ]

I realize that the issue mentioned by Michael McKay is actually different where the assertion was hit in ll_writepage() path. However initially the assertion was hit by a readahead page.

After taking a look at the crash log, it's really strange that the page was actually covered by a lock, as follows:

LustreError: 73977:0:(osc_io.c:698:osc_req_attr_set()) lock@ffff880b61a79518[0 5 0 0 0 00000000] W(2):[185984, 187903]@[0x102350000:0x536b:0x0] {
LustreError: 73977:0:(osc_io.c:698:osc_req_attr_set())     lovsub@ffff880b5c410620: [0 ffff880b57801678 W(2):[558720, 564735]@[0x2103bd364:0x1e89:0x0]]
LustreError: 73977:0:(osc_io.c:698:osc_req_attr_set())     osc@ffff880b64809648: ffff880b5a1e3480 40120002 0x6cc4082097a097e7 4 (null) size: 770703360 mtime: 1341558695 atime: 1341558695 ctime: 1341558695 blocks: 1459736
LustreError: 73977:0:(osc_io.c:698:osc_req_attr_set()) } lock@ffff880b61a79518
LustreError: 73977:0:(osc_io.c:698:osc_req_attr_set()) no cover page!

and page was:

LustreError: 73977:0:(osc_io.c:700:osc_req_attr_set()) page@ffff880b57936ec0[2 ffff8804ff1e1a48:561152 ^(null)_ffff880b57936e00 3 0 1 (null) ffff880b7608cec0 0x0]
LustreError: 73977:0:(osc_io.c:700:osc_req_attr_set()) page@ffff880b57936e00[1 ffff8804ff1e2d78:186368 ^ffff880b57936ec0_(null) 3 0 1 (null) (null) 0x0]
LustreError: 73977:0:(osc_io.c:700:osc_req_attr_set()) vvp-page@ffff880b57b38320(1:0:0) vm@ffffea0027a42cc0 1400000000000821 2:0 ffff880b57936ec0 561152 lru
LustreError: 73977:0:(osc_io.c:700:osc_req_attr_set()) lov-page@ffff880b5a276498
LustreError: 73977:0:(osc_io.c:700:osc_req_attr_set()) osc-page@ffff880b565f28b0: 1< 0x845fed 1 0 - - + > 2< 763363328 0 4096 0x7 0x8 | (null) ffff88104f42a588 ffff880501380840 ffffffffa08ea140 ffff880b565f28b0 > 3< + ffff8808220f0040 1 732 0 > 4< 0 0 32 72396800 - | - - + - > 5< - - + - | 0 - - | 208 + ->
LustreError: 73977:0:(osc_io.c:700:osc_req_attr_set()) end page@ffff880b57936e00
LustreError: 73977:0:(osc_io.c:700:osc_req_attr_set()) dump uncover page!

The page index is 186368, and the lock is a PW lock which covers [185984, 187903], and the lock is in cached state. It looks all good except hitting assertion. I'll make further investigation.

Comment by Jinshan Xiong (Inactive) [ 29/Sep/12 ]

Hi Gregoire, do you still have the crash file, I'd like to the value of osc_page::ops_lock because that lock was supposed to cover that readahead page? Otherwise, I'll provide you debug patch to print it in osc_page_print() so that we can see it next time you hit it.

Thanks

Comment by Bruno Faccini (Inactive) [ 03/Oct/12 ]

Hello Jinshan, I checked in all the crash-dumps still available on-site and osc_page::ops_lock is always non-null and valid.
By "valid" I mean a still allocated/in-use cl_lock, but not (no longer ??) linked on the currently scanned cl_object_header::coh_locks/cl_lock::cll_linkage linked-list, nor on any other, at the time of the LBUG()/crash.

BTW, we still have not suffered this LBUG() since we are running (on a much smaller test configuration) with our own debug-patch I already described in my earlier comment on 13/Sep/12 6:33 PM.

Comment by Jinshan Xiong (Inactive) [ 04/Oct/12 ]

what's the status of that lock?

Comment by Peter Jones [ 04/Oct/12 ]

Lai

Could you please help out on this issue?

Thanks

Peter

Comment by Bruno Faccini (Inactive) [ 04/Oct/12 ]

Status (cll_state ?) of osc_page::ops_lock found in still available crash-dumps, is 0x6/CLS_FREEING.

Comment by Cory Spitz [ 08/Oct/12 ]

FYI: LU-2094 was opened to track the same LBUG (but triggered down the ll_writepage path) that matches the signature that Michael McKay added to this ticket on 14/Sep/12.

Comment by Lai Siyao [ 11/Oct/12 ]

Hi Bruno, did you include http://review.whamcloud.com/3497 in your code?

Comment by Alexandre Louvet [ 11/Oct/12 ]

Hi Lai. No, we doesn't have this patch in our build tree.

Comment by Lai Siyao [ 11/Oct/12 ]

Alexandre, according to Jinshan's comment above, http://review.whamcloud.com/3497 is a possible fix to the LBUG in osc_req_attr_set(), will you apply that patch and retest?

Comment by Alexandre Louvet [ 11/Oct/12 ]

Patch has been applyied and we are going to test.

Comment by Jinshan Xiong (Inactive) [ 17/Oct/12 ]

Hi Bruno Faccini,

Can you please share me your debug patch? We have several occurrences of this issue or similar ones, but the lock was there from crash dump, I don't understand why it couldn't be found in cl_lock_at_pgoff().

Comment by Bruno Faccini (Inactive) [ 18/Oct/12 ]

You mean cl_lock_at_page()??
I attached our debug-patch as LU-1650_debug.patch file.

Comment by Jinshan Xiong (Inactive) [ 20/Oct/12 ]

Hi Bruno, can you please try patch: http://review.whamcloud.com/4316? It turns out that there are lots of LBUGs caused by this problem and this patch can fix it.

Comment by Bruno Faccini (Inactive) [ 26/Oct/12 ]

Will try to integrate "http://review.whamcloud.com/4316" in our builds asap.

But on the other-hand, we still miss Lustre-2.1.2 real world/production exposure on our Lustre-Clients
to allow us to reproduce the problem, or think it has been fixed ...

Comment by Bruno Faccini (Inactive) [ 12/Nov/12 ]

We upgraded the whole T100 system to Lustre-2.1.3 (Bull build .223) the past week and now we encounter again this issue quite
frequently (about 5 to 10 nodes crashed per-day ...).

BTW, we still miss "http://review.whamcloud.com/4316" in the current builds of Bull R&D, so we may plan to integrate it localy and by ourselves, along with our debug-patch too ...

Comment by Bruno Faccini (Inactive) [ 12/Nov/12 ]

I also have a question in mind, why does cl_lock_at_page() not take care of cl_locks in/with the CLS_QUEUED/CLS_QUEUEING states ???

Comment by Jinshan Xiong (Inactive) [ 13/Nov/12 ]

Hi Bruno, patch 4316 is worth trying because it has fixed a couple of similar issue in our testing, but to be honest, I haven't seen this issue in our testing. Now that it can be reproduced in a high frequency, this is a good chance for us to dig the root cause. Can you please turn on debug option to collect logs on the client?

To maintain data consistency, only granted locks are allowed to cache pages on the client side. If the lock is in CLS_QUEUED state, it means the dlm enqueue request has been sent but the server has not granted it yet, usually this is due to a conflicting lock is queued by other client and server has to revoke it before granting this one; for CLS_QUEUEING lock, the dlm request isn't sent, this happens if there exists a conflicting lock on the same client so we should cancel it before enqueuing this lock, otherwise it will take extra RPCs by server to revoke it.

Comment by Antoine Percher [ 22/Nov/12 ]

We have upgraded the system with a lustre version who integrated LU-1650_debug.patch
and LU-2170 patch backport of http://review.whamcloud.com/4316 and a LBUG occur
on cl_lock_at_page_dup()/.../osc_req_attr_set()/.../cl_req_attr_set()/.../ll_file_read()/...

And we saw 2 things :

  • another process wait spin_lock on cl_lock_hold_mutex()/.../cl_lock_request()/ll_file_read() system call
  • 2 ldlm threads wait also wait on a spin_lock
    ldlm_cb_xxx: cl_lock_modify()/osc_lock_granted()/.../osc_ldlm_completion_ast()/ldlm_work_cp_ast_lock()/....
    ldlm_bl_xxx: cl_lock_delete0()/.../ldlm_cancel_callback()/.../cl_lock_cancel()/.../osc_ldlm_blocking_ast()/..
    all stacktrace will comming soon
Comment by Antoine Percher [ 23/Nov/12 ]

Add in attchment some stack trace and debug lustre messages
after LBUG osc_req_attr_set()/cl_lock_at_page_dup()

From my point of view the insteresting things is on the
debug lustre message trace file because it can be explain
why we have a missing lock.

on the trace from germain1174, I found that the lock 0xffff88004e087cf8
could be the missing lock because it is deleted just before the LBUG :

CPU5 1353578056.708194 : vvp_io_write_start() write: [1090330624, 1090519040)
CPU0 1353578056.703907 : vvp_io_read_start() read: -> [1075838976, 1077616640)
CPU5 1353578056.710066 : osc_lock_enqueue_wait() lock ffff880626aec4d8 is conflicted with ffff88004e087cf8 will wait
CPU5 1353578056.710070 : cl_lock_enqueue_wait() enqueue wait: ffff88004e087cf8
CPU5 1353578056.710072 : cl_lock_cancel() cancel lock: ffff88004e087cf8
CPU5 1353578056.710126 : cl_lock_delete() delete lock: ffff88004e087cf8
CPU0 1353578056.710138 : cl_lock_at_page_dup() LBUG begin
CPU5 1353578056.710261 : cl_lock_delete() delete lock: ffff88004e087cf8

are you agree ? If you need more information I can try to extract it

Comment by Jinshan Xiong (Inactive) [ 23/Nov/12 ]

so lock ffff88004e087cf8 was enqueued for read and ffff880626aec4d8 is being enqueued for write. However, read is sync so the lock ffff88004e087cf8 shouldn't be canceled before reading is completed, can you please verify it?

what's the message printed in function cl_lock_at_page_dup()?

Comment by Bruno Faccini (Inactive) [ 23/Nov/12 ]

Debug stuff that Antoine was not able to attach and he sent me by email.

Comment by Bruno Faccini (Inactive) [ 23/Nov/12 ]

Hello Jinshan,

My just earlier comment was for LU1650_trace_cl_lock_at_page_dup_VS_osc_req_attr_set_LBUG.tgz file I attached.

In this archive, you should find at least the cl_lock_at_page_dup() msgs you requested.

Comment by Antoine Percher [ 26/Nov/12 ]

Did you see the LBUG message printed ?
on the trace_debug_germain1174_LBUG_cl_lock_at_page_dup.txt file, You can
saw also the printed struct of the lock ffff88004e087cf8,
the cll_state is CLS_FREEING for this lock

I also notice that for each issue/lbug, the page number who
make the lbug is a multiple of the page stripe size.

For example: In my configuration the stripe count is one mega (1m = 256 pages)
and on the germain1174 node, the page was page@ 261168 = 1028*256

May be the lock was canceled/deleted just one bit/page too early ?

Comment by Bruno Faccini (Inactive) [ 26/Nov/12 ]

Since we worked on that together, to enhance what Antoine started to detail, it seems to us that the page/index causing the LBUG, because missing a covering cl_lock, is the 1st of/in the just next/adjacent stripe-size x stripe-count chunk than the chunk containing the page where the file position plus the read size/offset stands ...

Hummm, not sure to be clear when trying to be detailed here !! So let's try to show it an other way :

– pages

stripe
stripe x count

File linearly represented :

...--|-...-|-...-|-...-|-...-|--...
^ ^

  _ 1st page of next chunk
page of file->pos
+ read(fd,,size)

no let's see how it looks in the Web interface ...

Comment by Bruno Faccini (Inactive) [ 26/Nov/12 ]

Ok just as I expected, to see it "readable" you have to edit the comment, sorry.

Comment by Jinshan Xiong (Inactive) [ 26/Nov/12 ]

Hi Antoine and Bruno, thanks for the analysis. It's normal that a conflicting lock is canceled during IO time. For this case, lock ffff88004e087cf8 is conflicting with a queueing write lock, and it's covering 121 readahead pages so the process didn't hold this lock. However, if everything is okay, those pages should be discarded when the lock is canceled but apparently that didn't happen.

There must be some race window in cl_lock_page_out() and check_and_discard_cb(). I'm checking the code.

Comment by Jinshan Xiong (Inactive) [ 26/Nov/12 ]

I've found the root cause, working on a patch...

Comment by Jinshan Xiong (Inactive) [ 26/Nov/12 ]

Can you please apply this patch http://review.whamcloud.com/4673 and test it again? Thanks.

Comment by Alain GOB [ 29/Nov/12 ]

Dear,
Patch was apply and currently running on testing system. Unfortunately, if this issue is now fix, will not see anything an the testing system. For now we don't see other issue happening on it. We will let you known about any problem. To definitely confirm that this issue is fixed, we will have to run on prod system with consequent number of node and relevant crash statistics. FYI, on testing system we have 1 crash a weeks, on prod system we see 5 crashs by day. Did you known when this fix will be landed to be include on official bull release and push on production system?
Regards,
Alain GOB.

Comment by Peter Jones [ 29/Nov/12 ]

Alain

We are testing and reviewing the patch in parallel with your testing. It sounds to me like we would want to see at least two weeks of the test running on your test cluster before we could confidently assert that the situation has been improved by the fix. We'll aim to get the fix landed to our tree and thus expose it to broader testing to raise confidence that there are no adverse side-effects for this fix.

Peter

Comment by Antoine Percher [ 10/Dec/12 ]

After 6 days on a big server who reproduce the issue each day,
I able to tell you that with the first (patch set 1) fix the Lbug didin't reproduce.
so we can say that the fix is OK and can be landed to an official release
Thanks for all

Comment by Peter Jones [ 19/Dec/12 ]

Thanks. This fix has been landed for 2.1.4 and 2.4

Generated at Sat Feb 10 01:18:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.