[LU-1650] crash of lustre clients in osc_req_attr_set() routine Created: 20/Jul/12 Updated: 19/Dec/12 Resolved: 19/Dec/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0, Lustre 2.3.0 |
| Fix Version/s: | Lustre 2.4.0, Lustre 2.1.4 |
| Type: | Bug | Priority: | Major |
| Reporter: | Gregoire Pichon | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | cl_lock, client, crash | ||
| Environment: |
lustre 2.1.2 + a few additional patches |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 4053 | ||||||||
| Description |
|
Our lustre release is made of lustre 2.1.2 plus the following patches:
At CEA, there are several lustre client crashes/LBUGs with this same signature/stack like following : ........ LustreError: 6614:0:(osc_io.c:698:osc_req_attr_set()) no cover page! LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) page@ffff880b9f71e180[2 ffff8806a0423448:2342912 ^(null)_ffff880b9f71e0c0 3 0 1 (null) ffff880af0057f80 0x0] LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) page@ffff880b9f71e0c0[1 ffff8806e43a6aa8:780288 ^ffff880b9f71e180_(null) 3 0 1 (null) (null) 0x0] LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) vvp-page@ffff880ba2855640(1:0:0) vm@ffffea0028af1e40 1400000000000801 3:0 ffff880b9f71e180 2342912 lru LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) lov-page@ffff880ba285bf48 LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) osc-page@ffff880b9e38d070: 1< 0x845fed 1 0 - - + > 2< 3196059648 0 4096 0x7 0x8 | (null) ffff88102de0e8c8 ffff8806a08d4e40 ffffffffa08e6140 ffff880b9e38d070 > 3< + ffff880a8b481100 1 3668 0 > 4< 0 0 32 69689344 - | - - - - > 5< - - - - | 0 - - | 0 - -> LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) end page@ffff880b9f71e0c0 LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) dump uncover page! Pid: 6614, comm: %%USR123_%%%A456 Call Trace: [<ffffffffa03bc865>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa08e3979>] osc_req_attr_set+0x2f9/0x310 [osc] [<ffffffffa04de979>] cl_req_attr_set+0xc9/0x250 [obdclass] [<ffffffffa08d088b>] osc_send_oap_rpc+0xc2b/0x1b40 [osc] [<ffffffffa03bd86e>] ? cfs_mem_cache_free+0xe/0x10 [libcfs] [<ffffffffa08d1a4e>] osc_check_rpcs+0x2ae/0x4c0 [osc] [<ffffffffa08e4037>] osc_io_submit+0x1e7/0x540 [osc] [<ffffffffa04ded00>] cl_io_submit_rw+0x70/0x180 [obdclass] [<ffffffffa0962a4e>] lov_io_submit+0x4ee/0xc30 [lov] [<ffffffffa04ded00>] cl_io_submit_rw+0x70/0x180 [obdclass] [<ffffffffa04e0f40>] cl_io_read_page+0xb0/0x170 [obdclass] [<ffffffffa04d5349>] ? cl_page_assume+0xf9/0x2d0 [obdclass] [<ffffffffa0a1a6b6>] ll_readpage+0x96/0x200 [lustre] [<ffffffff810fc9dc>] generic_file_aio_read+0x1fc/0x700 [<ffffffffa0a4237b>] vvp_io_read_start+0x13b/0x3e0 [lustre] [<ffffffffa04defca>] cl_io_start+0x6a/0x140 [obdclass] [<ffffffffa04e322c>] cl_io_loop+0xcc/0x190 [obdclass] [<ffffffffa09f1ef7>] ll_file_io_generic+0x3a7/0x560 [lustre] [<ffffffffa09f21e9>] ll_file_aio_read+0x139/0x2c0 [lustre] [<ffffffffa09f26a9>] ll_file_read+0x169/0x2a0 [lustre] [<ffffffff8115e355>] vfs_read+0xb5/0x1a0 [<ffffffff8115e491>] sys_read+0x51/0x90 [<ffffffff81003172>] system_call_fastpath+0x16/0x1b LustreError: 6614:0:(osc_io.c:702:osc_req_attr_set()) LBUG Pid: 6614, comm: %%USR123_%%%A456 and also, but much less frequently with a stack like : LustreError: 23020:0:(cl_lock.c:906:cl_lock_hold_release()) failed at lock->cll_state != CLS_HELD. LustreError: 23020:0:(cl_lock.c:906:cl_lock_hold_release()) LBUG Pid: 23020, comm: %%USR789_%%A37_ Call Trace: [<ffffffffa03dc865>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa03dce77>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa04f95e2>] cl_lock_hold_release+0x2a2/0x2b0 [obdclass] [<ffffffffa04fadf2>] cl_lock_release+0x82/0x180 [obdclass] [<ffffffffa0502338>] cl_lock_link_fini+0x68/0x160 [obdclass] [<ffffffffa0502565>] cl_io_unlock+0x135/0x2e0 [obdclass] [<ffffffffa0503245>] cl_io_loop+0xe5/0x190 [obdclass] [<ffffffffa0a5ac13>] cl_setattr_ost+0x1c3/0x240 [lustre] [<ffffffffa0a2e59a>] ll_setattr_raw+0x96a/0xf20 [lustre] [<ffffffffa0a2ebaf>] ll_setattr+0x5f/0x100 [lustre] [<ffffffff811796d8>] notify_change+0x168/0x340 [<ffffffff8115be54>] do_truncate+0x64/0xa0 [<ffffffff8116dff1>] do_filp_open+0x821/0xd30 [<ffffffff8112af80>] ? unmap_region+0x110/0x130 [<ffffffff8117a6a2>] ? alloc_fd+0x92/0x160 [<ffffffff8115ac29>] do_sys_open+0x69/0x140 [<ffffffff8115ad40>] sys_open+0x20/0x30 [<ffffffff81003172>] system_call_fastpath+0x16/0x1b The second LBUG() is an assertion introduced by I have asked the support team to provide additional debug data (debug log with "dlmtrace" enabled, and a dump image). But unfortunately it will take some time to get these information. |
| Comments |
| Comment by Peter Jones [ 20/Jul/12 ] | ||||
|
Assigning to Jinshan who has already looked into this issue a little | ||||
| Comment by Gregoire Pichon [ 23/Jul/12 ] | ||||
|
Hese are the crash-dump analysis from the support team:
| ||||
| Comment by Gregoire Pichon [ 23/Jul/12 ] | ||||
|
Jinshan, From Could you confirm this problem is also fixed by patch from b2_1 set 10 ? This is the patch version we have integrated in our Lustre version. Thanks. | ||||
| Comment by Jinshan Xiong (Inactive) [ 23/Jul/12 ] | ||||
|
I've found the root cause of the 2nd problem and will fix | ||||
| Comment by Jinshan Xiong (Inactive) [ 24/Jul/12 ] | ||||
|
It's my fault that the fix has already been in master but I forgot to merge it into the patch for b2_1. I've fixed this problem in patch set 11 of http://review.whamcloud.com/2574. | ||||
| Comment by Gregoire Pichon [ 24/Jul/12 ] | ||||
|
Thank you Jinshan. | ||||
| Comment by Christopher Morrone [ 30/Jul/12 ] | ||||
|
We're still hitting the same assertion with patch set 11 of http://review.whamcloud.com/2574. Will attach console log excerpt. We were running 2.1.2-3chaos. | ||||
| Comment by Christopher Morrone [ 30/Jul/12 ] | ||||
|
Console log from assertion when running patchset 11 from the | ||||
| Comment by Jinshan Xiong (Inactive) [ 30/Jul/12 ] | ||||
|
Hi Chris, Do you collect the backtrace or it had the exactly same backtrace as before? | ||||
| Comment by Prakash Surya (Inactive) [ 30/Jul/12 ] | ||||
|
Jinshan, the back trace is the same as what is posted in the description. For completeness it's pasted below as well:
crash> bt
PID: 28339 TASK: ffff88040aa03500 CPU: 6 COMMAND: "mpi_dump"
#0 [ffff88040a9a3a90] machine_kexec at ffffffff8103216b
#1 [ffff88040a9a3af0] crash_kexec at ffffffff810b8d12
#2 [ffff88040a9a3bc0] panic at ffffffff814ee999
#3 [ffff88040a9a3c40] lbug_with_loc at ffffffffa0438e1b [libcfs]
#4 [ffff88040a9a3c60] cl_lock_hold_release at ffffffffa05b124f [obdclass]
#5 [ffff88040a9a3c80] cl_lock_release at ffffffffa05b24a2 [obdclass]
#6 [ffff88040a9a3cb0] cl_lock_link_fini at ffffffffa05b8128 [obdclass]
#7 [ffff88040a9a3cf0] cl_io_unlock at ffffffffa05b8295 [obdclass]
#8 [ffff88040a9a3d30] cl_io_loop at ffffffffa05b8ced [obdclass]
#9 [ffff88040a9a3d80] cl_setattr_ost at ffffffffa09cb383 [lustre]
#10 [ffff88040a9a3de0] ll_setattr_raw at ffffffffa09a5854 [lustre]
#11 [ffff88040a9a3e60] ll_setattr at ffffffffa09a5c8f [lustre]
#12 [ffff88040a9a3e70] notify_change at ffffffff81193bc8
#13 [ffff88040a9a3ee0] do_truncate at ffffffff811762a4
#14 [ffff88040a9a3f50] sys_ftruncate at ffffffff81176570
#15 [ffff88040a9a3f80] system_call_fastpath at ffffffff8100b0f2
RIP: 00002aaaac1d09d7 RSP: 00007fffffffb028 RFLAGS: 00010246
RAX: 000000000000004d RBX: ffffffff8100b0f2 RCX: 0000000000000000
RDX: 0000000020000008 RSI: 0000000020000008 RDI: 0000000000000005
RBP: 00007fffffffc860 R8: 0000000000000001 R9: 0000000000000000
R10: 0000000000000001 R11: 0000000000000203 R12: 0000000000606080
R13: 00007fffffffd160 R14: 0000000000000000 R15: 0000000020000008
ORIG_RAX: 000000000000004d CS: 0033 SS: 002b
| ||||
| Comment by Prakash Surya (Inactive) [ 30/Jul/12 ] | ||||
|
We also crashed a few of our Orion compute nodes over the weekend. I believe this bug was the cause of at least one of them. The stack trace is below: crash> bt
PID: 71817 TASK: ffff88082ae9c080 CPU: 14 COMMAND: "ior"
#0 [ffff880488f552d0] machine_kexec at ffffffff8103216b
#1 [ffff880488f55330] crash_kexec at ffffffff810b8d12
#2 [ffff880488f55400] panic at ffffffff814ee999
#3 [ffff880488f55480] lbug_with_loc at ffffffffa032af2b [libcfs]
#4 [ffff880488f554a0] osc_req_attr_set at ffffffffa0a329b1 [osc]
#5 [ffff880488f554e0] cl_req_attr_set at ffffffffa06d3799 [obdclass]
#6 [ffff880488f55540] osc_build_rpc at ffffffffa0a266d9 [osc]
#7 [ffff880488f55670] osc_send_read_rpc at ffffffffa0a3b013 [osc]
#8 [ffff880488f55810] osc_io_unplug0 at ffffffffa0a3ebcb [osc]
#9 [ffff880488f55920] osc_io_unplug at ffffffffa0a3f9c1 [osc]
#10 [ffff880488f55930] osc_queue_sync_pages at ffffffffa0a45ec8 [osc]
#11 [ffff880488f559a0] osc_io_submit at ffffffffa0a32f60 [osc]
#12 [ffff880488f55a40] cl_io_submit_rw at ffffffffa06d325c [obdclass]
#13 [ffff880488f55a90] lov_io_submit at ffffffffa0abe032 [lov]
#14 [ffff880488f55b20] cl_io_submit_rw at ffffffffa06d325c [obdclass]
#15 [ffff880488f55b70] cl_io_read_page at ffffffffa06d4fe6 [obdclass]
#16 [ffff880488f55bc0] ll_readpage at ffffffffa0b592c0 [lustre]
#17 [ffff880488f55c00] generic_file_aio_read at ffffffff8111262c
#18 [ffff880488f55ce0] vvp_io_read_start at ffffffffa0b7e806 [lustre]
#19 [ffff880488f55d50] cl_io_start at ffffffffa06d3392 [obdclass]
#20 [ffff880488f55d80] cl_io_loop at ffffffffa06d6ca4 [obdclass]
#21 [ffff880488f55db0] ll_file_io_generic at ffffffffa0b38a87 [lustre]
#22 [ffff880488f55e20] ll_file_aio_read at ffffffffa0b39929 [lustre]
#23 [ffff880488f55e80] ll_file_read at ffffffffa0b39be3 [lustre]
#24 [ffff880488f55ef0] vfs_read at ffffffff811782d5
#25 [ffff880488f55f30] sys_read at ffffffff81178411
#26 [ffff880488f55f80] system_call_fastpath at ffffffff8100b0f2
RIP: 00002aaaab65b4ed RSP: 00007fffffffd040 RFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffffff8100b0f2 RCX: 0000008e50133f96
RDX: 0000000008000000 RSI: 00002aaab8001000 RDI: 0000000000000009
RBP: 0000000008000000 R8: 0000000000000064 R9: 00000000ea31beaf
R10: 00007fffffffcd6c R11: 0000000000000293 R12: 00002aaab8001000
R13: 0000000000000003 R14: 00000000006c2ac0 R15: 0000000000000002
ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b
EDIT: We are not running with the patch from | ||||
| Comment by Jinshan Xiong (Inactive) [ 30/Jul/12 ] | ||||
|
I've found the root cause of this problem and I'm working on a patch. This problem is revealed by patch of Lu-1299 because the assertion was added in that patch. The root cause of this patch is that we shouldn't set the lock to be CLS_HELD after unusing it. | ||||
| Comment by Jinshan Xiong (Inactive) [ 30/Jul/12 ] | ||||
|
Will you please apply this patch (http://review.whamcloud.com/3497) and | ||||
| Comment by Ian Colle (Inactive) [ 31/Jul/12 ] | ||||
|
Also observed at LLNL on Orion | ||||
| Comment by Christopher Morrone [ 31/Jul/12 ] | ||||
|
I've added the http://review.whamcloud.com/3497 patch for testing and we currently have patch set 11 of of the original | ||||
| Comment by Bruno Faccini (Inactive) [ 13/Sep/12 ] | ||||
|
We also still frequently suffer the "(osc_io.c:702:osc_req_attr_set()) LBUG" with our/Bull Lustre v2.1.2 build including This seem to mean that we hit some timing window issue where the expected but missing "matching" lock may have been delayed to be linked to the object locks list ... BTW, we have develloped a local debug-patch to allow the same condition to be earlier trigger by moving the LBUG in case of no lock finding in a "clone" routine of cl_lock_at_page() routine and to be only called from osc_req_attr_set(). This will allow to detect and LBUG under coh_lock_guard protection this time, and why not crash with the future matching lock bringer to be stuck waiting on coh_lock_guard !!... We will also try to run with "max_ptlrpcds=1" and "ptlrpcd_bind_policy=1" settings of ptlrpc module, since we quite stronger think that the "multi+binding ptlrpcds" feature may have a certain impact for this problem ... | ||||
| Comment by Jinshan Xiong (Inactive) [ 13/Sep/12 ] | ||||
Did you notice the lock state when this issue was happening? My gut feeling is that this page must be a readahead page and the routine of adding a readahead page into cache and checking matched cl_lock are not atomic. Please dump the lock state and page state when you see this problem next time. Another issue observed in | ||||
| Comment by Michael McKay (Inactive) [ 14/Sep/12 ] | ||||
|
I am currently looking at an issue that looks similar to the first stack trace in this ticket. The crashing trace in my case is: In this case the object has 2 locks: Could this be a similar issue? | ||||
| Comment by Jinshan Xiong (Inactive) [ 15/Sep/12 ] | ||||
|
Hi Michael, it looks similar. Actually when this bug is hit, it should dump all locks of this object. Can anybody post it here on the ticket? | ||||
| Comment by Michael McKay (Inactive) [ 17/Sep/12 ] | ||||
|
Here is the output if the locks when the assertion hit. 7196:0:(osc_io.c:698:osc_req_attr_set()) lock@ffff8803f4522358[3 2 0 1 1 00000006] P(0):[0, 18446744073709551615]@[0x100000000:0x4b1f5c0:0x0] { 7196:0:(osc_io.c:698:osc_req_attr_set()) lovsub@ffff8803f525e4e0: [0 ffff880404198470 P(0):[0, 18446744073709551615]@[0x23d979800:0x1:0x0]] 7196:0:(osc_io.c:698:osc_req_attr_set()) osc@ffff8803f4a3ab50: (null) 00001000 0x85ead42e54740750 1 (null) size: 0 mtime: 0 atime: 0 ctime: 0 blocks: 0 7196:0:(osc_io.c:698:osc_req_attr_set()) } lock@ffff8803f4522358 lock@ffff8803f4522dd8 | ||||
| Comment by Jinshan Xiong (Inactive) [ 28/Sep/12 ] | ||||
|
I'm working on this issue. | ||||
| Comment by Jinshan Xiong (Inactive) [ 28/Sep/12 ] | ||||
|
I realize that the issue mentioned by Michael McKay is actually different where the assertion was hit in ll_writepage() path. However initially the assertion was hit by a readahead page. After taking a look at the crash log, it's really strange that the page was actually covered by a lock, as follows: LustreError: 73977:0:(osc_io.c:698:osc_req_attr_set()) lock@ffff880b61a79518[0 5 0 0 0 00000000] W(2):[185984, 187903]@[0x102350000:0x536b:0x0] {
LustreError: 73977:0:(osc_io.c:698:osc_req_attr_set()) lovsub@ffff880b5c410620: [0 ffff880b57801678 W(2):[558720, 564735]@[0x2103bd364:0x1e89:0x0]]
LustreError: 73977:0:(osc_io.c:698:osc_req_attr_set()) osc@ffff880b64809648: ffff880b5a1e3480 40120002 0x6cc4082097a097e7 4 (null) size: 770703360 mtime: 1341558695 atime: 1341558695 ctime: 1341558695 blocks: 1459736
LustreError: 73977:0:(osc_io.c:698:osc_req_attr_set()) } lock@ffff880b61a79518
LustreError: 73977:0:(osc_io.c:698:osc_req_attr_set()) no cover page!
and page was: LustreError: 73977:0:(osc_io.c:700:osc_req_attr_set()) page@ffff880b57936ec0[2 ffff8804ff1e1a48:561152 ^(null)_ffff880b57936e00 3 0 1 (null) ffff880b7608cec0 0x0] LustreError: 73977:0:(osc_io.c:700:osc_req_attr_set()) page@ffff880b57936e00[1 ffff8804ff1e2d78:186368 ^ffff880b57936ec0_(null) 3 0 1 (null) (null) 0x0] LustreError: 73977:0:(osc_io.c:700:osc_req_attr_set()) vvp-page@ffff880b57b38320(1:0:0) vm@ffffea0027a42cc0 1400000000000821 2:0 ffff880b57936ec0 561152 lru LustreError: 73977:0:(osc_io.c:700:osc_req_attr_set()) lov-page@ffff880b5a276498 LustreError: 73977:0:(osc_io.c:700:osc_req_attr_set()) osc-page@ffff880b565f28b0: 1< 0x845fed 1 0 - - + > 2< 763363328 0 4096 0x7 0x8 | (null) ffff88104f42a588 ffff880501380840 ffffffffa08ea140 ffff880b565f28b0 > 3< + ffff8808220f0040 1 732 0 > 4< 0 0 32 72396800 - | - - + - > 5< - - + - | 0 - - | 208 + -> LustreError: 73977:0:(osc_io.c:700:osc_req_attr_set()) end page@ffff880b57936e00 LustreError: 73977:0:(osc_io.c:700:osc_req_attr_set()) dump uncover page! The page index is 186368, and the lock is a PW lock which covers [185984, 187903], and the lock is in cached state. It looks all good except hitting assertion. I'll make further investigation. | ||||
| Comment by Jinshan Xiong (Inactive) [ 29/Sep/12 ] | ||||
|
Hi Gregoire, do you still have the crash file, I'd like to the value of osc_page::ops_lock because that lock was supposed to cover that readahead page? Otherwise, I'll provide you debug patch to print it in osc_page_print() so that we can see it next time you hit it. Thanks | ||||
| Comment by Bruno Faccini (Inactive) [ 03/Oct/12 ] | ||||
|
Hello Jinshan, I checked in all the crash-dumps still available on-site and osc_page::ops_lock is always non-null and valid. BTW, we still have not suffered this LBUG() since we are running (on a much smaller test configuration) with our own debug-patch I already described in my earlier comment on 13/Sep/12 6:33 PM. | ||||
| Comment by Jinshan Xiong (Inactive) [ 04/Oct/12 ] | ||||
|
what's the status of that lock? | ||||
| Comment by Peter Jones [ 04/Oct/12 ] | ||||
|
Lai Could you please help out on this issue? Thanks Peter | ||||
| Comment by Bruno Faccini (Inactive) [ 04/Oct/12 ] | ||||
|
Status (cll_state ?) of osc_page::ops_lock found in still available crash-dumps, is 0x6/CLS_FREEING. | ||||
| Comment by Cory Spitz [ 08/Oct/12 ] | ||||
|
FYI: | ||||
| Comment by Lai Siyao [ 11/Oct/12 ] | ||||
|
Hi Bruno, did you include http://review.whamcloud.com/3497 in your code? | ||||
| Comment by Alexandre Louvet [ 11/Oct/12 ] | ||||
|
Hi Lai. No, we doesn't have this patch in our build tree. | ||||
| Comment by Lai Siyao [ 11/Oct/12 ] | ||||
|
Alexandre, according to Jinshan's comment above, http://review.whamcloud.com/3497 is a possible fix to the LBUG in osc_req_attr_set(), will you apply that patch and retest? | ||||
| Comment by Alexandre Louvet [ 11/Oct/12 ] | ||||
|
Patch has been applyied and we are going to test. | ||||
| Comment by Jinshan Xiong (Inactive) [ 17/Oct/12 ] | ||||
|
Hi Bruno Faccini, Can you please share me your debug patch? We have several occurrences of this issue or similar ones, but the lock was there from crash dump, I don't understand why it couldn't be found in cl_lock_at_pgoff(). | ||||
| Comment by Bruno Faccini (Inactive) [ 18/Oct/12 ] | ||||
|
You mean cl_lock_at_page()?? | ||||
| Comment by Jinshan Xiong (Inactive) [ 20/Oct/12 ] | ||||
|
Hi Bruno, can you please try patch: http://review.whamcloud.com/4316? It turns out that there are lots of LBUGs caused by this problem and this patch can fix it. | ||||
| Comment by Bruno Faccini (Inactive) [ 26/Oct/12 ] | ||||
|
Will try to integrate "http://review.whamcloud.com/4316" in our builds asap. But on the other-hand, we still miss Lustre-2.1.2 real world/production exposure on our Lustre-Clients | ||||
| Comment by Bruno Faccini (Inactive) [ 12/Nov/12 ] | ||||
|
We upgraded the whole T100 system to Lustre-2.1.3 (Bull build .223) the past week and now we encounter again this issue quite BTW, we still miss "http://review.whamcloud.com/4316" in the current builds of Bull R&D, so we may plan to integrate it localy and by ourselves, along with our debug-patch too ... | ||||
| Comment by Bruno Faccini (Inactive) [ 12/Nov/12 ] | ||||
|
I also have a question in mind, why does cl_lock_at_page() not take care of cl_locks in/with the CLS_QUEUED/CLS_QUEUEING states ??? | ||||
| Comment by Jinshan Xiong (Inactive) [ 13/Nov/12 ] | ||||
|
Hi Bruno, patch 4316 is worth trying because it has fixed a couple of similar issue in our testing, but to be honest, I haven't seen this issue in our testing. Now that it can be reproduced in a high frequency, this is a good chance for us to dig the root cause. Can you please turn on debug option to collect logs on the client? To maintain data consistency, only granted locks are allowed to cache pages on the client side. If the lock is in CLS_QUEUED state, it means the dlm enqueue request has been sent but the server has not granted it yet, usually this is due to a conflicting lock is queued by other client and server has to revoke it before granting this one; for CLS_QUEUEING lock, the dlm request isn't sent, this happens if there exists a conflicting lock on the same client so we should cancel it before enqueuing this lock, otherwise it will take extra RPCs by server to revoke it. | ||||
| Comment by Antoine Percher [ 22/Nov/12 ] | ||||
|
We have upgraded the system with a lustre version who integrated And we saw 2 things :
| ||||
| Comment by Antoine Percher [ 23/Nov/12 ] | ||||
|
Add in attchment some stack trace and debug lustre messages From my point of view the insteresting things is on the on the trace from germain1174, I found that the lock 0xffff88004e087cf8 CPU5 1353578056.708194 : vvp_io_write_start() write: [1090330624, 1090519040) are you agree ? If you need more information I can try to extract it | ||||
| Comment by Jinshan Xiong (Inactive) [ 23/Nov/12 ] | ||||
|
so lock ffff88004e087cf8 was enqueued for read and ffff880626aec4d8 is being enqueued for write. However, read is sync so the lock ffff88004e087cf8 shouldn't be canceled before reading is completed, can you please verify it? what's the message printed in function cl_lock_at_page_dup()? | ||||
| Comment by Bruno Faccini (Inactive) [ 23/Nov/12 ] | ||||
|
Debug stuff that Antoine was not able to attach and he sent me by email. | ||||
| Comment by Bruno Faccini (Inactive) [ 23/Nov/12 ] | ||||
|
Hello Jinshan, My just earlier comment was for LU1650_trace_cl_lock_at_page_dup_VS_osc_req_attr_set_LBUG.tgz file I attached. In this archive, you should find at least the cl_lock_at_page_dup() msgs you requested. | ||||
| Comment by Antoine Percher [ 26/Nov/12 ] | ||||
|
Did you see the LBUG message printed ? I also notice that for each issue/lbug, the page number who For example: In my configuration the stripe count is one mega (1m = 256 pages) May be the lock was canceled/deleted just one bit/page too early ? | ||||
| Comment by Bruno Faccini (Inactive) [ 26/Nov/12 ] | ||||
|
Since we worked on that together, to enhance what Antoine started to detail, it seems to us that the page/index causing the LBUG, because missing a covering cl_lock, is the 1st of/in the just next/adjacent stripe-size x stripe-count chunk than the chunk containing the page where the file position plus the read size/offset stands ... Hummm, not sure to be clear when trying to be detailed here !! So let's try to show it an other way : – pages
File linearly represented : ...--
no let's see how it looks in the Web interface ... | ||||
| Comment by Bruno Faccini (Inactive) [ 26/Nov/12 ] | ||||
|
Ok just as I expected, to see it "readable" you have to edit the comment, sorry. | ||||
| Comment by Jinshan Xiong (Inactive) [ 26/Nov/12 ] | ||||
|
Hi Antoine and Bruno, thanks for the analysis. It's normal that a conflicting lock is canceled during IO time. For this case, lock ffff88004e087cf8 is conflicting with a queueing write lock, and it's covering 121 readahead pages so the process didn't hold this lock. However, if everything is okay, those pages should be discarded when the lock is canceled but apparently that didn't happen. There must be some race window in cl_lock_page_out() and check_and_discard_cb(). I'm checking the code. | ||||
| Comment by Jinshan Xiong (Inactive) [ 26/Nov/12 ] | ||||
|
I've found the root cause, working on a patch... | ||||
| Comment by Jinshan Xiong (Inactive) [ 26/Nov/12 ] | ||||
|
Can you please apply this patch http://review.whamcloud.com/4673 and test it again? Thanks. | ||||
| Comment by Alain GOB [ 29/Nov/12 ] | ||||
|
Dear, | ||||
| Comment by Peter Jones [ 29/Nov/12 ] | ||||
|
Alain We are testing and reviewing the patch in parallel with your testing. It sounds to me like we would want to see at least two weeks of the test running on your test cluster before we could confidently assert that the situation has been improved by the fix. We'll aim to get the fix landed to our tree and thus expose it to broader testing to raise confidence that there are no adverse side-effects for this fix. Peter | ||||
| Comment by Antoine Percher [ 10/Dec/12 ] | ||||
|
After 6 days on a big server who reproduce the issue each day, | ||||
| Comment by Peter Jones [ 19/Dec/12 ] | ||||
|
Thanks. This fix has been landed for 2.1.4 and 2.4 |