[LU-2557] osc_page_delete()) Trying to teardown failed: -16 (EBUSY) Created: 02/Jan/13 Updated: 25/Feb/13 Resolved: 25/Feb/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0, Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Bruce Korb (Inactive) | Assignee: | Hongchao Zhang |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | LB | ||
| Environment: |
Lustre client 2.3 rc6, SLES11 2.6.32.59 or SLES11 3.0.38 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 2 | ||||||||
| Rank (Obsolete): | 5983 | ||||||||
| Description |
|
The "teardown failure" is handled with an assertion failure. The stack trace from the assertion failure call: And extract from pid 617 Lustre traces: 00000008:00020000:21.0:1353473247.507160:0:617:0:(osc_page.c:411:osc_page_delete()) page@ffff88080aa2f180[1 ffff8801ff1e1508:51 ^ffff88080aa2f240_(null) 4 0 1 (null) (null) 0x0] Customer commentary: A dump is in: Reference: |
| Comments |
| Comment by Bruce Korb (Inactive) [ 09/Jan/13 ] |
|
The attached "clio disable CILR_PEEK optimization" patch does not seem to work. Vitaly's analysis of the issues falling out from The new symptom is a hang. Two threads have the following stack trace and the data for one of the stack traces appears to have been corrupted. Upon request, I'll add those details, but meanwhile any help at all in understanding the root cause of the Thank you. Stack Trace of Hung Process PID: 18325 TASK: ffff8803fe1381c0 CPU: 1 COMMAND: "fsx-linux-aio"
#0 [ffff8803c51dd538] schedule at ffffffff8141e405
#1 [ffff8803c51dd610] schedule_timeout at ffffffff8141ee14
#2 [ffff8803c51dd690] cfs_waitq_timedwait at ffffffffa01a66c1 [libcfs]
#3 [ffff8803c51dd6a0] osc_extent_wait at ffffffffa0725694 [osc]
#4 [ffff8803c51dd810] osc_cache_wait_range at ffffffffa0726167 [osc]
#5 [ffff8803c51dd9a0] osc_io_fsync_end at ffffffffa0717a79 [osc]
#6 [ffff8803c51dd9e0] cl_io_end at ffffffffa03048b2 [obdclass]
#7 [ffff8803c51dda10] lov_io_end_wrapper at ffffffffa07c26f9 [lov]
#8 [ffff8803c51dda30] lov_io_fsync_end at ffffffffa07c407b [lov]
#9 [ffff8803c51dda60] cl_io_end at ffffffffa03048b2 [obdclass]
#10 [ffff8803c51dda90] cl_io_loop at ffffffffa0308c4a [obdclass]
#11 [ffff8803c51ddac0] cl_sync_file_range at ffffffffa08aacbb [lustre]
#12 [ffff8803c51ddb30] ll_writepages at ffffffffa08d0164 [lustre]
#13 [ffff8803c51ddb70] do_writepages at ffffffff810cd631
#14 [ffff8803c51ddb80] __filemap_fdatawrite_range at ffffffff810c5b65
#15 [ffff8803c51ddbe0] do_sync_mapping_range at ffffffff8112ea01
#16 [ffff8803c51ddc10] __generic_file_aio_write at ffffffff810c6d6c
#17 [ffff8803c51ddcc0] generic_file_aio_write at ffffffff810c6e42
#18 [ffff8803c51ddd00] vvp_io_write_start at ffffffffa08fd4f6 [lustre]
#19 [ffff8803c51ddd40] cl_io_start at ffffffffa0304a1a [obdclass]
#20 [ffff8803c51ddd70] cl_io_loop at ffffffffa0308c3c [obdclass]
#21 [ffff8803c51ddda0] ll_file_io_generic at ffffffffa08a9cc5 [lustre]
#22 [ffff8803c51dde10] ll_file_aio_write at ffffffffa08aa01b [lustre]
#23 [ffff8803c51dde60] aio_rw_vect_retry at ffffffff811426d4
#24 [ffff8803c51dde90] aio_run_iocb at ffffffff81143ffa
#25 [ffff8803c51ddec0] sys_io_submit at ffffffff81144e38
#26 [ffff8803c51ddf80] system_call_fastpath at ffffffff8100305b
RIP: 00007fa64d7d06f7 RSP: 00007fff6d99f318 RFLAGS: 00010206
RAX: 00000000000000d1 RBX: ffffffff8100305b RCX: 000000000000002d
RDX: 00007fff6d99f2c0 RSI: 0000000000000001 RDI: 00007fa64dbe6000
RBP: 0000000000006000 R8: 0000000000039000 R9: 0000000000000003
R10: 262b0c2b4d2b962b R11: 0000000000000202 R12: 00007fff6d99f770
R13: 0000000000404658 R14: 000000000003f000 R15: 000000000003f000
ORIG_RAX: 00000000000000d1 CS: 0033 SS: 002b
|
| Comment by Hongchao Zhang [ 29/Jan/13 ] |
|
this ticket could be related to the "write_one_page" call in ll_page_mkwrite0 if there is no lock to protect the write. else if (result == -ENODATA) { unlock_page(vmpage); if there is an extent covering this page is waiting to be flushed back to OST, "write_one_page" will put the page into the pending list. Hi Bruce, could you please try to test it without the "write_one_page" call? Thanks. |
| Comment by Bruce Korb (Inactive) [ 29/Jan/13 ] |
|
Meaning that the call to "write_one_page(vmpage, 1)" did nothing useful, so just eliminate it entirely? OK... BTW, "noformat" inside of curly braces before and after a code snippet makes the code much more readable. code snippet from ll_page_mkwrite0 if (result == 0 || result == -ENODATA) {
...
if (vmpage->mapping == NULL) {
...
} else if (result == -ENODATA) {
write_one_page(vmpage, 1);
lock_page(vmpage);
if (vmpage->mapping != NULL) {
ll_invalidate_page(vmpage);
LASSERT(vmpage->mapping == NULL);
}
unlock_page(vmpage);
}
...
}
|
| Comment by Bruce Korb (Inactive) [ 29/Jan/13 ] |
|
longer snippet from ll_page_mkwrite0 if (result == 0 || result == -ENODATA) {
lock_page(vmpage);
if (vmpage->mapping == NULL) {
unlock_page(vmpage);
if (result == 0)
result = -ENODATA;
} else if (result == -ENODATA) {
write_one_page(vmpage, 1);
lock_page(vmpage);
if (vmpage->mapping != NULL) {
ll_invalidate_page(vmpage);
LASSERT(vmpage->mapping == NULL);
}
unlock_page(vmpage);
OK, I'm not understanding how the write_one_page is not within a "lock_page". |
| Comment by Bruce Korb (Inactive) [ 29/Jan/13 ] |
|
do not lock a page that is already locked |
| Comment by Hongchao Zhang [ 29/Jan/13 ] |
|
the page is locked before calling "write_one_lock", and it will unlock the page after it exits, the modified codes like longer snippet from ll_page_mkwrite0 if (result == 0 || result == -ENODATA) {
lock_page(vmpage);
if (vmpage->mapping == NULL) {
unlock_page(vmpage);
if (result == 0)
result = -ENODATA;
} else if (result == -ENODATA) {
if (vmpage->mapping != NULL) {
ll_invalidate_page(vmpage);
LASSERT(vmpage->mapping == NULL);
}
unlock_page(vmpage);
|
| Comment by Bruce Korb (Inactive) [ 30/Jan/13 ] |
|
Sorry – should have read code before commenting. I knew Linus strongly favored locking and unlocking in the same function, ... Anyway, according to Cray, removing the "write_one_page" call still hangs: Wally Wang updated LELUS-103:
-----------------------------
Attachment: lelus103-sp1-nowrite-one-page.tar.bz2
After removing the write_one_page(), it still hangs. See attachment
|
| Comment by Hongchao Zhang [ 31/Jan/13 ] |
|
Hi, is the previouse patch(removing CILR_PEEK) still applied with removing the write_one_page? |
| Comment by Bruce Korb (Inactive) [ 31/Jan/13 ] |
|
CF:
|
| Comment by Peter Jones [ 25/Feb/13 ] |
|
duplicate of |