[LU-3448] osc_page_delete()) ASSERTION(0) failed running racer Created: 10/Jun/13 Updated: 24/Jul/13 Resolved: 21/Jun/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.5.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | John Hammond | Assignee: | John Hammond |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | osc | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 8619 | ||||||||||||
| Description |
|
Running a modified racer ( LustreError: 23518:0:(osc_cache.c:2379:osc_teardown_async_page()) extent ffff8800abb54c60@{[0 -> 1/255], [3|1|-|active|wi|ffff8800aa8b4688], [8192|2|+|-|ffff8800ac9a2e78|256|(null)]} trunc at 0.
LustreError: 23518:0:(osc_page.c:430:osc_page_delete()) page@ffff8800acab8600[2 ffff8800ae3bb630:0 ^(null)_ffff880112410800 4 0 1 (null) (null) 0x0]
LustreError: 23518:0:(osc_page.c:430:osc_page_delete()) page@ffff880112410800[2 ffff8800a99e8508:0 ^ffff8800acab8600_(null) 4 0 1 (null) (null) 0x0]
LustreError: 23518:0:(osc_page.c:430:osc_page_delete()) vvp-page@ffff8800acab86c0(0:0:0) vm@ffffea0002c25250 20000000000035 3:0 0 0 lru
LustreError: 23518:0:(osc_page.c:430:osc_page_delete()) lov-page@ffff8800acab8710
LustreError: 23518:0:(osc_page.c:430:osc_page_delete()) osc-page@ffff8801124108e8: 1< 0x845fed 258 0 + - > 2< 0 0 4096 0x0 0x520 | (null) ffff8800b135e600 ffff8800aa8b4688 > 3< + ffff8800ade74080 0 0 0 > 4< 0 0 8 33980416 - | - - + - > 5< - - + - | 0 - | 7 - ->
LustreError: 23518:0:(osc_page.c:430:osc_page_delete()) end page@ffff8800acab8600
LustreError: 23518:0:(osc_page.c:430:osc_page_delete()) Trying to teardown failed: -16
LustreError: 23518:0:(osc_page.c:431:osc_page_delete()) ASSERTION( 0 ) failed:
LustreError: 23518:0:(osc_page.c:431:osc_page_delete()) LBUG
Pid: 23518, comm: cp
Call Trace:
[<ffffffffa02ae895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[<ffffffffa02aee97>] lbug_with_loc+0x47/0xb0 [libcfs]
[<ffffffffa0854701>] osc_page_delete+0x311/0x320 [osc]
[<ffffffffa0468bb5>] cl_page_delete0+0xc5/0x4e0 [obdclass]
[<ffffffffa0469012>] cl_page_delete+0x42/0x120 [obdclass]
[<ffffffffa0ce766d>] ll_invalidatepage+0x8d/0x160 [lustre]
[<ffffffff81131ae5>] do_invalidatepage+0x25/0x30
[<ffffffff81131e02>] truncate_inode_page+0xa2/0xc0
[<ffffffff811322d2>] truncate_inode_pages_range+0x292/0x500
[<ffffffffa02afa4e>] ? cfs_mem_cache_free+0xe/0x10 [libcfs]
[<ffffffff81143b62>] ? unmap_mapping_range+0x72/0x140
[<ffffffff811325d5>] truncate_inode_pages+0x15/0x20
[<ffffffff8113262f>] truncate_pagecache+0x4f/0x70
[<ffffffff811aa84a>] simple_setsize+0x3a/0x50
[<ffffffff811aa8a0>] simple_setattr+0x40/0x70
[<ffffffffa0cc1416>] ll_setattr_raw+0x2a6/0x1090 [lustre]
[<ffffffffa0cc225b>] ll_setattr+0x5b/0xf0 [lustre]
[<ffffffff8119fdc8>] notify_change+0x168/0x340
[<ffffffff811807e4>] do_truncate+0x64/0xa0
[<ffffffff8121e52f>] ? security_inode_permission+0x1f/0x30
[<ffffffff811946e4>] do_filp_open+0x844/0xdd0
[<ffffffff8104757c>] ? __do_page_fault+0x1ec/0x480
[<ffffffff811a0ca2>] ? alloc_fd+0x92/0x160
[<ffffffff8117f559>] do_sys_open+0x69/0x140
[<ffffffff8117f670>] sys_open+0x20/0x30
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
|
| Comments |
| Comment by John Hammond [ 12/Jun/13 ] |
|
This appears to have been introduced by a61ff59. After reverting that commit I no longer see it. On master I can see it using a non-modified racer as well. |
| Comment by Jinshan Xiong (Inactive) [ 13/Jun/13 ] |
|
My bad, this is a race. After the truncate finished in ll_setattr_ost(), it released inode mutex so that new write produced more pages beyond truncate size. John, if you get a chance, can you please restore the change in http://review.whamcloud.com/#patch,sidebyside,4816,18,lustre/llite/llite_lib.c and give it a try? |
| Comment by John Hammond [ 13/Jun/13 ] |
|
This works for me. I reverted the first 3 of the 4 hunks in llite_lib.c and the crash has gone away. I don't follow your explanation however. I seems more likely to me that the difference is that the old way cleared ATTR_SIZE from valid in simple_setattr(). Please correct me if I'm wrong. |
| Comment by John Hammond [ 14/Jun/13 ] |
|
Please see http://review.whamcloud.com/6643. |
| Comment by John Hammond [ 14/Jun/13 ] |
|
Restoring the call to simple_setattr() fixed the LBUG but causes the added test (sanity 229) to fail. Shall I delete the test as well? On master (2.4.50-79-gaed8203) which has the patch from # MOUNT_2=y llmount.sh # multiop /mnt/lustre/f0 H2c # truncate --size=42 /mnt/lustre/f0 # stat /mnt/lustre/f0 File: `/mnt/lustre/f0' Size: 42 Blocks: 0 IO Block: 4194304 regular file ... # stat /mnt/lustre2/f0 File: `/mnt/lustre2/f0' Size: 0 Blocks: 0 IO Block: 4194304 regular empty file ... # stat /mnt/lustre/f0 File: `/mnt/lustre/f0' Size: 42 Blocks: 0 IO Block: 4194304 regular file ... Can someone explain why we support truncate (to non-zero size) on released files? Truncate to zero seems somewhat defensible, but to non-zero just asks for trouble. But in either case why bother? In practice won't truncate almost always be followed by write (requiring a restore)? |
| Comment by Jinshan Xiong (Inactive) [ 14/Jun/13 ] |
|
Originally this is worked out for an optimization. For example, when a released file is truncated, we just change the size on the MDT but restore it later. But here is a problem that if a release file is truncated down to size A, then up to size B. The file content in [A, B] should contain zero. I think it's okay to remove the truncate part in the test case. |
| Comment by John Hammond [ 14/Jun/13 ] |
|
Well that would make an even stronger case for disallowing truncates on released files. CEA colleagues, do you have opinions to share on this? |
| Comment by Johann Lombardi (Inactive) [ 15/Jun/13 ] |
|
I don't think we ever intended to support such an optimization. In general, truncate should trigger a restore. The only case we want to "optimize" is truncate to 0 where we can just discard the HSM copy. |
| Comment by jacques-charles lafoucriere [ 16/Jun/13 ] |
|
I confirm Johann comment, truncate trigs a restore and blocks up to end of full restore. Later with partial restore we can optimize this. |
| Comment by Jinshan Xiong (Inactive) [ 17/Jun/13 ] |
|
It appears I misunderstood you guys. Because when you asked me to add the test case for truncate to released file, I thought this is what you will do. |
| Comment by Jinshan Xiong (Inactive) [ 17/Jun/13 ] |
|
To make it safe, let's deny the truncate to HSM released file. John, can you please add this fix in your patch? The extra fix would be: [jinxiong@intel mdc]$ git diff ../lov/lov_io.c diff --git a/lustre/lov/lov_io.c b/lustre/lov/lov_io.c index 6f6ea84..bec9fea 100644 --- a/lustre/lov/lov_io.c +++ b/lustre/lov/lov_io.c @@ -984,12 +984,12 @@ int lov_io_init_released(const struct lu_env *env, struct cl_object *obj, LASSERTF(0, "invalid type %d\n", io->ci_type); case CIT_MISC: case CIT_FSYNC: - case CIT_SETATTR: result = +1; break; case CIT_READ: case CIT_WRITE: case CIT_FAULT: + case CIT_SETATTR: /* TODO: need to restore the file. */ result = -EBADF; break; Without this fix, it will have problem to handle the size of released file with truncate. |
| Comment by John Hammond [ 21/Jun/13 ] |
|
Patch landed to master. |
| Comment by John Hammond [ 24/Jul/13 ] |
|
Is there any plan to rehabilitate truncate to 0 (or O_TRUNC) for released files? |
| Comment by jacques-charles lafoucriere [ 24/Jul/13 ] |
|
today we have 2 cases: truncate 0 or > 0 on 1) make an error If we change to have truncate 0 works on 1), we can also change to have truncate > 0 also works. This is only philosophy on "what is a released file without an HSM archive" |