[LU-3448] osc_page_delete()) ASSERTION(0) failed running racer Created: 10/Jun/13  Updated: 24/Jul/13  Resolved: 21/Jun/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.5.0

Type: Bug Priority: Blocker
Reporter: John Hammond Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: osc

Issue Links:
Related
is related to LU-2482 Define new layout for released file Resolved
is related to LU-2531 osc_page.c:432:osc_page_delete()) ASS... Resolved
Severity: 3
Rank (Obsolete): 8619

 Description   

Running a modified racer (LU-3072, + smaller dds + sleep in file_create.sh) I can reproduce this.

LustreError: 23518:0:(osc_cache.c:2379:osc_teardown_async_page()) extent ffff8800abb54c60@{[0 -> 1/255], [3|1|-|active|wi|ffff8800aa8b4688], [8192|2|+|-|ffff8800ac9a2e78|256|(null)]} trunc at 0.
LustreError: 23518:0:(osc_page.c:430:osc_page_delete()) page@ffff8800acab8600[2 ffff8800ae3bb630:0 ^(null)_ffff880112410800 4 0 1 (null) (null) 0x0]
LustreError: 23518:0:(osc_page.c:430:osc_page_delete()) page@ffff880112410800[2 ffff8800a99e8508:0 ^ffff8800acab8600_(null) 4 0 1 (null) (null) 0x0]
LustreError: 23518:0:(osc_page.c:430:osc_page_delete()) vvp-page@ffff8800acab86c0(0:0:0) vm@ffffea0002c25250 20000000000035 3:0 0 0 lru
LustreError: 23518:0:(osc_page.c:430:osc_page_delete()) lov-page@ffff8800acab8710
LustreError: 23518:0:(osc_page.c:430:osc_page_delete()) osc-page@ffff8801124108e8: 1< 0x845fed 258 0 + - > 2< 0 0 4096 0x0 0x520 | (null) ffff8800b135e600 ffff8800aa8b4688 > 3< + ffff8800ade74080 0 0 0 > 4< 0 0 8 33980416 - | - - + - > 5< - - + - | 0 - | 7 - ->
LustreError: 23518:0:(osc_page.c:430:osc_page_delete()) end page@ffff8800acab8600
LustreError: 23518:0:(osc_page.c:430:osc_page_delete()) Trying to teardown failed: -16
LustreError: 23518:0:(osc_page.c:431:osc_page_delete()) ASSERTION( 0 ) failed: 
LustreError: 23518:0:(osc_page.c:431:osc_page_delete()) LBUG
Pid: 23518, comm: cp

Call Trace:
 [<ffffffffa02ae895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa02aee97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0854701>] osc_page_delete+0x311/0x320 [osc]
 [<ffffffffa0468bb5>] cl_page_delete0+0xc5/0x4e0 [obdclass]
 [<ffffffffa0469012>] cl_page_delete+0x42/0x120 [obdclass]
 [<ffffffffa0ce766d>] ll_invalidatepage+0x8d/0x160 [lustre]
 [<ffffffff81131ae5>] do_invalidatepage+0x25/0x30
 [<ffffffff81131e02>] truncate_inode_page+0xa2/0xc0
 [<ffffffff811322d2>] truncate_inode_pages_range+0x292/0x500
 [<ffffffffa02afa4e>] ? cfs_mem_cache_free+0xe/0x10 [libcfs]
 [<ffffffff81143b62>] ? unmap_mapping_range+0x72/0x140
 [<ffffffff811325d5>] truncate_inode_pages+0x15/0x20
 [<ffffffff8113262f>] truncate_pagecache+0x4f/0x70
 [<ffffffff811aa84a>] simple_setsize+0x3a/0x50
 [<ffffffff811aa8a0>] simple_setattr+0x40/0x70
 [<ffffffffa0cc1416>] ll_setattr_raw+0x2a6/0x1090 [lustre]
 [<ffffffffa0cc225b>] ll_setattr+0x5b/0xf0 [lustre]
 [<ffffffff8119fdc8>] notify_change+0x168/0x340
 [<ffffffff811807e4>] do_truncate+0x64/0xa0
 [<ffffffff8121e52f>] ? security_inode_permission+0x1f/0x30
 [<ffffffff811946e4>] do_filp_open+0x844/0xdd0
 [<ffffffff8104757c>] ? __do_page_fault+0x1ec/0x480
 [<ffffffff811a0ca2>] ? alloc_fd+0x92/0x160
 [<ffffffff8117f559>] do_sys_open+0x69/0x140
 [<ffffffff8117f670>] sys_open+0x20/0x30
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b


 Comments   
Comment by John Hammond [ 12/Jun/13 ]

This appears to have been introduced by a61ff59. After reverting that commit I no longer see it.

On master I can see it using a non-modified racer as well.

Comment by Jinshan Xiong (Inactive) [ 13/Jun/13 ]

My bad, this is a race. After the truncate finished in ll_setattr_ost(), it released inode mutex so that new write produced more pages beyond truncate size.

John, if you get a chance, can you please restore the change in http://review.whamcloud.com/#patch,sidebyside,4816,18,lustre/llite/llite_lib.c and give it a try?

Comment by John Hammond [ 13/Jun/13 ]

This works for me. I reverted the first 3 of the 4 hunks in llite_lib.c and the crash has gone away. I don't follow your explanation however. I seems more likely to me that the difference is that the old way cleared ATTR_SIZE from valid in simple_setattr(). Please correct me if I'm wrong.

Comment by John Hammond [ 14/Jun/13 ]

Please see http://review.whamcloud.com/6643.

Comment by John Hammond [ 14/Jun/13 ]

Restoring the call to simple_setattr() fixed the LBUG but causes the added test (sanity 229) to fail. Shall I delete the test as well?

On master (2.4.50-79-gaed8203) which has the patch from LU-2482 I see that the effect of truncate on a released file is not seen consistently across clients:

# MOUNT_2=y llmount.sh
# multiop /mnt/lustre/f0 H2c
# truncate --size=42 /mnt/lustre/f0
# stat /mnt/lustre/f0
  File: `/mnt/lustre/f0'
  Size: 42        	Blocks: 0          IO Block: 4194304 regular file
...
# stat /mnt/lustre2/f0
  File: `/mnt/lustre2/f0'
  Size: 0         	Blocks: 0          IO Block: 4194304 regular empty file
...
# stat /mnt/lustre/f0
  File: `/mnt/lustre/f0'
  Size: 42        	Blocks: 0          IO Block: 4194304 regular file
...

Can someone explain why we support truncate (to non-zero size) on released files? Truncate to zero seems somewhat defensible, but to non-zero just asks for trouble. But in either case why bother? In practice won't truncate almost always be followed by write (requiring a restore)?

Comment by Jinshan Xiong (Inactive) [ 14/Jun/13 ]

Originally this is worked out for an optimization. For example, when a released file is truncated, we just change the size on the MDT but restore it later. But here is a problem that if a release file is truncated down to size A, then up to size B. The file content in [A, B] should contain zero.

I think it's okay to remove the truncate part in the test case.

Comment by John Hammond [ 14/Jun/13 ]

Well that would make an even stronger case for disallowing truncates on released files.

CEA colleagues, do you have opinions to share on this?

Comment by Johann Lombardi (Inactive) [ 15/Jun/13 ]

I don't think we ever intended to support such an optimization. In general, truncate should trigger a restore. The only case we want to "optimize" is truncate to 0 where we can just discard the HSM copy.

Comment by jacques-charles lafoucriere [ 16/Jun/13 ]

I confirm Johann comment, truncate trigs a restore and blocks up to end of full restore. Later with partial restore we can optimize this.
Today the only optimization is truncate to 0 (no restore)

Comment by Jinshan Xiong (Inactive) [ 17/Jun/13 ]

It appears I misunderstood you guys. Because when you asked me to add the test case for truncate to released file, I thought this is what you will do.

Comment by Jinshan Xiong (Inactive) [ 17/Jun/13 ]

To make it safe, let's deny the truncate to HSM released file.

John, can you please add this fix in your patch? The extra fix would be:

[jinxiong@intel mdc]$ git diff ../lov/lov_io.c 
diff --git a/lustre/lov/lov_io.c b/lustre/lov/lov_io.c
index 6f6ea84..bec9fea 100644
--- a/lustre/lov/lov_io.c
+++ b/lustre/lov/lov_io.c
@@ -984,12 +984,12 @@ int lov_io_init_released(const struct lu_env *env, struct cl_object *obj,
                LASSERTF(0, "invalid type %d\n", io->ci_type);
        case CIT_MISC:
        case CIT_FSYNC:
-       case CIT_SETATTR:
                result = +1;
                break;
        case CIT_READ:
        case CIT_WRITE:
        case CIT_FAULT:
+       case CIT_SETATTR:
                /* TODO: need to restore the file. */
                result = -EBADF;
                break;

Without this fix, it will have problem to handle the size of released file with truncate.

Comment by John Hammond [ 21/Jun/13 ]

Patch landed to master.

Comment by John Hammond [ 24/Jul/13 ]

Is there any plan to rehabilitate truncate to 0 (or O_TRUNC) for released files?

Comment by jacques-charles lafoucriere [ 24/Jul/13 ]

today we have 2 cases:
1) a released file (just created with a release layout as in test 229)
2) a released file associated with an archive (after hsm_archive + hsm_release or an import)

truncate 0 or > 0 on 1) make an error
truncate > 0 on 2), restores the archived and truncates the file
truncate 0 on 2) should work without a restore (still need to be done, see LU-3454)

If we change to have truncate 0 works on 1), we can also change to have truncate > 0 also works. This is only philosophy on "what is a released file without an HSM archive"

Generated at Sat Feb 10 01:34:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.