[LU-2980] sanity.sh test_17b: Read-only file system Created: 18/Mar/13  Updated: 20/May/13  Resolved: 20/May/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: HB

Issue Links:
Duplicate
is duplicated by LU-3005 MDT attempted to access beyond the disk Resolved
is duplicated by LU-2974 Interop 2.1.4<->2.4 failure on test s... Closed
Related
is related to LU-3005 MDT attempted to access beyond the disk Resolved
Severity: 3
Rank (Obsolete): 7263

 Description   

This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/43161786-8f2c-11e2-92ff-52540035b04c.

The sub-test test_17b failed with the following error in the MDS dmesg log:

Lustre: DEBUG MARKER: == sanity test 17a: symlinks: create, remove (real) ==================== 16:19:07 (1363475947)
attempt to access beyond end of device
dm-0: rw=0, want=855117409799080, limit=4194304
attempt to access beyond end of device
dm-0: rw=0, want=855117409799080, limit=4194304
attempt to access beyond end of device
dm-0: rw=0, want=855117409799080, limit=4194304
attempt to access beyond end of device
dm-0: rw=0, want=855117409799080, limit=4194304
LDISKFS-fs error (device dm-0): ldiskfs_xattr_delete_inode: inode 524372: block 106889676224884 read error
Aborting journal on device dm-0-8.
LDISKFS-fs (dm-0): Remounting filesystem read-only
LDISKFS-fs error (device dm-0) in ldiskfs_free_inode: Journal has aborted
LustreError: 2764:0:(osd_handler.c:635:osd_trans_commit_cb()) transaction @0xffff88007cdd41c0 commit error: 2

It seems that test_17a is somehow corrupting the filesystem, since the block number is the same 106889676224884 in the few MDS dmesg logs I looked at, and this seems to be ASCII text from the test_17a() run.

(gdb) p /x 106889676224884
$1 = 0x6137312e7974

This is the ASCII string "ty.17a<NUL><NUL>", which might be a fragment from $tdir or similar "sani[ty.17a]".

Info required for matching: sanity 17b



 Comments   
Comment by Andreas Dilger [ 18/Mar/13 ]

Searching back the past 4 weeks, this has only started failing on 2013-03-12, so it is likely a new regression introduced by a patch landed on that day or the one before:

https://maloo.whamcloud.com/sub_tests/8d829c9c-8bfb-11e2-abec-52540035b04c
https://maloo.whamcloud.com/sub_tests/dc5453f0-8bcf-11e2-aa89-52540035b04c
https://maloo.whamcloud.com/sub_tests/2eee8a56-8c5f-11e2-af77-52540035b04c
https://maloo.whamcloud.com/sub_tests/38d2778c-8cf3-11e2-af77-52540035b04c
https://maloo.whamcloud.com/sub_tests/77a75086-8cfd-11e2-af77-52540035b04c
https://maloo.whamcloud.com/sub_tests/100ab654-8dbe-11e2-abc6-52540035b04c
https://maloo.whamcloud.com/sub_tests/97489ff0-8dc8-11e2-bb99-52540035b04c

Comment by Di Wang [ 19/Mar/13 ]

In sanity 17a, it tries to create a symlink with "/mnt/lustre/d0.sanity/d17/f.sanity.17a", since it is only 38bytes, so it should be written to i_data, but somehow i_file_acl is being overwritten according to Andreas's comment. But the interesting thing is that i_data has 60 bytes length, I do not know how it can be overwritten.

struct ldiskfs_inode_info {
        __le32  i_data[15];     /* unconverted */
        __u32   i_dtime;
        ldiskfs_fsblk_t i_file_acl;      
.....
}
Comment by Andreas Dilger [ 19/Mar/13 ]

If this can be reproduced, it would be useful to dump the inode contents, either with debugfs, or "od -tx4" to see what else is being written into the i_data field to offset the symlink data.

Comment by Andreas Dilger [ 19/Mar/13 ]

Maybe a patch for test_17a() to see if ls -l works, and/or something like:

        local mds_dev=$(mdsdevname $(($($LFS getstripe -M $DIR/$tdir/$tfile) + 1))
        do_facet $SINGLEMDS debugfs -c -R "stat /ROOT$tdir/$tfile" $mds_dev}}
Comment by Hongchao Zhang [ 21/Mar/13 ]

there is no related patch landed on "ldiskfs", and there are two patches which could be related to the issue
http://git.whamcloud.com/?p=fs/lustre-release.git;a=commitdiff;h=c8d5aa14e50be2a85491783f169a8f4e646b9594
http://git.whamcloud.com/?p=fs/lustre-release.git;a=commitdiff;h=d7ac66d2fddc3b2a6fb91b6421f9a15b80c8d10a

but no possible location is found to be related to the issue.

how about printing the "i_data" and "i_dtime" alongside "i_file_acl" in ldiskfs_xattr_delete_inode (http://review.whamcloud.com/#change,5798) ?

Comment by Hongchao Zhang [ 21/Mar/13 ]

this ticket can't be reproduced by running "sanity.sh" (from subtest0 to subtest17) repeatedly for a long time.

Comment by Zhenyu Xu [ 22/Mar/13 ]

I think LU-3005 has the same issue.

Comment by James A Simmons [ 22/Mar/13 ]

Yes it is a duplicate. I noticed as well it is a very difficult bug to reproduce.

Comment by James A Simmons [ 22/Mar/13 ]

If I encounter this bug again what data should I collect?

Comment by Peter Jones [ 25/Mar/13 ]

Hongchao

I see that you have created a debug patch - http://review.whamcloud.com/#change,5798 - is your intention to land this so if anyone hits this issue again then we have more info to go on?

Peter

Comment by Hongchao Zhang [ 27/Mar/13 ]

currently, this bug only occurs during test for patch review, LU-1812, LU-1199 and LU-2473, which have modification on the ldiskfs.
then this ticket could be related to the corresponding patches.

Comment by Hongchao Zhang [ 27/Mar/13 ]

btw, when this ticket occurs, most of the tests in sanity.sh failed!
e.g.
https://maloo.whamcloud.com/test_sets/355feee0-8cf3-11e2-af77-52540035b04c
https://maloo.whamcloud.com/test_sets/d5a0d9e8-8bcf-11e2-aa89-52540035b04c
https://maloo.whamcloud.com/test_sets/2bc5a18e-8c5f-11e2-af77-52540035b04c
https://maloo.whamcloud.com/test_sets/ff2eefc8-9455-11e2-93c6-52540035b04c
https://maloo.whamcloud.com/test_sets/0e9fde50-9419-11e2-89cc-52540035b04c
https://maloo.whamcloud.com/test_sets/f822c5c0-9431-11e2-8809-52540035b04c
https://maloo.whamcloud.com/test_sets/4c7555b2-9427-11e2-8809-52540035b04c
https://maloo.whamcloud.com/test_sets/b6637c0a-94c3-11e2-93c6-52540035b04c
https://maloo.whamcloud.com/test_sets/4309922a-9339-11e2-b06e-52540035b04c

Comment by James A Simmons [ 27/Mar/13 ]

It would be really nice if maloo reported which patch was being tested in its subtest logs to avoid thinking that this bug was on the master branch. The failures started March 12th which is when I introduced the ldiskfs-config.h version of the patch. This makes sense if the source of the problem was the patch from LU-1199. The reason being is the ldiskfs-config.h contains the configuration for all the CONFIG_LDISKFS* setting for ldiskfs. These values need to placed before the ldiskfs/ext4 headers because those CONFIG_* settings have a impact on how the ext4/ldiskfs headers influence the code. In osd_io.c the configuration header has being placed after the ldiskfs.h headers which was what was causing the breakage. The latest patch properly places ldiskfs-config.h before all ext4/ldiskfs specific headers. If this is correct then we should not see this problem going forward. Lets keep a eye out to see if this problem goes away.

Comment by Peter Jones [ 27/Mar/13 ]

Dropping priority given the latest information. Please close this ticket if no further work needs to be tracked by this ticket

Comment by James A Simmons [ 09/Apr/13 ]

I assume we haven't seen this bug in some time. If that is the case we can close this ticket and reopen it if it for some reason reappears.

Comment by James A Simmons [ 20/May/13 ]

Peter can you close this ticket. Thanks.

Comment by Peter Jones [ 20/May/13 ]

ok thanks

Generated at Sat Feb 10 01:29:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.