[LU-6096] sanity test_17m: e2fsck Inode 32775, i_size is 0, should be 4096 Created: 08/Jan/15 Updated: 02/Nov/16 Resolved: 05/Aug/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 16971 | ||||||||||||||||
| Description |
|
This issue was created by maloo for Bob Glossman <bob.glossman@intel.com> This issue relates to the following test suite run of review-ldiskfs: https://testing.hpdd.intel.com/test_sets/277e606e-976d-11e4-bafa-5254006e85c2. I note that several other recent similar failures have been marked as The sub-test test_17m failed with the following error: e2fsck -fnvd /dev/lvm-Role_MDS/P1 e2fsck 1.42.12.wc1 (15-Sep-2014) shadow-26vm8: check_blocks:2814: increase inode 32775 badness 0 to 1 shadow-26vm8: check_blocks:2814: increase inode 32776 badness 0 to 1 Pass 1: Checking inodes, blocks, and sizes Inode 32775, i_size is 0, should be 4096. Fix? no Inode 32776, i_size is 0, should be 4096. Fix? no Please provide additional information about the failure here. Info required for matching: sanity 17m |
| Comments |
| Comment by Andreas Dilger [ 09/Jan/15 ] |
|
This is just a regular review-ldiskfs test run, not even review-dne, and it means that the MDT filesystem is corrupt at this point in the test, which isn't very far along. Bob, could you please run this test locally with RHEL 7, and find out what files the affected inodes belong to (32775 and 32776 in this case). That can be done with something like the following to find the pathnames for those inodes, as well as dump the attributes for them: debugfs -c /dev/lvm-Role_MDS/P1 debugfs> ncheck 32775 32776 debugfs> stat <32775> debugfs> stat <32776> Then we need to check why that subtest is causing problems, but this will at least narrow it down to a single subtest. |
| Comment by Andreas Dilger [ 13/Jan/15 ] |
|
Bob, any chance to run this test on RHEL7 again? |
| Comment by Bob Glossman (Inactive) [ 13/Jan/15 ] |
|
not yet, hoping to get to it today |
| Comment by Bob Glossman (Inactive) [ 13/Jan/15 ] |
|
doesn't seem to need el7 server to fail. I see the following in a local test of only sanity, 17m with el7 client on el6 servers. the exact error looks different, but it still fails. I may have something wrong in my local environment. == sanity test 17m: run e2fsck against MDT which contains short/long symlink ========================= 11:05:36 (1421175936)
create 512 short and long symlink files under /mnt/lustre/d17m.sanitym
erase them
Waiting for local destroys to complete
recreate the 512 symlink files with a shorter string
stop and checking mds1: e2fsck -fnvd /dev/sdb
Stopping /mnt/mds1 (opts:) on centos2
pdsh@centos7: centos2: ssh exited with exit code 1
centos2: e2fsck 1.42.12.wc1 (15-Sep-2014)
centos2: [QUOTA WARNING] Usage inconsistent for ID 0:actual (3624960, 1237) != expected (3899392, 2307)
centos2: [QUOTA WARNING] Usage inconsistent for ID 0:actual (3624960, 1237) != expected (3899392, 2306)
pdsh@centos7: centos2: ssh exited with exit code 4
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Update quota info for quota type 0? no
Update quota info for quota type 1? no
lustre-MDT0000: ********** WARNING: Filesystem still has errors **********
1246 inodes used (1.25%, out of 100000)
6 non-contiguous files (0.5%)
1 non-contiguous directory (0.1%)
# of inodes with ind/dind/tind blocks: 5/0/0
17580 blocks used (35.16%, out of 50000)
0 bad blocks
1 large file
98 regular files
115 directories
0 character device files
0 block device files
0 fifos
0 links
1024 symbolic links (512 fast symbolic links)
0 sockets
------------
1237 files
Starting mds1: /dev/sdb /mnt/mds1
Started lustre-MDT0000
sanity test_17m: @@@@@@ FAIL: e2fsck should not report error upon short/long symlink MDT: rc=4
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:4677:error_noexit()
= /usr/lib64/lustre/tests/test-framework.sh:4708:error()
= /usr/lib64/lustre/tests/sanity.sh:611:test_17m()
= /usr/lib64/lustre/tests/test-framework.sh:4955:run_one()
= /usr/lib64/lustre/tests/test-framework.sh:4992:run_one_logged()
= /usr/lib64/lustre/tests/test-framework.sh:4809:run_test()
= /usr/lib64/lustre/tests/sanity.sh:614:main()
Dumping lctl log to /tmp/test_logs/2015-01-13/110504/sanity.test_17m.*.1421176025.log
FAIL 17m (94s)
resend_count is set to 4 4
resend_count is set to 4 4
resend_count is set to 4 4
resend_count is set to 4 4
resend_count is set to 4 4
== sanity test complete, duration 119 sec ============================================================ 11:07:17 (1421176037)
sanity: FAIL: test_17m e2fsck should not report error upon short/long symlink MDT: rc=4
debug=super ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace config console lfsck
sanity returned 0
Finished at Tue Jan 13 11:07:31 PST 2015 in 147s
/usr/lib64/lustre/tests/auster: completed with rc 0
|
| Comment by Bob Glossman (Inactive) [ 13/Jan/15 ] |
|
I see a similar error with an el6.6 client. whatever problem I'm seeing isn't el7 only. |
| Comment by Bob Glossman (Inactive) [ 13/Jan/15 ] |
|
all the rest of test 17 subtests work fine. it looks like only 17m has problems |
| Comment by Bob Glossman (Inactive) [ 13/Jan/15 ] |
|
errors reported above I think were due to missing quota-devel rpm on all my local test nodes. with that rpm installed on client & servers I'm no longer seeing any error in sanity, 17m. That's with el6 servers. Still trying to reproduce the original error in the ticket. |
| Comment by Bob Glossman (Inactive) [ 13/Jan/15 ] |
|
I tried the simplest possible config, client & server on one node. The problem reproduces. == sanity test 17m: run e2fsck against MDT which contains short/long symlink ========================= 14:21:42 (1421187702) create 512 short and long symlink files under /mnt/lustre/d17m.sanitym erase them Waiting for local destroys to complete recreate the 512 symlink files with a shorter string stop and checking mds1: e2fsck -fnvd /dev/sdb Stopping /mnt/mds1 (opts:) on centos7 e2fsck 1.42.12.wc1 (15-Sep-2014) check_blocks:2814: increase inode 32777 badness 0 to 1 check_blocks:2814: increase inode 32779 badness 0 to 1 Pass 1: Checking inodes, blocks, and sizes Inode 32777, i_size is 0, should be 4096. Fix? no Inode 32779, i_size is 0, should be 4096. Fix? no Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information lustre-MDT0000: ********** WARNING: Filesystem still has errors ********** debugfs output [root@centos7 bogl]# umount /dev/sdb [root@centos7 bogl]# debugfs /dev/sdb debugfs 1.42.12.wc1 (15-Sep-2014) debugfs: ncheck 32777 32779 Inode Pathname 32779 //PENDING 32777 /ROOT/.lustre/fid debugfs: stat PENDING Inode: 32779 Type: directory Mode: 0755 Flags: 0x0 Generation: 512259630 Version: 0x00000000:00000000 User: 0 Group: 0 Size: 0 File ACL: 0 Directory ACL: 0 Links: 2 Blockcount: 8 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x54b599b4:7dbd4a3c -- Tue Jan 13 14:18:28 2015 atime: 0x54b599b4:7dbd4a3c -- Tue Jan 13 14:18:28 2015 mtime: 0x54b599b4:7dbd4a3c -- Tue Jan 13 14:18:28 2015 crtime: 0x54b599b4:7dbd4a3c -- Tue Jan 13 14:18:28 2015 Size of extra inode fields: 28 Extended attributes stored in inode body: lma = "00 00 00 00 00 00 00 00 03 00 00 00 02 00 00 00 0a 00 00 00 00 00 00 00 " (24) lma: fid=[0x200000003:0xa:0x0] compat=0 incompat=0 BLOCKS: (0):16620 TOTAL: 1 debugfs: stat /ROOT/.lustre/fid Inode: 32777 Type: directory Mode: 0100 Flags: 0x0 Generation: 512259628 Version: 0x00000000:00000000 User: 0 Group: 0 Size: 0 File ACL: 0 Directory ACL: 0 Links: 2 Blockcount: 8 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x54b599b4:7dbd4a3c -- Tue Jan 13 14:18:28 2015 atime: 0x54b599b4:7dbd4a3c -- Tue Jan 13 14:18:28 2015 mtime: 0x54b599b4:7dbd4a3c -- Tue Jan 13 14:18:28 2015 crtime: 0x54b599b4:7dbd4a3c -- Tue Jan 13 14:18:28 2015 Size of extra inode fields: 28 Extended attributes stored in inode body: lma = "00 00 00 00 00 00 00 00 02 00 00 00 02 00 00 00 02 00 00 00 00 00 00 00 " (24) lma: fid=[0x200000002:0x2:0x0] compat=0 incompat=0 BLOCKS: (0):16618 TOTAL: 1 |
| Comment by Bob Glossman (Inactive) [ 13/Jan/15 ] |
|
non-zero Blockcount with 0 size on those entries looks kind of suspicious, but what do I know |
| Comment by Andreas Dilger [ 14/Jan/15 ] |
|
The client version doesn't matter, only the server version, since clients don't interact with the soak filesystem directly. The affected files (PENDING, .lustre/fid) are internally created, so this is a bug in the osd-ldiskfs or MDD code. |
| Comment by Alex Zhuravlev [ 23/Jan/15 ] |
|
Bob, do I understand correctly that you can reproduce the issue with el7 server only? I can't reproduce with el6 locally. |
| Comment by Bob Glossman (Inactive) [ 23/Jan/15 ] |
|
Alex, yes exactly. Appears 100% reproducible but only with el7 server. |
| Comment by Alex Zhuravlev [ 23/Jan/15 ] |
|
then can I guess this is introduced with one of the patches adding el7 support? |
| Comment by Bob Glossman (Inactive) [ 23/Jan/15 ] |
|
that's possible but not certain. el7 ldiskfs is http://review.whamcloud.com/#/c/10249 could be some other non-obvious diff in kernel internal API for vfs, for example |
| Comment by Alex Zhuravlev [ 23/Jan/15 ] |
|
oh, yes, for sure.. we know, linux kernel api is super stable |
| Comment by James A Simmons [ 04/Feb/15 ] |
|
Actually I also have seen this error as well. |
| Comment by Alex Zhuravlev [ 05/Feb/15 ] |
|
James, with el7 only? |
| Comment by James A Simmons [ 14/Apr/15 ] |
|
Got around to update the SLES12 ldiskfs support to see if I can reproduce this issue. Before I saw this issue but now I can't seem to reproduce it with the latest ldiskfs patch. |
| Comment by Peter Jones [ 15/Apr/15 ] |
|
Is it still happening on EL7? |
| Comment by Gerrit Updater [ 15/Apr/15 ] |
|
Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/14469 quickest way to prove it's still a problem is to let autotest tell us |
| Comment by Bob Glossman (Inactive) [ 15/Apr/15 ] |
|
yes, it still fails. see https://testing.hpdd.intel.com/test_sets/32d9d04a-e38c-11e4-bb64-5254006e85c2 |
| Comment by Andreas Dilger [ 11/May/15 ] |
|
Alex, can you please take a look at this? |
| Comment by Alex Zhuravlev [ 12/May/15 ] |
|
I pushed a debug patch. |
| Comment by Bob Glossman (Inactive) [ 12/May/15 ] |
|
I think there may be a typo in that debug patch. title in commit header refers to 6069, not 6096 |
| Comment by Alex Zhuravlev [ 12/May/15 ] |
|
ok, thanks. that's just to try to catch the cause.. not to fix the issue. |
| Comment by Andreas Dilger [ 20/May/15 ] |
|
Patch is http://review.whamcloud.com/14772 |
| Comment by Andreas Dilger [ 20/May/15 ] |
|
It looks like the requested sanity test_17m test was run, but none of the added LASSERTs/BUG were hit. Alex, what is the next step? |
| Comment by Alex Zhuravlev [ 20/May/15 ] |
|
going to reproduce locally, will take a bit. |
| Comment by Alex Zhuravlev [ 25/May/15 ] |
|
the patch referenced above should address the issue. |
| Comment by Bob Glossman (Inactive) [ 26/May/15 ] |
|
In local test on el7 servers the fix works. The problem which was 100% reproducible now doesn't appear even once. |
| Comment by Gerrit Updater [ 26/May/15 ] |
|
Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/14772/ |
| Comment by Andreas Dilger [ 26/May/15 ] |
|
Patch landed to master for 2.8.0. |
| Comment by Sarah Liu [ 29/Jun/15 ] |
|
Hit this error again on build #3071 EL7 DNE mode https://testing.hpdd.intel.com/test_sets/683aadc2-1311-11e5-8d21-5254006e85c2 |
| Comment by Andreas Dilger [ 03/Jul/15 ] |
|
It looks like this is still being hit. The symptom is slightly different - instead of only the PENDING and .lustre/fid directories being affected, it looks like a larger number of directories are having problems: onyx-41vm7: e2fsck 1.42.12.wc1 (15-Sep-2014) onyx-41vm7: check_blocks:2814: increase inode 139 badness 0 to 1 onyx-41vm7: check_blocks:2814: increase inode 140 badness 0 to 1 onyx-41vm7: check_blocks:2814: increase inode 142 badness 0 to 1 onyx-41vm7: check_blocks:2814: increase inode 143 badness 0 to 1 onyx-41vm7: check_blocks:2814: increase inode 145 badness 0 to 1 onyx-41vm7: check_blocks:2814: increase inode 146 badness 0 to 1 onyx-41vm7: check_blocks:2814: increase inode 148 badness 0 to 1 onyx-41vm7: check_blocks:2814: increase inode 524394 badness 0 to 1 onyx-41vm7: check_blocks:2814: increase inode 524397 badness 0 to 1 onyx-41vm7: check_blocks:2814: increase inode 524402 badness 0 to 1 onyx-41vm7: check_blocks:2814: increase inode 524405 badness 0 to 1 onyx-41vm7: check_blocks:2814: increase inode 524408 badness 0 to 1 onyx-41vm7: check_blocks:2814: increase inode 524411 badness 0 to 1 onyx-41vm7: check_blocks:2814: increase inode 524416 badness 0 to 1 onyx-41vm7: check_blocks:2814: increase inode 524421 badness 0 to 1 onyx-41vm7: check_blocks:2814: increase inode 524424 badness 0 to 1 onyx-41vm7: check_blocks:2814: increase inode 524429 badness 0 to 1 onyx-41vm7: check_blocks:2814: increase inode 524433 badness 0 to 1 Pass 1: Checking inodes, blocks, and sizes Inode 139, i_size is 0, should be 4096. Fix? no Inode 140, i_size is 0, should be 4096. Fix? no Inode 142, i_size is 0, should be 4096. Fix? no Inode 143, i_size is 0, should be 4096. Fix? no Inode 145, i_size is 0, should be 4096. Fix? no Inode 146, i_size is 0, should be 4096. Fix? no Inode 148, i_size is 0, should be 4096. Fix? no Inode 524394, i_size is 0, should be 4096. Fix? no Inode 524397, i_size is 0, should be 4096. Fix? no Inode 524402, i_size is 0, should be 4096. Fix? no Inode 524405, i_size is 0, should be 4096. Fix? no Inode 524408, i_size is 0, should be 4096. Fix? no Inode 524411, i_size is 0, should be 4096. Fix? no Inode 524416, i_size is 0, should be 4096. Fix? no Inode 524421, i_size is 0, should be 4096. Fix? no Inode 524424, i_size is 0, should be 4096. Fix? no Inode 524429, i_size is 0, should be 4096. Fix? no Inode 524433, i_size is 0, should be 4096. Fix? no It still always seems related to EL7 though. |
| Comment by Andreas Dilger [ 07/Jul/15 ] |
|
Alex, any comment on this? Is it possible the filesystem is being uncounted without flushing those updates to disk? |
| Comment by Alex Zhuravlev [ 07/Jul/15 ] |
|
I'm trying to reproduce this locally.. will take some time. |
| Comment by Gerrit Updater [ 13/Jul/15 ] |
|
Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/15581 |
| Comment by James A Simmons [ 28/Jul/15 ] |
|
Bob is this also a problem for SLES12? |
| Comment by Bob Glossman (Inactive) [ 28/Jul/15 ] |
|
James, have never seen it on sles12, only on el7. can't promise it's not in sles12 too. |
| Comment by Gerrit Updater [ 05/Aug/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15581/ |
| Comment by Peter Jones [ 05/Aug/15 ] |
|
Landed for 2.8. Resolving ticket. If this every crops up on SLES12 we can always open a ticket for it then |