[LU-6096] sanity test_17m: e2fsck Inode 32775, i_size is 0, should be 4096 Created: 08/Jan/15  Updated: 02/Nov/16  Resolved: 05/Aug/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: HB

Issue Links:
Duplicate
is duplicated by LU-7094 object UID not being set correctly fo... Resolved
Related
is related to LU-5022 support for 3.10 rhel7 linux kernel Resolved
Severity: 3
Rank (Obsolete): 16971

 Description   

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run of review-ldiskfs: https://testing.hpdd.intel.com/test_sets/277e606e-976d-11e4-bafa-5254006e85c2.

I note that several other recent similar failures have been marked as LU-3534.
As I'm unsure of the reasoning for that and this one is seen on el7 I've raised it as new.
Somebody more expert may decide it's a dup after looking it over.

The sub-test test_17m failed with the following error:

e2fsck -fnvd /dev/lvm-Role_MDS/P1
e2fsck 1.42.12.wc1 (15-Sep-2014)
shadow-26vm8: check_blocks:2814: increase inode 32775 badness 0 to 1
shadow-26vm8: check_blocks:2814: increase inode 32776 badness 0 to 1
Pass 1: Checking inodes, blocks, and sizes
Inode 32775, i_size is 0, should be 4096.  Fix? no

Inode 32776, i_size is 0, should be 4096.  Fix? no

Please provide additional information about the failure here.

Info required for matching: sanity 17m



 Comments   
Comment by Andreas Dilger [ 09/Jan/15 ]

This is just a regular review-ldiskfs test run, not even review-dne, and it means that the MDT filesystem is corrupt at this point in the test, which isn't very far along.

Bob, could you please run this test locally with RHEL 7, and find out what files the affected inodes belong to (32775 and 32776 in this case). That can be done with something like the following to find the pathnames for those inodes, as well as dump the attributes for them:

debugfs -c /dev/lvm-Role_MDS/P1
debugfs> ncheck 32775 32776
debugfs> stat <32775>
debugfs> stat <32776>

Then we need to check why that subtest is causing problems, but this will at least narrow it down to a single subtest.

Comment by Andreas Dilger [ 13/Jan/15 ]

Bob, any chance to run this test on RHEL7 again?

Comment by Bob Glossman (Inactive) [ 13/Jan/15 ]

not yet, hoping to get to it today

Comment by Bob Glossman (Inactive) [ 13/Jan/15 ]

doesn't seem to need el7 server to fail. I see the following in a local test of only sanity, 17m with el7 client on el6 servers. the exact error looks different, but it still fails. I may have something wrong in my local environment.

== sanity test 17m: run e2fsck against MDT which contains short/long symlink ========================= 11:05:36 (1421175936)
create 512 short and long symlink files under /mnt/lustre/d17m.sanitym
erase them
Waiting for local destroys to complete
recreate the 512 symlink files with a shorter string
stop and checking mds1: e2fsck -fnvd /dev/sdb
Stopping /mnt/mds1 (opts:) on centos2
pdsh@centos7: centos2: ssh exited with exit code 1
centos2: e2fsck 1.42.12.wc1 (15-Sep-2014)
centos2: [QUOTA WARNING] Usage inconsistent for ID 0:actual (3624960, 1237) != expected (3899392, 2307)
centos2: [QUOTA WARNING] Usage inconsistent for ID 0:actual (3624960, 1237) != expected (3899392, 2306)
pdsh@centos7: centos2: ssh exited with exit code 4
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Update quota info for quota type 0? no

Update quota info for quota type 1? no


lustre-MDT0000: ********** WARNING: Filesystem still has errors **********


        1246 inodes used (1.25%, out of 100000)
           6 non-contiguous files (0.5%)
           1 non-contiguous directory (0.1%)
             # of inodes with ind/dind/tind blocks: 5/0/0
       17580 blocks used (35.16%, out of 50000)
           0 bad blocks
           1 large file

          98 regular files
         115 directories
           0 character device files
           0 block device files
           0 fifos
           0 links
        1024 symbolic links (512 fast symbolic links)
           0 sockets
------------
        1237 files
Starting mds1:   /dev/sdb /mnt/mds1
Started lustre-MDT0000
 sanity test_17m: @@@@@@ FAIL: e2fsck should not report error upon  short/long symlink MDT: rc=4 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4677:error_noexit()
  = /usr/lib64/lustre/tests/test-framework.sh:4708:error()
  = /usr/lib64/lustre/tests/sanity.sh:611:test_17m()
  = /usr/lib64/lustre/tests/test-framework.sh:4955:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:4992:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:4809:run_test()
  = /usr/lib64/lustre/tests/sanity.sh:614:main()
Dumping lctl log to /tmp/test_logs/2015-01-13/110504/sanity.test_17m.*.1421176025.log
FAIL 17m (94s)
resend_count is set to 4 4
resend_count is set to 4 4
resend_count is set to 4 4
resend_count is set to 4 4
resend_count is set to 4 4
== sanity test complete, duration 119 sec ============================================================ 11:07:17 (1421176037)
sanity: FAIL: test_17m e2fsck should not report error upon  short/long symlink MDT: rc=4
debug=super ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace config console lfsck
sanity returned 0
Finished at Tue Jan 13 11:07:31 PST 2015 in 147s
/usr/lib64/lustre/tests/auster: completed with rc 0
Comment by Bob Glossman (Inactive) [ 13/Jan/15 ]

I see a similar error with an el6.6 client. whatever problem I'm seeing isn't el7 only.

Comment by Bob Glossman (Inactive) [ 13/Jan/15 ]

all the rest of test 17 subtests work fine. it looks like only 17m has problems

Comment by Bob Glossman (Inactive) [ 13/Jan/15 ]

errors reported above I think were due to missing quota-devel rpm on all my local test nodes. with that rpm installed on client & servers I'm no longer seeing any error in sanity, 17m. That's with el6 servers. Still trying to reproduce the original error in the ticket.

Comment by Bob Glossman (Inactive) [ 13/Jan/15 ]

I tried the simplest possible config, client & server on one node. The problem reproduces.

== sanity test 17m: run e2fsck against MDT which contains short/long symlink ========================= 14:21:42 (1421187702)
create 512 short and long symlink files under /mnt/lustre/d17m.sanitym
erase them
Waiting for local destroys to complete
recreate the 512 symlink files with a shorter string
stop and checking mds1: e2fsck -fnvd /dev/sdb
Stopping /mnt/mds1 (opts:) on centos7
e2fsck 1.42.12.wc1 (15-Sep-2014)
check_blocks:2814: increase inode 32777 badness 0 to 1
check_blocks:2814: increase inode 32779 badness 0 to 1
Pass 1: Checking inodes, blocks, and sizes
Inode 32777, i_size is 0, should be 4096.  Fix? no

Inode 32779, i_size is 0, should be 4096.  Fix? no

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

lustre-MDT0000: ********** WARNING: Filesystem still has errors **********

debugfs output

[root@centos7 bogl]# umount /dev/sdb
[root@centos7 bogl]# debugfs /dev/sdb
debugfs 1.42.12.wc1 (15-Sep-2014)
debugfs:  ncheck 32777 32779
Inode	Pathname
32779	//PENDING
32777	/ROOT/.lustre/fid
debugfs:  stat PENDING
Inode: 32779   Type: directory    Mode:  0755   Flags: 0x0
Generation: 512259630    Version: 0x00000000:00000000
User:     0   Group:     0   Size: 0
File ACL: 0    Directory ACL: 0
Links: 2   Blockcount: 8
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x54b599b4:7dbd4a3c -- Tue Jan 13 14:18:28 2015
 atime: 0x54b599b4:7dbd4a3c -- Tue Jan 13 14:18:28 2015
 mtime: 0x54b599b4:7dbd4a3c -- Tue Jan 13 14:18:28 2015
crtime: 0x54b599b4:7dbd4a3c -- Tue Jan 13 14:18:28 2015
Size of extra inode fields: 28
Extended attributes stored in inode body: 
  lma = "00 00 00 00 00 00 00 00 03 00 00 00 02 00 00 00 0a 00 00 00 00 00 00 00
 " (24)
  lma: fid=[0x200000003:0xa:0x0] compat=0 incompat=0
BLOCKS:
(0):16620
TOTAL: 1

debugfs:  stat /ROOT/.lustre/fid
Inode: 32777   Type: directory    Mode:  0100   Flags: 0x0
Generation: 512259628    Version: 0x00000000:00000000
User:     0   Group:     0   Size: 0
File ACL: 0    Directory ACL: 0
Links: 2   Blockcount: 8
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x54b599b4:7dbd4a3c -- Tue Jan 13 14:18:28 2015
 atime: 0x54b599b4:7dbd4a3c -- Tue Jan 13 14:18:28 2015
 mtime: 0x54b599b4:7dbd4a3c -- Tue Jan 13 14:18:28 2015
crtime: 0x54b599b4:7dbd4a3c -- Tue Jan 13 14:18:28 2015
Size of extra inode fields: 28
Extended attributes stored in inode body: 
  lma = "00 00 00 00 00 00 00 00 02 00 00 00 02 00 00 00 02 00 00 00 00 00 00 00
 " (24)
  lma: fid=[0x200000002:0x2:0x0] compat=0 incompat=0
BLOCKS:
(0):16618
TOTAL: 1
Comment by Bob Glossman (Inactive) [ 13/Jan/15 ]

non-zero Blockcount with 0 size on those entries looks kind of suspicious, but what do I know

Comment by Andreas Dilger [ 14/Jan/15 ]

The client version doesn't matter, only the server version, since clients don't interact with the soak filesystem directly.

The affected files (PENDING, .lustre/fid) are internally created, so this is a bug in the osd-ldiskfs or MDD code.

Comment by Alex Zhuravlev [ 23/Jan/15 ]

Bob, do I understand correctly that you can reproduce the issue with el7 server only? I can't reproduce with el6 locally.

Comment by Bob Glossman (Inactive) [ 23/Jan/15 ]

Alex, yes exactly. Appears 100% reproducible but only with el7 server.

Comment by Alex Zhuravlev [ 23/Jan/15 ]

then can I guess this is introduced with one of the patches adding el7 support?

Comment by Bob Glossman (Inactive) [ 23/Jan/15 ]

that's possible but not certain. el7 ldiskfs is http://review.whamcloud.com/#/c/10249

could be some other non-obvious diff in kernel internal API for vfs, for example
even some e2fsck flaw specific to el7 is possible

Comment by Alex Zhuravlev [ 23/Jan/15 ]

oh, yes, for sure.. we know, linux kernel api is super stable

Comment by James A Simmons [ 04/Feb/15 ]

Actually I also have seen this error as well.

Comment by Alex Zhuravlev [ 05/Feb/15 ]

James, with el7 only?

Comment by James A Simmons [ 14/Apr/15 ]

Got around to update the SLES12 ldiskfs support to see if I can reproduce this issue. Before I saw this issue but now I can't seem to reproduce it with the latest ldiskfs patch.

Comment by Peter Jones [ 15/Apr/15 ]

Is it still happening on EL7?

Comment by Gerrit Updater [ 15/Apr/15 ]

Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/14469
Subject: LU-6096 test: force test of known failure
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 747eeed7e4eec9ea5c900886d97709b7d9ef9e0b

quickest way to prove it's still a problem is to let autotest tell us

Comment by Bob Glossman (Inactive) [ 15/Apr/15 ]

yes, it still fails. see https://testing.hpdd.intel.com/test_sets/32d9d04a-e38c-11e4-bb64-5254006e85c2

Comment by Andreas Dilger [ 11/May/15 ]

Alex, can you please take a look at this?

Comment by Alex Zhuravlev [ 12/May/15 ]

I pushed a debug patch.

Comment by Bob Glossman (Inactive) [ 12/May/15 ]

I think there may be a typo in that debug patch. title in commit header refers to 6069, not 6096

Comment by Alex Zhuravlev [ 12/May/15 ]

ok, thanks. that's just to try to catch the cause.. not to fix the issue.

Comment by Andreas Dilger [ 20/May/15 ]

Patch is http://review.whamcloud.com/14772

Comment by Andreas Dilger [ 20/May/15 ]

It looks like the requested sanity test_17m test was run, but none of the added LASSERTs/BUG were hit.

Alex, what is the next step?

Comment by Alex Zhuravlev [ 20/May/15 ]

going to reproduce locally, will take a bit.

Comment by Alex Zhuravlev [ 25/May/15 ]

the patch referenced above should address the issue.

Comment by Bob Glossman (Inactive) [ 26/May/15 ]

In local test on el7 servers the fix works. The problem which was 100% reproducible now doesn't appear even once.

Comment by Gerrit Updater [ 26/May/15 ]

Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/14772/
Subject: LU-6096 osd: mark directory's inode dirty
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 52437f5218fc911cc0b6b9f36e8a143bf54f1024

Comment by Andreas Dilger [ 26/May/15 ]

Patch landed to master for 2.8.0.

Comment by Sarah Liu [ 29/Jun/15 ]

Hit this error again on build #3071 EL7 DNE mode

https://testing.hpdd.intel.com/test_sets/683aadc2-1311-11e5-8d21-5254006e85c2

Comment by Andreas Dilger [ 03/Jul/15 ]

It looks like this is still being hit. The symptom is slightly different - instead of only the PENDING and .lustre/fid directories being affected, it looks like a larger number of directories are having problems:

onyx-41vm7: e2fsck 1.42.12.wc1 (15-Sep-2014)
onyx-41vm7: check_blocks:2814: increase inode 139 badness 0 to 1
onyx-41vm7: check_blocks:2814: increase inode 140 badness 0 to 1
onyx-41vm7: check_blocks:2814: increase inode 142 badness 0 to 1
onyx-41vm7: check_blocks:2814: increase inode 143 badness 0 to 1
onyx-41vm7: check_blocks:2814: increase inode 145 badness 0 to 1
onyx-41vm7: check_blocks:2814: increase inode 146 badness 0 to 1
onyx-41vm7: check_blocks:2814: increase inode 148 badness 0 to 1
onyx-41vm7: check_blocks:2814: increase inode 524394 badness 0 to 1
onyx-41vm7: check_blocks:2814: increase inode 524397 badness 0 to 1
onyx-41vm7: check_blocks:2814: increase inode 524402 badness 0 to 1
onyx-41vm7: check_blocks:2814: increase inode 524405 badness 0 to 1
onyx-41vm7: check_blocks:2814: increase inode 524408 badness 0 to 1
onyx-41vm7: check_blocks:2814: increase inode 524411 badness 0 to 1
onyx-41vm7: check_blocks:2814: increase inode 524416 badness 0 to 1
onyx-41vm7: check_blocks:2814: increase inode 524421 badness 0 to 1
onyx-41vm7: check_blocks:2814: increase inode 524424 badness 0 to 1
onyx-41vm7: check_blocks:2814: increase inode 524429 badness 0 to 1
onyx-41vm7: check_blocks:2814: increase inode 524433 badness 0 to 1
Pass 1: Checking inodes, blocks, and sizes
Inode 139, i_size is 0, should be 4096.  Fix? no
Inode 140, i_size is 0, should be 4096.  Fix? no
Inode 142, i_size is 0, should be 4096.  Fix? no
Inode 143, i_size is 0, should be 4096.  Fix? no
Inode 145, i_size is 0, should be 4096.  Fix? no
Inode 146, i_size is 0, should be 4096.  Fix? no
Inode 148, i_size is 0, should be 4096.  Fix? no
Inode 524394, i_size is 0, should be 4096.  Fix? no
Inode 524397, i_size is 0, should be 4096.  Fix? no
Inode 524402, i_size is 0, should be 4096.  Fix? no
Inode 524405, i_size is 0, should be 4096.  Fix? no
Inode 524408, i_size is 0, should be 4096.  Fix? no
Inode 524411, i_size is 0, should be 4096.  Fix? no
Inode 524416, i_size is 0, should be 4096.  Fix? no
Inode 524421, i_size is 0, should be 4096.  Fix? no
Inode 524424, i_size is 0, should be 4096.  Fix? no
Inode 524429, i_size is 0, should be 4096.  Fix? no
Inode 524433, i_size is 0, should be 4096.  Fix? no

It still always seems related to EL7 though.

Comment by Andreas Dilger [ 07/Jul/15 ]

Alex, any comment on this? Is it possible the filesystem is being uncounted without flushing those updates to disk?

Comment by Alex Zhuravlev [ 07/Jul/15 ]

I'm trying to reproduce this locally.. will take some time.

Comment by Gerrit Updater [ 13/Jul/15 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/15581
Subject: LU-6096 osd: mark dir's inode dirty
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d62ecb63fb898caa8c83f99ea325d5ad3fb96e00

Comment by James A Simmons [ 28/Jul/15 ]

Bob is this also a problem for SLES12?

Comment by Bob Glossman (Inactive) [ 28/Jul/15 ]

James, have never seen it on sles12, only on el7. can't promise it's not in sles12 too.

Comment by Gerrit Updater [ 05/Aug/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15581/
Subject: LU-6096 ldiskfs: mark dir's inode dirty
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bad49e39e301d4367eaead5ee566f5dcacfde8f6

Comment by Peter Jones [ 05/Aug/15 ]

Landed for 2.8. Resolving ticket. If this every crops up on SLES12 we can always open a ticket for it then

Generated at Sat Feb 10 01:57:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.