[LU-1540] e2fsck remove too many symlinks Created: 19/Jun/12  Updated: 03/Aug/22  Resolved: 24/Aug/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0, Lustre 2.3.0, Lustre 2.1.1, Lustre 2.1.2, Lustre 2.1.3
Fix Version/s: Lustre 2.3.0, Lustre 2.4.0, Lustre 2.1.3

Type: Bug Priority: Major
Reporter: Alexandre Louvet Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Attachments: File gen.sh     File log.txt.gz     Text File src.txt    
Issue Links:
Related
is related to LU-1366 getting "dirdata length set incorrect... Closed
is related to LU-1774 fsck -fD corrupts filesystem Resolved
is related to LU-16060 e2fsck reported "symlink missing NUL ... Resolved
Severity: 3
Rank (Obsolete): 4418

 Description   

Each time we run fsck on a MDS, we lose thousand of symlinks (which is a kind of corruption).
To figure out a little what was wrong with those inodes, I did a small instrumentation of e2fsck_pass1_check_symlink and started to print various information about the inode (modification are attached).

Here is what I got :

[root@gaia14 ~]# ./e2fsck -n -f -v -d /dev/mapper/vg_mdt_work1-mdt_work1 | grep -B2 -A1 "at pass1.c:246"
e2fsck 1.41.90.wc4 (01-Sep-2011)
blocks != 0 && fs->blocksize = 4096, buf = %/home/cont001/segura/BIN/ELSA/CHAINE_V2/MODULES_PYTHON/GENERIQUES/Block.py_oldaux.py%
len = 84, inode->i_size = 78
at pass1.c:246 : offending inode 16819164 found !
e2fsck_pass1:1416: increase inode 16819164 badness 0 to 1

note the content of buf (result of io_channel_read_blk64), and specially the latest characters 'aux.py' (% was in the printf format).

Looking at this inode using debugfs return

debugfs: stat <16819164>
Inode: 16819164 Type: symlink Mode: 0777 Flag
Generation: 1029089099 Version: 0x0000004e:1a78c70
User: 11876 Group: 1850 Size: 78
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 8
Fragment: Address: 0 Number: 0 Size: 0
 ctime: 0x4fd1a998:00000000 -- Fri Jun 8 09:28:24 20
 atime: 0x4fd1a998:00000000 -- Fri Jun 8 09:28:24 20
 mtime: 0x4fd1a998:00000000 -- Fri Jun 8 09:28:24 20
crtime: 0x4fd1a998:e6373924 -- Fri Jun 8 09:28:24 20
Size of extra inode fields: 28
Extended attributes stored in inode body:
  lma = "00 00 00 00 00 00 00 00 56 fa 1b 0d 02 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  link = "df f1 ea 11 01 00 00 00 36 00 00 00 00 00 0
 5c 00 00 00 00 42 6c 6f 63 6b 2e 70 79 5f 6f 6c 64 "
BLOCKS:
(0):17832933
TOTAL: 1

debugfs: cat <16819164>
/home/cont001/segura/BIN/ELSA/CHAINE_V2/MODULES_PYTHON/GENERIQUES/Block.py_old

The string is the same except that it doesn't display the 'aux.py', and the targeted file is what was expected.

Accessing the filesystem through mount doesn't display any issue. Accessing the symlink works and the content is valid (the expected link has a length of inode->i_size).
The buffer retrieved in e2fsck looks wrong and the end of the string (starting inode->i_size) always contain garbage.

So who is wrong ? Is it correct for e2fsprog to guess that a symlink is always terminated by a '\0', giving strnlen a chance to return the right length ? Or is ldiskfs wrong in not enforcing the '\0' at inode->i_size position ? ... or something else ...



 Comments   
Comment by Peter Jones [ 19/Jun/12 ]

Bobijam

Could you please look into this one?

Thanks

Peter

Comment by Andreas Dilger [ 20/Jun/12 ]

I believe this symlink problem is caused by the MDS being formatted with the "extents" option, which is not the default (see discussion in LU-1366 for a similar issue).

The removal of symlinks should be fixed in the recently-released e2fsprogs-1.42.3.wc1 version of e2fsck. Please see http://downloads.whamlcoud.com/public/e2fsprogs/latest/ for packages.

Comment by Alexandre Louvet [ 20/Jun/12 ]

Unfortunately not, it isn't (unless it's not the good way to check).

debugfs: params
Open mode: read-only
Filesystem in use: /dev/mapper/vg_mdt_store1-mdt_store1
debugfs: features
Filesystem features: has_journal ext_attr resize_inode dir_index filetype
needs_recovery mmp flex_bg dirdata sparse_super large_file uninit_bg

I'll double check, but we also give a try to e2fsprogs-1.42.3.wc1 yesterday and it didn't change anything.

Alex.

Comment by Alexandre Louvet [ 20/Jun/12 ]

Checked that e2fsprogs-1.42.3.wc1 doesn't change the unexpected behaviour.

Alex.

Comment by Andreas Dilger [ 20/Jun/12 ]

I wonder if the Lustre osd-ldiskfs code is not NUL terminating the long symlinks? I'm not sure if that is required, or if the fault is in e2fsck for assuming that they will be NUL terminated?

Comment by Andreas Dilger [ 20/Jun/12 ]

In either case, it seems we will also need to make a patch for e2fsck so that it doesn't consider such symlinks as corrupt, but instead fixes them up, since this will be the case for all 2.x Lustre MDTs.

Comment by Zhenyu Xu [ 21/Jun/12 ]

I tried b1_2 branch, lustre doesn't create size-mismatched long symbolic link file.

create a long symlink

# ls -l /mnt/lustre
lrwxrwxrwx 1 root root 84 Jun 21 16:13 long-symlink -> /home/cont001/segura/BIN/ELSA/CHAINE_V2/MODULES_PYTHON/GENERIQUES/Block.py_oldaux.py
its size is 84 == length of the string

umount MDT, and check the long symlink file.

e2fsck MDT device

# e2fsck -fnvd /dev/sdb
e2fsck 1.41.90.wc4 (01-Sep-2011)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

39 inodes used (0.04%)
3 non-contiguous files (7.7%)
0 non-contiguous directories (0.0%)

  1. of inodes with ind/dind/tind blocks: 0/0/0
    16734 blocks used (33.47%)
    0 bad blocks
    1 large file

21 regular files
8 directories
0 character device files
0 block device files
0 fifos
0 links
1 symbolic link (0 fast symbolic links) ===> this long symlink file
0 sockets
--------
30 files

stat the file

# debugfs -R "stat /ROOT/long-symlink" /dev/sdb
debugfs 1.41.90.wc4 (01-Sep-2011)
Inode: 34 Type: symlink Mode: 0777 Flags: 0x0
Generation: 759704194 Version: 0x00000001:00000002
User: 0 Group: 0 Size: 84 =====> size matched
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 8
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x4fe2d7c2:00000000 – Thu Jun 21 16:13:54 2012
atime: 0x4fe2d7c2:00000000 – Thu Jun 21 16:13:54 2012
mtime: 0x4fe2d7c2:00000000 – Thu Jun 21 16:13:54 2012
crtime: 0x4fe2d7c2:b3456314 – Thu Jun 21 16:13:54 2012
Size of extra inode fields: 28
Extended attributes stored in inode body:
lma = "00 00 00 00 00 00 00 00 00 04 00 00 02 00 00 00 01 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0
0 00 00 00 00 00 00 00 00 00 00 00 00 00 " (64)
link = "df f1 ea 11 01 00 00 00 36 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0
0 00 1e 00 00 00 00 00 00 61 ab 2d 48 2a 75 00 00 00 00 6c 6f 6e 67 2d 73 79 6d
6c 69 6e 6b " (54)
BLOCKS:
(0):57
TOTAL: 1

cat it

# debugfs -R "cat <34>" /dev/sdb
debugfs 1.41.90.wc4 (01-Sep-2011)
/home/cont001/segura/BIN/ELSA/CHAINE_V2/MODULES_PYTHON/GENERIQUES/Block.py_oldaux.py

Comment by Alexandre Louvet [ 21/Jun/12 ]

This is not that simple as all symlinks are not in this case. We have a huge number of symlinks that remain untouched during fsck. Only a subset are affected.

Comment by Alexandre Louvet [ 21/Jun/12 ]

Ok, so I'm now able to provide a (more or less) simple reproducer. It was run on a cluster running lustre 2.1.2 on a freshly created filesystem (mounted in /scratch on the client).
On the client, after the new filesystem has been mounted, simply run 'gen.sh'. When the script did complete, umount the filesystem and run fsck on the MDT.

Attached gen.sh (the reproducer) & log.txt (output of e2fsck -v -n /dev/sdb2 in my case).

Comment by Zhenyu Xu [ 21/Jun/12 ]

found the lustre code corresponding to the issue, osd_write does not write the end NULL symlink string to its data block area.

osd_write()
        /* Write small symlink to inode body as we need to maintain correct
         * on-disk symlinks for ldiskfs.
         */
        if(S_ISLNK(obj->oo_dt.do_lu.lo_header->loh_attr) &&
           (buf->lb_len < sizeof (LDISKFS_I(inode)->i_data)))
                result = osd_ldiskfs_writelink(inode, buf->lb_buf, buf->lb_len);
        else
                result = osd_ldiskfs_write_record(inode, buf->lb_buf,
                                                  buf->lb_len, pos,
                                                  oh->ot_handle);

I did a hack to verify it, with following patch, fsck passed the MDT device filled with gen.sh reproducer.

diff --git a/lustre/osd-ldiskfs/osd_handler.c b/lustre/osd-ldiskfs/osd_handler.c
index 9a052be..457090c 100644
--- a/lustre/osd-ldiskfs/osd_handler.c
+++ b/lustre/osd-ldiskfs/osd_handler.c
@@ -2555,6 +2555,8 @@ static int osd_ldiskfs_write_record(struct inode *inode, v
         int boffs;
         int dirty_inode = 0;

+        *((char *)buf + bufsize) = '\0';
+        ++bufsize;
         while (bufsize > 0) {
                 if (bh != NULL)
                         brelse(bh);
@@ -2593,6 +2595,7 @@ static int osd_ldiskfs_write_record(struct inode *inode, v

         /* correct in-core and on-disk sizes */
         if (new_size > i_size_read(inode)) {
+                --new_size;
                 spin_lock(&inode->i_lock);
                 if (new_size > i_size_read(inode))
                         i_size_write(inode, new_size);
Comment by Zhenyu Xu [ 21/Jun/12 ]

I think we need change 2 parts:

1. include the '\0' in mdd_create() when writing symlink data (passing buflen + 1), thus inode->i_size also include this NULL string terminator.
2. in e2fsck_pass1_check_symlink(), change the long symlink check to make sure the string length of its data block is one less than the inode's i_size.

Comment by Andreas Dilger [ 21/Jun/12 ]

Bobijam,
it appears that the NUL terminator for the symlink is at buf[inode->i_size], so for your example i_size = 84, which is also the number of characters in the string. That means e2fsck shouldn't check that the strlen(symlink) is (i_size - 1), but rather that it should add a NUL terminator at buf[i_size] (assuming this fits within the blocksize).

Comment by Zhenyu Xu [ 21/Jun/12 ]
mdd_create
        if (S_ISLNK(attr->la_mode)) {
                struct md_ucred  *uc = md_ucred(env);
                struct dt_object *dt = mdd_object_child(son);
                const char *target_name = spec->u.sp_symname;
                int sym_len = strlen(target_name);
                const struct lu_buf *buf;
                loff_t pos = 0;

                buf = mdd_buf_get_const(env, target_name, sym_len);
                rc = dt->do_body_ops->dbo_write(env, dt, buf, &pos, handle,
                                                mdd_object_capa(env, son),
                                                uc->mu_cap &
                                                CFS_CAP_SYS_RESOURCE_MASK);

                if (rc == sym_len)
                        rc = 0;
                else
                        GOTO(cleanup, rc = -EFAULT);
        }

spec->u.sp_symname contains the NUL terminator, while sym_len does not count it, so osd_write does not write the NUL terminator to the inode's data block and also set i_size to sym_len. So the string length e2fsck_pass1_check_symlink() reads from the inode's data block cannot gurantee to match with the inode's i_size.

Comment by Zhenyu Xu [ 22/Jun/12 ]

As the original case shows:

debugfs: cat <16819164>
/home/cont001/segura/BIN/ELSA/CHAINE_V2/MODULES_PYTHON/GENERIQUES/Block.py_old

this is what the symlink's target string, and e2fsck read more than that

[root@gaia14 ~]# ./e2fsck -n -f -v -d /dev/mapper/vg_mdt_work1-mdt_work1 | grep -B2 -A1 "at pass1.c:246"
e2fsck 1.41.90.wc4 (01-Sep-2011)
blocks != 0 && fs->blocksize = 4096, buf = %/home/cont001/segura/BIN/ELSA/CHAINE_V2/MODULES_PYTHON/GENERIQUES/Block.py_oldaux.py%
len = 84, inode->i_size = 78

Comment by Zhenyu Xu [ 22/Jun/12 ]

http://review.whamcloud.com/3171

LU-1540 e2fsprogs: symlink check handle string length properly

The symbolic link target string is written into inode data block
without NUL terminator, its length is decided by the inode's size,
so this fix makes symlink check take heed of it.

Comment by Andreas Dilger [ 22/Jun/12 ]

There must be some way to write a block with NULs beyond i_size. When writing a normal data file, it has to fill in zeroes after i_size fit the rest of the block to ensure that sparse writes work properly. Why can't we do the same for symlinks? One bad way I can think to do this are to write the full block, then truncate to i_size, but I suspect there is a better way to do this.

Alex, could you please comment?

Comment by Andreas Dilger [ 27/Jun/12 ]

The patch to e2fsck in http://review.whamcloud.com/3171 can be used as a temporary workaround to the removal of corrupt symlinks, but it doesn't actually fix the problem on disk, only prevents e2fsck from deleting the symlinks. This is why I want an updated version of that patch landed to e2fsck.

There is also as yet no patch to fix the osd-ldiskfs code to prevent further bad symlinks from being created in the future.

Both of these issues need to be addressed before this bug can be closed.

Comment by Andreas Dilger [ 27/Jun/12 ]

Since this is now known to be a bug in osd-ldiskfs and not e2fsck, I'm elevating this to be a 2.1.3 and 2.3 blocker.

Comment by Andreas Dilger [ 28/Jun/12 ]

Alex, can you please comment on the osd-ldiskfs issues here. What is the best way to write out a NUL-terminated symlink without changing the file size? This shouldn't be any different than writing out a partial-page buffer (which should be zero-filled at the end).

Comment by Alex Zhuravlev [ 28/Jun/12 ]

I think osd-ldiskfs can check the object's type (=symlink), then either append a zero and decrement i_size or just use trivial copy of osd_ldiskfs_write_record() ?

Comment by Andreas Dilger [ 30/Jun/12 ]

How does osd-ldiskfs handle the case of a normal 80-byte write to a new file? The rest of the disk block beyond 80 bytes needs to be zero-filled, but I don't think any tricks are played with the file size. Can we not just hook into the ldiskfs symlink code?

Cheers, Andreas

Comment by Build Master (Inactive) [ 16/Jul/12 ]

Integrated in e2fsprogs-master » x86_64,el6 #226
LU-1540 e2fsck: add missing symlink NUL terminator (Revision 311939a654b4085aef9559f21e91d7205150950f)

Result = SUCCESS
Andreas Dilger : 311939a654b4085aef9559f21e91d7205150950f
Files :

  • tests/f_badsymlinks/expect.2
  • tests/f_badsymlinks/expect.1
  • e2fsck/e2fsck.h
  • e2fsck/problem.c
  • e2fsck/pass1.c
  • e2fsck/problem.h
  • e2fsck/pass2.c
Comment by Build Master (Inactive) [ 16/Jul/12 ]

Integrated in e2fsprogs-master » i686,el6 #226
LU-1540 e2fsck: add missing symlink NUL terminator (Revision 311939a654b4085aef9559f21e91d7205150950f)

Result = SUCCESS
Andreas Dilger : 311939a654b4085aef9559f21e91d7205150950f
Files :

  • e2fsck/problem.c
  • tests/f_badsymlinks/expect.1
  • e2fsck/problem.h
  • e2fsck/pass1.c
  • tests/f_badsymlinks/expect.2
  • e2fsck/pass2.c
  • e2fsck/e2fsck.h
Comment by Build Master (Inactive) [ 16/Jul/12 ]

Integrated in e2fsprogs-master » x86_64,el5 #226
LU-1540 e2fsck: add missing symlink NUL terminator (Revision 311939a654b4085aef9559f21e91d7205150950f)

Result = SUCCESS
Andreas Dilger : 311939a654b4085aef9559f21e91d7205150950f
Files :

  • e2fsck/pass2.c
  • tests/f_badsymlinks/expect.2
  • e2fsck/pass1.c
  • tests/f_badsymlinks/expect.1
  • e2fsck/e2fsck.h
  • e2fsck/problem.h
  • e2fsck/problem.c
Comment by Build Master (Inactive) [ 16/Jul/12 ]

Integrated in e2fsprogs-master » i686,el5 #226
LU-1540 e2fsck: add missing symlink NUL terminator (Revision 311939a654b4085aef9559f21e91d7205150950f)

Result = SUCCESS
Andreas Dilger : 311939a654b4085aef9559f21e91d7205150950f
Files :

  • tests/f_badsymlinks/expect.2
  • e2fsck/e2fsck.h
  • e2fsck/pass2.c
  • e2fsck/pass1.c
  • e2fsck/problem.c
  • tests/f_badsymlinks/expect.1
  • e2fsck/problem.h
Comment by Peter Jones [ 07/Aug/12 ]

Bobijam

I spoke with Andreas about this issue and he thinks that it makes sense for you to talk directly with Alex Z about ways of approaching this problem and that we should aim to get this fixed for 2.3

Thanks

Peter

Comment by Peter Jones [ 08/Aug/12 ]

http://review.whamcloud.com/#change,3560

Comment by Andreas Dilger [ 09/Aug/12 ]

The e2fsck part of this change is included into the e2fsprogs-1.42.3.wc3 release.

Comment by Peter Jones [ 21/Aug/12 ]

Fix landed for 2.3

Comment by Diego Moreno (Inactive) [ 23/Aug/12 ]

Could it be possible to have the ldiskfs patch (LU-1540 osd: add NUL terminator for long symlink) ported to b2_1?

Thanks,

Diego

Comment by Zhenyu Xu [ 23/Aug/12 ]

b2_1 port of ldiskfs patch tracking at http://review.whamcloud.com/3765

patch description
LU-1540 osd: add NUL terminator for long symlink

Add NUL terminator for long symlink to ldiskfs inode on-disk data.

Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
Comment by Diego Moreno (Inactive) [ 24/Aug/12 ]

That's great, we'll try the patch.

Thanks guys,

Generated at Sat Feb 10 01:17:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.