[LU-5584] EL7 client: Test failure on test suite runtests, subtest test_1 Created: 04/Sep/14  Updated: 14/May/15  Resolved: 31/Oct/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: None

Attachments: HTML File after-umount-log     HTML File copy-end-log    
Severity: 3
Rank (Obsolete): 15577

 Description   

This issue was created by maloo for Minh Diep <minh.diep@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/d54472f6-2fc6-11e4-957a-5254006e85c2.

The sub-test test_1 failed with the following error:

old and new files are different: rc=22

Info required for matching: runtests 1



 Comments   
Comment by Peter Jones [ 15/Sep/14 ]

Yang Sheng

Could you please advise on this one?

Thanks

Peter

Comment by Yang Sheng [ 22/Sep/14 ]

http://review.whamcloud.com/12009
This is not a final solution but a necessary part.

Comment by Yang Sheng [ 26/Sep/14 ]

Hi, PengTao. Could you please help verify upstream lustre also exist this problem? TIA.

Comment by Peter Jones [ 26/Sep/14 ]

Oleg may be better placed to comment

Comment by Yang Sheng [ 29/Sep/14 ]

http://review.whamcloud.com/#/c/12103/

Comment by Yang Sheng [ 29/Sep/14 ]

Hi, JinShan, Could you please comment on this patch? TIA.

Comment by Yang Sheng [ 20/Oct/14 ]

As JinShan asked in gerrit. Attached logs for investigation. The runtests can hardly to collect log since it copy too much files. So i use a small script to reproduce this issue. It can produce log more easy to collection and analyzed.

modprobe mbcache
sh llmount.sh
#echo  "mdc osc class log llite lov mgc" >/proc/sys/lnet/subsystem_debug
echo  "class osc mdc rpc lov llite" >/proc/sys/lnet/subsystem_debug
echo  -1 >/proc/sys/lnet/debug
lctl set_param osc/*/max_rpcs_in_flight=1

while : ; do
        rm -rf /mnt/lustre/ddd
        mkdir - p /mnt/lustre/ddd
        lctl mark "copy start"
        cp /aaa/* /mnt/lustre/ddd/
        lctl mark "copy end"
        lctl dk >copy-end-log
        nm=`ls /proc/fs/lustre/llite/`
        umount /mnt/lustre
        lctl dk > after-umount-log
        mount -t lustre 192.168.122.111@tcp:/lustre /mnt/lustre
        ls -lR /mnt/lustre/ddd |grep " 0 " && exit
        echo "==========================new loop============="
done                                          

This script just copy files from /aaa to /mnt/lustre/ddd. The file list is

-rw-r--r-- 1 root root     16 9月  21 13:57 adjtime
-rw-r--r-- 1 root root   1518 9月  21 13:57 aliases
-rw-r--r-- 1 root root  12288 9月  21 13:57 aliases.db
-rw------- 1 root root    541 9月  21 13:57 anacrontab
-rw-r--r-- 1 root root     55 9月  21 13:57 asound.conf
-rw-r--r-- 1 root root      1 9月  21 13:57 at.deny
-rw-r--r-- 1 root root   2835 9月  21 13:57 bashrc
-rw-r--r-- 1 root root   1157 9月  21 13:57 chrony.conf
-rw-r----- 1 root root     62 9月  21 13:57 chrony.keys
-rw-r--r-- 1 root root    451 9月  21 13:57 crontab
-rw-r--r-- 1 root root   1602 9月  21 13:57 csh.cshrc
-rw-r--r-- 1 root root    841 9月  21 13:57 csh.login
-rw-r--r-- 1 root root   5090 9月  21 13:57 DIR_COLORS
-rw-r--r-- 1 root root   5725 9月  21 13:57 DIR_COLORS.256color
-rw-r--r-- 1 root root   4669 9月  21 13:57 DIR_COLORS.lightbgcolor
-rw-r--r-- 1 root root  25213 9月  21 13:57 dnsmasq.conf
-rw-r--r-- 1 root root   1285 9月  21 13:57 dracut.conf
-rw-r--r-- 1 root root    112 9月  21 13:57 e2fsck.conf
-rw-r--r-- 1 root root   1317 9月  21 13:57 ethertypes
-rw-r--r-- 1 root root    730 9月  21 13:57 favicon.png
-rw-r--r-- 1 root root     64 9月  21 13:57 filesystems
-rw-r--r-- 1 root root     20 9月  21 13:57 fprintd.conf
-rw-r--r-- 1 root root    501 9月  21 13:57 fstab
-rw-r--r-- 1 root root    265 9月  21 13:57 gdbinit
-rw-r--r-- 1 root root     94 9月  21 13:57 GREP_COLORS
-rw-r--r-- 1 root root    601 9月  21 13:57 group
-rw-r--r-- 1 root root    573 9月  21 13:57 group-
-rw-r--r-- 1 root root   4672 9月  21 13:57 grub2.cfg
---------- 1 root root    467 9月  21 13:57 gshadow
---------- 1 root root    458 9月  21 13:57 gshadow-
-rw-r--r-- 1 root root      9 9月  21 13:57 host.conf
-rw-r--r-- 1 root root      3 9月  21 13:57 hostname
-rw-r--r-- 1 root root    194 9月  21 13:57 hosts
-rw-r--r-- 1 root root    370 9月  21 13:57 hosts.allow
-rw-r--r-- 1 root root    460 9月  21 13:57 hosts.deny
-rw-r--r-- 1 root root    491 9月  21 13:57 inittab
-rw-r--r-- 1 root root    942 9月  21 13:57 inputrc
-rw-r--r-- 1 root root     23 9月  21 13:57 issue
-rw-r--r-- 1 root root     22 9月  21 13:57 issue.net
-rw-r--r-- 1 root root   6088 9月  21 13:57 kdump.conf
-rw-r--r-- 1 root root    495 9月  21 13:57 krb5.conf
-rw-r--r-- 1 root root    548 9月  21 13:57 ldev.conf
-rw-r--r-- 1 root root  29632 9月  21 13:57 ld.so.cache
-rw-r--r-- 1 root root     28 9月  21 13:57 ld.so.conf
-rw-r----- 1 root root    191 9月  21 13:57 libaudit.conf
-rw-r--r-- 1 root root   2391 9月  21 13:57 libuser.conf
-rw-r--r-- 1 root root     19 9月  21 13:57 locale.conf
-rw-r--r-- 1 root root   3519 9月  21 13:57 localtime
-rw-r--r-- 1 root root   2028 9月  21 13:57 login.defs
-rw-r--r-- 1 root root    662 9月  21 13:57 logrotate.conf
-r--r--r-- 1 root root     33 9月  21 13:57 machine-id
-rw-r--r-- 1 root root    111 9月  21 13:57 magic
-rw-r--r-- 1 root root   1968 9月  21 13:57 mail.rc
-rw-r--r-- 1 root root   5122 9月  21 13:57 makedumpfile.conf.sample
-rw-r--r-- 1 root root   5171 9月  21 13:57 man_db.conf
-rw-r--r-- 1 root root    956 9月  21 13:57 mke2fs.conf
-r--r--r-- 1 root root   2140 9月  21 13:57 mtab
-rw-r--r-- 1 root root    570 9月  21 13:57 my.cnf
-rw-r--r-- 1 root root   8892 9月  21 13:57 nanorc
-rw-r--r-- 1 root root     58 9月  21 13:57 networks
-rw-r--r-- 1 root root   1717 9月  21 13:57 nsswitch.conf
-rw-r--r-- 1 root root   1726 9月  21 13:57 nsswitch.conf.bak
-rw-r--r-- 1 root root    493 9月  21 13:57 os-release
-rw-r--r-- 1 root root   1163 9月  21 13:57 passwd
-rw-r--r-- 1 root root   1126 9月  21 13:57 passwd-
-rw-r--r-- 1 root root   2872 9月  21 13:57 pinforc
-rw-r--r-- 1 root root    233 9月  21 13:57 printcap
-rw-r--r-- 1 root root   1750 9月  21 13:57 profile
-rw-r--r-- 1 root root   6545 9月  21 13:57 protocols
-rw-r--r-- 1 root root    477 9月  21 13:57 rc.local
-rw-r--r-- 1 root root     52 9月  21 13:57 redhat-release
-rw-r--r-- 1 root root     55 9月  21 13:57 resolv.conf
-rw-r--r-- 1 root root   1634 9月  21 13:57 rpc
-rw-r--r-- 1 root root    458 9月  21 13:57 rsyncd.conf
-rw-r--r-- 1 root root   3232 9月  21 13:57 rsyslog.conf
-rw-r--r-- 1 root root    927 9月  21 13:57 rwtab
-rw------- 1 root root    221 9月  21 13:57 securetty
-rw-r--r-- 1 root root 670293 9月  21 13:57 services
-rw-r--r-- 1 root root    216 9月  21 13:57 sestatus.conf
---------- 1 root root    728 9月  21 13:57 shadow
---------- 1 root root    701 9月  21 13:57 shadow-
-rw-r--r-- 1 root root     95 9月  21 13:57 shells
-rw-r--r-- 1 root root    256 9月  21 13:57 sos.conf
-rw-r--r-- 1 root root    212 9月  21 13:57 statetab
-rw-r----- 1 root root   1786 9月  21 13:57 sudo.conf
-r--r----- 1 root root   4000 9月  21 13:57 sudoers
-rw-r----- 1 root root   3181 9月  21 13:57 sudo-ldap.conf
-rw-r--r-- 1 root root    225 9月  21 13:57 sysctl.conf
-rw-r--r-- 1 root root     52 9月  21 13:57 system-release
-rw-r--r-- 1 root root     45 9月  21 13:57 system-release-cpe
-rw-r--r-- 1 root root    375 9月  21 13:57 trusted-key.key
-rw-r--r-- 1 root root    508 9月  21 13:57 updatedb.conf
-rw-r--r-- 1 root root    889 9月  21 13:57 usb_modeswitch.conf
-rw-r--r-- 1 root root     37 9月  21 13:57 vconsole.conf
-rw-r--r-- 1 root root   1982 9月  21 13:57 vimrc
-rw-r--r-- 1 root root   1982 9月  21 13:57 virc
-rw-r--r-- 1 root root   4479 9月  21 13:57 wgetrc
-rw-r--r-- 1 root root    813 9月  21 13:57 yum.conf

Below is problem file list(zero size after umount & mount):

-rw-r----- 1 root root      0 10月 20 10:06 chrony.keys
-rw-r--r-- 1 root root      0 10月 20 10:06 DIR_COLORS.lightbgcolor
-rw-r--r-- 1 root root      0 10月 20 10:06 ethertypes
-rw-r--r-- 1 root root      0 10月 20 10:06 filesystems
-rw-r--r-- 1 root root      0 10月 20 10:06 GREP_COLORS
-rw-r--r-- 1 root root      0 10月 20 10:06 group-
---------- 1 root root      0 10月 20 10:06 gshadow
-rw-r--r-- 1 root root      0 10月 20 10:06 host.conf
-rw-r--r-- 1 root root      0 10月 20 10:06 hosts
-rw-r--r-- 1 root root      0 10月 20 10:06 inputrc
-rw-r--r-- 1 root root      0 10月 20 10:06 issue.net
-rw-r--r-- 1 root root      0 10月 20 10:06 krb5.conf
-rw-r--r-- 1 root root      0 10月 20 10:06 ld.so.cache
-rw-r----- 1 root root      0 10月 20 10:06 libaudit.conf
-rw-r--r-- 1 root root      0 10月 20 10:06 libuser.conf
-rw-r--r-- 1 root root      0 10月 20 10:06 locale.conf
-r--r--r-- 1 root root      0 10月 20 10:06 machine-id
-r--r--r-- 1 root root      0 10月 20 10:06 mtab
-rw-r--r-- 1 root root      0 10月 20 10:06 nsswitch.conf
Comment by Jinshan Xiong (Inactive) [ 20/Oct/14 ]

Hi Yang Sheng, Thank you very much for the reproduce script and log, I'm looking at it now.

Comment by Jinshan Xiong (Inactive) [ 21/Oct/14 ]

When a filesystem is being umounted, it will sync dirty data by calling sync_filesystem(). Then it calls __sync_filesystem() twice: the first time just starts the write back and the 2nd time is to wait for the write back to finish. Both call should reach ll_writepages() but for unknown reason, the 2nd call didn't reach it.

My temporary solution is to make ll_writepages() to wait for write back to finish anyway:

diff --git a/lustre/llite/rw.c b/lustre/llite/rw.c
index ff25525..778609b 100644
--- a/lustre/llite/rw.c
+++ b/lustre/llite/rw.c
@@ -1080,7 +1080,7 @@ int ll_writepages(struct address_space *mapping, struct writeback_control *wbc)
                }
        }
 
-       mode = CL_FSYNC_NONE;
+       mode = CL_FSYNC_LOCAL;
        if (wbc->sync_mode == WB_SYNC_ALL)
                mode = CL_FSYNC_LOCAL;
 

I can't see this issue any more after applying the above patch.

Yang Sheng, if you have any time, can you please investigate it further to figure out why the 2nd __sync_filesystem() didn't reach ll_writepages()?

Thanks,
Jinshan

Comment by Jinshan Xiong (Inactive) [ 21/Oct/14 ]

It looks like the inode is called with redirty_tail() and then put in the list of b_dirty with newer timestamp. This is why it was skipped in the 2nd __sync_filesystem() therefore dirty data is not flushed. I think Yang Sheng's patch is good to fix the problem.

Comment by Yang Sheng [ 25/Oct/14 ]

The umount will hang while invoke with CL_FSYNC_ALL in replay-single test_89.

{no format}
[ 8119.181431] umount D ffff88003fc14580 0 4225 3612 0x00000080
[ 8119.181431] ffff880020cb7ae0 0000000000000086 ffff880020cb7fd8 0000000000014580
[ 8119.181431] ffff880020cb7fd8 0000000000014580 ffff880012100b60 ffff880009f50e80
[ 8119.181431] ffff880009f50e88 7fffffffffffffff ffff880012100b60 0000000000000000
[ 8119.181431] Call Trace:
[ 8119.181431] [<ffffffff815e71b9>] schedule+0x29/0x70
[ 8119.181431] [<ffffffff815e50b9>] schedule_timeout+0x209/0x2d0
[ 8119.181431] [<ffffffffa0783d60>] ? ptlrpcd_add_req+0x210/0x300 [ptlrpc]
[ 8119.181431] [<ffffffffa0764370>] ? lustre_swab_obdo+0x100/0x100 [ptlrpc]
[ 8119.181431] [<ffffffff815e76e6>] wait_for_completion+0x116/0x170
[ 8119.181431] [<ffffffff81097700>] ? wake_up_state+0x20/0x20
[ 8119.181431] [<ffffffffa096a944>] osc_io_fsync_end+0x74/0xa0 [osc]
[ 8119.181431] [<ffffffffa0572a2d>] cl_io_end+0x5d/0x150 [obdclass]
[ 8119.181431] [<ffffffffa09b61fb>] lov_io_end_wrapper+0xdb/0xe0 [lov]
[ 8119.181431] [<ffffffffa09b63a4>] lov_io_fsync_end+0x84/0x1c0 [lov]
[ 8119.181431] [<ffffffffa0572a2d>] cl_io_end+0x5d/0x150 [obdclass]
[ 8119.181431] [<ffffffffa0576733>] cl_io_loop+0xb3/0x190 [obdclass]
[ 8119.181431] [<ffffffffa0f0022b>] cl_sync_file_range+0x40b/0x610 [lustre]
[ 8119.181431] [<ffffffffa0f13fba>] ll_delete_inode+0x10a/0x230 [lustre]
[ 8119.181431] [<ffffffff811db91e>] ? inode_wait_for_writeback+0x2e/0x40
[ 8119.181431] [<ffffffff811cae27>] evict+0xa7/0x170
[ 8119.181431] [<ffffffff811caf2e>] dispose_list+0x3e/0x50
[ 8119.181431] [<ffffffff811cbb14>] evict_inodes+0x114/0x140
[ 8119.181431] [<ffffffff811b1fc8>] generic_shutdown_super+0x48/0xe0
[ 8119.181431] [<ffffffff811b2242>] kill_anon_super+0x12/0x20
[ 8119.181431] [<ffffffffa055880a>] lustre_kill_super+0x7a/0x80 [obdclass]
[ 8119.181431] [<ffffffff811b265d>] deactivate_locked_super+0x3d/0x60
[ 8119.181431] [<ffffffff811b26c6>] deactivate_super+0x46/0x60
[ 8119.181431] [<ffffffff811cf455>] mntput_no_expire+0xc5/0x120
[ 8119.181431] [<ffffffff811d058f>] SyS_umount+0x9f/0x3c0
[ 8119.181431] [<ffffffff815f2119>] system_call_fastpath+0x16/0x1b{no format}

But CL_FSYNC_LOCAL not. I have updated patch.

Comment by Bob Glossman (Inactive) [ 31/Oct/14 ]

I've verified that with the recent landing of http://review.whamcloud.com/#/c/12103 in master I no longer see this failure when running the test on el7 clients. I think this ticket can be marked Fixed.

Comment by Peter Jones [ 31/Oct/14 ]

Landed for 2.7

Generated at Sat Feb 10 01:52:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.