[LU-5584] EL7 client: Test failure on test suite runtests, subtest test_1 Created: 04/Sep/14 Updated: 14/May/15 Resolved: 31/Oct/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Yang Sheng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 15577 |
| Description |
|
This issue was created by maloo for Minh Diep <minh.diep@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/d54472f6-2fc6-11e4-957a-5254006e85c2. The sub-test test_1 failed with the following error:
Info required for matching: runtests 1 |
| Comments |
| Comment by Peter Jones [ 15/Sep/14 ] |
|
Yang Sheng Could you please advise on this one? Thanks Peter |
| Comment by Yang Sheng [ 22/Sep/14 ] |
|
http://review.whamcloud.com/12009 |
| Comment by Yang Sheng [ 26/Sep/14 ] |
|
Hi, PengTao. Could you please help verify upstream lustre also exist this problem? TIA. |
| Comment by Peter Jones [ 26/Sep/14 ] |
|
Oleg may be better placed to comment |
| Comment by Yang Sheng [ 29/Sep/14 ] |
| Comment by Yang Sheng [ 29/Sep/14 ] |
|
Hi, JinShan, Could you please comment on this patch? TIA. |
| Comment by Yang Sheng [ 20/Oct/14 ] |
|
As JinShan asked in gerrit. Attached logs for investigation. The runtests can hardly to collect log since it copy too much files. So i use a small script to reproduce this issue. It can produce log more easy to collection and analyzed. modprobe mbcache
sh llmount.sh
#echo "mdc osc class log llite lov mgc" >/proc/sys/lnet/subsystem_debug
echo "class osc mdc rpc lov llite" >/proc/sys/lnet/subsystem_debug
echo -1 >/proc/sys/lnet/debug
lctl set_param osc/*/max_rpcs_in_flight=1
while : ; do
rm -rf /mnt/lustre/ddd
mkdir - p /mnt/lustre/ddd
lctl mark "copy start"
cp /aaa/* /mnt/lustre/ddd/
lctl mark "copy end"
lctl dk >copy-end-log
nm=`ls /proc/fs/lustre/llite/`
umount /mnt/lustre
lctl dk > after-umount-log
mount -t lustre 192.168.122.111@tcp:/lustre /mnt/lustre
ls -lR /mnt/lustre/ddd |grep " 0 " && exit
echo "==========================new loop============="
done
This script just copy files from /aaa to /mnt/lustre/ddd. The file list is -rw-r--r-- 1 root root 16 9月 21 13:57 adjtime -rw-r--r-- 1 root root 1518 9月 21 13:57 aliases -rw-r--r-- 1 root root 12288 9月 21 13:57 aliases.db -rw------- 1 root root 541 9月 21 13:57 anacrontab -rw-r--r-- 1 root root 55 9月 21 13:57 asound.conf -rw-r--r-- 1 root root 1 9月 21 13:57 at.deny -rw-r--r-- 1 root root 2835 9月 21 13:57 bashrc -rw-r--r-- 1 root root 1157 9月 21 13:57 chrony.conf -rw-r----- 1 root root 62 9月 21 13:57 chrony.keys -rw-r--r-- 1 root root 451 9月 21 13:57 crontab -rw-r--r-- 1 root root 1602 9月 21 13:57 csh.cshrc -rw-r--r-- 1 root root 841 9月 21 13:57 csh.login -rw-r--r-- 1 root root 5090 9月 21 13:57 DIR_COLORS -rw-r--r-- 1 root root 5725 9月 21 13:57 DIR_COLORS.256color -rw-r--r-- 1 root root 4669 9月 21 13:57 DIR_COLORS.lightbgcolor -rw-r--r-- 1 root root 25213 9月 21 13:57 dnsmasq.conf -rw-r--r-- 1 root root 1285 9月 21 13:57 dracut.conf -rw-r--r-- 1 root root 112 9月 21 13:57 e2fsck.conf -rw-r--r-- 1 root root 1317 9月 21 13:57 ethertypes -rw-r--r-- 1 root root 730 9月 21 13:57 favicon.png -rw-r--r-- 1 root root 64 9月 21 13:57 filesystems -rw-r--r-- 1 root root 20 9月 21 13:57 fprintd.conf -rw-r--r-- 1 root root 501 9月 21 13:57 fstab -rw-r--r-- 1 root root 265 9月 21 13:57 gdbinit -rw-r--r-- 1 root root 94 9月 21 13:57 GREP_COLORS -rw-r--r-- 1 root root 601 9月 21 13:57 group -rw-r--r-- 1 root root 573 9月 21 13:57 group- -rw-r--r-- 1 root root 4672 9月 21 13:57 grub2.cfg ---------- 1 root root 467 9月 21 13:57 gshadow ---------- 1 root root 458 9月 21 13:57 gshadow- -rw-r--r-- 1 root root 9 9月 21 13:57 host.conf -rw-r--r-- 1 root root 3 9月 21 13:57 hostname -rw-r--r-- 1 root root 194 9月 21 13:57 hosts -rw-r--r-- 1 root root 370 9月 21 13:57 hosts.allow -rw-r--r-- 1 root root 460 9月 21 13:57 hosts.deny -rw-r--r-- 1 root root 491 9月 21 13:57 inittab -rw-r--r-- 1 root root 942 9月 21 13:57 inputrc -rw-r--r-- 1 root root 23 9月 21 13:57 issue -rw-r--r-- 1 root root 22 9月 21 13:57 issue.net -rw-r--r-- 1 root root 6088 9月 21 13:57 kdump.conf -rw-r--r-- 1 root root 495 9月 21 13:57 krb5.conf -rw-r--r-- 1 root root 548 9月 21 13:57 ldev.conf -rw-r--r-- 1 root root 29632 9月 21 13:57 ld.so.cache -rw-r--r-- 1 root root 28 9月 21 13:57 ld.so.conf -rw-r----- 1 root root 191 9月 21 13:57 libaudit.conf -rw-r--r-- 1 root root 2391 9月 21 13:57 libuser.conf -rw-r--r-- 1 root root 19 9月 21 13:57 locale.conf -rw-r--r-- 1 root root 3519 9月 21 13:57 localtime -rw-r--r-- 1 root root 2028 9月 21 13:57 login.defs -rw-r--r-- 1 root root 662 9月 21 13:57 logrotate.conf -r--r--r-- 1 root root 33 9月 21 13:57 machine-id -rw-r--r-- 1 root root 111 9月 21 13:57 magic -rw-r--r-- 1 root root 1968 9月 21 13:57 mail.rc -rw-r--r-- 1 root root 5122 9月 21 13:57 makedumpfile.conf.sample -rw-r--r-- 1 root root 5171 9月 21 13:57 man_db.conf -rw-r--r-- 1 root root 956 9月 21 13:57 mke2fs.conf -r--r--r-- 1 root root 2140 9月 21 13:57 mtab -rw-r--r-- 1 root root 570 9月 21 13:57 my.cnf -rw-r--r-- 1 root root 8892 9月 21 13:57 nanorc -rw-r--r-- 1 root root 58 9月 21 13:57 networks -rw-r--r-- 1 root root 1717 9月 21 13:57 nsswitch.conf -rw-r--r-- 1 root root 1726 9月 21 13:57 nsswitch.conf.bak -rw-r--r-- 1 root root 493 9月 21 13:57 os-release -rw-r--r-- 1 root root 1163 9月 21 13:57 passwd -rw-r--r-- 1 root root 1126 9月 21 13:57 passwd- -rw-r--r-- 1 root root 2872 9月 21 13:57 pinforc -rw-r--r-- 1 root root 233 9月 21 13:57 printcap -rw-r--r-- 1 root root 1750 9月 21 13:57 profile -rw-r--r-- 1 root root 6545 9月 21 13:57 protocols -rw-r--r-- 1 root root 477 9月 21 13:57 rc.local -rw-r--r-- 1 root root 52 9月 21 13:57 redhat-release -rw-r--r-- 1 root root 55 9月 21 13:57 resolv.conf -rw-r--r-- 1 root root 1634 9月 21 13:57 rpc -rw-r--r-- 1 root root 458 9月 21 13:57 rsyncd.conf -rw-r--r-- 1 root root 3232 9月 21 13:57 rsyslog.conf -rw-r--r-- 1 root root 927 9月 21 13:57 rwtab -rw------- 1 root root 221 9月 21 13:57 securetty -rw-r--r-- 1 root root 670293 9月 21 13:57 services -rw-r--r-- 1 root root 216 9月 21 13:57 sestatus.conf ---------- 1 root root 728 9月 21 13:57 shadow ---------- 1 root root 701 9月 21 13:57 shadow- -rw-r--r-- 1 root root 95 9月 21 13:57 shells -rw-r--r-- 1 root root 256 9月 21 13:57 sos.conf -rw-r--r-- 1 root root 212 9月 21 13:57 statetab -rw-r----- 1 root root 1786 9月 21 13:57 sudo.conf -r--r----- 1 root root 4000 9月 21 13:57 sudoers -rw-r----- 1 root root 3181 9月 21 13:57 sudo-ldap.conf -rw-r--r-- 1 root root 225 9月 21 13:57 sysctl.conf -rw-r--r-- 1 root root 52 9月 21 13:57 system-release -rw-r--r-- 1 root root 45 9月 21 13:57 system-release-cpe -rw-r--r-- 1 root root 375 9月 21 13:57 trusted-key.key -rw-r--r-- 1 root root 508 9月 21 13:57 updatedb.conf -rw-r--r-- 1 root root 889 9月 21 13:57 usb_modeswitch.conf -rw-r--r-- 1 root root 37 9月 21 13:57 vconsole.conf -rw-r--r-- 1 root root 1982 9月 21 13:57 vimrc -rw-r--r-- 1 root root 1982 9月 21 13:57 virc -rw-r--r-- 1 root root 4479 9月 21 13:57 wgetrc -rw-r--r-- 1 root root 813 9月 21 13:57 yum.conf Below is problem file list(zero size after umount & mount): -rw-r----- 1 root root 0 10月 20 10:06 chrony.keys -rw-r--r-- 1 root root 0 10月 20 10:06 DIR_COLORS.lightbgcolor -rw-r--r-- 1 root root 0 10月 20 10:06 ethertypes -rw-r--r-- 1 root root 0 10月 20 10:06 filesystems -rw-r--r-- 1 root root 0 10月 20 10:06 GREP_COLORS -rw-r--r-- 1 root root 0 10月 20 10:06 group- ---------- 1 root root 0 10月 20 10:06 gshadow -rw-r--r-- 1 root root 0 10月 20 10:06 host.conf -rw-r--r-- 1 root root 0 10月 20 10:06 hosts -rw-r--r-- 1 root root 0 10月 20 10:06 inputrc -rw-r--r-- 1 root root 0 10月 20 10:06 issue.net -rw-r--r-- 1 root root 0 10月 20 10:06 krb5.conf -rw-r--r-- 1 root root 0 10月 20 10:06 ld.so.cache -rw-r----- 1 root root 0 10月 20 10:06 libaudit.conf -rw-r--r-- 1 root root 0 10月 20 10:06 libuser.conf -rw-r--r-- 1 root root 0 10月 20 10:06 locale.conf -r--r--r-- 1 root root 0 10月 20 10:06 machine-id -r--r--r-- 1 root root 0 10月 20 10:06 mtab -rw-r--r-- 1 root root 0 10月 20 10:06 nsswitch.conf |
| Comment by Jinshan Xiong (Inactive) [ 20/Oct/14 ] |
|
Hi Yang Sheng, Thank you very much for the reproduce script and log, I'm looking at it now. |
| Comment by Jinshan Xiong (Inactive) [ 21/Oct/14 ] |
|
When a filesystem is being umounted, it will sync dirty data by calling sync_filesystem(). Then it calls __sync_filesystem() twice: the first time just starts the write back and the 2nd time is to wait for the write back to finish. Both call should reach ll_writepages() but for unknown reason, the 2nd call didn't reach it. My temporary solution is to make ll_writepages() to wait for write back to finish anyway: diff --git a/lustre/llite/rw.c b/lustre/llite/rw.c
index ff25525..778609b 100644
--- a/lustre/llite/rw.c
+++ b/lustre/llite/rw.c
@@ -1080,7 +1080,7 @@ int ll_writepages(struct address_space *mapping, struct writeback_control *wbc)
}
}
- mode = CL_FSYNC_NONE;
+ mode = CL_FSYNC_LOCAL;
if (wbc->sync_mode == WB_SYNC_ALL)
mode = CL_FSYNC_LOCAL;
I can't see this issue any more after applying the above patch. Yang Sheng, if you have any time, can you please investigate it further to figure out why the 2nd __sync_filesystem() didn't reach ll_writepages()? Thanks, |
| Comment by Jinshan Xiong (Inactive) [ 21/Oct/14 ] |
|
It looks like the inode is called with redirty_tail() and then put in the list of b_dirty with newer timestamp. This is why it was skipped in the 2nd __sync_filesystem() therefore dirty data is not flushed. I think Yang Sheng's patch is good to fix the problem. |
| Comment by Yang Sheng [ 25/Oct/14 ] |
|
The umount will hang while invoke with CL_FSYNC_ALL in replay-single test_89. {no format}[ 8119.181431] umount D ffff88003fc14580 0 4225 3612 0x00000080 [ 8119.181431] ffff880020cb7ae0 0000000000000086 ffff880020cb7fd8 0000000000014580 [ 8119.181431] ffff880020cb7fd8 0000000000014580 ffff880012100b60 ffff880009f50e80 [ 8119.181431] ffff880009f50e88 7fffffffffffffff ffff880012100b60 0000000000000000 [ 8119.181431] Call Trace: [ 8119.181431] [<ffffffff815e71b9>] schedule+0x29/0x70 [ 8119.181431] [<ffffffff815e50b9>] schedule_timeout+0x209/0x2d0 [ 8119.181431] [<ffffffffa0783d60>] ? ptlrpcd_add_req+0x210/0x300 [ptlrpc] [ 8119.181431] [<ffffffffa0764370>] ? lustre_swab_obdo+0x100/0x100 [ptlrpc] [ 8119.181431] [<ffffffff815e76e6>] wait_for_completion+0x116/0x170 [ 8119.181431] [<ffffffff81097700>] ? wake_up_state+0x20/0x20 [ 8119.181431] [<ffffffffa096a944>] osc_io_fsync_end+0x74/0xa0 [osc] [ 8119.181431] [<ffffffffa0572a2d>] cl_io_end+0x5d/0x150 [obdclass] [ 8119.181431] [<ffffffffa09b61fb>] lov_io_end_wrapper+0xdb/0xe0 [lov] [ 8119.181431] [<ffffffffa09b63a4>] lov_io_fsync_end+0x84/0x1c0 [lov] [ 8119.181431] [<ffffffffa0572a2d>] cl_io_end+0x5d/0x150 [obdclass] [ 8119.181431] [<ffffffffa0576733>] cl_io_loop+0xb3/0x190 [obdclass] [ 8119.181431] [<ffffffffa0f0022b>] cl_sync_file_range+0x40b/0x610 [lustre] [ 8119.181431] [<ffffffffa0f13fba>] ll_delete_inode+0x10a/0x230 [lustre] [ 8119.181431] [<ffffffff811db91e>] ? inode_wait_for_writeback+0x2e/0x40 [ 8119.181431] [<ffffffff811cae27>] evict+0xa7/0x170 [ 8119.181431] [<ffffffff811caf2e>] dispose_list+0x3e/0x50 [ 8119.181431] [<ffffffff811cbb14>] evict_inodes+0x114/0x140 [ 8119.181431] [<ffffffff811b1fc8>] generic_shutdown_super+0x48/0xe0 [ 8119.181431] [<ffffffff811b2242>] kill_anon_super+0x12/0x20 [ 8119.181431] [<ffffffffa055880a>] lustre_kill_super+0x7a/0x80 [obdclass] [ 8119.181431] [<ffffffff811b265d>] deactivate_locked_super+0x3d/0x60 [ 8119.181431] [<ffffffff811b26c6>] deactivate_super+0x46/0x60 [ 8119.181431] [<ffffffff811cf455>] mntput_no_expire+0xc5/0x120 [ 8119.181431] [<ffffffff811d058f>] SyS_umount+0x9f/0x3c0 [ 8119.181431] [<ffffffff815f2119>] system_call_fastpath+0x16/0x1b{no format} But CL_FSYNC_LOCAL not. I have updated patch. |
| Comment by Bob Glossman (Inactive) [ 31/Oct/14 ] |
|
I've verified that with the recent landing of http://review.whamcloud.com/#/c/12103 in master I no longer see this failure when running the test on el7 clients. I think this ticket can be marked Fixed. |
| Comment by Peter Jones [ 31/Oct/14 ] |
|
Landed for 2.7 |