[LU-3573] lustre-rsync-test test_8: @@@@@@ FAIL: Failure in replication; differences found. Created: 10/Jul/13 Updated: 20/Jul/15 Resolved: 12/Dec/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.5.1, Lustre 2.7.0, Lustre 2.5.3 |
| Fix Version/s: | Lustre 2.7.0, Lustre 2.5.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Keith Mannthey (Inactive) | Assignee: | Nathaniel Clark |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | prz, zfs | ||
| Environment: |
a patch pushed to autotest |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9026 | ||||||||||||||||||||||||||||||||
| Description |
|
On ZFS an error was seen with lustre-rsync test 8 the logs are here: I don't really know how to read lrsync_log. The test error reports a very basic: == lustre-rsync-test test 8: Replicate multiple file/directory moves == 16:00:59 (1373410859) CMD: wtm-10vm7 lctl --device lustre-MDT0000 changelog_register -n lustre-MDT0000: Registered changelog user cl13 CMD: wtm-10vm7 lctl get_param -n mdd.lustre-MDT0000.changelog_users Lustre filesystem: lustre MDT device: lustre-MDT0000 Source: /mnt/lustre Target: /tmp/target Statuslog: /tmp/lustre_rsync.log Changelog registration: cl13 Starting changelog record: 0 Clear changelog after use: no Errors: 0 lustre_rsync took 107 seconds Changelog records consumed: 1881 Only in /tmp/target/d0.lustre-rsync-test/d8/d08/d083: a3 lustre-rsync-test test_8: @@@@@@ FAIL: Failure in replication; differences found. Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4066:error_noexit() = /usr/lib64/lustre/tests/test-framework.sh:4093:error() ..... Out of the last 100 runs it reports 1 error so it could be related to the base patch or a rare error. |
| Comments |
| Comment by Jodi Levi (Inactive) [ 11/Jul/13 ] |
|
Nathaniel, |
| Comment by Andreas Dilger [ 11/Jul/13 ] |
|
It appears that this has been failing for several weeks. I searched for "lustre-rsync-test test8" in Maloo Results->Search->SubTest and found quite a few hits, dating back to 2013-03-25 after https://maloo.whamcloud.com/sub_tests/1d923f9a-959d-11e2-bc9e-52540035b04c The actual file reported changes ("a8", "a9", etc). The Maloo search also confirmed that none of the recent test_8 failures happened on ldiskfs, only ZFS. |
| Comment by nasf (Inactive) [ 27/Dec/13 ] |
|
Another failure instance: https://maloo.whamcloud.com/test_sets/395e5f0a-6ec4-11e3-a57c-52540035b04c |
| Comment by Jian Yu [ 13/Mar/14 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/41/ (2.5.1 RC3) The same failure occurred: |
| Comment by Nathaniel Clark [ 03/Apr/14 ] |
|
This bug may be fixed by |
| Comment by Peter Jones [ 24/Apr/14 ] |
|
Nathaniel Could you please prove/disprove this theory by porting the Thanks Peter |
| Comment by Nathaniel Clark [ 24/Jul/14 ] |
|
|
| Comment by Jian Yu [ 23/Sep/14 ] |
|
While testing patch http://review.whamcloud.com/12012 on Lustre b2_5 branch with FSTYPE=zfs, the same failure occurred: |
| Comment by Nathaniel Clark [ 02/Oct/14 ] |
|
I've been running this on my local test cluster (tests 1-8 repetitively) and have not had the failure happen locally to debug the system in the failed state. This failure has only happened 4 times out of 121 test runs (zfs on master) in the last week (about 3%). |
| Comment by Nathaniel Clark [ 03/Oct/14 ] |
|
Finally reproduced: File reported as missing from lustre but present in rsync'd dir is in-fact missing from lustre. Remounting client then then mdt did not cause file to re-appear, and there is no log of the file being deleted. This behaviour can be reproduced by just running: for i in 1 2 3 4 5 6 7 8 9; do mkdir $DIR/$tdir/d$i for j in 1 2 3 4 5 6 7 8 9; do mkdir $DIR/$tdir/d$i/d$i$j createmany -o $DIR/$tdir/d$i/d$i$j/a 10 \ > /dev/null mv $DIR/$tdir/d$i/d$i$j $DIR/$tdir/d$i/d0$i$j createmany -o $DIR/$tdir/d$i/d0$i$j/b 10 \ > /dev/null mv $DIR/$tdir/d$i/d0$i$j/a0 $DIR/$tdir/d$i/d0$i$j/c0 done mv $DIR/$tdir/d$i $DIR/$tdir/d0$i done and comparing the created files to what would be expected. for i in 1 2 3 4 5 6 7 8 9; do for j in 1 2 3 4 5 6 7 8 9; do for k in 1 2 3 4 5 6 7 8 9; do echo $D/d0$i/d0$i$j/a$k >> $F done for k in 0 1 2 3 4 5 6 7 8 9; do echo $D/d0$i/d0$i$j/b$k >> $F done echo $D/d0$i/d0$i$j/c0 >> $F done done find $DIR/$tdir -type f | sort | diff $F - || error "Differences Found" |
| Comment by Jodi Levi (Inactive) [ 05/Oct/14 ] |
|
Nathaniel, |
| Comment by Nathaniel Clark [ 06/Oct/14 ] |
|
File actually does exist in MDT's zfs filesystem, and can be accessed directly from client, but ls does not show it in the directory. This reminds of a "recent" bug, but I'm having trouble locating it |
| Comment by Nathaniel Clark [ 16/Oct/14 ] |
|
Status of current testing: |
| Comment by Andreas Dilger [ 16/Oct/14 ] |
|
Nathaniel, does this problem disappear if the client is unmounted/remounted, or is it a function of the directory itself? Is it possible that there are hash collisions in the name entries that are not being handled properly? |
| Comment by Nathaniel Clark [ 21/Oct/14 ] |
|
Andreas, |
| Comment by Nathaniel Clark [ 22/Oct/14 ] |
|
Not sure if this is related but, I was trying to see if it was something in the file structures on disk vs. something under the hood for ZFS so I followed the directions for Backup and Restore of a File-Level backup. I restored and then tried to mount the MDT resulting in this: LustreError: 46380:0:(osd_oi.c:232:osd_fld_lookup()) ASSERTION( ss != ((void *)0) ) failed: LustreError: 46380:0:(osd_oi.c:232:osd_fld_lookup()) LBUG Kernel panic - not syncing: LBUG Pid: 46380, comm: mount.lustre Tainted: P --------------- 2.6.32-431.20.3.el6_lustre.g5a7c614.x86_64 #1 Call Trace: [<ffffffff8152859c>] ? panic+0xa7/0x16f [<ffffffffa0665eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [<ffffffffa0e9c78f>] ? osd_fld_lookup+0xcf/0xd0 [osd_zfs] [<ffffffffa0e9c88e>] ? fid_is_on_ost+0xfe/0x310 [osd_zfs] [<ffffffffa0392869>] ? dbuf_rele_and_unlock+0x169/0x1e0 [zfs] [<ffffffffa0e9cbac>] ? osd_get_name_n_idx+0x4c/0xe80 [osd_zfs] [<ffffffffa03b3f53>] ? dsl_dataset_block_freeable+0x43/0x60 [zfs] [<ffffffffa03a8d93>] ? dmu_tx_hold_zap+0x1a3/0x200 [zfs] [<ffffffffa0e9dd1a>] ? osd_convert_root_to_new_seq+0x33a/0x670 [osd_zfs] [<ffffffffa0e8c4b8>] ? osd_mount+0xbc8/0xf60 [osd_zfs] [<ffffffffa0e8ed66>] ? osd_device_alloc+0x2a6/0x3b0 [osd_zfs] [<ffffffffa07ebf0f>] ? obd_setup+0x1bf/0x290 [obdclass] [<ffffffffa07ec1e8>] ? class_setup+0x208/0x870 [obdclass] [<ffffffffa07f438c>] ? class_process_config+0xc5c/0x1ac0 [obdclass] [<ffffffffa07f94b5>] ? lustre_cfg_new+0x4f5/0x6f0 [obdclass] [<ffffffffa07f9808>] ? do_lcfg+0x158/0x450 [obdclass] [<ffffffffa07f9b94>] ? lustre_start_simple+0x94/0x200 [obdclass] [<ffffffffa0832df1>] ? server_fill_super+0x1061/0x1720 [obdclass] [<ffffffffa07ff6b8>] ? lustre_fill_super+0x1d8/0x550 [obdclass] [<ffffffffa07ff4e0>] ? lustre_fill_super+0x0/0x550 [obdclass] [<ffffffff8118c01f>] ? get_sb_nodev+0x5f/0xa0 [<ffffffffa07f72b5>] ? lustre_get_sb+0x25/0x30 [obdclass] [<ffffffff8118b67b>] ? vfs_kern_mount+0x7b/0x1b0 [<ffffffff8118b822>] ? do_kern_mount+0x52/0x130 [<ffffffff8119e422>] ? vfs_ioctl+0x22/0xa0 [<ffffffff811ad1fb>] ? do_mount+0x2fb/0x930 [<ffffffff811ad8c0>] ? sys_mount+0x90/0xe0 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b I had fully and cleanly unmounted everything prior to creating the backup, I then reformatted the ZFS filesystem. update |
| Comment by Andreas Dilger [ 24/Oct/14 ] |
|
I don't think it is supported to do file-level backup and restore for ZFS. For ZFS you need to do "zfs send" and "zfs recv" or else the OI files will be broken, and there is not currently OI Scrub functionality for osd-zfs. If that isn't clearly documented in the Lustre User Manual's Backup and Restore section, then that is a defect in the manual that should be fixed. |
| Comment by Nathaniel Clark [ 24/Oct/14 ] |
|
Doing update restore using |
| Comment by Jian Yu [ 25/Oct/14 ] |
|
One more instance on master branch: https://testing.hpdd.intel.com/test_sets/a2161baa-5bea-11e4-a35f-5254006e85c2 |
| Comment by Nathaniel Clark [ 27/Oct/14 ] |
|
The file itself isn't hidden, the name is "cloaked". [root@lubuilder d074]# ls a1 a2 a3 a4 a5 a6 a7 a8 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 c0 [root@lubuilder d074]# mv a9 a9a [root@lubuilder d074]# ls a1 a2 a3 a4 a5 a6 a7 a8 a9a b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 c0 [root@lubuilder d074]# mv a9a a9 [root@lubuilder d074]# ls a1 a2 a3 a4 a5 a6 a7 a8 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 c0 [root@lubuilder d074]# mv a9 a9a [root@lubuilder d074]# ls a1 a2 a3 a4 a5 a6 a7 a8 a9a b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 c0 [root@lubuilder d074]# mv a8 a9 [root@lubuilder d074]# ls a1 a2 a3 a4 a5 a6 a7 a9a b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 c0 [root@lubuilder d074]# mv a9 a8 [root@lubuilder d074]# ls a1 a2 a3 a4 a5 a6 a7 a8 a9a b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 c0 |
| Comment by Andreas Dilger [ 29/Oct/14 ] |
|
Does ZFS have whiteout entries, or is there some problem with the hashing that prevents the directory entry from appearing? I know that "ls" will not show entries that have inode == 0, but the inode number should be independent of the filename, so that wouldn't behave in this manner. Maybe there is a ZFS "hidden" flag that is not being initialized correctly and in some cases this flag is set? Does anything appear with "strace" or with "zdb"? |
| Comment by Nathaniel Clark [ 30/Oct/14 ] |
|
The inode is not 0, the inode is tied to the file and this issue is with the name in that directory. |
| Comment by Jian Yu [ 03/Nov/14 ] |
|
One more instance on Lustre b2_5 branch with FSTYPE=zfs: https://testing.hpdd.intel.com/test_sets/5310f46c-61fd-11e4-bd1f-5254006e85c2 |
| Comment by Nathaniel Clark [ 05/Nov/14 ] |
| Comment by Peter Jones [ 10/Nov/14 ] |
|
Fix landed for 2.7 |
| Comment by Bob Glossman (Inactive) [ 10/Nov/14 ] |
|
in b2_5: |
| Comment by Andreas Dilger [ 14/Nov/14 ] |
|
Unfortunately, version 2 of the http://review.whamcloud.com/12582 patch failed review-zfs conf-sanity test_32b with a real problem that is now causing serious test failures on master. The fact that it failed on the unlanded patch, when no other tests were failing before that time makes it very likely that this patch caused the problem. I've opened |
| Comment by Jodi Levi (Inactive) [ 17/Nov/14 ] |
|
Patch landed to Master. b2_5 patch ready to land and tracked outside of this ticket. |
| Comment by Gerrit Updater [ 01/Dec/14 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/12887 |
| Comment by Gerrit Updater [ 02/Dec/14 ] |
|
Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/12904 |
| Comment by Nathaniel Clark [ 02/Dec/14 ] |
|
This should be a fix for this bug without re-introducing |
| Comment by Nathaniel Clark [ 02/Dec/14 ] |
|
When http://review.whamcloud.com/12887 is landed this bug will be open |
| Comment by Gerrit Updater [ 02/Dec/14 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12887/ |
| Comment by Gerrit Updater [ 10/Dec/14 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12904/ |
| Comment by Jodi Levi (Inactive) [ 12/Dec/14 ] |
|
Patches landed to Master. |
| Comment by Gerrit Updater [ 12/Jan/15 ] |
|
Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/13357 |
| Comment by Gerrit Updater [ 27/Jan/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13357/ |