[LU-3573] lustre-rsync-test test_8: @@@@@@ FAIL: Failure in replication; differences found. - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.7.0, Lustre 2.5.4
Affects Version/s: Lustre 2.6.0, Lustre 2.5.1, Lustre 2.7.0, Lustre 2.5.3
Labels:
- prz
- zfs
Environment:
a patch pushed to autotest

Severity:
3
Rank (Obsolete):
9026

Description

On ZFS an error was seen with lustre-rsync test 8 the logs are here:
https://maloo.whamcloud.com/test_sets/e3492890-e901-11e2-ae91-52540035b04c

I don't really know how to read lrsync_log.

The test error reports a very basic:

== lustre-rsync-test test 8: Replicate multiple file/directory moves == 16:00:59 (1373410859)
CMD: wtm-10vm7 lctl --device lustre-MDT0000 changelog_register -n
lustre-MDT0000: Registered changelog user cl13
CMD: wtm-10vm7 lctl get_param -n mdd.lustre-MDT0000.changelog_users
Lustre filesystem: lustre
MDT device: lustre-MDT0000
Source: /mnt/lustre
Target: /tmp/target
Statuslog: /tmp/lustre_rsync.log
Changelog registration: cl13
Starting changelog record: 0
Clear changelog after use: no
Errors: 0
lustre_rsync took 107 seconds
Changelog records consumed: 1881
Only in /tmp/target/d0.lustre-rsync-test/d8/d08/d083: a3
 lustre-rsync-test test_8: @@@@@@ FAIL: Failure in replication; differences found. 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4066:error_noexit()
  = /usr/lib64/lustre/tests/test-framework.sh:4093:error()
  .....

Out of the last 100 runs it reports 1 error so it could be related to the base patch or a rare error.

Attachments

Issue Links

is duplicated by

LU-5765 sanity test_123a test_123b: rm: no such file or directory

Resolved

is related to

LU-5924 conf-sanity test_32b: list verification failed

Resolved

LU-6101 sanity test_24A: Can not delete directories

Resolved

LU-5475 readdir missing a directory

Resolved

LU-3580 Panic in ptlrpc when rerunning lustre-rsync-test/8 without remount

Resolved

is related to

LU-5254 readdir missing a directory

Resolved

(1 is related to )

Activity

[LU-3573] lustre-rsync-test test_8: @@@@@@ FAIL: Failure in replication; differences found.

Nathaniel Clark added a comment - 24/Oct/14 2:54 PM

Doing update restore using ~~LUDOC-161~~ information (zfs send/recv) instead of tar works much better. It shows no change after restore, the file is still missing form lustre client.

Nathaniel Clark added a comment - 24/Oct/14 2:54 PM Doing update restore using LUDOC-161 information (zfs send/recv) instead of tar works much better. It shows no change after restore, the file is still missing form lustre client.

Andreas Dilger added a comment - 24/Oct/14 3:26 AM

I don't think it is supported to do file-level backup and restore for ZFS. For ZFS you need to do "zfs send" and "zfs recv" or else the OI files will be broken, and there is not currently OI Scrub functionality for osd-zfs.

If that isn't clearly documented in the Lustre User Manual's Backup and Restore section, then that is a defect in the manual that should be fixed.

Andreas Dilger added a comment - 24/Oct/14 3:26 AM I don't think it is supported to do file-level backup and restore for ZFS. For ZFS you need to do "zfs send" and "zfs recv" or else the OI files will be broken, and there is not currently OI Scrub functionality for osd-zfs. If that isn't clearly documented in the Lustre User Manual's Backup and Restore section, then that is a defect in the manual that should be fixed.

Nathaniel Clark added a comment - 22/Oct/14 7:59 PM - edited

Not sure if this is related but, I was trying to see if it was something in the file structures on disk vs. something under the hood for ZFS so I followed the directions for Backup and Restore of a File-Level backup. I restored and then tried to mount the MDT resulting in this:

LustreError: 46380:0:(osd_oi.c:232:osd_fld_lookup()) ASSERTION( ss != ((void *)0) ) failed: 
LustreError: 46380:0:(osd_oi.c:232:osd_fld_lookup()) LBUG
Kernel panic - not syncing: LBUG
Pid: 46380, comm: mount.lustre Tainted: P           ---------------    2.6.32-431.20.3.el6_lustre.g5a7c614.x86_64 #1
Call Trace:
 [<ffffffff8152859c>] ? panic+0xa7/0x16f
 [<ffffffffa0665eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
 [<ffffffffa0e9c78f>] ? osd_fld_lookup+0xcf/0xd0 [osd_zfs]
 [<ffffffffa0e9c88e>] ? fid_is_on_ost+0xfe/0x310 [osd_zfs]
 [<ffffffffa0392869>] ? dbuf_rele_and_unlock+0x169/0x1e0 [zfs]
 [<ffffffffa0e9cbac>] ? osd_get_name_n_idx+0x4c/0xe80 [osd_zfs]
 [<ffffffffa03b3f53>] ? dsl_dataset_block_freeable+0x43/0x60 [zfs]
 [<ffffffffa03a8d93>] ? dmu_tx_hold_zap+0x1a3/0x200 [zfs]
 [<ffffffffa0e9dd1a>] ? osd_convert_root_to_new_seq+0x33a/0x670 [osd_zfs]
 [<ffffffffa0e8c4b8>] ? osd_mount+0xbc8/0xf60 [osd_zfs]
 [<ffffffffa0e8ed66>] ? osd_device_alloc+0x2a6/0x3b0 [osd_zfs]
 [<ffffffffa07ebf0f>] ? obd_setup+0x1bf/0x290 [obdclass]
 [<ffffffffa07ec1e8>] ? class_setup+0x208/0x870 [obdclass]
 [<ffffffffa07f438c>] ? class_process_config+0xc5c/0x1ac0 [obdclass]
 [<ffffffffa07f94b5>] ? lustre_cfg_new+0x4f5/0x6f0 [obdclass]
 [<ffffffffa07f9808>] ? do_lcfg+0x158/0x450 [obdclass]
 [<ffffffffa07f9b94>] ? lustre_start_simple+0x94/0x200 [obdclass]
 [<ffffffffa0832df1>] ? server_fill_super+0x1061/0x1720 [obdclass]
 [<ffffffffa07ff6b8>] ? lustre_fill_super+0x1d8/0x550 [obdclass]
 [<ffffffffa07ff4e0>] ? lustre_fill_super+0x0/0x550 [obdclass]
 [<ffffffff8118c01f>] ? get_sb_nodev+0x5f/0xa0
 [<ffffffffa07f72b5>] ? lustre_get_sb+0x25/0x30 [obdclass]
 [<ffffffff8118b67b>] ? vfs_kern_mount+0x7b/0x1b0
 [<ffffffff8118b822>] ? do_kern_mount+0x52/0x130
 [<ffffffff8119e422>] ? vfs_ioctl+0x22/0xa0
 [<ffffffff811ad1fb>] ? do_mount+0x2fb/0x930
 [<ffffffff811ad8c0>] ? sys_mount+0x90/0xe0
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

I had fully and cleanly unmounted everything prior to creating the backup, I then reformatted the ZFS filesystem.

update
This happens on seq is FID_SEQ_ROOT.

Nathaniel Clark added a comment - 22/Oct/14 7:59 PM - edited Not sure if this is related but, I was trying to see if it was something in the file structures on disk vs. something under the hood for ZFS so I followed the directions for Backup and Restore of a File-Level backup . I restored and then tried to mount the MDT resulting in this: LustreError: 46380:0:(osd_oi.c:232:osd_fld_lookup()) ASSERTION( ss != ((void *)0) ) failed: LustreError: 46380:0:(osd_oi.c:232:osd_fld_lookup()) LBUG Kernel panic - not syncing: LBUG Pid: 46380, comm: mount.lustre Tainted: P --------------- 2.6.32-431.20.3.el6_lustre.g5a7c614.x86_64 #1 Call Trace: [<ffffffff8152859c>] ? panic+0xa7/0x16f [<ffffffffa0665eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [<ffffffffa0e9c78f>] ? osd_fld_lookup+0xcf/0xd0 [osd_zfs] [<ffffffffa0e9c88e>] ? fid_is_on_ost+0xfe/0x310 [osd_zfs] [<ffffffffa0392869>] ? dbuf_rele_and_unlock+0x169/0x1e0 [zfs] [<ffffffffa0e9cbac>] ? osd_get_name_n_idx+0x4c/0xe80 [osd_zfs] [<ffffffffa03b3f53>] ? dsl_dataset_block_freeable+0x43/0x60 [zfs] [<ffffffffa03a8d93>] ? dmu_tx_hold_zap+0x1a3/0x200 [zfs] [<ffffffffa0e9dd1a>] ? osd_convert_root_to_new_seq+0x33a/0x670 [osd_zfs] [<ffffffffa0e8c4b8>] ? osd_mount+0xbc8/0xf60 [osd_zfs] [<ffffffffa0e8ed66>] ? osd_device_alloc+0x2a6/0x3b0 [osd_zfs] [<ffffffffa07ebf0f>] ? obd_setup+0x1bf/0x290 [obdclass] [<ffffffffa07ec1e8>] ? class_setup+0x208/0x870 [obdclass] [<ffffffffa07f438c>] ? class_process_config+0xc5c/0x1ac0 [obdclass] [<ffffffffa07f94b5>] ? lustre_cfg_new+0x4f5/0x6f0 [obdclass] [<ffffffffa07f9808>] ? do_lcfg+0x158/0x450 [obdclass] [<ffffffffa07f9b94>] ? lustre_start_simple+0x94/0x200 [obdclass] [<ffffffffa0832df1>] ? server_fill_super+0x1061/0x1720 [obdclass] [<ffffffffa07ff6b8>] ? lustre_fill_super+0x1d8/0x550 [obdclass] [<ffffffffa07ff4e0>] ? lustre_fill_super+0x0/0x550 [obdclass] [<ffffffff8118c01f>] ? get_sb_nodev+0x5f/0xa0 [<ffffffffa07f72b5>] ? lustre_get_sb+0x25/0x30 [obdclass] [<ffffffff8118b67b>] ? vfs_kern_mount+0x7b/0x1b0 [<ffffffff8118b822>] ? do_kern_mount+0x52/0x130 [<ffffffff8119e422>] ? vfs_ioctl+0x22/0xa0 [<ffffffff811ad1fb>] ? do_mount+0x2fb/0x930 [<ffffffff811ad8c0>] ? sys_mount+0x90/0xe0 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b I had fully and cleanly unmounted everything prior to creating the backup, I then reformatted the ZFS filesystem. update This happens on seq is FID_SEQ_ROOT.

Nathaniel Clark added a comment - 21/Oct/14 1:42 PM - edited

Andreas,
It is a function of the directory itself. client can be umounted then remounted w/o any change, the MDT can be mounted / exported/ imported and remounted w/o any change. It feels like it might be an issue in ZFS itself, but I don't have a smoking gun.

Nathaniel Clark added a comment - 21/Oct/14 1:42 PM - edited Andreas, It is a function of the directory itself. client can be umounted then remounted w/o any change, the MDT can be mounted / exported/ imported and remounted w/o any change. It feels like it might be an issue in ZFS itself, but I don't have a smoking gun.

Andreas Dilger added a comment - 16/Oct/14 8:52 PM

Nathaniel, does this problem disappear if the client is unmounted/remounted, or is it a function of the directory itself? Is it possible that there are hash collisions in the name entries that are not being handled properly?

Andreas Dilger added a comment - 16/Oct/14 8:52 PM Nathaniel, does this problem disappear if the client is unmounted/remounted, or is it a function of the directory itself? Is it possible that there are hash collisions in the name entries that are not being handled properly?

Nathaniel Clark added a comment - 16/Oct/14 7:16 PM

Status of current testing:
The issue appears to be in the iteration of the directory. All the files are correctly created on the MDT and are visible (as far as I can tell via mounting MDT as ZFS), but iterating across that directory using zap_cursor_advance and zap_cursor_retrieve seem to somehow miss that one file. No errors are return in the creation of the object or adding the object to the directory, the object is just missing during iteration.

Nathaniel Clark added a comment - 16/Oct/14 7:16 PM Status of current testing: The issue appears to be in the iteration of the directory. All the files are correctly created on the MDT and are visible (as far as I can tell via mounting MDT as ZFS), but iterating across that directory using zap_cursor_advance and zap_cursor_retrieve seem to somehow miss that one file. No errors are return in the creation of the object or adding the object to the directory, the object is just missing during iteration.

Nathaniel Clark added a comment - 06/Oct/14 5:46 PM - edited

File actually does exist in MDT's zfs filesystem, and can be accessed directly from client, but ls does not show it in the directory. This reminds of a "recent" bug, but I'm having trouble locating it
Update:
Similar bugs are ~~LU-5475~~ and ~~LU-5254~~

Nathaniel Clark added a comment - 06/Oct/14 5:46 PM - edited File actually does exist in MDT's zfs filesystem, and can be accessed directly from client, but ls does not show it in the directory. This reminds of a "recent" bug, but I'm having trouble locating it Update: Similar bugs are LU-5475 and LU-5254

Jodi Levi (Inactive) added a comment - 05/Oct/14 3:33 PM

Nathaniel,
Is the solution to fix this clear or do we need to involve anyone else?

Jodi Levi (Inactive) added a comment - 05/Oct/14 3:33 PM Nathaniel, Is the solution to fix this clear or do we need to involve anyone else?

Nathaniel Clark added a comment - 03/Oct/14 3:07 PM - edited

Finally reproduced:

File reported as missing from lustre but present in rsync'd dir is in-fact missing from lustre. Remounting client then then mdt did not cause file to re-appear, and there is no log of the file being deleted.

This behaviour can be reproduced by just running:

    for i in 1 2 3 4 5 6 7 8 9; do
	mkdir $DIR/$tdir/d$i
	for j in 1 2 3 4 5 6 7 8 9; do
	    mkdir $DIR/$tdir/d$i/d$i$j
	    createmany -o $DIR/$tdir/d$i/d$i$j/a 10 \
		> /dev/null
	    mv $DIR/$tdir/d$i/d$i$j $DIR/$tdir/d$i/d0$i$j
	    createmany -o $DIR/$tdir/d$i/d0$i$j/b 10 \
	            > /dev/null
	    mv $DIR/$tdir/d$i/d0$i$j/a0 $DIR/$tdir/d$i/d0$i$j/c0
	done
	mv $DIR/$tdir/d$i $DIR/$tdir/d0$i
    done

and comparing the created files to what would be expected.

     for i in 1 2 3 4 5 6 7 8 9; do
	for j in 1 2 3 4 5 6 7 8 9; do
	    for k in 1 2 3 4 5 6 7 8 9; do
		echo $D/d0$i/d0$i$j/a$k >> $F
	    done
	    for k in 0 1 2 3 4 5 6 7 8 9; do
		echo $D/d0$i/d0$i$j/b$k >> $F
	    done
	    echo $D/d0$i/d0$i$j/c0 >> $F
	done
    done

    find $DIR/$tdir -type f | sort | diff $F - || error "Differences Found"

Nathaniel Clark added a comment - 03/Oct/14 3:07 PM - edited Finally reproduced: File reported as missing from lustre but present in rsync'd dir is in-fact missing from lustre. Remounting client then then mdt did not cause file to re-appear, and there is no log of the file being deleted. This behaviour can be reproduced by just running: for i in 1 2 3 4 5 6 7 8 9; do mkdir $DIR/$tdir/d$i for j in 1 2 3 4 5 6 7 8 9; do mkdir $DIR/$tdir/d$i/d$i$j createmany -o $DIR/$tdir/d$i/d$i$j/a 10 \ > /dev/ null mv $DIR/$tdir/d$i/d$i$j $DIR/$tdir/d$i/d0$i$j createmany -o $DIR/$tdir/d$i/d0$i$j/b 10 \ > /dev/ null mv $DIR/$tdir/d$i/d0$i$j/a0 $DIR/$tdir/d$i/d0$i$j/c0 done mv $DIR/$tdir/d$i $DIR/$tdir/d0$i done and comparing the created files to what would be expected. for i in 1 2 3 4 5 6 7 8 9; do for j in 1 2 3 4 5 6 7 8 9; do for k in 1 2 3 4 5 6 7 8 9; do echo $D/d0$i/d0$i$j/a$k >> $F done for k in 0 1 2 3 4 5 6 7 8 9; do echo $D/d0$i/d0$i$j/b$k >> $F done echo $D/d0$i/d0$i$j/c0 >> $F done done find $DIR/$tdir -type f | sort | diff $F - || error "Differences Found"

Nathaniel Clark added a comment - 02/Oct/14 7:05 PM

I've been running this on my local test cluster (tests 1-8 repetitively) and have not had the failure happen locally to debug the system in the failed state. This failure has only happened 4 times out of 121 test runs (zfs on master) in the last week (about 3%).

Nathaniel Clark added a comment - 02/Oct/14 7:05 PM I've been running this on my local test cluster (tests 1-8 repetitively) and have not had the failure happen locally to debug the system in the failed state. This failure has only happened 4 times out of 121 test runs (zfs on master) in the last week (about 3%).

Jian Yu added a comment - 23/Sep/14 4:55 PM

While testing patch http://review.whamcloud.com/12012 on Lustre b2_5 branch with FSTYPE=zfs, the same failure occurred:
https://testing.hpdd.intel.com/test_sets/983a9d84-431e-11e4-bf48-5254006e85c2

Jian Yu added a comment - 23/Sep/14 4:55 PM While testing patch http://review.whamcloud.com/12012 on Lustre b2_5 branch with FSTYPE=zfs, the same failure occurred: https://testing.hpdd.intel.com/test_sets/983a9d84-431e-11e4-bf48-5254006e85c2

People

Assignee:: Nathaniel Clark

Reporter:: Keith Mannthey (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 10/Jul/13 10:14 PM

Updated:: 20/Jul/15 9:43 PM

Resolved:: 12/Dec/14 3:31 PM