Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3573

lustre-rsync-test test_8: @@@@@@ FAIL: Failure in replication; differences found.

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0, Lustre 2.5.4
    • Lustre 2.6.0, Lustre 2.5.1, Lustre 2.7.0, Lustre 2.5.3
    • a patch pushed to autotest
    • 3
    • 9026

    Description

      On ZFS an error was seen with lustre-rsync test 8 the logs are here:
      https://maloo.whamcloud.com/test_sets/e3492890-e901-11e2-ae91-52540035b04c

      I don't really know how to read lrsync_log.

      The test error reports a very basic:

      == lustre-rsync-test test 8: Replicate multiple file/directory moves == 16:00:59 (1373410859)
      CMD: wtm-10vm7 lctl --device lustre-MDT0000 changelog_register -n
      lustre-MDT0000: Registered changelog user cl13
      CMD: wtm-10vm7 lctl get_param -n mdd.lustre-MDT0000.changelog_users
      Lustre filesystem: lustre
      MDT device: lustre-MDT0000
      Source: /mnt/lustre
      Target: /tmp/target
      Statuslog: /tmp/lustre_rsync.log
      Changelog registration: cl13
      Starting changelog record: 0
      Clear changelog after use: no
      Errors: 0
      lustre_rsync took 107 seconds
      Changelog records consumed: 1881
      Only in /tmp/target/d0.lustre-rsync-test/d8/d08/d083: a3
       lustre-rsync-test test_8: @@@@@@ FAIL: Failure in replication; differences found. 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4066:error_noexit()
        = /usr/lib64/lustre/tests/test-framework.sh:4093:error()
        .....
      

      Out of the last 100 runs it reports 1 error so it could be related to the base patch or a rare error.

      Attachments

        Issue Links

          Activity

            [LU-3573] lustre-rsync-test test_8: @@@@@@ FAIL: Failure in replication; differences found.

            Doing update restore using LUDOC-161 information (zfs send/recv) instead of tar works much better. It shows no change after restore, the file is still missing form lustre client.

            utopiabound Nathaniel Clark added a comment - Doing update restore using LUDOC-161 information (zfs send/recv) instead of tar works much better. It shows no change after restore, the file is still missing form lustre client.

            I don't think it is supported to do file-level backup and restore for ZFS. For ZFS you need to do "zfs send" and "zfs recv" or else the OI files will be broken, and there is not currently OI Scrub functionality for osd-zfs.

            If that isn't clearly documented in the Lustre User Manual's Backup and Restore section, then that is a defect in the manual that should be fixed.

            adilger Andreas Dilger added a comment - I don't think it is supported to do file-level backup and restore for ZFS. For ZFS you need to do "zfs send" and "zfs recv" or else the OI files will be broken, and there is not currently OI Scrub functionality for osd-zfs. If that isn't clearly documented in the Lustre User Manual's Backup and Restore section, then that is a defect in the manual that should be fixed.
            utopiabound Nathaniel Clark added a comment - - edited

            Not sure if this is related but, I was trying to see if it was something in the file structures on disk vs. something under the hood for ZFS so I followed the directions for Backup and Restore of a File-Level backup. I restored and then tried to mount the MDT resulting in this:

            LustreError: 46380:0:(osd_oi.c:232:osd_fld_lookup()) ASSERTION( ss != ((void *)0) ) failed: 
            LustreError: 46380:0:(osd_oi.c:232:osd_fld_lookup()) LBUG
            Kernel panic - not syncing: LBUG
            Pid: 46380, comm: mount.lustre Tainted: P           ---------------    2.6.32-431.20.3.el6_lustre.g5a7c614.x86_64 #1
            Call Trace:
             [<ffffffff8152859c>] ? panic+0xa7/0x16f
             [<ffffffffa0665eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
             [<ffffffffa0e9c78f>] ? osd_fld_lookup+0xcf/0xd0 [osd_zfs]
             [<ffffffffa0e9c88e>] ? fid_is_on_ost+0xfe/0x310 [osd_zfs]
             [<ffffffffa0392869>] ? dbuf_rele_and_unlock+0x169/0x1e0 [zfs]
             [<ffffffffa0e9cbac>] ? osd_get_name_n_idx+0x4c/0xe80 [osd_zfs]
             [<ffffffffa03b3f53>] ? dsl_dataset_block_freeable+0x43/0x60 [zfs]
             [<ffffffffa03a8d93>] ? dmu_tx_hold_zap+0x1a3/0x200 [zfs]
             [<ffffffffa0e9dd1a>] ? osd_convert_root_to_new_seq+0x33a/0x670 [osd_zfs]
             [<ffffffffa0e8c4b8>] ? osd_mount+0xbc8/0xf60 [osd_zfs]
             [<ffffffffa0e8ed66>] ? osd_device_alloc+0x2a6/0x3b0 [osd_zfs]
             [<ffffffffa07ebf0f>] ? obd_setup+0x1bf/0x290 [obdclass]
             [<ffffffffa07ec1e8>] ? class_setup+0x208/0x870 [obdclass]
             [<ffffffffa07f438c>] ? class_process_config+0xc5c/0x1ac0 [obdclass]
             [<ffffffffa07f94b5>] ? lustre_cfg_new+0x4f5/0x6f0 [obdclass]
             [<ffffffffa07f9808>] ? do_lcfg+0x158/0x450 [obdclass]
             [<ffffffffa07f9b94>] ? lustre_start_simple+0x94/0x200 [obdclass]
             [<ffffffffa0832df1>] ? server_fill_super+0x1061/0x1720 [obdclass]
             [<ffffffffa07ff6b8>] ? lustre_fill_super+0x1d8/0x550 [obdclass]
             [<ffffffffa07ff4e0>] ? lustre_fill_super+0x0/0x550 [obdclass]
             [<ffffffff8118c01f>] ? get_sb_nodev+0x5f/0xa0
             [<ffffffffa07f72b5>] ? lustre_get_sb+0x25/0x30 [obdclass]
             [<ffffffff8118b67b>] ? vfs_kern_mount+0x7b/0x1b0
             [<ffffffff8118b822>] ? do_kern_mount+0x52/0x130
             [<ffffffff8119e422>] ? vfs_ioctl+0x22/0xa0
             [<ffffffff811ad1fb>] ? do_mount+0x2fb/0x930
             [<ffffffff811ad8c0>] ? sys_mount+0x90/0xe0
             [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
            

            I had fully and cleanly unmounted everything prior to creating the backup, I then reformatted the ZFS filesystem.

            update
            This happens on seq is FID_SEQ_ROOT.

            utopiabound Nathaniel Clark added a comment - - edited Not sure if this is related but, I was trying to see if it was something in the file structures on disk vs. something under the hood for ZFS so I followed the directions for Backup and Restore of a File-Level backup . I restored and then tried to mount the MDT resulting in this: LustreError: 46380:0:(osd_oi.c:232:osd_fld_lookup()) ASSERTION( ss != ((void *)0) ) failed: LustreError: 46380:0:(osd_oi.c:232:osd_fld_lookup()) LBUG Kernel panic - not syncing: LBUG Pid: 46380, comm: mount.lustre Tainted: P --------------- 2.6.32-431.20.3.el6_lustre.g5a7c614.x86_64 #1 Call Trace: [<ffffffff8152859c>] ? panic+0xa7/0x16f [<ffffffffa0665eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [<ffffffffa0e9c78f>] ? osd_fld_lookup+0xcf/0xd0 [osd_zfs] [<ffffffffa0e9c88e>] ? fid_is_on_ost+0xfe/0x310 [osd_zfs] [<ffffffffa0392869>] ? dbuf_rele_and_unlock+0x169/0x1e0 [zfs] [<ffffffffa0e9cbac>] ? osd_get_name_n_idx+0x4c/0xe80 [osd_zfs] [<ffffffffa03b3f53>] ? dsl_dataset_block_freeable+0x43/0x60 [zfs] [<ffffffffa03a8d93>] ? dmu_tx_hold_zap+0x1a3/0x200 [zfs] [<ffffffffa0e9dd1a>] ? osd_convert_root_to_new_seq+0x33a/0x670 [osd_zfs] [<ffffffffa0e8c4b8>] ? osd_mount+0xbc8/0xf60 [osd_zfs] [<ffffffffa0e8ed66>] ? osd_device_alloc+0x2a6/0x3b0 [osd_zfs] [<ffffffffa07ebf0f>] ? obd_setup+0x1bf/0x290 [obdclass] [<ffffffffa07ec1e8>] ? class_setup+0x208/0x870 [obdclass] [<ffffffffa07f438c>] ? class_process_config+0xc5c/0x1ac0 [obdclass] [<ffffffffa07f94b5>] ? lustre_cfg_new+0x4f5/0x6f0 [obdclass] [<ffffffffa07f9808>] ? do_lcfg+0x158/0x450 [obdclass] [<ffffffffa07f9b94>] ? lustre_start_simple+0x94/0x200 [obdclass] [<ffffffffa0832df1>] ? server_fill_super+0x1061/0x1720 [obdclass] [<ffffffffa07ff6b8>] ? lustre_fill_super+0x1d8/0x550 [obdclass] [<ffffffffa07ff4e0>] ? lustre_fill_super+0x0/0x550 [obdclass] [<ffffffff8118c01f>] ? get_sb_nodev+0x5f/0xa0 [<ffffffffa07f72b5>] ? lustre_get_sb+0x25/0x30 [obdclass] [<ffffffff8118b67b>] ? vfs_kern_mount+0x7b/0x1b0 [<ffffffff8118b822>] ? do_kern_mount+0x52/0x130 [<ffffffff8119e422>] ? vfs_ioctl+0x22/0xa0 [<ffffffff811ad1fb>] ? do_mount+0x2fb/0x930 [<ffffffff811ad8c0>] ? sys_mount+0x90/0xe0 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b I had fully and cleanly unmounted everything prior to creating the backup, I then reformatted the ZFS filesystem. update This happens on seq is FID_SEQ_ROOT.
            utopiabound Nathaniel Clark added a comment - - edited

            Andreas,
            It is a function of the directory itself. client can be umounted then remounted w/o any change, the MDT can be mounted / exported/ imported and remounted w/o any change. It feels like it might be an issue in ZFS itself, but I don't have a smoking gun.

            utopiabound Nathaniel Clark added a comment - - edited Andreas, It is a function of the directory itself. client can be umounted then remounted w/o any change, the MDT can be mounted / exported/ imported and remounted w/o any change. It feels like it might be an issue in ZFS itself, but I don't have a smoking gun.

            Nathaniel, does this problem disappear if the client is unmounted/remounted, or is it a function of the directory itself? Is it possible that there are hash collisions in the name entries that are not being handled properly?

            adilger Andreas Dilger added a comment - Nathaniel, does this problem disappear if the client is unmounted/remounted, or is it a function of the directory itself? Is it possible that there are hash collisions in the name entries that are not being handled properly?

            Status of current testing:
            The issue appears to be in the iteration of the directory. All the files are correctly created on the MDT and are visible (as far as I can tell via mounting MDT as ZFS), but iterating across that directory using zap_cursor_advance and zap_cursor_retrieve seem to somehow miss that one file. No errors are return in the creation of the object or adding the object to the directory, the object is just missing during iteration.

            utopiabound Nathaniel Clark added a comment - Status of current testing: The issue appears to be in the iteration of the directory. All the files are correctly created on the MDT and are visible (as far as I can tell via mounting MDT as ZFS), but iterating across that directory using zap_cursor_advance and zap_cursor_retrieve seem to somehow miss that one file. No errors are return in the creation of the object or adding the object to the directory, the object is just missing during iteration.
            utopiabound Nathaniel Clark added a comment - - edited

            File actually does exist in MDT's zfs filesystem, and can be accessed directly from client, but ls does not show it in the directory. This reminds of a "recent" bug, but I'm having trouble locating it
            Update:
            Similar bugs are LU-5475 and LU-5254

            utopiabound Nathaniel Clark added a comment - - edited File actually does exist in MDT's zfs filesystem, and can be accessed directly from client, but ls does not show it in the directory. This reminds of a "recent" bug, but I'm having trouble locating it Update: Similar bugs are LU-5475 and LU-5254

            Nathaniel,
            Is the solution to fix this clear or do we need to involve anyone else?

            jlevi Jodi Levi (Inactive) added a comment - Nathaniel, Is the solution to fix this clear or do we need to involve anyone else?
            utopiabound Nathaniel Clark added a comment - - edited

            Finally reproduced:

            File reported as missing from lustre but present in rsync'd dir is in-fact missing from lustre. Remounting client then then mdt did not cause file to re-appear, and there is no log of the file being deleted.

            This behaviour can be reproduced by just running:

                for i in 1 2 3 4 5 6 7 8 9; do
            	mkdir $DIR/$tdir/d$i
            	for j in 1 2 3 4 5 6 7 8 9; do
            	    mkdir $DIR/$tdir/d$i/d$i$j
            	    createmany -o $DIR/$tdir/d$i/d$i$j/a 10 \
            		> /dev/null
            	    mv $DIR/$tdir/d$i/d$i$j $DIR/$tdir/d$i/d0$i$j
            	    createmany -o $DIR/$tdir/d$i/d0$i$j/b 10 \
            	            > /dev/null
            	    mv $DIR/$tdir/d$i/d0$i$j/a0 $DIR/$tdir/d$i/d0$i$j/c0
            	done
            	mv $DIR/$tdir/d$i $DIR/$tdir/d0$i
                done
            

            and comparing the created files to what would be expected.

                 for i in 1 2 3 4 5 6 7 8 9; do
            	for j in 1 2 3 4 5 6 7 8 9; do
            	    for k in 1 2 3 4 5 6 7 8 9; do
            		echo $D/d0$i/d0$i$j/a$k >> $F
            	    done
            	    for k in 0 1 2 3 4 5 6 7 8 9; do
            		echo $D/d0$i/d0$i$j/b$k >> $F
            	    done
            	    echo $D/d0$i/d0$i$j/c0 >> $F
            	done
                done
            
                find $DIR/$tdir -type f | sort | diff $F - || error "Differences Found"
            
            utopiabound Nathaniel Clark added a comment - - edited Finally reproduced: File reported as missing from lustre but present in rsync'd dir is in-fact missing from lustre. Remounting client then then mdt did not cause file to re-appear, and there is no log of the file being deleted. This behaviour can be reproduced by just running: for i in 1 2 3 4 5 6 7 8 9; do mkdir $DIR/$tdir/d$i for j in 1 2 3 4 5 6 7 8 9; do mkdir $DIR/$tdir/d$i/d$i$j createmany -o $DIR/$tdir/d$i/d$i$j/a 10 \ > /dev/ null mv $DIR/$tdir/d$i/d$i$j $DIR/$tdir/d$i/d0$i$j createmany -o $DIR/$tdir/d$i/d0$i$j/b 10 \ > /dev/ null mv $DIR/$tdir/d$i/d0$i$j/a0 $DIR/$tdir/d$i/d0$i$j/c0 done mv $DIR/$tdir/d$i $DIR/$tdir/d0$i done and comparing the created files to what would be expected. for i in 1 2 3 4 5 6 7 8 9; do for j in 1 2 3 4 5 6 7 8 9; do for k in 1 2 3 4 5 6 7 8 9; do echo $D/d0$i/d0$i$j/a$k >> $F done for k in 0 1 2 3 4 5 6 7 8 9; do echo $D/d0$i/d0$i$j/b$k >> $F done echo $D/d0$i/d0$i$j/c0 >> $F done done find $DIR/$tdir -type f | sort | diff $F - || error "Differences Found"

            I've been running this on my local test cluster (tests 1-8 repetitively) and have not had the failure happen locally to debug the system in the failed state. This failure has only happened 4 times out of 121 test runs (zfs on master) in the last week (about 3%).

            utopiabound Nathaniel Clark added a comment - I've been running this on my local test cluster (tests 1-8 repetitively) and have not had the failure happen locally to debug the system in the failed state. This failure has only happened 4 times out of 121 test runs (zfs on master) in the last week (about 3%).
            yujian Jian Yu added a comment -

            While testing patch http://review.whamcloud.com/12012 on Lustre b2_5 branch with FSTYPE=zfs, the same failure occurred:
            https://testing.hpdd.intel.com/test_sets/983a9d84-431e-11e4-bf48-5254006e85c2

            yujian Jian Yu added a comment - While testing patch http://review.whamcloud.com/12012 on Lustre b2_5 branch with FSTYPE=zfs, the same failure occurred: https://testing.hpdd.intel.com/test_sets/983a9d84-431e-11e4-bf48-5254006e85c2

            People

              utopiabound Nathaniel Clark
              keith Keith Mannthey (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: