Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3573

lustre-rsync-test test_8: @@@@@@ FAIL: Failure in replication; differences found.

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0, Lustre 2.5.4
    • Lustre 2.6.0, Lustre 2.5.1, Lustre 2.7.0, Lustre 2.5.3
    • a patch pushed to autotest
    • 3
    • 9026

    Description

      On ZFS an error was seen with lustre-rsync test 8 the logs are here:
      https://maloo.whamcloud.com/test_sets/e3492890-e901-11e2-ae91-52540035b04c

      I don't really know how to read lrsync_log.

      The test error reports a very basic:

      == lustre-rsync-test test 8: Replicate multiple file/directory moves == 16:00:59 (1373410859)
      CMD: wtm-10vm7 lctl --device lustre-MDT0000 changelog_register -n
      lustre-MDT0000: Registered changelog user cl13
      CMD: wtm-10vm7 lctl get_param -n mdd.lustre-MDT0000.changelog_users
      Lustre filesystem: lustre
      MDT device: lustre-MDT0000
      Source: /mnt/lustre
      Target: /tmp/target
      Statuslog: /tmp/lustre_rsync.log
      Changelog registration: cl13
      Starting changelog record: 0
      Clear changelog after use: no
      Errors: 0
      lustre_rsync took 107 seconds
      Changelog records consumed: 1881
      Only in /tmp/target/d0.lustre-rsync-test/d8/d08/d083: a3
       lustre-rsync-test test_8: @@@@@@ FAIL: Failure in replication; differences found. 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4066:error_noexit()
        = /usr/lib64/lustre/tests/test-framework.sh:4093:error()
        .....
      

      Out of the last 100 runs it reports 1 error so it could be related to the base patch or a rare error.

      Attachments

        Issue Links

          Activity

            [LU-3573] lustre-rsync-test test_8: @@@@@@ FAIL: Failure in replication; differences found.
            bogl Bob Glossman (Inactive) added a comment - in b2_5: http://review.whamcloud.com/12649
            pjones Peter Jones added a comment -

            Fix landed for 2.7

            pjones Peter Jones added a comment - Fix landed for 2.7
            utopiabound Nathaniel Clark added a comment - http://review.whamcloud.com/12582
            yujian Jian Yu added a comment -

            One more instance on Lustre b2_5 branch with FSTYPE=zfs: https://testing.hpdd.intel.com/test_sets/5310f46c-61fd-11e4-bd1f-5254006e85c2

            yujian Jian Yu added a comment - One more instance on Lustre b2_5 branch with FSTYPE=zfs: https://testing.hpdd.intel.com/test_sets/5310f46c-61fd-11e4-bd1f-5254006e85c2

            The inode is not 0, the inode is tied to the file and this issue is with the name in that directory.
            I can delete the file and create a new one, and it will still be hidden. It must be in how ZFS stores the name, I'm trying to debug that now.

            utopiabound Nathaniel Clark added a comment - The inode is not 0, the inode is tied to the file and this issue is with the name in that directory. I can delete the file and create a new one, and it will still be hidden. It must be in how ZFS stores the name, I'm trying to debug that now.

            Does ZFS have whiteout entries, or is there some problem with the hashing that prevents the directory entry from appearing? I know that "ls" will not show entries that have inode == 0, but the inode number should be independent of the filename, so that wouldn't behave in this manner. Maybe there is a ZFS "hidden" flag that is not being initialized correctly and in some cases this flag is set? Does anything appear with "strace" or with "zdb"?

            adilger Andreas Dilger added a comment - Does ZFS have whiteout entries, or is there some problem with the hashing that prevents the directory entry from appearing? I know that "ls" will not show entries that have inode == 0, but the inode number should be independent of the filename, so that wouldn't behave in this manner. Maybe there is a ZFS "hidden" flag that is not being initialized correctly and in some cases this flag is set? Does anything appear with "strace" or with "zdb"?

            The file itself isn't hidden, the name is "cloaked".
            In this example d074/a9 has been "lost", notice that a9 will never appear to ls:

            [root@lubuilder d074]# ls
            a1  a2  a3  a4  a5  a6  a7  a8  b0  b1  b2  b3  b4  b5  b6  b7  b8  b9  c0
            [root@lubuilder d074]# mv a9 a9a
            [root@lubuilder d074]# ls
            a1  a2  a3  a4  a5  a6  a7  a8  a9a  b0  b1  b2  b3  b4  b5  b6  b7  b8  b9  c0
            [root@lubuilder d074]# mv a9a a9
            [root@lubuilder d074]# ls
            a1  a2  a3  a4  a5  a6  a7  a8  b0  b1  b2  b3  b4  b5  b6  b7  b8  b9  c0
            [root@lubuilder d074]# mv a9 a9a
            [root@lubuilder d074]# ls
            a1  a2  a3  a4  a5  a6  a7  a8  a9a  b0  b1  b2  b3  b4  b5  b6  b7  b8  b9  c0
            [root@lubuilder d074]# mv a8 a9
            [root@lubuilder d074]# ls
            a1  a2  a3  a4  a5  a6  a7  a9a  b0  b1  b2  b3  b4  b5  b6  b7  b8  b9  c0
            [root@lubuilder d074]# mv a9 a8
            [root@lubuilder d074]# ls
            a1  a2  a3  a4  a5  a6  a7  a8  a9a  b0  b1  b2  b3  b4  b5  b6  b7  b8  b9  c0
            
            utopiabound Nathaniel Clark added a comment - The file itself isn't hidden, the name is "cloaked". In this example d074/a9 has been "lost", notice that a9 will never appear to ls: [root@lubuilder d074]# ls a1 a2 a3 a4 a5 a6 a7 a8 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 c0 [root@lubuilder d074]# mv a9 a9a [root@lubuilder d074]# ls a1 a2 a3 a4 a5 a6 a7 a8 a9a b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 c0 [root@lubuilder d074]# mv a9a a9 [root@lubuilder d074]# ls a1 a2 a3 a4 a5 a6 a7 a8 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 c0 [root@lubuilder d074]# mv a9 a9a [root@lubuilder d074]# ls a1 a2 a3 a4 a5 a6 a7 a8 a9a b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 c0 [root@lubuilder d074]# mv a8 a9 [root@lubuilder d074]# ls a1 a2 a3 a4 a5 a6 a7 a9a b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 c0 [root@lubuilder d074]# mv a9 a8 [root@lubuilder d074]# ls a1 a2 a3 a4 a5 a6 a7 a8 a9a b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 c0
            yujian Jian Yu added a comment - One more instance on master branch: https://testing.hpdd.intel.com/test_sets/a2161baa-5bea-11e4-a35f-5254006e85c2

            Doing update restore using LUDOC-161 information (zfs send/recv) instead of tar works much better. It shows no change after restore, the file is still missing form lustre client.

            utopiabound Nathaniel Clark added a comment - Doing update restore using LUDOC-161 information (zfs send/recv) instead of tar works much better. It shows no change after restore, the file is still missing form lustre client.

            I don't think it is supported to do file-level backup and restore for ZFS. For ZFS you need to do "zfs send" and "zfs recv" or else the OI files will be broken, and there is not currently OI Scrub functionality for osd-zfs.

            If that isn't clearly documented in the Lustre User Manual's Backup and Restore section, then that is a defect in the manual that should be fixed.

            adilger Andreas Dilger added a comment - I don't think it is supported to do file-level backup and restore for ZFS. For ZFS you need to do "zfs send" and "zfs recv" or else the OI files will be broken, and there is not currently OI Scrub functionality for osd-zfs. If that isn't clearly documented in the Lustre User Manual's Backup and Restore section, then that is a defect in the manual that should be fixed.
            utopiabound Nathaniel Clark added a comment - - edited

            Not sure if this is related but, I was trying to see if it was something in the file structures on disk vs. something under the hood for ZFS so I followed the directions for Backup and Restore of a File-Level backup. I restored and then tried to mount the MDT resulting in this:

            LustreError: 46380:0:(osd_oi.c:232:osd_fld_lookup()) ASSERTION( ss != ((void *)0) ) failed: 
            LustreError: 46380:0:(osd_oi.c:232:osd_fld_lookup()) LBUG
            Kernel panic - not syncing: LBUG
            Pid: 46380, comm: mount.lustre Tainted: P           ---------------    2.6.32-431.20.3.el6_lustre.g5a7c614.x86_64 #1
            Call Trace:
             [<ffffffff8152859c>] ? panic+0xa7/0x16f
             [<ffffffffa0665eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
             [<ffffffffa0e9c78f>] ? osd_fld_lookup+0xcf/0xd0 [osd_zfs]
             [<ffffffffa0e9c88e>] ? fid_is_on_ost+0xfe/0x310 [osd_zfs]
             [<ffffffffa0392869>] ? dbuf_rele_and_unlock+0x169/0x1e0 [zfs]
             [<ffffffffa0e9cbac>] ? osd_get_name_n_idx+0x4c/0xe80 [osd_zfs]
             [<ffffffffa03b3f53>] ? dsl_dataset_block_freeable+0x43/0x60 [zfs]
             [<ffffffffa03a8d93>] ? dmu_tx_hold_zap+0x1a3/0x200 [zfs]
             [<ffffffffa0e9dd1a>] ? osd_convert_root_to_new_seq+0x33a/0x670 [osd_zfs]
             [<ffffffffa0e8c4b8>] ? osd_mount+0xbc8/0xf60 [osd_zfs]
             [<ffffffffa0e8ed66>] ? osd_device_alloc+0x2a6/0x3b0 [osd_zfs]
             [<ffffffffa07ebf0f>] ? obd_setup+0x1bf/0x290 [obdclass]
             [<ffffffffa07ec1e8>] ? class_setup+0x208/0x870 [obdclass]
             [<ffffffffa07f438c>] ? class_process_config+0xc5c/0x1ac0 [obdclass]
             [<ffffffffa07f94b5>] ? lustre_cfg_new+0x4f5/0x6f0 [obdclass]
             [<ffffffffa07f9808>] ? do_lcfg+0x158/0x450 [obdclass]
             [<ffffffffa07f9b94>] ? lustre_start_simple+0x94/0x200 [obdclass]
             [<ffffffffa0832df1>] ? server_fill_super+0x1061/0x1720 [obdclass]
             [<ffffffffa07ff6b8>] ? lustre_fill_super+0x1d8/0x550 [obdclass]
             [<ffffffffa07ff4e0>] ? lustre_fill_super+0x0/0x550 [obdclass]
             [<ffffffff8118c01f>] ? get_sb_nodev+0x5f/0xa0
             [<ffffffffa07f72b5>] ? lustre_get_sb+0x25/0x30 [obdclass]
             [<ffffffff8118b67b>] ? vfs_kern_mount+0x7b/0x1b0
             [<ffffffff8118b822>] ? do_kern_mount+0x52/0x130
             [<ffffffff8119e422>] ? vfs_ioctl+0x22/0xa0
             [<ffffffff811ad1fb>] ? do_mount+0x2fb/0x930
             [<ffffffff811ad8c0>] ? sys_mount+0x90/0xe0
             [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
            

            I had fully and cleanly unmounted everything prior to creating the backup, I then reformatted the ZFS filesystem.

            update
            This happens on seq is FID_SEQ_ROOT.

            utopiabound Nathaniel Clark added a comment - - edited Not sure if this is related but, I was trying to see if it was something in the file structures on disk vs. something under the hood for ZFS so I followed the directions for Backup and Restore of a File-Level backup . I restored and then tried to mount the MDT resulting in this: LustreError: 46380:0:(osd_oi.c:232:osd_fld_lookup()) ASSERTION( ss != ((void *)0) ) failed: LustreError: 46380:0:(osd_oi.c:232:osd_fld_lookup()) LBUG Kernel panic - not syncing: LBUG Pid: 46380, comm: mount.lustre Tainted: P --------------- 2.6.32-431.20.3.el6_lustre.g5a7c614.x86_64 #1 Call Trace: [<ffffffff8152859c>] ? panic+0xa7/0x16f [<ffffffffa0665eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [<ffffffffa0e9c78f>] ? osd_fld_lookup+0xcf/0xd0 [osd_zfs] [<ffffffffa0e9c88e>] ? fid_is_on_ost+0xfe/0x310 [osd_zfs] [<ffffffffa0392869>] ? dbuf_rele_and_unlock+0x169/0x1e0 [zfs] [<ffffffffa0e9cbac>] ? osd_get_name_n_idx+0x4c/0xe80 [osd_zfs] [<ffffffffa03b3f53>] ? dsl_dataset_block_freeable+0x43/0x60 [zfs] [<ffffffffa03a8d93>] ? dmu_tx_hold_zap+0x1a3/0x200 [zfs] [<ffffffffa0e9dd1a>] ? osd_convert_root_to_new_seq+0x33a/0x670 [osd_zfs] [<ffffffffa0e8c4b8>] ? osd_mount+0xbc8/0xf60 [osd_zfs] [<ffffffffa0e8ed66>] ? osd_device_alloc+0x2a6/0x3b0 [osd_zfs] [<ffffffffa07ebf0f>] ? obd_setup+0x1bf/0x290 [obdclass] [<ffffffffa07ec1e8>] ? class_setup+0x208/0x870 [obdclass] [<ffffffffa07f438c>] ? class_process_config+0xc5c/0x1ac0 [obdclass] [<ffffffffa07f94b5>] ? lustre_cfg_new+0x4f5/0x6f0 [obdclass] [<ffffffffa07f9808>] ? do_lcfg+0x158/0x450 [obdclass] [<ffffffffa07f9b94>] ? lustre_start_simple+0x94/0x200 [obdclass] [<ffffffffa0832df1>] ? server_fill_super+0x1061/0x1720 [obdclass] [<ffffffffa07ff6b8>] ? lustre_fill_super+0x1d8/0x550 [obdclass] [<ffffffffa07ff4e0>] ? lustre_fill_super+0x0/0x550 [obdclass] [<ffffffff8118c01f>] ? get_sb_nodev+0x5f/0xa0 [<ffffffffa07f72b5>] ? lustre_get_sb+0x25/0x30 [obdclass] [<ffffffff8118b67b>] ? vfs_kern_mount+0x7b/0x1b0 [<ffffffff8118b822>] ? do_kern_mount+0x52/0x130 [<ffffffff8119e422>] ? vfs_ioctl+0x22/0xa0 [<ffffffff811ad1fb>] ? do_mount+0x2fb/0x930 [<ffffffff811ad8c0>] ? sys_mount+0x90/0xe0 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b I had fully and cleanly unmounted everything prior to creating the backup, I then reformatted the ZFS filesystem. update This happens on seq is FID_SEQ_ROOT.

            People

              utopiabound Nathaniel Clark
              keith Keith Mannthey (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: