Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-468

md-raid corruptions for zero copy patch.

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • Lustre 2.1.0
    • Lustre 2.1.0
    • None
    • RHEL6
    • 3
    • 6592

    Description

      While porting zero copy patch to RHEL6 we have found some corruptions during IO while raid5/6 reconstruction. I think it's should be affect to RHEL5 also.

      it's easy to replicated by

      echo 32 > /sys/block/md0/md/stripe_cache_size
      echo 0 > /proc/fs/lustre/obdfilter/<ost_name>/writethrough_cache_enable
      echo 0 > /proc/fs/lustre/obdfilter/<ost_name>/read_cache_enable

      and fail one of disks with
      mdadm /dev/mdX --fail /dev/....

      after it verify data is correct.

      [root@sjlustre1-o1 ~]# dd if=/dev/urandom of=test.1 oflag=direct bs=128k
      count=8
      8+0 records in
      8+0 records out
      1048576 bytes (1.0 MB) copied, 0.157819 seconds, 6.6 MB/s
      [root@sjlustre1-o1 ~]# md5sum test.1
      4ec4d0b67a2b3341795706605e0b0a28 test.1
      [root@sjlustre1-o1 ~]# md5sum test.1 > test.1.md5
      [root@sjlustre1-o1 ~]# dd if=test.1 iflag=direct of=/lustre/stry/test.1
      oflag=direct bs=128k
      8+0 records in
      8+0 records out
      1048576 bytes (1.0 MB) copied, 0.319458 seconds, 3.3 MB/s

      [root@sjlustre1-o1 ~]# dd if=/lustre/stry/test.1 iflag=direct of=test.2
      oflag=direct bs=128k
      8+0 records in
      8+0 records out
      1048576 bytes (1.0 MB) copied, 0.114691 seconds, 9.1 MB/s
      [root@sjlustre1-o1 ~]# md5sum test.1 test.2
      4ec4d0b67a2b3341795706605e0b0a28 test.1
      426c976b75fa3ce5b5ae22b5195f85fd test.2

      after work problem identified as two bugs in zcopy patch.
      1) raid5 set a flag UPTODATE to stripe with staled pointers from DIO and try to copy data from these pointers during READ phase.

      2) restoring pages from stripe cache issue.

      please verify it's issue on RHEL5 env (we don't have it's now).

      Attachments

        Issue Links

          Activity

            [LU-468] md-raid corruptions for zero copy patch.

            Xyratex: MRP-158

            nrutman Nathan Rutman added a comment - Xyratex: MRP-158
            pjones Peter Jones added a comment -

            As I understand it, the bug in the zero copy patch has been fixed in the version contributed under LU535

            pjones Peter Jones added a comment - As I understand it, the bug in the zero copy patch has been fixed in the version contributed under LU535

            Apologies, my previous comment contained an inaccuracy:

            I have been able to reliably generate zero-copy writes by writing after a drive has failed. However, I am still unable to observe data corruption with RHEL5.

            rhenwood Richard Henwood (Inactive) added a comment - Apologies, my previous comment contained an inaccuracy: I have been able to reliably generate zero-copy writes by writing after a drive has failed. However, I am still unable to observe data corruption with RHEL5.
            pjones Peter Jones added a comment -

            Dropping priority so this is no longer a blocker. If there is any evidence that this affects the master code on either RHEL5 or RHEL6 then it can raise in priority again.

            pjones Peter Jones added a comment - Dropping priority so this is no longer a blocker. If there is any evidence that this affects the master code on either RHEL5 or RHEL6 then it can raise in priority again.

            UPDATE:

            There currently is no zero-copy patch for RHEL6 in the Lustre source. As a result, this bug should not be reproducible on RHEL6.

            There is a zero-copy patch for RHEL5 in the Lustre source. I have been unable to reliably generate zero-copy writes by writing after a drive has failed. However, I am still unable to observe data corruption with RHEL5.

            Because the zero copy patch is not available for RHEL6, I recommend this issue be CLOSED: Can't reproduce.
            A new Jira issue can be created for the RHEL6 zero-copy patch.

            rhenwood Richard Henwood (Inactive) added a comment - UPDATE: There currently is no zero-copy patch for RHEL6 in the Lustre source. As a result, this bug should not be reproducible on RHEL6. There is a zero-copy patch for RHEL5 in the Lustre source. I have been unable to reliably generate zero-copy writes by writing after a drive has failed. However, I am still unable to observe data corruption with RHEL5. Because the zero copy patch is not available for RHEL6, I recommend this issue be CLOSED: Can't reproduce. A new Jira issue can be created for the RHEL6 zero-copy patch.

            Richard, I don't have the env to test it, and I'll be away for two weeks from tomorrow, sorry I couldn't be more helpful... I just read your steps again, it seems all correct to me. So please get zerocopy write working on RHEL6 as the first step.

            ericm Eric Mei (Inactive) added a comment - Richard, I don't have the env to test it, and I'll be away for two weeks from tomorrow, sorry I couldn't be more helpful... I just read your steps again, it seems all correct to me. So please get zerocopy write working on RHEL6 as the first step.

            llmount.sh uses loopback devices (even with clearing OST_MOUNT_OPTS/MDS_MOUNT_OPTS as suggested by Alexey.) These devices create indirection that may obscure the problem.

            As an alternative to llmount.sh I'm manually creating the filesystem. I have used the following steps on RHEL5 and my RHEL6. I have been unable to recreate the bug reliably. As you suggest, I am working on a method to predictably perform zerocopy writes.

            Eric, can you run these commands on your RHEL6 environment to confirm that these instructions reproduce this bug on RHEL6.

            Create a MD device.

            mdadm --create --verbose /dev/md0 --chunk=64 --level=5 --raid-devices=3 /dev/sdb5 /dev/sdb6 /dev/sdb7
            

            Create MDS/MDT and mount.

            # mkfs.lustre --fsname=temp --mgs --mdt /dev/sdb11
            ...
            # mount -t lustre /dev/sdb11 /mnt/mdt
            

            Create OST on the MD device and mount on OSS.

            # mkfs.lustre --ost --fsname=temp --mgsnode=10.0.0.1@tcp0 /dev/md0
            ...
            # mount -t lustre /dev/md0 /mnt/ost1
            

            Mount the Lustre fs.

            # mount -t lustre 10.0.0.1@tcp0:/temp /mnt/lustre
            ...
            # mount
            /dev/sda1 on / type ext3 (rw)
            proc on /proc type proc (rw)
            sysfs on /sys type sysfs (rw)
            devpts on /dev/pts type devpts (rw,gid=5,mode=620)
            tmpfs on /dev/shm type tmpfs (rw)
            none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
            sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
            /dev/sdb11 on /mnt/mdt type lustre (rw)
            /dev/md0 on /mnt/ost1 type lustre (rw)
            10.0.0.1@tcp0:/temp on /mnt/lustre type lustre (rw)
            

            Turn off stripe size etc.

            # echo 32 > /sys/block/md0/md/stripe_cache_size
            # echo 0 > /proc/fs/lustre/obdfilter/temp-OST0000/writethrough_cache_enable
            # echo 0 > /proc/fs/lustre/obdfilter/temp-OST0000/read_cache_enable
            

            Copy file onto Lustre, fail drive and copy off.

            # dd if=/dev/urandom of=/root/test.1 oflag=direct bs=128k count=8
            8+0 records in
            8+0 records out
            1048576 bytes (1.0 MB) copied, 0.230175 seconds, 4.6 MB/s
            
            # md5sum test.1
            d02213ae420e043d42688874a93c7e1b  test.1
            
            # dd if=/root/test.1 of=/mnt/lustre/test.1 oflag=direct bs=128k
            8+0 records in
            8+0 records out
            1048576 bytes (1.0 MB) copied, 0.080452 seconds, 13.0 MB/s
            
            # mdadm /dev/md0 --fail /dev/sdb7
            mdadm: set /dev/sdb7 faulty in /dev/md0
            
            # dd if=/mnt/lustre/test.1 iflag=direct of=/root/test.2 oflag=direct bs=128k
            8+0 records in
            8+0 records out
            1048576 bytes (1.0 MB) copied, 0.0758 seconds, 13.8 MB/s
            
            # md5sum test.1 test.2
            bf4d5039cb2c7acd744d119a262bc90b  test.1
            bf4d5039cb2c7acd744d119a262bc90b  test.2
            
            rhenwood Richard Henwood (Inactive) added a comment - llmount.sh uses loopback devices (even with clearing OST_MOUNT_OPTS/MDS_MOUNT_OPTS as suggested by Alexey.) These devices create indirection that may obscure the problem. As an alternative to llmount.sh I'm manually creating the filesystem. I have used the following steps on RHEL5 and my RHEL6. I have been unable to recreate the bug reliably. As you suggest, I am working on a method to predictably perform zerocopy writes. Eric, can you run these commands on your RHEL6 environment to confirm that these instructions reproduce this bug on RHEL6. Create a MD device. mdadm --create --verbose /dev/md0 --chunk=64 --level=5 --raid-devices=3 /dev/sdb5 /dev/sdb6 /dev/sdb7 Create MDS/MDT and mount. # mkfs.lustre --fsname=temp --mgs --mdt /dev/sdb11 ... # mount -t lustre /dev/sdb11 /mnt/mdt Create OST on the MD device and mount on OSS. # mkfs.lustre --ost --fsname=temp --mgsnode=10.0.0.1@tcp0 /dev/md0 ... # mount -t lustre /dev/md0 /mnt/ost1 Mount the Lustre fs. # mount -t lustre 10.0.0.1@tcp0:/temp /mnt/lustre ... # mount /dev/sda1 on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on / var /lib/nfs/rpc_pipefs type rpc_pipefs (rw) /dev/sdb11 on /mnt/mdt type lustre (rw) /dev/md0 on /mnt/ost1 type lustre (rw) 10.0.0.1@tcp0:/temp on /mnt/lustre type lustre (rw) Turn off stripe size etc. # echo 32 > /sys/block/md0/md/stripe_cache_size # echo 0 > /proc/fs/lustre/obdfilter/temp-OST0000/writethrough_cache_enable # echo 0 > /proc/fs/lustre/obdfilter/temp-OST0000/read_cache_enable Copy file onto Lustre, fail drive and copy off. # dd if =/dev/urandom of=/root/test.1 oflag=direct bs=128k count=8 8+0 records in 8+0 records out 1048576 bytes (1.0 MB) copied, 0.230175 seconds, 4.6 MB/s # md5sum test.1 d02213ae420e043d42688874a93c7e1b test.1 # dd if =/root/test.1 of=/mnt/lustre/test.1 oflag=direct bs=128k 8+0 records in 8+0 records out 1048576 bytes (1.0 MB) copied, 0.080452 seconds, 13.0 MB/s # mdadm /dev/md0 --fail /dev/sdb7 mdadm: set /dev/sdb7 faulty in /dev/md0 # dd if =/mnt/lustre/test.1 iflag=direct of=/root/test.2 oflag=direct bs=128k 8+0 records in 8+0 records out 1048576 bytes (1.0 MB) copied, 0.0758 seconds, 13.8 MB/s # md5sum test.1 test.2 bf4d5039cb2c7acd744d119a262bc90b test.1 bf4d5039cb2c7acd744d119a262bc90b test.2

            Richard, I think firstly you need to figure out why there's no zerocopy write happened on RHEL6, then move to RHEL5. I don't know your exact environment, maybe you should consult MD expert in WC.

            ericm Eric Mei (Inactive) added a comment - Richard, I think firstly you need to figure out why there's no zerocopy write happened on RHEL6, then move to RHEL5. I don't know your exact environment, maybe you should consult MD expert in WC.

            Hi Alexey, I have tried clearing OST_MOUNT_OPTS / MDS_MOUNT_OPTS as you suggest.

            STDEV1="/dev/md0" OSTDEV2="/dev/md127" OST_MOUNT_OPTS="" MDS_MOUNT_OPTS="" /usr/lib64/lustre/tests/llmount.sh
            

            No difference.

            rhenwood Richard Henwood (Inactive) added a comment - Hi Alexey, I have tried clearing OST_MOUNT_OPTS / MDS_MOUNT_OPTS as you suggest. STDEV1= "/dev/md0" OSTDEV2= "/dev/md127" OST_MOUNT_OPTS= "" MDS_MOUNT_OPTS=" " /usr/lib64/lustre/tests/llmount.sh No difference.

            2Peter: yes, i'm busy with different issue, that will be reported later.
            OOM killer can kill OST_IO threads that is a block client to reconnect until node reboot.

            2Richard: looks you forget to clear OST_MOUNT_OPTS / MDS_MOUNT_OPTS.

            shadow Alexey Lyashkov added a comment - 2Peter: yes, i'm busy with different issue, that will be reported later. OOM killer can kill OST_IO threads that is a block client to reconnect until node reboot. 2Richard: looks you forget to clear OST_MOUNT_OPTS / MDS_MOUNT_OPTS.

            People

              rhenwood Richard Henwood (Inactive)
              shadow Alexey Lyashkov
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: