[LU-468] md-raid corruptions for zero copy patch. Created: 27/Jun/11 Updated: 20/Nov/12 Resolved: 26/Jul/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0 |
| Fix Version/s: | Lustre 2.1.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Alexey Lyashkov | Assignee: | Richard Henwood (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL6 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 6592 |
| Description |
|
While porting zero copy patch to RHEL6 we have found some corruptions during IO while raid5/6 reconstruction. I think it's should be affect to RHEL5 also. it's easy to replicated by echo 32 > /sys/block/md0/md/stripe_cache_size and fail one of disks with after it verify data is correct. [root@sjlustre1-o1 ~]# dd if=/dev/urandom of=test.1 oflag=direct bs=128k [root@sjlustre1-o1 ~]# dd if=/lustre/stry/test.1 iflag=direct of=test.2 after work problem identified as two bugs in zcopy patch. 2) restoring pages from stripe cache issue. please verify it's issue on RHEL5 env (we don't have it's now). |
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 27/Jun/11 ] |
|
Hi Shadow, can you please show me the problematic code? |
| Comment by Peter Jones [ 27/Jun/11 ] |
|
Richard is going to try and repro this issue |
| Comment by Alexey Lyashkov [ 27/Jun/11 ] |
|
this is two patches which a fix issue for RHEL6 port. |
| Comment by Peter Jones [ 28/Jun/11 ] |
|
Alexey Could you please upload these patches into gerrit? Thanks Peter |
| Comment by Richard Henwood (Inactive) [ 30/Jun/11 ] |
|
I have been looking at this issue on CentOS 5.6, s/w raid on a sun machine. An initial attempt did not reproduce this issue, however there are a number of factors may be be in play and this result isn't conclusive. Work continues to reproduce on both RHEL5 and RHEL6. I am now reserving resources to more accurately identify the scope of this issue. |
| Comment by Richard Henwood (Inactive) [ 01/Jul/11 ] |
|
Hi Alexey, I've been working on this bug today, can you provide clarification on which kernel did you used to get the corruption - including the zero copy patch. |
| Comment by Richard Henwood (Inactive) [ 08/Jul/11 ] |
|
Alexey, can you please review the steps I'm taking (below) to verify that I'm not missing something when trying to reproduce this issue. Thanks Provisioning a test machine
Building a RAID 5 setTaken from RAID wiki
Build a Lustre filesystem on the md0 device.
The Lustre fs is now available at /mnt/lustre/ Following
|
| Comment by Peter Jones [ 11/Jul/11 ] |
|
Vitaly It seems that Alexey is unavailable to answer even the simplest question about this ticket so that we can establish the scope of the issue and whether it impacts RHEL5 or not. Are you able to assist in this matter? If not, could you please advise who at Xyratex could? Thanks Peter |
| Comment by Eric Mei (Inactive) [ 11/Jul/11 ] |
|
Richard, there's several issues in your test:
I've no idea whether RHEL5 have this problem (RHEL5 is different from RHEL6 in MD sense, I didn't check details of that). I tend to think the bugs are introduced in by porting patch to RHEL6. So if in the end you can't reproduce this on RHEL5, that probably means RHEL5 is safe. |
| Comment by Eric Mei (Inactive) [ 11/Jul/11 ] |
|
Richard, you updated your previous comments, do you mean you did the right steps but still can't reproduce it? |
| Comment by Richard Henwood (Inactive) [ 11/Jul/11 ] |
|
I've update the reproducer above to include Eric's suggestions. I am not able to reproduce this on RHEL5. However, I'm reluctant to assert that this isn't a problem with RHEL5 as the above reproducer does not reproduce on RHEL6. I would appreciate further feedback on the reproducer, maybe Oleg can comment? |
| Comment by Eric Mei (Inactive) [ 11/Jul/11 ] |
|
I noticed in your /proc/mdstat output, the zcopy write account is 0. So there's actually no zerocopy write happened in your test. I'm not sure why... I don't know whether zcopy in RHEL5 works the same way as in RHEL6. But one thing you can try is using bs=256K in the dd write, which generate full stripe write (with 128K chunk size). I've no idea other than this. If you managed to get zcopy write and no data corruption, then RHEL5 is probably fine. |
| Comment by Richard Henwood (Inactive) [ 11/Jul/11 ] |
|
Thanks for the suggestions; I've tried with the 256k blocksize, and increasing the size of the file that was being shifted around. The zcopy value stayed at 0. These changes also did not reproduce the bug on my RHEL6. Can you confirm that the reproducer above reproduces on your RHEL6 testbed? |
| Comment by Eric Mei (Inactive) [ 11/Jul/11 ] |
|
Did you actually run Lustre on top of raid? Because I noticed following lines of mount: /dev/loop1 on /mnt/ost1 type lustre (rw) |
| Comment by Richard Henwood (Inactive) [ 11/Jul/11 ] |
# losetup /dev/loop1 /dev/loop1: [0011]:6353 (/dev/md0) # losetup /dev/loop2 /dev/loop2: [0011]:13978 (/dev/md127) I'm reading this as loop on the md devices. |
| Comment by Alexey Lyashkov [ 12/Jul/11 ] |
|
2Peter: yes, i'm busy with different issue, that will be reported later. 2Richard: looks you forget to clear OST_MOUNT_OPTS / MDS_MOUNT_OPTS. |
| Comment by Richard Henwood (Inactive) [ 12/Jul/11 ] |
|
Hi Alexey, I have tried clearing OST_MOUNT_OPTS / MDS_MOUNT_OPTS as you suggest. STDEV1="/dev/md0" OSTDEV2="/dev/md127" OST_MOUNT_OPTS="" MDS_MOUNT_OPTS="" /usr/lib64/lustre/tests/llmount.sh No difference. |
| Comment by Eric Mei (Inactive) [ 12/Jul/11 ] |
|
Richard, I think firstly you need to figure out why there's no zerocopy write happened on RHEL6, then move to RHEL5. I don't know your exact environment, maybe you should consult MD expert in WC. |
| Comment by Richard Henwood (Inactive) [ 12/Jul/11 ] |
|
llmount.sh uses loopback devices (even with clearing OST_MOUNT_OPTS/MDS_MOUNT_OPTS as suggested by Alexey.) These devices create indirection that may obscure the problem. As an alternative to llmount.sh I'm manually creating the filesystem. I have used the following steps on RHEL5 and my RHEL6. I have been unable to recreate the bug reliably. As you suggest, I am working on a method to predictably perform zerocopy writes. Eric, can you run these commands on your RHEL6 environment to confirm that these instructions reproduce this bug on RHEL6. Create a MD device.mdadm --create --verbose /dev/md0 --chunk=64 --level=5 --raid-devices=3 /dev/sdb5 /dev/sdb6 /dev/sdb7 Create MDS/MDT and mount.# mkfs.lustre --fsname=temp --mgs --mdt /dev/sdb11 ... # mount -t lustre /dev/sdb11 /mnt/mdt Create OST on the MD device and mount on OSS.# mkfs.lustre --ost --fsname=temp --mgsnode=10.0.0.1@tcp0 /dev/md0 ... # mount -t lustre /dev/md0 /mnt/ost1 Mount the Lustre fs.# mount -t lustre 10.0.0.1@tcp0:/temp /mnt/lustre
...
# mount
/dev/sda1 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/sdb11 on /mnt/mdt type lustre (rw)
/dev/md0 on /mnt/ost1 type lustre (rw)
10.0.0.1@tcp0:/temp on /mnt/lustre type lustre (rw)
Turn off stripe size etc.# echo 32 > /sys/block/md0/md/stripe_cache_size # echo 0 > /proc/fs/lustre/obdfilter/temp-OST0000/writethrough_cache_enable # echo 0 > /proc/fs/lustre/obdfilter/temp-OST0000/read_cache_enable Copy file onto Lustre, fail drive and copy off.# dd if=/dev/urandom of=/root/test.1 oflag=direct bs=128k count=8 8+0 records in 8+0 records out 1048576 bytes (1.0 MB) copied, 0.230175 seconds, 4.6 MB/s # md5sum test.1 d02213ae420e043d42688874a93c7e1b test.1 # dd if=/root/test.1 of=/mnt/lustre/test.1 oflag=direct bs=128k 8+0 records in 8+0 records out 1048576 bytes (1.0 MB) copied, 0.080452 seconds, 13.0 MB/s # mdadm /dev/md0 --fail /dev/sdb7 mdadm: set /dev/sdb7 faulty in /dev/md0 # dd if=/mnt/lustre/test.1 iflag=direct of=/root/test.2 oflag=direct bs=128k 8+0 records in 8+0 records out 1048576 bytes (1.0 MB) copied, 0.0758 seconds, 13.8 MB/s # md5sum test.1 test.2 bf4d5039cb2c7acd744d119a262bc90b test.1 bf4d5039cb2c7acd744d119a262bc90b test.2 |
| Comment by Eric Mei (Inactive) [ 12/Jul/11 ] |
|
Richard, I don't have the env to test it, and I'll be away for two weeks from tomorrow, sorry I couldn't be more helpful... I just read your steps again, it seems all correct to me. So please get zerocopy write working on RHEL6 as the first step. |
| Comment by Richard Henwood (Inactive) [ 13/Jul/11 ] |
|
UPDATE: There currently is no zero-copy patch for RHEL6 in the Lustre source. As a result, this bug should not be reproducible on RHEL6. There is a zero-copy patch for RHEL5 in the Lustre source. I have been unable to reliably generate zero-copy writes by writing after a drive has failed. However, I am still unable to observe data corruption with RHEL5. Because the zero copy patch is not available for RHEL6, I recommend this issue be CLOSED: Can't reproduce. |
| Comment by Peter Jones [ 13/Jul/11 ] |
|
Dropping priority so this is no longer a blocker. If there is any evidence that this affects the master code on either RHEL5 or RHEL6 then it can raise in priority again. |
| Comment by Richard Henwood (Inactive) [ 13/Jul/11 ] |
|
Apologies, my previous comment contained an inaccuracy: I have been able to reliably generate zero-copy writes by writing after a drive has failed. However, I am still unable to observe data corruption with RHEL5. |
| Comment by Peter Jones [ 26/Jul/11 ] |
|
As I understand it, the bug in the zero copy patch has been fixed in the version contributed under LU535 |
| Comment by Nathan Rutman [ 20/Nov/12 ] |
|
Xyratex: MRP-158 |