[LU-468] md-raid corruptions for zero copy patch. - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: Lustre 2.1.0
Affects Version/s: Lustre 2.1.0
Labels:
None
Environment:
RHEL6

Severity:
3
Rank (Obsolete):
6592

Description

While porting zero copy patch to RHEL6 we have found some corruptions during IO while raid5/6 reconstruction. I think it's should be affect to RHEL5 also.

it's easy to replicated by

echo 32 > /sys/block/md0/md/stripe_cache_size
echo 0 > /proc/fs/lustre/obdfilter/<ost_name>/writethrough_cache_enable
echo 0 > /proc/fs/lustre/obdfilter/<ost_name>/read_cache_enable

and fail one of disks with
mdadm /dev/mdX --fail /dev/....

after it verify data is correct.

[root@sjlustre1-o1 ~]# dd if=/dev/urandom of=test.1 oflag=direct bs=128k
count=8
8+0 records in
8+0 records out
1048576 bytes (1.0 MB) copied, 0.157819 seconds, 6.6 MB/s
[root@sjlustre1-o1 ~]# md5sum test.1
4ec4d0b67a2b3341795706605e0b0a28 test.1
[root@sjlustre1-o1 ~]# md5sum test.1 > test.1.md5
[root@sjlustre1-o1 ~]# dd if=test.1 iflag=direct of=/lustre/stry/test.1
oflag=direct bs=128k
8+0 records in
8+0 records out
1048576 bytes (1.0 MB) copied, 0.319458 seconds, 3.3 MB/s

[root@sjlustre1-o1 ~]# dd if=/lustre/stry/test.1 iflag=direct of=test.2
oflag=direct bs=128k
8+0 records in
8+0 records out
1048576 bytes (1.0 MB) copied, 0.114691 seconds, 9.1 MB/s
[root@sjlustre1-o1 ~]# md5sum test.1 test.2
4ec4d0b67a2b3341795706605e0b0a28 test.1
426c976b75fa3ce5b5ae22b5195f85fd test.2

after work problem identified as two bugs in zcopy patch.
1) raid5 set a flag UPTODATE to stripe with staled pointers from DIO and try to copy data from these pointers during READ phase.

2) restoring pages from stripe cache issue.

please verify it's issue on RHEL5 env (we don't have it's now).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

01.fix_uptodate_flag.patch
0.9 kB
27/Jun/11 2:52 PM
02.switch_page.patch
3 kB
27/Jun/11 2:52 PM

Issue Links

Trackbacks

Attempting to reproduce LU-468 LU468

Activity

[LU-468] md-raid corruptions for zero copy patch.

Richard Henwood (Inactive) added a comment - 12/Jul/11 5:52 PM

llmount.sh uses loopback devices (even with clearing OST_MOUNT_OPTS/MDS_MOUNT_OPTS as suggested by Alexey.) These devices create indirection that may obscure the problem.

As an alternative to llmount.sh I'm manually creating the filesystem. I have used the following steps on RHEL5 and my RHEL6. I have been unable to recreate the bug reliably. As you suggest, I am working on a method to predictably perform zerocopy writes.

Eric, can you run these commands on your RHEL6 environment to confirm that these instructions reproduce this bug on RHEL6.

Create a MD device.

mdadm --create --verbose /dev/md0 --chunk=64 --level=5 --raid-devices=3 /dev/sdb5 /dev/sdb6 /dev/sdb7

Create MDS/MDT and mount.

# mkfs.lustre --fsname=temp --mgs --mdt /dev/sdb11
...
# mount -t lustre /dev/sdb11 /mnt/mdt

Create OST on the MD device and mount on OSS.

# mkfs.lustre --ost --fsname=temp --mgsnode=10.0.0.1@tcp0 /dev/md0
...
# mount -t lustre /dev/md0 /mnt/ost1

Mount the Lustre fs.

# mount -t lustre 10.0.0.1@tcp0:/temp /mnt/lustre
...
# mount
/dev/sda1 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/sdb11 on /mnt/mdt type lustre (rw)
/dev/md0 on /mnt/ost1 type lustre (rw)
10.0.0.1@tcp0:/temp on /mnt/lustre type lustre (rw)

Turn off stripe size etc.

# echo 32 > /sys/block/md0/md/stripe_cache_size
# echo 0 > /proc/fs/lustre/obdfilter/temp-OST0000/writethrough_cache_enable
# echo 0 > /proc/fs/lustre/obdfilter/temp-OST0000/read_cache_enable

Copy file onto Lustre, fail drive and copy off.

# dd if=/dev/urandom of=/root/test.1 oflag=direct bs=128k count=8
8+0 records in
8+0 records out
1048576 bytes (1.0 MB) copied, 0.230175 seconds, 4.6 MB/s

# md5sum test.1
d02213ae420e043d42688874a93c7e1b  test.1

# dd if=/root/test.1 of=/mnt/lustre/test.1 oflag=direct bs=128k
8+0 records in
8+0 records out
1048576 bytes (1.0 MB) copied, 0.080452 seconds, 13.0 MB/s

# mdadm /dev/md0 --fail /dev/sdb7
mdadm: set /dev/sdb7 faulty in /dev/md0

# dd if=/mnt/lustre/test.1 iflag=direct of=/root/test.2 oflag=direct bs=128k
8+0 records in
8+0 records out
1048576 bytes (1.0 MB) copied, 0.0758 seconds, 13.8 MB/s

# md5sum test.1 test.2
bf4d5039cb2c7acd744d119a262bc90b  test.1
bf4d5039cb2c7acd744d119a262bc90b  test.2

Richard Henwood (Inactive) added a comment - 12/Jul/11 5:52 PM llmount.sh uses loopback devices (even with clearing OST_MOUNT_OPTS/MDS_MOUNT_OPTS as suggested by Alexey.) These devices create indirection that may obscure the problem. As an alternative to llmount.sh I'm manually creating the filesystem. I have used the following steps on RHEL5 and my RHEL6. I have been unable to recreate the bug reliably. As you suggest, I am working on a method to predictably perform zerocopy writes. Eric, can you run these commands on your RHEL6 environment to confirm that these instructions reproduce this bug on RHEL6. Create a MD device. mdadm --create --verbose /dev/md0 --chunk=64 --level=5 --raid-devices=3 /dev/sdb5 /dev/sdb6 /dev/sdb7 Create MDS/MDT and mount. # mkfs.lustre --fsname=temp --mgs --mdt /dev/sdb11 ... # mount -t lustre /dev/sdb11 /mnt/mdt Create OST on the MD device and mount on OSS. # mkfs.lustre --ost --fsname=temp --mgsnode=10.0.0.1@tcp0 /dev/md0 ... # mount -t lustre /dev/md0 /mnt/ost1 Mount the Lustre fs. # mount -t lustre 10.0.0.1@tcp0:/temp /mnt/lustre ... # mount /dev/sda1 on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on / var /lib/nfs/rpc_pipefs type rpc_pipefs (rw) /dev/sdb11 on /mnt/mdt type lustre (rw) /dev/md0 on /mnt/ost1 type lustre (rw) 10.0.0.1@tcp0:/temp on /mnt/lustre type lustre (rw) Turn off stripe size etc. # echo 32 > /sys/block/md0/md/stripe_cache_size # echo 0 > /proc/fs/lustre/obdfilter/temp-OST0000/writethrough_cache_enable # echo 0 > /proc/fs/lustre/obdfilter/temp-OST0000/read_cache_enable Copy file onto Lustre, fail drive and copy off. # dd if =/dev/urandom of=/root/test.1 oflag=direct bs=128k count=8 8+0 records in 8+0 records out 1048576 bytes (1.0 MB) copied, 0.230175 seconds, 4.6 MB/s # md5sum test.1 d02213ae420e043d42688874a93c7e1b test.1 # dd if =/root/test.1 of=/mnt/lustre/test.1 oflag=direct bs=128k 8+0 records in 8+0 records out 1048576 bytes (1.0 MB) copied, 0.080452 seconds, 13.0 MB/s # mdadm /dev/md0 --fail /dev/sdb7 mdadm: set /dev/sdb7 faulty in /dev/md0 # dd if =/mnt/lustre/test.1 iflag=direct of=/root/test.2 oflag=direct bs=128k 8+0 records in 8+0 records out 1048576 bytes (1.0 MB) copied, 0.0758 seconds, 13.8 MB/s # md5sum test.1 test.2 bf4d5039cb2c7acd744d119a262bc90b test.1 bf4d5039cb2c7acd744d119a262bc90b test.2

Eric Mei (Inactive) added a comment - 12/Jul/11 4:52 PM

Richard, I think firstly you need to figure out why there's no zerocopy write happened on RHEL6, then move to RHEL5. I don't know your exact environment, maybe you should consult MD expert in WC.

Eric Mei (Inactive) added a comment - 12/Jul/11 4:52 PM Richard, I think firstly you need to figure out why there's no zerocopy write happened on RHEL6, then move to RHEL5. I don't know your exact environment, maybe you should consult MD expert in WC.

Richard Henwood (Inactive) added a comment - 12/Jul/11 4:24 PM

Hi Alexey, I have tried clearing OST_MOUNT_OPTS / MDS_MOUNT_OPTS as you suggest.

STDEV1="/dev/md0" OSTDEV2="/dev/md127" OST_MOUNT_OPTS="" MDS_MOUNT_OPTS="" /usr/lib64/lustre/tests/llmount.sh

No difference.

Richard Henwood (Inactive) added a comment - 12/Jul/11 4:24 PM Hi Alexey, I have tried clearing OST_MOUNT_OPTS / MDS_MOUNT_OPTS as you suggest. STDEV1= "/dev/md0" OSTDEV2= "/dev/md127" OST_MOUNT_OPTS= "" MDS_MOUNT_OPTS=" " /usr/lib64/lustre/tests/llmount.sh No difference.

Alexey Lyashkov added a comment - 12/Jul/11 1:01 AM

2Peter: yes, i'm busy with different issue, that will be reported later.
OOM killer can kill OST_IO threads that is a block client to reconnect until node reboot.

2Richard: looks you forget to clear OST_MOUNT_OPTS / MDS_MOUNT_OPTS.

Alexey Lyashkov added a comment - 12/Jul/11 1:01 AM 2Peter: yes, i'm busy with different issue, that will be reported later. OOM killer can kill OST_IO threads that is a block client to reconnect until node reboot. 2Richard: looks you forget to clear OST_MOUNT_OPTS / MDS_MOUNT_OPTS.

Richard Henwood (Inactive) added a comment - 11/Jul/11 6:29 PM

# losetup /dev/loop1
/dev/loop1: [0011]:6353 (/dev/md0)
# losetup /dev/loop2
/dev/loop2: [0011]:13978 (/dev/md127)

I'm reading this as loop on the md devices.

Richard Henwood (Inactive) added a comment - 11/Jul/11 6:29 PM # losetup /dev/loop1 /dev/loop1: [0011]:6353 (/dev/md0) # losetup /dev/loop2 /dev/loop2: [0011]:13978 (/dev/md127) I'm reading this as loop on the md devices.

Eric Mei (Inactive) added a comment - 11/Jul/11 6:20 PM

Did you actually run Lustre on top of raid? Because I noticed following lines of mount:

/dev/loop1 on /mnt/ost1 type lustre (rw)
/dev/loop2 on /mnt/ost2 type lustre (rw)

Eric Mei (Inactive) added a comment - 11/Jul/11 6:20 PM Did you actually run Lustre on top of raid? Because I noticed following lines of mount: /dev/loop1 on /mnt/ost1 type lustre (rw) /dev/loop2 on /mnt/ost2 type lustre (rw)

Richard Henwood (Inactive) added a comment - 11/Jul/11 5:43 PM

Thanks for the suggestions;

I've tried with the 256k blocksize, and increasing the size of the file that was being shifted around.

The zcopy value stayed at 0.

These changes also did not reproduce the bug on my RHEL6. Can you confirm that the reproducer above reproduces on your RHEL6 testbed?

Richard Henwood (Inactive) added a comment - 11/Jul/11 5:43 PM Thanks for the suggestions; I've tried with the 256k blocksize, and increasing the size of the file that was being shifted around. The zcopy value stayed at 0. These changes also did not reproduce the bug on my RHEL6. Can you confirm that the reproducer above reproduces on your RHEL6 testbed?

Eric Mei (Inactive) added a comment - 11/Jul/11 5:10 PM

I noticed in your /proc/mdstat output, the zcopy write account is 0. So there's actually no zerocopy write happened in your test. I'm not sure why...

I don't know whether zcopy in RHEL5 works the same way as in RHEL6. But one thing you can try is using bs=256K in the dd write, which generate full stripe write (with 128K chunk size). I've no idea other than this. If you managed to get zcopy write and no data corruption, then RHEL5 is probably fine.

Eric Mei (Inactive) added a comment - 11/Jul/11 5:10 PM I noticed in your /proc/mdstat output, the zcopy write account is 0. So there's actually no zerocopy write happened in your test. I'm not sure why... I don't know whether zcopy in RHEL5 works the same way as in RHEL6. But one thing you can try is using bs=256K in the dd write, which generate full stripe write (with 128K chunk size). I've no idea other than this. If you managed to get zcopy write and no data corruption, then RHEL5 is probably fine.

Richard Henwood (Inactive) added a comment - 11/Jul/11 4:59 PM

I've update the reproducer above to include Eric's suggestions.

I am not able to reproduce this on RHEL5.

However, I'm reluctant to assert that this isn't a problem with RHEL5 as the above reproducer does not reproduce on RHEL6.

I would appreciate further feedback on the reproducer, maybe Oleg can comment?

Richard Henwood (Inactive) added a comment - 11/Jul/11 4:59 PM I've update the reproducer above to include Eric's suggestions. I am not able to reproduce this on RHEL5. However, I'm reluctant to assert that this isn't a problem with RHEL5 as the above reproducer does not reproduce on RHEL6. I would appreciate further feedback on the reproducer, maybe Oleg can comment?

Eric Mei (Inactive) added a comment - 11/Jul/11 4:34 PM

Richard, you updated your previous comments, do you mean you did the right steps but still can't reproduce it?

Eric Mei (Inactive) added a comment - 11/Jul/11 4:34 PM Richard, you updated your previous comments, do you mean you did the right steps but still can't reproduce it?

Eric Mei (Inactive) added a comment - 11/Jul/11 2:26 PM

Richard, there's several issues in your test:

In your test io size is 128K. In that case, when you create md array, you should specify chunk size as 128K or less.
When writing file on Lustre, you should use "oflag=direct" instead of iflag.
Before reading file back from Lustre, you should fail a drive at first.

I've no idea whether RHEL5 have this problem (RHEL5 is different from RHEL6 in MD sense, I didn't check details of that). I tend to think the bugs are introduced in by porting patch to RHEL6. So if in the end you can't reproduce this on RHEL5, that probably means RHEL5 is safe.

Eric Mei (Inactive) added a comment - 11/Jul/11 2:26 PM Richard, there's several issues in your test: In your test io size is 128K. In that case, when you create md array, you should specify chunk size as 128K or less. When writing file on Lustre, you should use "oflag=direct" instead of iflag. Before reading file back from Lustre, you should fail a drive at first. I've no idea whether RHEL5 have this problem (RHEL5 is different from RHEL6 in MD sense, I didn't check details of that). I tend to think the bugs are introduced in by porting patch to RHEL6. So if in the end you can't reproduce this on RHEL5, that probably means RHEL5 is safe.

People

Assignee:: Richard Henwood (Inactive)

Reporter:: Alexey Lyashkov

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 27/Jun/11 3:06 AM

Updated:: 20/Nov/12 2:10 PM

Resolved:: 26/Jul/11 1:03 PM