[LU-468] md-raid corruptions for zero copy patch. Created: 27/Jun/11  Updated: 20/Nov/12  Resolved: 26/Jul/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: Lustre 2.1.0

Type: Bug Priority: Critical
Reporter: Alexey Lyashkov Assignee: Richard Henwood (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

RHEL6


Attachments: Text File 01.fix_uptodate_flag.patch     Text File 02.switch_page.patch    
Severity: 3
Rank (Obsolete): 6592

 Description   

While porting zero copy patch to RHEL6 we have found some corruptions during IO while raid5/6 reconstruction. I think it's should be affect to RHEL5 also.

it's easy to replicated by

echo 32 > /sys/block/md0/md/stripe_cache_size
echo 0 > /proc/fs/lustre/obdfilter/<ost_name>/writethrough_cache_enable
echo 0 > /proc/fs/lustre/obdfilter/<ost_name>/read_cache_enable

and fail one of disks with
mdadm /dev/mdX --fail /dev/....

after it verify data is correct.

[root@sjlustre1-o1 ~]# dd if=/dev/urandom of=test.1 oflag=direct bs=128k
count=8
8+0 records in
8+0 records out
1048576 bytes (1.0 MB) copied, 0.157819 seconds, 6.6 MB/s
[root@sjlustre1-o1 ~]# md5sum test.1
4ec4d0b67a2b3341795706605e0b0a28 test.1
[root@sjlustre1-o1 ~]# md5sum test.1 > test.1.md5
[root@sjlustre1-o1 ~]# dd if=test.1 iflag=direct of=/lustre/stry/test.1
oflag=direct bs=128k
8+0 records in
8+0 records out
1048576 bytes (1.0 MB) copied, 0.319458 seconds, 3.3 MB/s

[root@sjlustre1-o1 ~]# dd if=/lustre/stry/test.1 iflag=direct of=test.2
oflag=direct bs=128k
8+0 records in
8+0 records out
1048576 bytes (1.0 MB) copied, 0.114691 seconds, 9.1 MB/s
[root@sjlustre1-o1 ~]# md5sum test.1 test.2
4ec4d0b67a2b3341795706605e0b0a28 test.1
426c976b75fa3ce5b5ae22b5195f85fd test.2

after work problem identified as two bugs in zcopy patch.
1) raid5 set a flag UPTODATE to stripe with staled pointers from DIO and try to copy data from these pointers during READ phase.

2) restoring pages from stripe cache issue.

please verify it's issue on RHEL5 env (we don't have it's now).



 Comments   
Comment by Jinshan Xiong (Inactive) [ 27/Jun/11 ]

Hi Shadow, can you please show me the problematic code?

Comment by Peter Jones [ 27/Jun/11 ]

Richard is going to try and repro this issue

Comment by Alexey Lyashkov [ 27/Jun/11 ]

this is two patches which a fix issue for RHEL6 port.

Comment by Peter Jones [ 28/Jun/11 ]

Alexey

Could you please upload these patches into gerrit?

Thanks

Peter

Comment by Richard Henwood (Inactive) [ 30/Jun/11 ]

I have been looking at this issue on CentOS 5.6, s/w raid on a sun machine. An initial attempt did not reproduce this issue, however there are a number of factors may be be in play and this result isn't conclusive.

Work continues to reproduce on both RHEL5 and RHEL6. I am now reserving resources to more accurately identify the scope of this issue.

Comment by Richard Henwood (Inactive) [ 01/Jul/11 ]

Hi Alexey,

I've been working on this bug today, can you provide clarification on which kernel did you used to get the corruption - including the zero copy patch.

Comment by Richard Henwood (Inactive) [ 08/Jul/11 ]

Alexey,

can you please review the steps I'm taking (below) to verify that I'm not missing something when trying to reproduce this issue.

Thanks

Provisioning a test machine

  • Provision AMD64 with CentOS 5.6 latest build
  • Slice up sdb into 10G chunks: One extended partition over the whole drive, sliced up into six 10G logical partitions.

Building a RAID 5 set

Taken from RAID wiki

  1. load the raid 5 module
    # modprobe raid456
    
  2. Create the raid array:
    mdadm --create --verbose /dev/md0 --chunk=64 --level=5 --raid-devices=3 /dev/sdb5 /dev/sdb6 /dev/sdb7
    mdadm --create --verbose /dev/md127 --chunk=64 --level=5 --raid-devices=3 /dev/sdb8 /dev/sdb9 /dev/sdb10
    

Build a Lustre filesystem on the md0 device.

  1. Run the following:
    OSTDEV1="/dev/md0" OSTDEV2="/dev/md127" /usr/lib64/lustre/tests/llmount.sh
    

The Lustre fs is now available at /mnt/lustre/

Following LU-468

  1. Fix strip cache size, disable writethrough/read caching:
    echo 32 > /sys/block/md0/md/stripe_cache_size
    echo 32 > /sys/block/md127/md/stripe_cache_size
    echo 0 > /proc/fs/lustre/obdfilter/lustre-OST0000/writethrough_cache_enable
    echo 0 > /proc/fs/lustre/obdfilter/lustre-OST0001/writethrough_cache_enable
    echo 0 > /proc/fs/lustre/obdfilter/lustre-OST0000/read_cache_enable
    echo 0 > /proc/fs/lustre/obdfilter/lustre-OST0001/read_cache_enable
    

Create a file, not on the Lustre fs:

dd if=/dev/urandom of=/root/test.1 oflag=direct bs=128k count=8

result:

8+0 records in
8+0 records out
1048576 bytes (1.0 MB) copied, 0.23174 seconds, 4.5 MB/s

md5sum /root/test.1

# md5sum test.1
2cb6571392d5ba2e0bd34e3a33f35a43  test.1

dd the file onto the Lustre fs:

# dd if=/root/test.1 of=/mnt/lustre/test.1 oflag=direct bs=128k
2048+0 records in
2048+0 records out
1048576 bytes (1.0 MB) copied, 0.029338 seconds, 45.5 MB/s

Fail a drive in each array to be on the safe side:

mdadm /dev/md0 --fail /dev/sdb7
mdadm /dev/md127 --fail /dev/sdb10

dd the file off the Lustre fs:

# dd if=/mnt/lustre/test.1 iflag=direct of=/root/test.2 oflag=direct bs=128k
8+0 records in
8+0 records out
1048576 bytes (1.0 MB) copied, 0.012411 seconds, 84.5 MB/s

md5sum the test.1 and test.2

# md5sum test.1 test.2
2cb6571392d5ba2e0bd34e3a33f35a43  test.1
2cb6571392d5ba2e0bd34e3a33f35a43  test.2

Additional info (observed at the end of test).

# lfs getstripe /mnt/lustre/test.1 
/mnt/lustre/test.1
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_stripe_offset:  1
	obdidx		 objid		objid		 group
	     1	             3	          0x3	             0

# mount
/dev/sda1 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/loop0 on /mnt/mds1 type lustre (rw,user_xattr,acl)
/dev/loop1 on /mnt/ost1 type lustre (rw)
/dev/loop2 on /mnt/ost2 type lustre (rw)
fat-amd-2.lab.whamcloud.com@tcp:/lustre on /mnt/lustre type lustre (rw,user_xattr,acl,flock)
# losetup /dev/loop1
/dev/loop1: [0011]:6353 (/dev/md0)
# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md127 : active raid5 sdb10[3](F) sdb9[1] sdb8[0]
      19550848 blocks level 5, 64k chunk, algorithm 2 [3/2] [UU_]
		in: 217 reads, 9466 writes; out: 17107851 reads, 4902113 writes
		14669311 in raid5d, 156389 out of stripes, 22010562 handle called
		reads: 269 for rmw, 407 for rcw. zcopy writes: 0, copied writes: 9466
		0 delayed, 0 bit delayed, 0 active, queues: 2 in, 0 out
		0 expanding overlap

      
md0 : active raid5 sdb7[3](F) sdb6[1] sdb5[0]
      19550848 blocks level 5, 64k chunk, algorithm 2 [3/2] [UU_]
		in: 321 reads, 17595 writes; out: 17108767 reads, 4914447 writes
		14674850 in raid5d, 156647 out of stripes, 22024334 handle called
		reads: 677 for rmw, 824 for rcw. zcopy writes: 0, copied writes: 17595
		0 delayed, 0 bit delayed, 0 active, queues: 4 in, 0 out
		0 expanding overlap

      
unused devices: <none>
Comment by Peter Jones [ 11/Jul/11 ]

Vitaly

It seems that Alexey is unavailable to answer even the simplest question about this ticket so that we can establish the scope of the issue and whether it impacts RHEL5 or not. Are you able to assist in this matter? If not, could you please advise who at Xyratex could?

Thanks

Peter

Comment by Eric Mei (Inactive) [ 11/Jul/11 ]

Richard, there's several issues in your test:

  • In your test io size is 128K. In that case, when you create md array, you should specify chunk size as 128K or less.
  • When writing file on Lustre, you should use "oflag=direct" instead of iflag.
  • Before reading file back from Lustre, you should fail a drive at first.

I've no idea whether RHEL5 have this problem (RHEL5 is different from RHEL6 in MD sense, I didn't check details of that). I tend to think the bugs are introduced in by porting patch to RHEL6. So if in the end you can't reproduce this on RHEL5, that probably means RHEL5 is safe.

Comment by Eric Mei (Inactive) [ 11/Jul/11 ]

Richard, you updated your previous comments, do you mean you did the right steps but still can't reproduce it?

Comment by Richard Henwood (Inactive) [ 11/Jul/11 ]

I've update the reproducer above to include Eric's suggestions.

I am not able to reproduce this on RHEL5.

However, I'm reluctant to assert that this isn't a problem with RHEL5 as the above reproducer does not reproduce on RHEL6.

I would appreciate further feedback on the reproducer, maybe Oleg can comment?

Comment by Eric Mei (Inactive) [ 11/Jul/11 ]

I noticed in your /proc/mdstat output, the zcopy write account is 0. So there's actually no zerocopy write happened in your test. I'm not sure why...

I don't know whether zcopy in RHEL5 works the same way as in RHEL6. But one thing you can try is using bs=256K in the dd write, which generate full stripe write (with 128K chunk size). I've no idea other than this. If you managed to get zcopy write and no data corruption, then RHEL5 is probably fine.

Comment by Richard Henwood (Inactive) [ 11/Jul/11 ]

Thanks for the suggestions;

I've tried with the 256k blocksize, and increasing the size of the file that was being shifted around.

The zcopy value stayed at 0.

These changes also did not reproduce the bug on my RHEL6. Can you confirm that the reproducer above reproduces on your RHEL6 testbed?

Comment by Eric Mei (Inactive) [ 11/Jul/11 ]

Did you actually run Lustre on top of raid? Because I noticed following lines of mount:

/dev/loop1 on /mnt/ost1 type lustre (rw)
/dev/loop2 on /mnt/ost2 type lustre (rw)

Comment by Richard Henwood (Inactive) [ 11/Jul/11 ]
# losetup /dev/loop1
/dev/loop1: [0011]:6353 (/dev/md0)
# losetup /dev/loop2
/dev/loop2: [0011]:13978 (/dev/md127)

I'm reading this as loop on the md devices.

Comment by Alexey Lyashkov [ 12/Jul/11 ]

2Peter: yes, i'm busy with different issue, that will be reported later.
OOM killer can kill OST_IO threads that is a block client to reconnect until node reboot.

2Richard: looks you forget to clear OST_MOUNT_OPTS / MDS_MOUNT_OPTS.

Comment by Richard Henwood (Inactive) [ 12/Jul/11 ]

Hi Alexey, I have tried clearing OST_MOUNT_OPTS / MDS_MOUNT_OPTS as you suggest.

STDEV1="/dev/md0" OSTDEV2="/dev/md127" OST_MOUNT_OPTS="" MDS_MOUNT_OPTS="" /usr/lib64/lustre/tests/llmount.sh

No difference.

Comment by Eric Mei (Inactive) [ 12/Jul/11 ]

Richard, I think firstly you need to figure out why there's no zerocopy write happened on RHEL6, then move to RHEL5. I don't know your exact environment, maybe you should consult MD expert in WC.

Comment by Richard Henwood (Inactive) [ 12/Jul/11 ]

llmount.sh uses loopback devices (even with clearing OST_MOUNT_OPTS/MDS_MOUNT_OPTS as suggested by Alexey.) These devices create indirection that may obscure the problem.

As an alternative to llmount.sh I'm manually creating the filesystem. I have used the following steps on RHEL5 and my RHEL6. I have been unable to recreate the bug reliably. As you suggest, I am working on a method to predictably perform zerocopy writes.

Eric, can you run these commands on your RHEL6 environment to confirm that these instructions reproduce this bug on RHEL6.

Create a MD device.

mdadm --create --verbose /dev/md0 --chunk=64 --level=5 --raid-devices=3 /dev/sdb5 /dev/sdb6 /dev/sdb7

Create MDS/MDT and mount.

# mkfs.lustre --fsname=temp --mgs --mdt /dev/sdb11
...
# mount -t lustre /dev/sdb11 /mnt/mdt

Create OST on the MD device and mount on OSS.

# mkfs.lustre --ost --fsname=temp --mgsnode=10.0.0.1@tcp0 /dev/md0
...
# mount -t lustre /dev/md0 /mnt/ost1

Mount the Lustre fs.

# mount -t lustre 10.0.0.1@tcp0:/temp /mnt/lustre
...
# mount
/dev/sda1 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/sdb11 on /mnt/mdt type lustre (rw)
/dev/md0 on /mnt/ost1 type lustre (rw)
10.0.0.1@tcp0:/temp on /mnt/lustre type lustre (rw)

Turn off stripe size etc.

# echo 32 > /sys/block/md0/md/stripe_cache_size
# echo 0 > /proc/fs/lustre/obdfilter/temp-OST0000/writethrough_cache_enable
# echo 0 > /proc/fs/lustre/obdfilter/temp-OST0000/read_cache_enable

Copy file onto Lustre, fail drive and copy off.

# dd if=/dev/urandom of=/root/test.1 oflag=direct bs=128k count=8
8+0 records in
8+0 records out
1048576 bytes (1.0 MB) copied, 0.230175 seconds, 4.6 MB/s

# md5sum test.1
d02213ae420e043d42688874a93c7e1b  test.1

# dd if=/root/test.1 of=/mnt/lustre/test.1 oflag=direct bs=128k
8+0 records in
8+0 records out
1048576 bytes (1.0 MB) copied, 0.080452 seconds, 13.0 MB/s

# mdadm /dev/md0 --fail /dev/sdb7
mdadm: set /dev/sdb7 faulty in /dev/md0

# dd if=/mnt/lustre/test.1 iflag=direct of=/root/test.2 oflag=direct bs=128k
8+0 records in
8+0 records out
1048576 bytes (1.0 MB) copied, 0.0758 seconds, 13.8 MB/s

# md5sum test.1 test.2
bf4d5039cb2c7acd744d119a262bc90b  test.1
bf4d5039cb2c7acd744d119a262bc90b  test.2
Comment by Eric Mei (Inactive) [ 12/Jul/11 ]

Richard, I don't have the env to test it, and I'll be away for two weeks from tomorrow, sorry I couldn't be more helpful... I just read your steps again, it seems all correct to me. So please get zerocopy write working on RHEL6 as the first step.

Comment by Richard Henwood (Inactive) [ 13/Jul/11 ]

UPDATE:

There currently is no zero-copy patch for RHEL6 in the Lustre source. As a result, this bug should not be reproducible on RHEL6.

There is a zero-copy patch for RHEL5 in the Lustre source. I have been unable to reliably generate zero-copy writes by writing after a drive has failed. However, I am still unable to observe data corruption with RHEL5.

Because the zero copy patch is not available for RHEL6, I recommend this issue be CLOSED: Can't reproduce.
A new Jira issue can be created for the RHEL6 zero-copy patch.

Comment by Peter Jones [ 13/Jul/11 ]

Dropping priority so this is no longer a blocker. If there is any evidence that this affects the master code on either RHEL5 or RHEL6 then it can raise in priority again.

Comment by Richard Henwood (Inactive) [ 13/Jul/11 ]

Apologies, my previous comment contained an inaccuracy:

I have been able to reliably generate zero-copy writes by writing after a drive has failed. However, I am still unable to observe data corruption with RHEL5.

Comment by Peter Jones [ 26/Jul/11 ]

As I understand it, the bug in the zero copy patch has been fixed in the version contributed under LU535

Comment by Nathan Rutman [ 20/Nov/12 ]

Xyratex: MRP-158

Generated at Sat Feb 10 01:07:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.