[LU-279] Test failure on test suite sanity, subtest test_17k Created: 04/May/11 Updated: 22/Apr/13 Resolved: 14/May/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.1.0, Lustre 2.4.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 5038 |
| Description |
|
This issue was created by maloo for Prakash Surya <surya1@llnl.gov> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/899e29aa-76a6-11e0-a1b3-52540025f9af. The sub-test test_17k failed with the following error:
Rsync is failing when xattrs are enable due to the lgetxattr function return inconsistent information about the xattr size. According to lgetxattr's man page:
It doesn't seem like these statements are being upheld. The behavior I am seeing is: // ... // path is a pointer to a sting with the pathname to a directory // on a lustre filesystem ssize_t ret1 = lgetxattr(path, "lustre.lov", NULL, 0); // returns 96 char *buf = (char *)malloc(ret1); ssize_t ret2 = lgetxattr(path, "lustre.lov", buf, ret1); // returns 56 // Thus at this point ret1 equals 96 and ret2 equals 56 // Shouldn't ret1 and ret2 be the same value?? // ... Shouldn't both invocations of lgetxattr return the same value? |
| Comments |
| Comment by Andreas Dilger [ 04/May/11 ] |
|
I don't know the details, but I suspect that the first call is somehow just returning the generic maximum size of the lustre.lov xattr, while the second call is returning the actual size of the lustre.lov xattr for that particular file. lmmsize = 96 = 32 + 3 * 24, so it looks like your filesystem has 3 OSTs, while lmmsize = 56 = 32 + 1 * 24, so the file actually has 1 stripe allocated. I don't know if this should be considered a bug or not to return an actual xattr size that is smaller than the reported size, though I agree it is a bit surprising, and unfortunate that rsync really cares that much about it. Having rsync call lgetxattr() twice on every file it is processing, for every xattr on the file seems quite inefficient, and regardless of whether this bug is fixed in Lustre or not, I'd also recommend filing a bug with the rsync folks to just have them allocate a decent-sized buffer for reading all xattrs (can be enlarged if they every get ERANGE back) and then malloc() and memcpy() the returned xattr in userspace, which is probably much more efficient than calling into the kernel (and in some cases over the network) twice for each xattr. |
| Comment by Peter Jones [ 06/May/11 ] |
|
Bobijam Could you please review this in more detail Thanks Peter |
| Comment by Zhenyu Xu [ 08/May/11 ] |
|
I tried rsync (version 2.6.8), it does not call lgetxattr() twice on every file, and it just use a 100 length buffer for reading xattrs. And lustre handles xattr retrival in this way: if size 0 is passed, lustre just returns the maximum buffer size which could hold the value. For "lustre.lov", it is 32(basic lmm) + 16(pool name) + max_stripecount * 24(lov ost data). Checking the lgetxattr manual, it states
and since the purpose of this behaviour is to estimate sufficient buffer size, IMHO Lustre's xattr handling does not count as break the statement. |
| Comment by Prakash Surya (Inactive) [ 09/May/11 ] |
Here is what I am seeing on my end using rsync v3.0.6 on a RHEL6 based system ... # /mnt/lustre > rsync --version | head -1 rsync version 3.0.6 protocol version 30 # /mnt/lustre > ls -lR .: total 0 # /mnt/lustre > mkdir -p ./dir ./dir.new # /mnt/lustre > touch ./dir/file # /mnt/lustre > ln -s file ./dir/file.lnk # /mnt/lustre > ls -lR .: total 8 drwx------ 2 root root 4096 May 9 11:35 dir drwx------ 2 root root 4096 May 9 11:35 dir.new ./dir: total 0 -rw------- 1 root root 0 May 9 11:35 file lrwxrwxrwx 1 root root 4 May 9 11:35 file.lnk -> file ./dir.new: total 0 # /mnt/lustre > strace rsync -av -X ./dir/ ./dir.new/ > strace.txt 2>&1 # /mnt/lustre > ls -lR .: total 8 drwx------ 2 root root 4096 May 9 11:35 dir drwx------ 2 root root 4096 May 9 11:35 dir.new -rw------- 1 root root 10941 May 9 11:36 strace.txt ./dir: total 0 -rw------- 1 root root 0 May 9 11:35 file lrwxrwxrwx 1 root root 4 May 9 11:35 file.lnk -> file ./dir.new: total 0 # /mnt/lustre > cat strace.txt # ... Output abbreviated to highlight lgetxattr calls ... llistxattr(".", 0x6797f0, 1024) = 36 lgetxattr(".", "trusted.lma", 0x0, 0) = 64 lgetxattr(".", "trusted.lma", "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x02\x00\x00\x00\x0f\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", 64) = 64 lgetxattr(".", "trusted.link", 0x0, 0) = 45 lgetxattr(".", "trusted.link", "\xdf\xf1\xea\x11\x01\x00\x00\x00-\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15\x00\x00\x00\x00\x00\x00\x00\x13\x03D\x8d\x8d\x00\x00\x00\x00dir", 45) = 45 lgetxattr(".", "lustre.lov", 0x0, 0) = 96 lgetxattr(".", "lustre.lov", "\xd0\x0b\xd1\x0b\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xff\xff\xff\xff\xff\xff\xff\x00\x00\x10\x00\x01\x00\xff\xff", 96) = 32 # ... # /mnt/lustre > rsync -av -X ./dir/ ./dir.new/ sending incremental file list get_xattr_data: lgetxattr(".","lustre.lov",96) returned 32 sent 18 bytes received 12 bytes 60.00 bytes/sec total size is 0 speedup is 0.00 rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6] # /mnt/lustre > ls -lR .: total 20 drwx------ 2 root root 4096 May 9 11:35 dir drwx------ 2 root root 4096 May 9 11:35 dir.new -rw------- 1 root root 287 May 9 11:46 strace.txt ./dir: total 0 -rw------- 1 root root 0 May 9 11:35 file lrwxrwxrwx 1 root root 4 May 9 11:35 file.lnk -> file ./dir.new: total 0 Looking at the output from strace, it definitely appears to be calling lgetxattr twice for every xattr. And from digging into the v3.0.8 rsync source, I believe the relevant code is found in xattrs.c static char *get_xattr_data(const char *fname, const char *name, size_t *len_ptr, int no_missing_error) { size_t datum_len = sys_lgetxattr(fname, name, NULL, 0); // ... if (datum_len) { size_t len = sys_lgetxattr(fname, name, ptr, datum_len); if (len != datum_len) { if (len == (size_t)-1) { // ... } else { rprintf(FERROR_XFER, "get_xattr_data: lgetxattr(\"%s\",\"%s\",%ld)" " returned %ld\n", full_fname(fname), name, (long)datum_len, (long)len); } // ... |
| Comment by Zhenyu Xu [ 09/May/11 ] |
|
confirmed that rsync v3.0.x changed its behavior. patch tracking at http://review.whamcloud.com/524 to return exact xattr value size when passing 0 size for determing buffer size to hold the xattr value. |
| Comment by Prakash Surya (Inactive) [ 10/May/11 ] |
|
Great! Maloo results for test 17k with patch 524: https://maloo.whamcloud.com/test_sets/e51aa7e4-7b4d-11e0-b5bf-52540025f9af |
| Comment by Build Master (Inactive) [ 13/May/11 ] |
|
Integrated in Oleg Drokin : 701627cb7c17dfd206ea65aaadab3a109047a4fd
|
| Comment by Build Master (Inactive) [ 13/May/11 ] |
|
Integrated in Oleg Drokin : 701627cb7c17dfd206ea65aaadab3a109047a4fd
|
| Comment by Build Master (Inactive) [ 13/May/11 ] |
|
Integrated in Oleg Drokin : 701627cb7c17dfd206ea65aaadab3a109047a4fd
|
| Comment by Build Master (Inactive) [ 13/May/11 ] |
|
Integrated in Oleg Drokin : 701627cb7c17dfd206ea65aaadab3a109047a4fd
|
| Comment by Build Master (Inactive) [ 13/May/11 ] |
|
Integrated in Oleg Drokin : 701627cb7c17dfd206ea65aaadab3a109047a4fd
|
| Comment by Build Master (Inactive) [ 13/May/11 ] |
|
Integrated in Oleg Drokin : 701627cb7c17dfd206ea65aaadab3a109047a4fd
|
| Comment by Build Master (Inactive) [ 13/May/11 ] |
|
Integrated in Oleg Drokin : 701627cb7c17dfd206ea65aaadab3a109047a4fd
|
| Comment by Build Master (Inactive) [ 13/May/11 ] |
|
Integrated in Oleg Drokin : 701627cb7c17dfd206ea65aaadab3a109047a4fd
|
| Comment by Build Master (Inactive) [ 13/May/11 ] |
|
Integrated in Oleg Drokin : 701627cb7c17dfd206ea65aaadab3a109047a4fd
|
| Comment by Build Master (Inactive) [ 13/May/11 ] |
|
Integrated in Oleg Drokin : 701627cb7c17dfd206ea65aaadab3a109047a4fd
|
| Comment by Build Master (Inactive) [ 13/May/11 ] |
|
Integrated in Oleg Drokin : 701627cb7c17dfd206ea65aaadab3a109047a4fd
|
| Comment by Build Master (Inactive) [ 13/May/11 ] |
|
Integrated in Oleg Drokin : 701627cb7c17dfd206ea65aaadab3a109047a4fd
|
| Comment by Build Master (Inactive) [ 13/May/11 ] |
|
Integrated in Oleg Drokin : 701627cb7c17dfd206ea65aaadab3a109047a4fd
|
| Comment by Build Master (Inactive) [ 13/May/11 ] |
|
Integrated in Oleg Drokin : 701627cb7c17dfd206ea65aaadab3a109047a4fd
|
| Comment by Peter Jones [ 14/May/11 ] |
|
Landed for 2.1. Please reopen if issue reoccurs or more work is required |