[LU-5637] sanity test_130a: FIEMAP on 1-stripe file(/mnt/lustre/f130a.sanity) failed Created: 17/Sep/14 Updated: 04/Jan/18 Resolved: 04/Jan/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.11.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Minh Diep |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
client: lustre-master RHEL7 |
||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 15784 | ||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/bfcd5a82-3610-11e4-8a7f-5254006e85c2. The sub-test test_130a failed with the following error:
several FIEMAP tests failed from test 130a to 130e == sanity test 130a: FIEMAP (1-stripe file) ========================================================== 09:49:12 (1409935752) 1+0 records in 1+0 records out 65536 bytes (66 kB) copied, 0.00196579 s, 33.3 MB/s Filesystem type is: bd00bd0 File size of /mnt/lustre/f130a.sanity is 65536 (16 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 15: 40923.. 40938: 16: eof /mnt/lustre/f130a.sanity: 1 extent found sanity test_130a: @@@@@@ FAIL: FIEMAP on 1-stripe file(/mnt/lustre/f130a.sanity) failed |
| Comments |
| Comment by Peter Jones [ 17/Sep/14 ] |
|
Bob Could you please assist with this issue? Thanks Peter |
| Comment by Bob Glossman (Inactive) [ 18/Sep/14 ] |
|
haven't been able to reproduce this failure manually yet. running just sanity, 130a alone doesn't fail. I see different output from the filefrag cmd as follows: Filesystem type is: bd00bd0 File size of /mnt/lustre/f130a.sanity is 65536 (64 blocks of 1024 bytes) ext: device_logical: physical_offset: length: dev: flags: 0: 0.. 63: 137344.. 137407: 64: 0000: last,net,eof /mnt/lustre/f130a.sanity: 1 extent found The values from my manual test look more like what I would expect than the ones in the failed test log. Can't yet explain the difference. |
| Comment by Andreas Dilger [ 19/Sep/14 ] |
|
The failed test run was printing "expected:" (i.e. next expected physical block number) instead of "dev:" (i.e. the Lustre OST number). This looks like the node did not have the correct Lustre e2fsprogs installed. Similarly, on the next test_130b "FIEMAP (2-stripe file)" filefrag printed: File size of /mnt/lustre/f130b.sanity is 2097152 (512 blocks of 4096 bytes) /mnt/lustre/f130b.sanity: FIBMAP unsupported which means also that the Lustre e2fsprogs wasn't installed, since Lustre doesn't allow normal FIEMAP for multi-stripe files. |
| Comment by Bob Glossman (Inactive) [ 19/Sep/14 ] |
|
that could explain my inability to reproduce the failure. I have lustre e2fsprogs installed everywhere, including on the client where it may not even be needed. In fact I have new, new 1.42.12.wc1 versions installed. |
| Comment by Bob Glossman (Inactive) [ 19/Sep/14 ] |
|
looking at test nodes on Onyx I can confirm that only native e2fsprogs is installed on client nodes, not lustre e2fsprogs. However this is so for el6 as well as el7. Not clear why it's only broken on el7. native e2fsprogs is e2fsprogs-1.42.9-4.el7.x86_64 in el7, e2fsprogs-1.41.12-18.el6_5.1.x86_64 in el6. |
| Comment by Bob Glossman (Inactive) [ 19/Sep/14 ] |
|
Maybe the proper remedy is to just install lustre e2fsprogs on all our test nodes regardless of client/server or version. However I don't think we require it there on client lustre installs, those done with lustre-client-* rpms on unpatched kernels. That being so we might not be testing what we say we support if we do so. |
| Comment by Andreas Dilger [ 20/Sep/14 ] |
|
It would be best to install the Lustre e2fsprogs on the clients. The current test is skipped if filefrag doesn't support any FIEMAP functionality. Most of the FIEMAP code was landed into upstream e2fsprogs, but not the Lustre-specific part that handles files striped across multiple devices. The best solution would be to allow the test to be run, but skip the parts that depend on the "device:" being returned. The stock filefrag will also not run on multi-striped files, so those tests also need to be skipped. |
| Comment by Bob Glossman (Inactive) [ 12/May/15 ] |
|
I think the proper fix is to start building e2fsprogs for sles12 in our tools/e2fsprogs builds, then installing that on sles12 clients. so the fix(es) are outside of lustre, in e2fsprogs and TEI. fwiw, I have built and installed the most recent 1.42.12-wc1 version of e2fsprogs locally on sles12 clients and servers without any problems. |
| Comment by Sarah Liu [ 05/Jun/17 ] |
|
In the past 7 days(from 5/30 to 6/5) lustre-review testing, sanity-130a/b/c/d/e each failed 21 times. The error only seen on review-ldiskfs(10 fails) and review-dne-part1(11 fails). For review-ldiskfs and dne-part1, there were totally 244 sessions ran last week, so the failure rate is about 10% |
| Comment by Peter Jones [ 06/Jun/17 ] |
|
Niu Discussing this on the triage call we thought that rare issue occurring more frequently might be due to PFL changes. Could you please investigate? Thanks Peter |
| Comment by John Hammond [ 06/Jun/17 ] |
|
Seems like the wrong e2fsprogs is being installed again. In the recent failures, the extent length reported by filefrag is in terms of 4096 byte blocks, whereas the test expects it in terms of 1K blocks, and 16 != 64. |
| Comment by Niu Yawei (Inactive) [ 07/Jun/17 ] |
|
Minh, could you help to verify that if wrong e2fsprogs is installed on clients again? Thanks in advance. |
| Comment by Andreas Dilger [ 07/Jun/17 ] |
|
It would also be possible to update the test to use "filefrag -k" to print the blocks in 1KB units (this is something we added to the upstream folefteag utility). If the test is on single-stripe files, the unpatched filefrag utility should still work, but it doesn't understand multi-stripe files. Improving upstream filefrag to include support for multi-device filesystems like Lustre, Btrfs, and ZFS, and compression is on my long list of things to do that I just never have time for. |
| Comment by Peter Jones [ 08/Jun/17 ] |
|
Removing 2.10 fixversion because IIUC the high occurence of this issue has been addressed by using the correct e2fsprogs and then all that is left is the long-standing rare iway of triggering this failure |
| Comment by Gerrit Updater [ 06/Dec/17 ] |
|
Sergey Cheremencev (c17829@cray.com) uploaded a new patch: https://review.whamcloud.com/30391 |
| Comment by Gerrit Updater [ 04/Jan/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30391/ |
| Comment by Peter Jones [ 04/Jan/18 ] |
|
Landed for 2.11 |