[LU-10686] sanity-pfl test 9 fails with “[0x100010000:0x6025:0x0] != “ Created: 20/Feb/18 Updated: 26/Aug/19 Resolved: 25/Sep/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.10.6, Lustre 2.10.7 |
| Fix Version/s: | Lustre 2.12.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Since lustre-master build # 3703, 2.10.57.57, on 2018-01-31 we see sanity-pfl test_9 failing to get and compare the FID of the file’s second component with the error [0x100010000:0x6025:0x0] != The FID of the second component of the file after MDS failover should be on the right hand side of the “!=”. Looking at the suite_log, we see that there is some issue writing to the file dd: error writing '/mnt/lustre/d9.sanity-pfl/f9.sanity-pfl': No data available 1+0 records in 0+0 records out 0 bytes copied, 0.000605975 s, 0.0 kB/s We know the file system isn’t full because, earlier in the test, ‘lfs df’ is printed and shows the file system only 2% full. I see this 'No data available' message when trying to reproduce this issue outside of autotest even without the replay-barrier and when the test succeeds. So, this is probably not the cause of the failure. Right after the failed write and prior to the MDS failover, we get the FID of the second component before MDS recovery, the ost fid of 2nd component is [0x100010000:0x6025:0x0] We then failover the MDS and, it looks like it is back on-line, we can’t get the FID of the second component onyx-32vm2: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 16 sec onyx-32vm1: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 16 sec after MDS recovery, the ost fid of 2nd component is sanity-pfl test_9: @@@@@@ FAIL: [0x100010000:0x6025:0x0] != There doesn’t seem to be anything enlightening in the console and dmesg logs. Looking at the MDS log, we see the second component created 00020000:01000000:0.0:1518721834.723830:0:29881:0:(lod_pool.c:919:lod_find_pool()) lustre-MDT0000-osd: request for an unknown pool (test_85b) 00000004:00080000:0.0:1518721834.723866:0:29881:0:(osp_object.c:1546:osp_create()) lustre-OST0001-osc-MDT0000: Wrote last used FID: [0x100010000:0x6025:0x0], index 1: 0 Logs for this failure are at |
| Comments |
| Comment by Peter Jones [ 28/Feb/18 ] |
|
Bobijam Could you please investigate? Thanks Peter |
| Comment by James Nunez (Inactive) [ 08/Mar/18 ] |
|
We only see this during full test session testing and not in review testing; the testing we do for every patch. |
| Comment by Minh Diep [ 09/Apr/18 ] |
|
+1 on 2.10 https://testing.hpdd.intel.com/test_sets/816b22e2-3aa3-11e8-8f8a-52540065bddc |
| Comment by Sarah Liu [ 23/Apr/18 ] |
|
+1 on master https://testing.hpdd.intel.com/test_sets/d761fec6-471b-11e8-95c0-52540065bddc I did some search on Maloo, it seems the failure only seen on single MDT config, which is why review testing pass since sanity-pfl is run with dne config |
| Comment by Gerrit Updater [ 06/Aug/18 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32945 |
| Comment by Gerrit Updater [ 18/Aug/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32945/ |
| Comment by Patrick Farrell (Inactive) [ 11/Sep/18 ] |
|
Well, I'm not sure why it doesn't pass in the single MDT config, but the problem with the test is pretty simple - We're writing beyond the defined layout for the file. This test doesn't actually instantiate the layout. The layout goes to 2 MiB, but the dd write is at 2 MiB. That gets ENODATA because it's beyond the end of the layout. I'll push a patch. |
| Comment by Gerrit Updater [ 11/Sep/18 ] |
|
Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33137 |
| Comment by Zhenyu Xu [ 11/Sep/18 ] |
|
I've verified that |
| Comment by Patrick Farrell (Inactive) [ 11/Sep/18 ] |
|
Have you confirmed the write is no longer returning ENODATA? Given that the write in the test is beyond the end of the specified layout, I think we've still got a problem. |
| Comment by Zhenyu Xu [ 12/Sep/18 ] |
|
yes, the write returns ENODATA, while the test just intends to instantiate the 2nd component, and verify the recovery replay the 2nd component instantiation. |
| Comment by Patrick Farrell (Inactive) [ 12/Sep/18 ] |
|
If the write returns ENODATA, how is it getting the component instantiated? Does it still hit with the first byte...? |
| Comment by Zhenyu Xu [ 12/Sep/18 ] |
|
Even it hit the ENODATA, MDS will still instantiate the available component. I don't mean that change the 2nd component end to EOF is not right, the essential issue here is that the component instantiation replay has bug, and |
| Comment by Peter Jones [ 25/Sep/18 ] |
|
It sounds like this is believed to be a duplicate of |
| Comment by Andreas Dilger [ 26/Aug/19 ] |
|
This test was removed from ALWAYS_EXCEPT by patch https://review.whamcloud.com/32847 " |