[LU-701] parallel-scale test_write_disjoint fails due to invalid file size Created: 21/Sep/11 Updated: 06/Sep/13 Resolved: 06/Sep/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0, Lustre 2.4.0, Lustre 1.8.7, Lustre 2.5.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Minh Diep | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Clients: Lustre Servers: |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 5514 | ||||||||
| Description |
|
v2_1_0_0_RC2 testing, MPI_ABORT for unknown reason. No console, syslog at all in the report (maloo bug?) Report: https://maloo.whamcloud.com/test_sets/44dc4934-e440-11e0-9909-52540025f9af == parallel-scale test write_disjoint: write_disjoint == 14:43:05 (1316554985) filesystem summary: 5000040 86 4999954 0% /mnt/lustre + chmod 0777 /mnt/lustre NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. filesystem summary: 5000040 87 4999953 0% /mnt/lustre parallel-scale test_write_disjoint: @@@@@@ FAIL: write_disjoint failed! 1 |
| Comments |
| Comment by Jian Yu [ 23/Sep/11 ] |
|
Lustre Clients: Lustre Servers: write_disjoint test passed in manual run: https://maloo.whamcloud.com/test_sets/af1b916c-e5bf-11e0-9909-52540025f9af |
| Comment by Andreas Dilger [ 30/May/13 ] |
|
I just noticed in the "full" runs that test_write_disjoint is one of the few tests that is consistently failing, and this bug is listed as the cause. The MPI_ABORT is not the cause of this problem, just a symptom. When write_disjoint detects a data consistency error it prints an error message and then calls MPI_Abort() to exit. The real problem is that the output file was not being written correctly or the DLM locks are caching the file size incorrectly, resulting in an inconsistent file size reported to the application: loop 90: chunk_size 62460, file size was 499680 rank 4, loop 91: invalid file size 203532 instead of 232608 = 29076 * 8 rank 2, loop 91: invalid file size 203532 instead of 232608 = 29076 * 8 rank 6, loop 91: invalid file size 203532 instead of 232608 = 29076 * 8 rank 0, loop 91: invalid file size 203532 instead of 232608 = 29076 * 8 -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD In this case MPI_ABORT is expected, and we need to find out why the test is failing. For good or bad, it seems like it fails virtually every test run, so it will hopefully not be too complex to debug. Almost certainly we would need to gather more debug logs from the client nodes (lctl set_param debug="+vfstrace +rpctrace +dlmtrace" at a minimum). |
| Comment by Andreas Dilger [ 06/Sep/13 ] |
|
Duplicate of |