[LU-3027] Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8 Created: 25/Mar/13 Updated: 08/Nov/13 Resolved: 11/Oct/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.4.1, Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.4.2 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 7390 | ||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/2ed1fef2-94bd-11e2-93c6-52540035b04c. The sub-test test_write_disjoint failed with the following error:
test log shows: librdmacm: Fatal: no RDMA devices found librdmacm: Fatal: no RDMA devices found librdmacm: Fatal: no RDMA devices found librdmacm: Fatal: no RDMA devices found librdmacm: Fatal: no RDMA devices found librdmacm: Fatal: no RDMA devices found librdmacm: Fatal: no RDMA devices found loop 0: chunk_size 103399 [client-27vm6.lab.whamcloud.com:00935] 7 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics [client-27vm6.lab.whamcloud.com:00935] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages loop 79: chunk_size 71702, file size was 573616 rank 2, loop 80: invalid file size 140329 instead of 160376 = 20047 * 8 loop 79: chunk_size 71702, file size was 573616 rank 4, loop 80: invalid file size 140329 instead of 160376 = 20047 * 8 loop 79: chunk_size 71702, file size was 573616 rank 6, loop 80: invalid file size 140329 instead of 160376 = 20047 * 8 loop 79: chunk_size 71702, file size was 573616 rank 0, loop 80: invalid file size 140329 instead of 160376 = 20047 * 8 -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD with errorcode -1. Looks like |
| Comments |
| Comment by Jian Yu [ 09/Aug/13 ] |
|
The failure occurred regularly on Lustre b2_4 branch: |
| Comment by Peter Jones [ 12/Aug/13 ] |
|
Lai Could you please look at this failure on b2_4? Thanks Peter |
| Comment by Lai Siyao [ 15/Aug/13 ] |
|
I couldn't reproduce on my local test environment with 2.4 and master code, will look more into maloo logs. |
| Comment by Jian Yu [ 16/Aug/13 ] |
|
Hi Lai, I can manually reproduce this failure on Rosso cluster: Lustre build: http://build.whamcloud.com/job/lustre-b2_4/32/ |
| Comment by Lai Siyao [ 17/Aug/13 ] |
|
ahh, yes, after increasing wdisjoint_REP, I can reproduce it now, and I'm looking into debug logs. |
| Comment by Oleg Drokin [ 22/Aug/13 ] |
| Comment by Oleg Drokin [ 26/Aug/13 ] |
|
So, after some digging, the reason for the failure is a lock that somehow survived truncate to 0 on the client with rank 0 process. Eg. in this test run: https://maloo.whamcloud.com/test_sets/1bec6152-fcb8-11e2-9222-52540035b04c Sadly the maloo report for this failure has a hole in lustre debug log right around where truncate have happened, so I cannot see why this lock did not get cancelled. And maloo is down ATM, so I cannot try my luck with a different report. |
| Comment by Oleg Drokin [ 26/Aug/13 ] |
|
Actually I just found that the lock in question is created after truncate. 00020000:00200000:0.0:1377200002.830736:0:29294:0:(lov_io.c:415:lov_io_iter_init()) shrink: 0 [26754, 40131) 00000080:00200000:0.0:1377200002.830740:0:29294:0:(lcommon_cl.c:728:ccc_io_one_lock_index()) lock: 2 [6, 9] 00020000:00010000:0.0:1377200002.830743:0:29294:0:(lov_lock.c:1057:lov_lock_fits_into()) W(2):[41, 61]/W(2):[0, 18446744073709551615] 0 1/1: 1 00000020:00010000:0.0:1377200002.830744:0:29294:0:(cl_lock.c:514:cl_lock_lookup()) has: W(2):[0, 18446744073709551615](5) need: W(2):[6, 9]: 1 So we matched a lok that was originally for pages 41-61, but was granted as whole file, or so it seems. But as we have decided this, there's a parallel process that cancels this lock (in fact the cancel was already happening) when we did the match: 00000008:00010000:0.0:1377200002.838156:0:28077:0:(osc_lock.c:1288:osc_lock_flush()) ### lock ffff8800411caa98: 4 pages were written. ns: lustre-OST0003-osc-ffff88005b040000 lock: ffff880042609740/0x695594a2b763446f lrc: 3/0,0 mode: PW/PW res: [0x65f71:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 167936->253951) flags: 0x420400020000 nid: local remote: 0x433a6203c6e8086c expref: -99 pid: 29295 timeout: 0 lvb_type: 1 ... 00010000:00010000:0.0:1377200002.838194:0:28077:0:(ldlm_request.c:1117:ldlm_cli_cancel_local()) ### client-side cancel ns: lustre-OST0003-osc-ffff88005b040000 lock: ffff880042609740/0x695594a2b763446f lrc: 4/0,0 mode: PW/PW res: [0x65f71:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 167936->253951) flags: 0x428400020000 nid: local remote: 0x433a6203c6e8086c expref: -99 pid: 29295 timeout: 0 lvb_type: 1 Now the original process continues to ask for the actual dlm lock: 00000008:00010000:0.0:1377200002.838278:0:29294:0:(osc_lock.c:1674:osc_lock_init()) ### lock ffff8800411caa98, osc lock ffff880042618a98, flags 40000000 ... 00010000:00010000:0.0:1377200002.838296:0:29294:0:(ldlm_request.c:916:ldlm_cli_enqueue()) ### client-side enqueue START, flags 40000000 ns: lustre-OST0003-osc-ffff88005b040000 lock: ffff880042609540/0x695594a2b7634476 lrc: 3/0,1 mode: --/PW res: [0x65f71:0x0:0x0].0 rrc: 1 type: EXT [167936->253951] (req 167936->253951) flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 29294 timeout: 0 lvb_type: 1 And here we requested wrong lock! (This log snipped is from a different log file than above, but you can see the same picture with the previous log too just search for that stray lock handle) |
| Comment by Zhenyu Xu [ 28/Aug/13 ] |
|
patch tracking at http://review.whamcloud.com/7477 commit message Change subject: LU-3027 clio: properly expand lov parent lock descr ...................................................................... LU-3027 clio: properly expand lov orig lock descr * If a client request a cl_lock with original ext region [A, B], its lov->lls_orig saves the [A, B] pair, while its cl_lock's cll_descr has been modified to what it get from osc_lock, such as [C, D]. When new ext lock request [E, F] from the same client, it uses existing cl_lock's cll_descr to match the request, even though [E, F] is not contained by [A, B], but it could be contianed by [C, D], and this cl_lock could be returned. When ldlm_lock [C, D] is canceled, the second lock request would enqueue a ldlm_lock using the returned cl_lock's original ext, i.e [A, B], that's wrong. * Put -machinefile option ahead of other mpiexec options, since some version of mpiexec request global options be put before its local options, and -machinefile is a global options. |
| Comment by Oleg Drokin [ 28/Aug/13 ] |
|
After discussing with Jinshan I have these two alternative approaches for a fix: http://review.whamcloud.com/7481 - I like this one more http://review.whamcloud.com/7482 - this one has a potential to waste more memory for sublocks, I fear so I don't like it as such. |
| Comment by Jian Yu [ 29/Aug/13 ] |
|
Hi Bobi and Oleg, While creating/updating the patches, could you please add the following test parameters into commit message so as to verify write_disjoint test automatically? Test-Parameters: envdefinitions=SLOW=yes,ONLY=write_disjoint \ clientdistro=el6 serverdistro=el6 clientarch=x86_64 \ serverarch=x86_64 testlist=parallel-scale |
| Comment by Oleg Drokin [ 30/Aug/13 ] |
|
patch landed to master. |
| Comment by Jian Yu [ 30/Aug/13 ] |
|
Hi Oleg, The patch http://review.whamcloud.com/7481 can be applied to Lustre b2_4 branch without conflict. Could you please cherry-pick it to Lustre b2_4 branch? Thanks. |
| Comment by Peter Jones [ 30/Aug/13 ] |
|
Landed for 2.5 |
| Comment by Jian Yu [ 09/Sep/13 ] |
|
Lustre build: http://build.whamcloud.com/job/lustre-b2_4/45/ (2.4.1 RC2) Since patch http://review.whamcloud.com/7481 was reverted from Lustre b2_4 branch, the failure occurred again: |
| Comment by Oleg Drokin [ 10/Sep/13 ] |
|
There's a different patch from Jinshan here: http://review.whamcloud.com/7569 |
| Comment by Sarah Liu [ 20/Sep/13 ] |
|
I will work on the NFS testing parallel-scale-nfsv3 failed as suite log shows == parallel-scale-nfsv3 test iorssf: iorssf == 20:53:59 (1380167639) + 6 * 1024 * 1024 * 4 + 26722304 * 9/10 / 1024 / 1024 / 2 / 2 free space=26722304, Need: 2 x 2 x 3 GB (blockSize reduced to 3 Gb) OPTIONS: IOR=/usr/bin/IOR ior_THREADS=2 ior_DURATION=30 MACHINEFILE=/tmp/parallel-scale-nfs.machines wtm-75 wtm-76 mkdir: cannot create directory `/mnt/lustre/d0.ior.ssf': Permission denied chmod: cannot access `/mnt/lustre/d0.ior.ssf': No such file or director test log shows: ** error ** ERROR in aiori-POSIX.c (line 105): cannot open file. ERROR: Not a directory ** exiting ** ** error ** ERROR in aiori-POSIX.c (line 105): cannot open file. ERROR: Not a directory ** exiting ** ** error ** ERROR in aiori-POSIX.c (line 105): cannot open file. ERROR: Not a directory ** exiting ** ** error ** ERROR in aiori-POSIX.c (line 105): cannot open file. ERROR: Not a directory ** exiting ** |
| Comment by Peter Jones [ 24/Sep/13 ] |
|
Patch landed for 2.5.0 |
| Comment by Sarah Liu [ 01/Oct/13 ] |
|
hmm for https://maloo.whamcloud.com/test_sets/9a628942-272b-11e3-88c6-52540035b04c |
| Comment by Jinshan Xiong (Inactive) [ 03/Oct/13 ] |
|
patch is at: http://review.whamcloud.com/7841, please give it a try. |
| Comment by Sarah Liu [ 08/Oct/13 ] |
|
Here is the result of parallel-scale test_write_disjoint: https://maloo.whamcloud.com/test_sessions/5602b272-303b-11e3-b28a-52540035b04c |
| Comment by Peter Jones [ 11/Oct/13 ] |
|
Landed for 2.5 |
| Comment by Jian Yu [ 02/Nov/13 ] |
|
Lustre build: http://build.whamcloud.com/job/lustre-b2_4/47/ The failure still occurred regularly on Lustre b2_4 branch: |
| Comment by Jian Yu [ 05/Nov/13 ] |
|
Patches http://review.whamcloud.com/7569 and http://review.whamcloud.com/7841 landed on Lustre b2_4 branch. The failure was fixed. |
| Comment by Patrick Farrell (Inactive) [ 07/Nov/13 ] |
|
Jian - Are you saying that when you added 7569 and 7841 to b2_4 you no longer see the failure? I do not see those patches as landed to b2_4 in Gerrit, and Cray is still seeing the closely related |
| Comment by Jian Yu [ 08/Nov/13 ] |
|
Hi Patrick, Lustre b2_4 build #48 contains those two patches: http://build.whamcloud.com/job/lustre-b2_4/48/. And since that build, parallel-scale test write_disjoint has passed: |
| Comment by Patrick Farrell (Inactive) [ 08/Nov/13 ] |
|
Interesting - Thanks, Jian. |