[LU-3027] Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8 Created: 25/Mar/13  Updated: 08/Nov/13  Resolved: 11/Oct/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.4.1, Lustre 2.5.0
Fix Version/s: Lustre 2.4.2

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-701 parallel-scale test_write_disjoint fa... Resolved
Related
is related to LU-3889 LBUG: (osc_lock.c:497:osc_lock_upcal... Resolved
is related to LU-3889 LBUG: (osc_lock.c:497:osc_lock_upcal... Resolved
is related to LU-3874 parallel-scale-nfsv3 test_iorfpp: ERR... Closed
Severity: 3
Rank (Obsolete): 7390

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/2ed1fef2-94bd-11e2-93c6-52540035b04c.

The sub-test test_write_disjoint failed with the following error:

write_disjoint failed! 1

test log shows:

librdmacm: Fatal: no RDMA devices found
librdmacm: Fatal: no RDMA devices found
librdmacm: Fatal: no RDMA devices found
librdmacm: Fatal: no RDMA devices found
librdmacm: Fatal: no RDMA devices found
librdmacm: Fatal: no RDMA devices found
librdmacm: Fatal: no RDMA devices found
loop 0: chunk_size 103399
[client-27vm6.lab.whamcloud.com:00935] 7 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[client-27vm6.lab.whamcloud.com:00935] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
loop 79: chunk_size 71702, file size was 573616
rank 2, loop 80: invalid file size 140329 instead of 160376 = 20047 * 8
loop 79: chunk_size 71702, file size was 573616
rank 4, loop 80: invalid file size 140329 instead of 160376 = 20047 * 8
loop 79: chunk_size 71702, file size was 573616
rank 6, loop 80: invalid file size 140329 instead of 160376 = 20047 * 8
loop 79: chunk_size 71702, file size was 573616
rank 0, loop 80: invalid file size 140329 instead of 160376 = 20047 * 8
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD 
with errorcode -1.

Looks like LU-2453 is a similar issue seen in b2_1 branch



 Comments   
Comment by Jian Yu [ 09/Aug/13 ]

The failure occurred regularly on Lustre b2_4 branch:
https://maloo.whamcloud.com/test_sets/b571c2f0-fdac-11e2-9fd5-52540035b04c
https://maloo.whamcloud.com/test_sets/d70a40f0-fcd9-11e2-b90c-52540035b04c
https://maloo.whamcloud.com/test_sets/dc877024-fcc1-11e2-9fdb-52540035b04c
https://maloo.whamcloud.com/test_sets/1bec6152-fcb8-11e2-9222-52540035b04c

Comment by Peter Jones [ 12/Aug/13 ]

Lai

Could you please look at this failure on b2_4?

Thanks

Peter

Comment by Lai Siyao [ 15/Aug/13 ]

I couldn't reproduce on my local test environment with 2.4 and master code, will look more into maloo logs.

Comment by Jian Yu [ 16/Aug/13 ]

Hi Lai,

I can manually reproduce this failure on Rosso cluster:

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/32/
Maloo report: https://maloo.whamcloud.com/test_sets/0d0b5720-0676-11e3-87de-52540035b04c

Comment by Lai Siyao [ 17/Aug/13 ]

ahh, yes, after increasing wdisjoint_REP, I can reproduce it now, and I'm looking into debug logs.

Comment by Oleg Drokin [ 22/Aug/13 ]

This issue is also present in master:
https://maloo.whamcloud.com/sub_tests/query?commit=Update+results&page=1&sub_test[query_bugs]=&sub_test[status]=FAIL&sub_test[sub_test_script_id]=6c1d247c-55d0-11e0-bb3d-52540025f9af&test_node[architecture_type_id]=&test_node[distribution_type_id]=&test_node[file_system_type_id]=&test_node[lustre_branch_id]=&test_node[os_type_id]=&test_node_network[network_type_id]=&test_session[query_date]=&test_session[query_recent_period]=&test_session[test_group]=&test_session[test_host]=&test_session[user_id]=&test_set[test_set_script_id]=b10ed7ea-55b4-11e0-bb3d-52540025f9af&utf8=%E2%9C%93

Comment by Oleg Drokin [ 26/Aug/13 ]

So, after some digging, the reason for the failure is a lock that somehow survived truncate to 0 on the client with rank 0 process.
So later one when a glimpse is done, this lock happens to have the highest offset and is consluted, yet it's on the wrong node.

Eg. in this test run: https://maloo.whamcloud.com/test_sets/1bec6152-fcb8-11e2-9222-52540035b04c
We can see that the latest client to write had a lock ffff88006a606000/0xab2c500485c68cac with offsets covering [20480->184319] with highest file size at 63248
But when the other (rank 0) client does glimpse to determine file size, there's another lock covering [184320-EOF] with handle 0x39ce525e4ad00966 that is a left over from previous iteration. This is a higher offset lock, so it gets the glimpse, but this client only has highest write at offset 55342.

Sadly the maloo report for this failure has a hole in lustre debug log right around where truncate have happened, so I cannot see why this lock did not get cancelled. And maloo is down ATM, so I cannot try my luck with a different report.

Comment by Oleg Drokin [ 26/Aug/13 ]

Actually I just found that the lock in question is created after truncate.
Apparently the bug is such that we match some wrong lock:

00020000:00200000:0.0:1377200002.830736:0:29294:0:(lov_io.c:415:lov_io_iter_init()) shrink: 0 [26754, 40131)
00000080:00200000:0.0:1377200002.830740:0:29294:0:(lcommon_cl.c:728:ccc_io_one_lock_index()) lock: 2 [6, 9]
00020000:00010000:0.0:1377200002.830743:0:29294:0:(lov_lock.c:1057:lov_lock_fits_into()) W(2):[41, 61]/W(2):[0, 18446744073709551615] 0 1/1: 1
00000020:00010000:0.0:1377200002.830744:0:29294:0:(cl_lock.c:514:cl_lock_lookup()) has: W(2):[0, 18446744073709551615](5) need: W(2):[6, 9]: 1

So we matched a lok that was originally for pages 41-61, but was granted as whole file, or so it seems.

But as we have decided this, there's a parallel process that cancels this lock (in fact the cancel was already happening) when we did the match:

00000008:00010000:0.0:1377200002.838156:0:28077:0:(osc_lock.c:1288:osc_lock_flush()) ### lock ffff8800411caa98: 4 pages were written.
 ns: lustre-OST0003-osc-ffff88005b040000 lock: ffff880042609740/0x695594a2b763446f lrc: 3/0,0 mode: PW/PW res: [0x65f71:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 167936->253951) flags: 0x420400020000 nid: local remote: 0x433a6203c6e8086c expref: -99 pid: 29295 timeout: 0 lvb_type: 1
...
00010000:00010000:0.0:1377200002.838194:0:28077:0:(ldlm_request.c:1117:ldlm_cli_cancel_local()) ### client-side cancel ns: lustre-OST0003-osc-ffff88005b040000 lock: ffff880042609740/0x695594a2b763446f lrc: 4/0,0 mode: PW/PW res: [0x65f71:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 167936->253951) flags: 0x428400020000 nid: local remote: 0x433a6203c6e8086c expref: -99 pid: 29295 timeout: 0 lvb_type: 1

Now the original process continues to ask for the actual dlm lock:

00000008:00010000:0.0:1377200002.838278:0:29294:0:(osc_lock.c:1674:osc_lock_init()) ### lock ffff8800411caa98, osc lock ffff880042618a98, flags 40000000
...
00010000:00010000:0.0:1377200002.838296:0:29294:0:(ldlm_request.c:916:ldlm_cli_enqueue()) ### client-side enqueue START, flags 40000000
 ns: lustre-OST0003-osc-ffff88005b040000 lock: ffff880042609540/0x695594a2b7634476 lrc: 3/0,1 mode: --/PW res: [0x65f71:0x0:0x0].0 rrc: 1 type: EXT [167936->253951] (req 167936->253951) flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 29294 timeout: 0 lvb_type: 1

And here we requested wrong lock!

(This log snipped is from a different log file than above, but you can see the same picture with the previous log too just search for that stray lock handle)

Comment by Zhenyu Xu [ 28/Aug/13 ]

patch tracking at http://review.whamcloud.com/7477

commit message
Change subject: LU-3027 clio: properly expand lov parent lock descr
...................................................................... 

LU-3027 clio: properly expand lov orig lock descr 

* If a client request a cl_lock with original ext region [A, B], 
its lov->lls_orig saves the [A, B] pair, while its cl_lock's 
cll_descr has been modified to what it get from osc_lock, such 
as [C, D]. 

When new ext lock request [E, F] from the same client, it uses 
existing cl_lock's cll_descr to match the request, even though 
[E, F] is not contained by [A, B], but it could be contianed by 
[C, D], and this cl_lock could be returned. 

When ldlm_lock [C, D] is canceled, the second lock request would 
enqueue a ldlm_lock using the returned cl_lock's original ext, i.e 
[A, B], that's wrong. 

* Put -machinefile option ahead of other mpiexec options, since some 
version of mpiexec request global options be put before its local 
options, and -machinefile is a global options. 
Comment by Oleg Drokin [ 28/Aug/13 ]

After discussing with Jinshan I have these two alternative approaches for a fix:

http://review.whamcloud.com/7481 - I like this one more

http://review.whamcloud.com/7482 - this one has a potential to waste more memory for sublocks, I fear so I don't like it as such.

Comment by Jian Yu [ 29/Aug/13 ]

Hi Bobi and Oleg,

While creating/updating the patches, could you please add the following test parameters into commit message so as to verify write_disjoint test automatically?

Test-Parameters: envdefinitions=SLOW=yes,ONLY=write_disjoint \
clientdistro=el6 serverdistro=el6 clientarch=x86_64 \
serverarch=x86_64 testlist=parallel-scale
Comment by Oleg Drokin [ 30/Aug/13 ]

patch landed to master.

Comment by Jian Yu [ 30/Aug/13 ]

Hi Oleg,

The patch http://review.whamcloud.com/7481 can be applied to Lustre b2_4 branch without conflict. Could you please cherry-pick it to Lustre b2_4 branch? Thanks.

Comment by Peter Jones [ 30/Aug/13 ]

Landed for 2.5

Comment by Jian Yu [ 09/Sep/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/45/ (2.4.1 RC2)

Since patch http://review.whamcloud.com/7481 was reverted from Lustre b2_4 branch, the failure occurred again:
https://maloo.whamcloud.com/test_sets/51ed9046-1919-11e3-aa6c-52540035b04c

Comment by Oleg Drokin [ 10/Sep/13 ]

There's a different patch from Jinshan here: http://review.whamcloud.com/7569
This works well in my testing, but would be great if somebody can do nfs testing of it (note it requires the other 3027 patch reverted which is already true for b2_4, but not true for master).

Comment by Sarah Liu [ 20/Sep/13 ]

I will work on the NFS testing
Here is the patch including 7569 and revert 3027 http://review.whamcloud.com/#/c/7733/

parallel-scale-nfsv3 failed as

suite log shows

== parallel-scale-nfsv3 test iorssf: iorssf == 20:53:59 (1380167639)
+ 6 * 1024 * 1024 * 4
+ 26722304 * 9/10 / 1024 / 1024 / 2 / 2
free space=26722304, Need: 2 x 2 x 3 GB
(blockSize reduced to 3 Gb)
OPTIONS:
IOR=/usr/bin/IOR
ior_THREADS=2
ior_DURATION=30
MACHINEFILE=/tmp/parallel-scale-nfs.machines
wtm-75
wtm-76
mkdir: cannot create directory `/mnt/lustre/d0.ior.ssf': Permission denied
chmod: cannot access `/mnt/lustre/d0.ior.ssf': No such file or director

test log shows:

** error **
ERROR in aiori-POSIX.c (line 105): cannot open file.
ERROR: Not a directory
** exiting **
** error **
ERROR in aiori-POSIX.c (line 105): cannot open file.
ERROR: Not a directory
** exiting **
** error **
ERROR in aiori-POSIX.c (line 105): cannot open file.
ERROR: Not a directory
** exiting **
** error **
ERROR in aiori-POSIX.c (line 105): cannot open file.
ERROR: Not a directory
** exiting **
Comment by Peter Jones [ 24/Sep/13 ]

Patch landed for 2.5.0

Comment by Sarah Liu [ 01/Oct/13 ]

hmm for LU-3027, tag-2.4.93 (http://build.whamcloud.com/job/lustre-master/1687/) should have the fix but I still hit this error:

https://maloo.whamcloud.com/test_sets/9a628942-272b-11e3-88c6-52540035b04c

Comment by Jinshan Xiong (Inactive) [ 03/Oct/13 ]

patch is at: http://review.whamcloud.com/7841, please give it a try.

Comment by Sarah Liu [ 08/Oct/13 ]

Here is the result of parallel-scale test_write_disjoint:

https://maloo.whamcloud.com/test_sessions/5602b272-303b-11e3-b28a-52540035b04c

Comment by Peter Jones [ 11/Oct/13 ]

Landed for 2.5

Comment by Jian Yu [ 02/Nov/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/47/

The failure still occurred regularly on Lustre b2_4 branch:
https://maloo.whamcloud.com/test_sets/0039b106-4332-11e3-9490-52540035b04c
https://maloo.whamcloud.com/test_sets/762409ba-4333-11e3-8676-52540035b04c

Comment by Jian Yu [ 05/Nov/13 ]

Patches http://review.whamcloud.com/7569 and http://review.whamcloud.com/7841 landed on Lustre b2_4 branch. The failure was fixed.

Comment by Patrick Farrell (Inactive) [ 07/Nov/13 ]

Jian - Are you saying that when you added 7569 and 7841 to b2_4 you no longer see the failure?

I do not see those patches as landed to b2_4 in Gerrit, and Cray is still seeing the closely related LU-3889 with both patches landed in 2.4.

Comment by Jian Yu [ 08/Nov/13 ]

Hi Patrick,

Lustre b2_4 build #48 contains those two patches: http://build.whamcloud.com/job/lustre-b2_4/48/. And since that build, parallel-scale test write_disjoint has passed:
https://maloo.whamcloud.com/test_sets/35b41124-45a7-11e3-b22a-52540035b04c
https://maloo.whamcloud.com/test_sets/484f9162-4590-11e3-8713-52540035b04c
https://maloo.whamcloud.com/test_sets/c5f43484-4602-11e3-b5e8-52540035b04c
https://maloo.whamcloud.com/test_sets/1c537816-47f3-11e3-bc81-52540035b04c

Comment by Patrick Farrell (Inactive) [ 08/Nov/13 ]

Interesting - Thanks, Jian.

Generated at Sat Feb 10 01:30:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.