[LU-2767] Interop 2.1.4<->2.4 failure on test suite parallel-scale test_compilebench Created: 06/Feb/13  Updated: 22/Mar/13  Resolved: 22/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.4
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Keith Mannthey (Inactive)
Resolution: Fixed Votes: 0
Labels: LB
Environment:

2.1.4 server with 2.4 client


Severity: 3
Rank (Obsolete): 6707

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/82e9c508-6d38-11e2-9f92-52540035b04c.

The sub-test test_compilebench failed with the following error:

test failed to respond and timed out

OSS dmesg shows:

InfLustre: lustre-OST0001: Client lustre-MDT0000-mdtlov_UUID (at 10.10.4.190@tcp) reconnecting
Lustre: 27717:0:(ldlm_lib.c:952:target_handle_connect()) lustre-OST0002: connection from lustre-MDT0000-mdtlov_UUID@10.10.4.190@tcp t0 exp ffff880050d33400 cur 1359779279 last 1359779279
Lustre: lustre-OST0000: received MDS connection from 10.10.4.190@tcp
Lustre: 27719:0:(llog_net.c:168:llog_receptor_accept()) changing the import ffff880050457800 - ffff8800305ca000
Lustre: 27719:0:(llog_net.c:168:llog_receptor_accept()) Skipped 1 previous similar message
Lustre: Skipped 6 previous similar messages
__ratelimit: 44 callbacks suppressed
cannot allocate a tage (259)
cannot allocate a tage (259)
cannot allocate a tage (259)


 Comments   
Comment by Keith Mannthey (Inactive) [ 06/Feb/13 ]

from the syslog of the OST (the 2 clients look the same)

Feb  1 18:27:28 client-31vm4 kernel: Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
Feb  1 18:27:28 client-31vm4 xinetd[2383]: EXIT: mshell status=0 pid=10351 duration=0(sec)
Feb  1 20:27:49 client-31vm4 xinetd[2383]: START: shell pid=10818 from=::ffff:10.10.4.193
Feb  1 20:27:49 client-31vm4 rshd[10819]: autotest@client-31vm6.lab.whamcloud.com as root: cmd='/home/autotest/.autotest/dynamic_bash/70344784267080+1359779269.10599'
Feb  1 20:27:59 client-31vm4 kernel: 8800759c5df0 ffffffff8104e379
Feb  1 20:27:59 client-31vm4 kernel: ffff880037a925f8 ffff8800759c5fd8 000000000000fb88 ffff880037a925f8

from the mds

Feb  1 18:27:28 client-31vm3 kernel: Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
Feb  1 18:27:28 client-31vm3 xinetd[2370]: EXIT: mshell status=0 pid=10343 duration=0(sec)
Feb  1 20:27:59 client-31vm3 kernel: Lustre: 3498:0:(client.c:1817:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1359779272/real 1359779272]  req@ffff880074c48000 x1425812505459431/t0(0) o400->lustre-OST0000-osc-MDT0000@10.10.4.191@tcp:28/4 lens 192/192 e 0 to 1 dl 1359779279 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Feb  1 20:27:59 client-31vm3 kernel: Lustre: 3498:0:(client.c:1817:ptlrpc_expire_one_request()) Skipped 6 previous similar messages

A magic 2 hours jump. I really think time was somehow time was adjusted on the systems. This may be some sort of TT issue it is not clear. The console outputs all look clean.

Comment by Keith Mannthey (Inactive) [ 07/Feb/13 ]

It seems like 2.1.4 needs this patch. http://review.whamcloud.com/5053 LU-1018 resolves this same issue for the other branchs but did not land the 2.1 patch.

Can you retest with http://review.whamcloud.com/5053 ?

Comment by Keith Mannthey (Inactive) [ 11/Feb/13 ]

Sarah,
It seems http://review.whamcloud.com/5053 has been landed for 2.1. Can you retest and let me know if you still have the issue?

Comment by Peter Jones [ 21/Feb/13 ]

Should be fixed in 2.1.5. Please reopen if this still occurs after we switch testing to 2.1.5

Comment by Peter Jones [ 22/Mar/13 ]

duplicate of LU-1018

Generated at Sat Feb 10 01:28:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.