[LU-2767] Interop 2.1.4<->2.4 failure on test suite parallel-scale test_compilebench Created: 06/Feb/13 Updated: 22/Mar/13 Resolved: 22/Mar/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Keith Mannthey (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | LB | ||
| Environment: |
2.1.4 server with 2.4 client |
||
| Severity: | 3 |
| Rank (Obsolete): | 6707 |
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/82e9c508-6d38-11e2-9f92-52540035b04c. The sub-test test_compilebench failed with the following error:
OSS dmesg shows: InfLustre: lustre-OST0001: Client lustre-MDT0000-mdtlov_UUID (at 10.10.4.190@tcp) reconnecting Lustre: 27717:0:(ldlm_lib.c:952:target_handle_connect()) lustre-OST0002: connection from lustre-MDT0000-mdtlov_UUID@10.10.4.190@tcp t0 exp ffff880050d33400 cur 1359779279 last 1359779279 Lustre: lustre-OST0000: received MDS connection from 10.10.4.190@tcp Lustre: 27719:0:(llog_net.c:168:llog_receptor_accept()) changing the import ffff880050457800 - ffff8800305ca000 Lustre: 27719:0:(llog_net.c:168:llog_receptor_accept()) Skipped 1 previous similar message Lustre: Skipped 6 previous similar messages __ratelimit: 44 callbacks suppressed cannot allocate a tage (259) cannot allocate a tage (259) cannot allocate a tage (259) |
| Comments |
| Comment by Keith Mannthey (Inactive) [ 06/Feb/13 ] |
|
from the syslog of the OST (the 2 clients look the same) Feb 1 18:27:28 client-31vm4 kernel: Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej Feb 1 18:27:28 client-31vm4 xinetd[2383]: EXIT: mshell status=0 pid=10351 duration=0(sec) Feb 1 20:27:49 client-31vm4 xinetd[2383]: START: shell pid=10818 from=::ffff:10.10.4.193 Feb 1 20:27:49 client-31vm4 rshd[10819]: autotest@client-31vm6.lab.whamcloud.com as root: cmd='/home/autotest/.autotest/dynamic_bash/70344784267080+1359779269.10599' Feb 1 20:27:59 client-31vm4 kernel: 8800759c5df0 ffffffff8104e379 Feb 1 20:27:59 client-31vm4 kernel: ffff880037a925f8 ffff8800759c5fd8 000000000000fb88 ffff880037a925f8 from the mds Feb 1 18:27:28 client-31vm3 kernel: Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej Feb 1 18:27:28 client-31vm3 xinetd[2370]: EXIT: mshell status=0 pid=10343 duration=0(sec) Feb 1 20:27:59 client-31vm3 kernel: Lustre: 3498:0:(client.c:1817:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1359779272/real 1359779272] req@ffff880074c48000 x1425812505459431/t0(0) o400->lustre-OST0000-osc-MDT0000@10.10.4.191@tcp:28/4 lens 192/192 e 0 to 1 dl 1359779279 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Feb 1 20:27:59 client-31vm3 kernel: Lustre: 3498:0:(client.c:1817:ptlrpc_expire_one_request()) Skipped 6 previous similar messages A magic 2 hours jump. I really think time was somehow time was adjusted on the systems. This may be some sort of TT issue it is not clear. The console outputs all look clean. |
| Comment by Keith Mannthey (Inactive) [ 07/Feb/13 ] |
|
It seems like 2.1.4 needs this patch. http://review.whamcloud.com/5053 Can you retest with http://review.whamcloud.com/5053 ? |
| Comment by Keith Mannthey (Inactive) [ 11/Feb/13 ] |
|
Sarah, |
| Comment by Peter Jones [ 21/Feb/13 ] |
|
Should be fixed in 2.1.5. Please reopen if this still occurs after we switch testing to 2.1.5 |
| Comment by Peter Jones [ 22/Mar/13 ] |
|
duplicate of |