[LU-3749] Failure on test suite replay-dual test_8: test_8 failed with 2 Created: 13/Aug/13  Updated: 02/Oct/13  Resolved: 02/Oct/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.5.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None
Environment:

server and client: tag-2.4.90 RHEL6


Issue Links:
Related
is related to LU-1012 replay-vbr: test_1b failure. Resolved
Severity: 3
Rank (Obsolete): 9672

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/fd1709ea-02c9-11e3-b384-52540035b04c.

The sub-test test_8 failed with the following error:

test_8 failed with 2

MDS console:

22:33:28:Lustre: DEBUG MARKER: == replay-dual test 8: replay of resent request == 22:33:24 (1376199204)
22:33:28:Lustre: DEBUG MARKER: sync; sync; sync
22:33:28:Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno
22:33:28:Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 readonly
22:33:28:LustreError: 14713:0:(osd_handler.c:1194:osd_ro()) *** setting lustre-MDT0000 read-only ***
22:33:28:LustreError: 14713:0:(osd_handler.c:1194:osd_ro()) Skipped 3 previous similar messages
22:33:28:Turning device dm-0 (0xfd00000) read-only
22:33:28:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
22:33:28:Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
22:33:28:Lustre: DEBUG MARKER: lctl set_param fail_loc=0x119
22:33:29:Lustre: *** cfs_fail_loc=119, val=2147483648***
22:33:29:LustreError: 14098:0:(ldlm_lib.c:2409:target_send_reply_msg()) @@@ dropping reply  req@ffff88005c311c00 x1443048921768268/t498216206346(0) o36->c0662422-4c9b-ce8c-77e9-eb8caf0a37b9@10.10.4.206@tcp:0/0 lens 496/448 e 0 to 0 dl 1376199231 ref 1 fl Interpret:/0/0 rc 0/0


 Comments   
Comment by Sarah Liu [ 13/Aug/13 ]

SLES11 SP2 client also hit this issue:
https://maloo.whamcloud.com/test_sets/85b954d8-029d-11e3-b384-52540035b04c

Comment by Keith Mannthey (Inactive) [ 13/Aug/13 ]

The client says this:

22:34:35:LustreError: 166-1: MGC10.10.4.208@tcp: Connection to MGS (at 10.10.4.208@tcp) was lost; in progress operations using this service will fail
22:34:35:Lustre: Evicted from MGS (at 10.10.4.208@tcp) after server handle changed from 0x2a56bfb73f955d8a to 0x2a56bfb73f95659b
22:35:07:Lustre: 19509:0:(client.c:2652:ptlrpc_replay_interpret()) @@@ Version mismatch during replay
22:35:07:  req@ffff88007c91a800 x1443048921768268/t498216206346(498216206346) o36->lustre-MDT0000-mdc-ffff880062d92800@10.10.4.208@tcp:12/10 lens 496/416 e 0 to 0 dl 1376199343 ref 2 fl Interpret:R/4/0 rc -75/-75
22:35:07:Lustre: lustre-MDT0000-mdc-ffff880065203400: Connection restored to lustre-MDT0000 (at 10.10.4.208@tcp)
22:35:07:Lustre: Skipped 6 previous similar messages
22:35:48:Lustre: 19509:0:(import.c:1209:completed_replay_interpret()) lustre-MDT0000-mdc-ffff880062d92800: version recovery fails, reconnecting

I wonder why we get a "Version mismatch during replay" Client and server appear to be the same version.

Comment by Oleg Drokin [ 14/Aug/13 ]

Keith, it's a file data version mismatch.

Comment by Jodi Levi (Inactive) [ 14/Aug/13 ]

Mike,
Would you be able to comment on this one?
Thank you!

Comment by Andreas Dilger [ 24/Sep/13 ]

Mike, any chance to look at this?

Comment by Mikhail Pershin [ 26/Sep/13 ]

yes, I am looking at it

Comment by Mikhail Pershin [ 27/Sep/13 ]

http://review.whamcloud.com/7786 - patch to fix that problem

Comment by Peter Jones [ 02/Oct/13 ]

Landed for 2.5.0

Generated at Sat Feb 10 01:36:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.