[LU-3279] Interop 2.3.0 <-> 2.4 failure on test suite lustre-rsync-test test_7: Failure in replication; differences found Created: 06/May/13  Updated: 26/Jun/13  Resolved: 20/May/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0, Lustre 2.1.6

Type: Bug Priority: Major
Reporter: Maloo Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: MB
Environment:

client: 2.3.0
server: http://review.whamcloud.com/#change,6252 patch set 5
client uses the t-f.sh from server


Issue Links:
Related
is related to LU-3190 Interop 2.3.0<->2.4 Failed on lustre-... Resolved
Severity: 3
Rank (Obsolete): 8116

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/5584e96e-b5de-11e2-9d08-52540035b04c.

The sub-test test_7 failed with the following error:

Failure in replication; differences found.

Info required for matching: lustre-rsync-test 7



 Comments   
Comment by Jodi Levi (Inactive) [ 06/May/13 ]

Mike,
Could you please comment on this one?
Thank you!

Comment by Andreas Dilger [ 07/May/13 ]

Di, Mike,
the problem here appears to be that the ChangeLog user doesn't see any log records. Is there some change in the protocol that breaks interoperability with a 2.3 client? Does this bug need to be a blocker for 2.4.0?

Comment by Di Wang [ 07/May/13 ]

It seems 2.4 adds Layout changes log

50596 20LAYOUT 06:38:24.898596932 2013.05.07 0x0 t=[0x200000401:0x2e:0x0]

which 2.3 clients do not recognize, which cause the failure of test_7. So it might not be a blocker.

Lustre: DEBUG MARKER: == lustre-rsync-test test 7: lustre_rsync stripesize == 22:54:01 (1367906041)
LustreError: 9793:0:(mdc_request.c:1236:changelog_show_cb()) Not a changelog rec 275120128/20
LustreError: 9801:0:(mdc_request.c:1236:changelog_show_cb()) Not a changelog rec 275120128/20

Jinshan, could you please comment?

Comment by Oleg Drokin [ 08/May/13 ]

it should not bee to had to filter out laouyt changes from a log to a client that does not understand layout lock?

Comment by Jinshan Xiong (Inactive) [ 08/May/13 ]

It seems there is nothing else we can do instead of requiring the customers to use uptodate client to do rsync.

Comment by Jinshan Xiong (Inactive) [ 08/May/13 ]

I don't think to filter out the unknown record is an option because that will make rsync not work practically.

Comment by Andreas Dilger [ 10/May/13 ]

Jinshan - that depends on what records are being filtered out. 275120128 = 0x10660000 = CHANGELOG_REC, and 20 = CL_LAYOUT. The CL_LAYOUT records do not affect the actual data in the file, which is what lustre_rsync cares about.

To be honest, I don't think the kernel should actually care about what is in the ChangeLog, it should only be the consumer that needs to check what the records are. changelog_show_cb() should IMHO be modified to remove the CL_LAST check, and just pass all of the records on to the consumer. Otherwise, we will have all sorts of "compatibility" problems here for no real benefit.

If necessary, we might consider to split up the cr_type field to contain a few bits of information, like:

#define CL_FLAG_OPTIONAL 0xff000000 /* does not affect data content/layout, can be skipped */

For now, to keep things simple, I've pushed http://review.whamcloud.com/6308 to fix this interop problem. It moves CL_LAYOUT in the place of CL_IOCTL (which was never used, and was IMHO a pointless record). This avoids the problem of CL_LAYOUT > CL_LAST(2.3). It also removes the check for cr_type > CL_LAST, and just passes the records up to the caller, and the caller can decide what to do with the records. It does not do anything like CL_FLAG_OPTIONAL, but we might consider that in the future.

Comment by Oleg Drokin [ 13/May/13 ]

Just a heads up to CEA and LLNL:

Current patch in review changes LAYOUT changelog record type to an incompatible value. Do you guys currently monitor it from your tools, would you be affected by the change?

Comment by Prakash Surya (Inactive) [ 13/May/13 ]

I appreciate the heads up! . AFAIK, we do not use any tools external to the Lustre tree which make use of changelogs except robinhood. Can one of the robinhood developers comment on this? I'm not familiar with its internals to know if it would affect it or not..

Comment by Henri Doreau (Inactive) [ 14/May/13 ]

Thanks for the heads up. The change doesn't affect our tools.

We have concerns about the comment that has been added to lustre/lustre_user.h:650 though. As Jinshan Xiong noted, stating that CL_LAYOUT doesn't reflect any actual data change is misleading.

Comment by Sarah Liu [ 14/May/13 ]

interop between 2.1.5 client and 2.4 server also hit this issue in tag-2.3.65 testing:

https://maloo.whamcloud.com/test_sets/3e989b54-b9a5-11e2-875f-52540035b04c

Comment by Andreas Dilger [ 14/May/13 ]

We have concerns about the comment that has been added to lustre/lustre_user.h:650 though. As Jinshan Xiong noted, stating that CL_LAYOUT doesn't reflect any actual data change is misleading.

My thought is that the content of the file is not changing, either in the case of file release and file migrate. In the case of file release, the content is still in the archive. Even if it is not in Lustre, any application accessing that file would get the same data back after it is restored from the archive.

There is of course some concern that a poorly-written layout swap could change the content of the file (e.g. source and target file do not contain the same data), but I can't see any reason to do that. Only the file owner could do this migrate, and they could just as easily write some other data directly into the file if they want.

In any case, I'll remove the comment from the CL_LAYOUT description.

Comment by Andreas Dilger [ 14/May/13 ]

I've pushed http://review.whamcloud.com/6338 as a follow-on patch for master. This changes the comment for CL_LAYOUT, and also changes the output string for this record type to match the 5-character convention used by other names.

I pushed http://review.whamcloud.com/6335 as a combined patch for b2_1.

Comment by Peter Jones [ 20/May/13 ]

Andreas

All the present patches in flight have landed. Is any further work required or can this ticket be marked as resolved?

Peter

Generated at Sat Feb 10 01:32:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.