[LU-2866] Interop 2.1.4<->2.4 failure on test suite lustre-rsync-test test_1: Failure in replication; differences found Created: 26/Feb/13  Updated: 11/Jul/13  Resolved: 24/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Yang Sheng
Resolution: Duplicate Votes: 0
Labels: MB
Environment:

client: 2.1.4 RHEL6
server: lustre-master build #


Issue Links:
Duplicate
duplicates LU-1331 changelogs: RNMTO record not always a... Resolved
Related
is related to LU-1237 2.1.1<->2.2 interop Test failure on t... Resolved
Severity: 3
Rank (Obsolete): 6932

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/2277b80e-7d52-11e2-a108-52540035b04c.

The sub-test test_1 failed with the following error:

Failure in replication; differences found.

client-1 test log shows:

== lustre-rsync-test test 1: Simple Replication == 04:47:54 (1361537274)
CMD: client-24vm3 lctl --device lustre-MDT0000 changelog_register -n
lustre-MDT0000: Registered changelog user cl1
CMD: client-24vm3 lctl get_param -n mdd.lustre-MDT0000.changelog_users
Replication #1
Replication of operation failed(-21): 15 UNLNK (6) [0x200000400:0x13:0x0] [0x200000400:0xd:0x0] file4
Lustre filesystem: lustre
MDT device: lustre-MDT0000
Source: /mnt/lustre
Target: /tmp/target
Target: /tmp/target2
Statuslog: /tmp/lustre_rsync.log
Changelog registration: cl1
Starting changelog record: 0
Errors: 1
lustre_rsync took 1 seconds
Changelog records consumed: 18
Replication #2
Lustre filesystem: lustre
MDT device: lustre-MDT0000
Source: /mnt/lustre
Target: /tmp/target
Target: /tmp/target2
Statuslog: /tmp/lustre_rsync.log
Changelog registration: cl1
Starting changelog record: 18
Errors: 0
lustre_rsync took 0 seconds
Changelog records consumed: 4
Only in /tmp/target/d0.lustre-rsync-test/d1/d1: file4
Only in /mnt/lustre/d0.lustre-rsync-test/d1: d2
 lustre-rsync-test test_1: @@@@@@ FAIL: Failure in replication; differences found. 
Dumping lctl log to /logdir/test_logs/2013-02-21/lustre-master-el6-x86_64-vs-lustre-b2_1-el6-x86_64--full--2_5_1__1269__-70264313904840-171517/lustre-rsync-

client console shows:

04:48:00:Lustre: DEBUG MARKER: == lustre-rsync-test test 1: Simple Replication == 04:47:54 (1361537274)
04:48:00:LustreError: 23091:0:(lmv_obd.c:855:lmv_iocontrol()) error: iocontrol MDC lustre-MDT0000_UUID on MDTidx 0 cmd c0086696: err = -2
04:48:00:LustreError: 23091:0:(lmv_obd.c:855:lmv_iocontrol()) error: iocontrol MDC lustre-MDT0000_UUID on MDTidx 0 cmd c0086696: err = -2
04:48:00:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  lustre-rsync-test test_1: @@@@@@ FAIL: Failure in replication\; differences found. 
04:48:00:Lustre: DEBUG MARKER: lustre-rsync-test test_1: @@@@@@ FAIL: Failure in replication; differences found.


 Comments   
Comment by Sarah Liu [ 26/Feb/13 ]

Also found following errors in subtest test_2b

== lustre-rsync-test test 2b: Replicate files changed by dbench. == 04:49:54 (1361537394)
CMD: client-24vm3 lctl --device lustre-MDT0000 changelog_register -n
lustre-MDT0000: Registered changelog user cl3
CMD: client-24vm3 lctl get_param -n mdd.lustre-MDT0000.changelog_users
Other changelog users present (3)
copying /usr/share/dbench/client.txt to /mnt/lustre/d0.lustre-rsync-test/d2/client.txt
copying necessary libs to /mnt/lustre/d0.lustre-rsync-test/d2
lib64/libpopt.so.0
lib64/libc.so.6
lib64/ld-linux-x86-64.so.2
running 'dbench 2 -t 60' in /mnt/lustre/d0.lustre-rsync-test/d2 at Fri Feb 22 04:49:56 PST 2013
dbench PID=24008
dbench version 4.00 - Copyright Andrew Tridgell 1999-2004

Running for 60 seconds with load 'client.txt' and minimum warmup 12 secs
0 of 2 processes prepared for launch   0 sec
2 of 2 processes prepared for launch   0 sec
releasing clients
   2       453    31.81 MB/sec  warmup   1 sec  latency 14.306 ms
   2       833    21.56 MB/sec  warmup   2 sec  latency 202.565 ms
   2       960    14.56 MB/sec  warmup   3 sec  latency 741.947 ms
   2      1450    13.01 MB/sec  warmup   4 sec  latency 79.735 ms
   2      1977    10.58 MB/sec  warmup   5 sec  latency 154.946 ms
   2      2451     9.37 MB/sec  warmup   6 sec  latency 76.113 ms
   2      3097     9.66 MB/sec  warmup   7 sec  latency 87.938 ms
   2      3686     9.68 MB/sec  warmup   8 sec  latency 29.801 ms
   2      4002     8.90 MB/sec  warmup   9 sec  latency 13.199 ms
   2      4303     8.09 MB/sec  warmup  10 sec  latency 106.175 ms
   2      4618     7.64 MB/sec  warmup  11 sec  latency 127.999 ms
   2      5662     1.07 MB/sec  execute   1 sec  latency 62.161 ms
   2      6086     2.70 MB/sec  execute   2 sec  latency 118.249 ms
   2      6839     6.15 MB/sec  execute   3 sec  latency 35.459 ms
   2      7348     6.82 MB/sec  execute   4 sec  latency 14.272 ms
   2      7648     5.59 MB/sec  execute   5 sec  latency 13.883 ms
   2      7960     4.86 MB/sec  execute   6 sec  latency 126.676 ms
PIDs: 24008 24009 24010
Stopping dbench
Starting replication
Replication of operation failed(-39): 5489 RNMFM (8) [0:0x0:0x0] [0x200000400:0x9a8:0x0] 
Replication of operation failed(-39): 5499 RNMFM (8) [0:0x0:0x0] [0x200000400:0x9a9:0x0] 
Replication of operation failed(-22): 5510 RNMFM (8) [0:0x0:0x0] [0x200000400:0x9a8:0x0] 
Replication of operation failed(-21): 5520 UNLNK (6) [0x200000400:0xa25:0x0] [0x200000400:0x9a8:0x0] PPTC321.TMP
Replication of operation failed(-22): 5522 RNMFM (8) [0:0x0:0x0] [0x200000400:0x9a8:0x0] 
Replication of operation failed(-21): 5525 UNLNK (6) [0x200000400:0xa21:0x0] [0x200000400:0x9a8:0x0] PPTC321.TMP
Replication of operation failed(-22): 5534 RNMFM (8) [0:0x0:0x0] [0x200000400:0x99a:0x0] 
Replication of operation failed(-39): 5548 RNMFM (8) [0:0x0:0x0] [0x200000400:0x99b:0x0] 
Replication of operation failed(-22): 5600 RNMFM (8) [0:0x0:0x0] [0x200000400:0x9dd:0x0] 
Replication of operation failed(-22): 5610 RNMFM (8) [0:0x0:0x0] [0x200000400:0x9dc:0x0] 
Replication of operation failed(-39): 5643 RNMFM (8) [0:0x0:0x0] [0x200000400:0x9a2:0x0] 
Replication of operation failed(-39): 5649 RNMFM (8) [0:0x0:0x0] [0x200000400:0x9a3:0x0] 
Replication of operation failed(-39): 5661 RNMFM (8) [0:0x0:0x0] [0x200000400:0x9a2:0x0] 
Replication of operation failed(-39): 5710 RNMFM (8) [0:0x0:0x0] [0x200000400:0x9cc:0x0] 
Replication of operation failed(-22): 5729 RNMFM (8) [0:0x0:0x0] [0x200000400:0x9cc:0x0] 
Replication of operation failed(-21): 5776 UNLNK (6) [0x200000400:0xa9c:0x0] [0x200000400:0xa07:0x0] FASTENER.LDB
Replication of operation failed(-22): 5799 RNMFM (8) [0:0x0:0x0] [0x200000400:0x9b4:0x0] 
Replication of operation failed(-22): 5801 RNMFM (8) [0:0x0:0x0] [0x200000400:0x9b4:0x0] 
Replication of operation failed(-22): 5814 RNMFM (8) [0:0x0:0x0] [0x200000400:0x9b4:0x0] 
Replication of operation failed(-22): 5817 RNMFM (8) [0:0x0:0x0] [0x200000400:0x9b4:0x0] 
Replication of operation failed(-21): 5819 UNLNK (6) [0x200000400:0xabd:0x0] [0x200000400:0x9b4:0x0] @@@CDRW.TMP
Replication of operation failed(-39): 6087 RNMFM (8) [0:0x0:0x0] [0x200000400:0x99a:0x0] 
Replication of operation failed(-39): 6093 RNMFM (8) [0:0x0:0x0] [0x200000400:0x99a:0x0] 
Replication of operation failed(-39): 6095 RNMFM (8) [0:0x0:0x0] [0x200000400:0x99a:0x0] 
Replication of operation failed(-39): 6105 RNMFM (8) [0:0x0:0x0] [0x200000400:0x99b:0x0] 
Replication of operation failed(-39): 6111 RNMFM (8) [0:0x0:0x0] [0x200000400:0x99b:0x0] 
Replication of operation failed(-39): 6113 RNMFM (8) [0:0x0:0x0] [0x200000400:0x99b:0x0] 
Replication of operation failed(-21): 6118 UNLNK (6) [0x200000400:0xb33:0x0] [0x200000400:0x99a:0x0] FSA4230.TMP
Replication of operation failed(-39): 6153 RNMFM (8) [0:0x0:0x0] [0x200000400:0x9dd:0x0] 
Replication of operation failed(-39): 6307 RNMFM (8) [0:0x0:0x0] [0x200000400:0x9cd:0x0] 
Replication of operation failed(-21): 6329 UNLNK (6) [0x200000400:0xb95:0x0] [0x200000400:0xa07:0x0] FASTENER.LDB
Replication of operation failed(-21): 6336 UNLNK (6) [0x200000400:0xba4:0x0] [0x200000400:0x9b4:0x0] @@@CDRW.TMP
Replication of operation failed(-21): 6340 UNLNK (6) [0x200000400:0xba7:0x0] [0x200000400:0x9b4:0x0] @@@CDRW.TMP
Replication of operation failed(-21): 6343 UNLNK (6) [0x200000400:0xba9:0x0] [0x200000400:0x9b4:0x0] @@@CDRW.TMP
Replication of operation failed(-21): 6371 UNLNK (6) [0x200000400:0xbb6:0x0] [0x200000400:0x9d4:0x0] PMC162.TMP
Lustre filesystem: lustre
MDT device: lustre-MDT0000
Source: /mnt/lustre
Target: /tmp/target
Target: /tmp/target2
Statuslog: /tmp/lustre_rsync.log
Changelog registration: cl3
Starting changelog record: 0
Errors: 35
lustre_rsync took 108 seconds
Changelog records consumed: 6602
Only in /mnt/lustre/d0.lustre-rsync-test/d2/clients/client0/~dmtmp: COREL
Only in /mnt/lustre/d0.lustre-rsync-test/d2/clients/client0/~dmtmp: EXCEL
Only in /tmp/target/d0.lustre-rsync-test/d2/clients/client0/~dmtmp/PM: PMC162.TMP
Only in /mnt/lustre/d0.lustre-rsync-test/d2/clients/client0/~dmtmp: PWRPNT
Only in /mnt/lustre/d0.lustre-rsync-test/d2/clients/client0/~dmtmp: WORD
Only in /mnt/lustre/d0.lustre-rsync-test/d2/clients/client0/~dmtmp: WORDPRO
Only in /tmp/target/d0.lustre-rsync-test/d2/clients/client1/~dmtmp/ACCESS: FASTENER.LDB
Only in /mnt/lustre/d0.lustre-rsync-test/d2/clients/client1/~dmtmp: COREL
Only in /mnt/lustre/d0.lustre-rsync-test/d2/clients/client1/~dmtmp: EXCEL
Only in /mnt/lustre/d0.lustre-rsync-test/d2/clients/client1/~dmtmp: PWRPNT
Only in /tmp/target/d0.lustre-rsync-test/d2/clients/client1/~dmtmp/WORD: FSA4230.TMP
Only in /mnt/lustre/d0.lustre-rsync-test/d2/clients/client1/~dmtmp: WORDPRO
 lustre-rsync-test test_2b: @@@@@@ FAIL: Failure in replication; differences found. 
Comment by Sarah Liu [ 13/Mar/13 ]

another instance: https://maloo.whamcloud.com/test_sets/bf2638e6-8906-11e2-b643-52540035b04c

Comment by Yang Sheng [ 13/Mar/13 ]

I am still trying to reproduce it.

Comment by Yang Sheng [ 15/Mar/13 ]

Hi, Sarah, have any special while the test running? I cannot reproduce it anyway.

Comment by Sarah Liu [ 17/Mar/13 ]

Hi Yang Sheng,

I can reproduce it on my test nodes, I just copy the test-framework.sh from server(2.4) to client(2.1.4), otherwise mkfs.lustre cannot work:
https://maloo.whamcloud.com/test_sets/403f19da-8ec4-11e2-92ff-52540035b04c

Comment by Yang Sheng [ 19/Mar/13 ]

Ah, Obviously it isn't standarad operation, has a ticket for mkfs.lustre issue?

Comment by Sarah Liu [ 19/Mar/13 ]

Yang Sheng, I think this is the ticket LU-1526 and this issue should be fixed in 2.1.5

Comment by Oleg Drokin [ 19/Mar/13 ]

So, I see LU-1526 is actually part of 2.1.4, then how will it be fixed in 2.1.5, by what patch?

Comment by Yang Sheng [ 20/Mar/13 ]

I think this issue should cause by 2.1.4-rc2 hasn't included http://review.whamcloud.com/#change,3835

So rename changelog is different between 2.1 & 2.4. Just include this patch else we have to stop the rsync in interop situation.

Comment by Sarah Liu [ 21/Mar/13 ]

verified with the latest b2_1 build #191 and it passed:

https://maloo.whamcloud.com/test_sessions/f2de5456-9244-11e2-b7d5-52540035b04c

Comment by Peter Jones [ 24/Mar/13 ]

ok so I am closing this as a duplicate of LU-1331 and it will be fixed once b2_1 interop switches to 2.1.5

Generated at Sat Feb 10 01:28:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.