[LU-9234] replay-single test_70f: checksum doesn't match Created: 20/Mar/17  Updated: 27/Mar/18  Resolved: 27/Mar/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Lustre 2.11.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Critical
Reporter: James Casper Assignee: Sarah Liu
Resolution: Fixed Votes: 0
Labels: None
Environment:

onyx-35vm3 thru 6, Interop test,
RHEL7.3/ldiskfs, branch master, v2.9.54, b3541, 2.9 Lustre,
Client 2.10 Lustre


Issue Links:
Related
is related to LU-10702 replay-single test_87a: checksum does... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://testing.hpdd.intel.com/test_sessions/1a6bc6e8-0a05-11e7-9053-5254006e85c2

After unmounting/mounting an OST, client detects a checksum mismatch:

test_log:

CMD: onyx-35vm4 umount -d /mnt/lustre-ost1
CMD: onyx-35vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
reboot facets: ost1
Failover ost1 to onyx-35vm4
03:10:44 (1489572644) waiting for onyx-35vm4 network 900 secs ...
03:10:44 (1489572644) network interface is UP
CMD: onyx-35vm4 hostname
mount facets: ost1
CMD: onyx-35vm4 test -b /dev/lvm-Role_OSS/P1
CMD: onyx-35vm4 e2label /dev/lvm-Role_OSS/P1
Starting ost1:   /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1
CMD: onyx-35vm4 mkdir -p /mnt/lustre-ost1; mount -t lustre

followed by:

CMD: onyx-35vm6 md5sum /mnt/lustre/d70f.replay-single/f70f.replay-single.onyx-35vm5.onyx.hpdd.intel.com
onyx-35vm5: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 3 sec
onyx-35vm6: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 3 sec
 replay-single test_70f: @@@@@@ FAIL: /mnt/lustre/d70f.replay-single/f70f.replay-single.onyx-35vm5.onyx.hpdd.intel.com: checksum doesn't match on onyx-35vm6 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4841:error()
  = /usr/lib64/lustre/tests/replay-single.sh:2334:test_70f_write_and_read()
  = /usr/lib64/lustre/tests/replay-single.sh:2350:test_70f_loop()
  = /usr/lib64/lustre/tests/replay-single.sh:2394:test_70f()
  = /usr/lib64/lustre/tests/test-framework.sh:5117:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5156:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:5003:run_test()
  = /usr/lib64/lustre/tests/replay-single.sh:2415:main()


 Comments   
Comment by James Nunez (Inactive) [ 21/Mar/17 ]

In the console for the OSTs, we see:

03:11:42:[18142.619725] Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 3 clients reconnect
03:11:42:[18144.679276] LustreError: 32511:0:(ofd_grant.c:686:ofd_grant_check()) lustre-OST0000: cli df47be9e-3378-d13b-7555-c154ed48e9ba is replaying OST_WRITE while one rnb hasn't OBD_BRW_FROM_GRANT set (0x108)
03:11:42:[18144.711883] LustreError: 168-f: BAD WRITE CHECKSUM: lustre-OST0000 from 12345-10.2.4.142@tcp inode [0x20001a213:0xb88d:0x0] object 0x0:9508 extent [0-1048575]: client csum 32a1611e, server csum 66771e72
03:11:42:[18144.873216] Lustre: lustre-OST0000: Recovery over after 0:03, of 3 clients 3 recovered and 0 were evicted.

Looking through Maloo, replay-single test 70f started failing at the beginning of February of this year (2017) and has failed 60 times for the full test group since that time. All failures I've seen are during interop testing.

Some early failure logs are at:
2017-02-03 - (upstream client + master servers) https://testing.hpdd.intel.com/test_sets/766efa26-eaf4-11e6-af25-5254006e85c2
2017-02-04 - (master clients + b2_8 servers) https://testing.hpdd.intel.com/test_sets/2122fcde-eba8-11e6-848c-5254006e85c2
2017-02-04 - (master clients + b2_9 servers) https://testing.hpdd.intel.com/test_sets/3e6136de-eba9-11e6-9bb9-5254006e85c2
2017-02-07 - (master clients + b2_8 servers) https://testing.hpdd.intel.com/test_sets/deae1082-ee27-11e6-bbfe-5254006e85c2
2017-02-07 - (master clients + b2_8 servers) https://testing.hpdd.intel.com/test_sets/bf86630e-edf9-11e6-8f6d-5254006e85c2

Comment by Peter Jones [ 21/Mar/17 ]

Bobijam

Could you please advise on this one?

Thanks

Peter

Comment by Zhenyu Xu [ 01/Apr/17 ]

the replay-single 70f only works for OSS server version after 2.9.52.60 (the issue is LU-1573, patch #16680 with git commit 1d2fbade1b658db4386091e7938d9483f7aa4a05), so 2.8/2.9 server does not contain this fix.

As for the master server failure case, I checked the maloo report, their OSS version is 2.9.52.54.gc6f5e81, it does not have this fix as well.

Comment by Gerrit Updater [ 19/Apr/17 ]

Wei Liu (wei3.liu@intel.com) uploaded a new patch: https://review.whamcloud.com/26739
Subject: LU-9234 test: Skip test_70f if OSS version is older than 2.9.53
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 70d77276b26a9a92621bc4aef4f3f04cb6da310f

Comment by Gerrit Updater [ 01/May/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26739/
Subject: LU-9234 test: Skip test_70f if OSS version is older than 2.9.53
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 00db1ffe72bc1f4504adfaba539a1ec4f0fde74b

Comment by Peter Jones [ 01/May/17 ]

Landed for 2.10

Comment by James Casper [ 23/Aug/17 ]

Seen again in 2.10.51 (b3620):

https://testing.hpdd.intel.com/test_sessions/c4874eda-04c9-40c5-9e92-b8e7574bd5fe

Comment by James Nunez (Inactive) [ 19/Jan/18 ]

There seems to still be an issue with checksums that shows up in replay-single test 87a, but I'll open a new ticket for that issue.

Generated at Sat Feb 10 02:24:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.