[LU-9234] replay-single test_70f: checksum doesn't match Created: 20/Mar/17 Updated: 27/Mar/18 Resolved: 27/Mar/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0, Lustre 2.11.0 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | James Casper | Assignee: | Sarah Liu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
onyx-35vm3 thru 6, Interop test, |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
https://testing.hpdd.intel.com/test_sessions/1a6bc6e8-0a05-11e7-9053-5254006e85c2 After unmounting/mounting an OST, client detects a checksum mismatch: test_log: CMD: onyx-35vm4 umount -d /mnt/lustre-ost1 CMD: onyx-35vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' reboot facets: ost1 Failover ost1 to onyx-35vm4 03:10:44 (1489572644) waiting for onyx-35vm4 network 900 secs ... 03:10:44 (1489572644) network interface is UP CMD: onyx-35vm4 hostname mount facets: ost1 CMD: onyx-35vm4 test -b /dev/lvm-Role_OSS/P1 CMD: onyx-35vm4 e2label /dev/lvm-Role_OSS/P1 Starting ost1: /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1 CMD: onyx-35vm4 mkdir -p /mnt/lustre-ost1; mount -t lustre followed by: CMD: onyx-35vm6 md5sum /mnt/lustre/d70f.replay-single/f70f.replay-single.onyx-35vm5.onyx.hpdd.intel.com onyx-35vm5: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 3 sec onyx-35vm6: osc.lustre-OST0000-osc-*.ost_server_uuid in FULL state after 3 sec replay-single test_70f: @@@@@@ FAIL: /mnt/lustre/d70f.replay-single/f70f.replay-single.onyx-35vm5.onyx.hpdd.intel.com: checksum doesn't match on onyx-35vm6 Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4841:error() = /usr/lib64/lustre/tests/replay-single.sh:2334:test_70f_write_and_read() = /usr/lib64/lustre/tests/replay-single.sh:2350:test_70f_loop() = /usr/lib64/lustre/tests/replay-single.sh:2394:test_70f() = /usr/lib64/lustre/tests/test-framework.sh:5117:run_one() = /usr/lib64/lustre/tests/test-framework.sh:5156:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:5003:run_test() = /usr/lib64/lustre/tests/replay-single.sh:2415:main() |
| Comments |
| Comment by James Nunez (Inactive) [ 21/Mar/17 ] |
|
In the console for the OSTs, we see: 03:11:42:[18142.619725] Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 3 clients reconnect 03:11:42:[18144.679276] LustreError: 32511:0:(ofd_grant.c:686:ofd_grant_check()) lustre-OST0000: cli df47be9e-3378-d13b-7555-c154ed48e9ba is replaying OST_WRITE while one rnb hasn't OBD_BRW_FROM_GRANT set (0x108) 03:11:42:[18144.711883] LustreError: 168-f: BAD WRITE CHECKSUM: lustre-OST0000 from 12345-10.2.4.142@tcp inode [0x20001a213:0xb88d:0x0] object 0x0:9508 extent [0-1048575]: client csum 32a1611e, server csum 66771e72 03:11:42:[18144.873216] Lustre: lustre-OST0000: Recovery over after 0:03, of 3 clients 3 recovered and 0 were evicted. Looking through Maloo, replay-single test 70f started failing at the beginning of February of this year (2017) and has failed 60 times for the full test group since that time. All failures I've seen are during interop testing. Some early failure logs are at: |
| Comment by Peter Jones [ 21/Mar/17 ] |
|
Bobijam Could you please advise on this one? Thanks Peter |
| Comment by Zhenyu Xu [ 01/Apr/17 ] |
|
the replay-single 70f only works for OSS server version after 2.9.52.60 (the issue is As for the master server failure case, I checked the maloo report, their OSS version is 2.9.52.54.gc6f5e81, it does not have this fix as well. |
| Comment by Gerrit Updater [ 19/Apr/17 ] |
|
Wei Liu (wei3.liu@intel.com) uploaded a new patch: https://review.whamcloud.com/26739 |
| Comment by Gerrit Updater [ 01/May/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26739/ |
| Comment by Peter Jones [ 01/May/17 ] |
|
Landed for 2.10 |
| Comment by James Casper [ 23/Aug/17 ] |
|
Seen again in 2.10.51 (b3620): https://testing.hpdd.intel.com/test_sessions/c4874eda-04c9-40c5-9e92-b8e7574bd5fe |
| Comment by James Nunez (Inactive) [ 19/Jan/18 ] |
|
There seems to still be an issue with checksums that shows up in replay-single test 87a, but I'll open a new ticket for that issue. |