[LU-10702] replay-single test_87a: checksum doesn't match Created: 23/Feb/18 Updated: 14/Dec/21 Resolved: 14/Dec/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.1, Lustre 2.11.0, Lustre 2.10.2, Lustre 2.10.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Hongchao Zhang |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | suse | ||
| Environment: |
Hard Failover |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
replay-single test_87a - New checksum d41d8cd98f00b204e9800998ecf8427e does not match original 258b70206dfda5af3d4dfe6946e0adb8 This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/5874e3d8-12d9-11e8-bd00-52540065bddc test_87a failed with the following error: New checksum d41d8cd98f00b204e9800998ecf8427e does not match original 258b70206dfda5af3d4dfe6946e0adb8 Test_logs: trevis-27vm3: CMD: trevis-27vm3 lctl get_param -n at_max trevis-27vm1: CMD: trevis-27vm1 lctl get_param -n at_max trevis-27vm3: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-27vm1: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-27vm4: CMD: trevis-27vm4 lctl get_param -n at_max trevis-27vm4: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-27vm3: osc.lustre-OST0001-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-27vm1: osc.lustre-OST0001-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-27vm4: osc.lustre-OST0001-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-27vm3: osc.lustre-OST0002-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-27vm4: osc.lustre-OST0002-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-27vm1: osc.lustre-OST0002-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-27vm3: osc.lustre-OST0003-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-27vm4: osc.lustre-OST0003-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-27vm1: osc.lustre-OST0003-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-27vm3: osc.lustre-OST0004-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-27vm1: osc.lustre-OST0004-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-27vm4: osc.lustre-OST0004-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-27vm3: osc.lustre-OST0005-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-27vm1: osc.lustre-OST0005-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-27vm4: osc.lustre-OST0005-osc-ffff*.ost_server_uuid in FULL state after 0 sec trevis-27vm1: osc.lustre-OST0006-osc-ffff*.ost_server_uuid in FULL state after 3 sec trevis-27vm3: osc.lustre-OST0006-osc-ffff*.ost_server_uuid in FULL state after 4 sec trevis-27vm4: osc.lustre-OST0006-osc-ffff*.ost_server_uuid in FULL state after 4 sec 0+0 records in 0+0 records out 0 bytes copied, 0.0018678 s, 0.0 kB/s replay-single test_87a: @@@@@@ FAIL: New checksum d41d8cd98f00b204e9800998ecf8427e does not match original 258b70206dfda5af3d4dfe6946e0adb8 Trace dump: |
| Comments |
| Comment by James Nunez (Inactive) [ 28/Feb/18 ] |
|
This test started failing with checksum errors on 2017-09 between 13 and 26 with master tag 2.10.53 seen during SLES client and SLES client/server testing. See the following logs for some early failures: Also seen in b2_10 starting with 2.10.1 on 2017-10-05; |
| Comment by Peter Jones [ 28/Feb/18 ] |
|
Hongchao Any idea why this is failing for SLES? Peter |
| Comment by Hongchao Zhang [ 12/Mar/18 ] |
|
I have searched the failed test_87a test in Maloo, and found all are caused by the client eviction for there is no corresponding [40129.638078] LustreError: 167-0: lustre-OST0000-osc-ffff880076344000: This client was evicted by lustre-OST0000; in progress operations using this service will fail. [40170.533253] LustreError: 167-0: lustre-OST0001-osc-ffff880076344000: This client was evicted by lustre-OST0001; in progress operations using this service will fail. [40210.533458] LustreError: 167-0: lustre-OST0002-osc-ffff880076344000: This client was evicted by lustre-OST0002; in progress operations using this service will fail. [40258.273807] LustreError: 167-0: lustre-OST0003-osc-ffff880076344000: This client was evicted by lustre-OST0003; in progress operations using this service will fail. 00000100:00080000:1.0:1520181481.496098:0:29559:0:(import.c:675:ptlrpc_connect_import()) ffff88007b158800 lustre-OST0000_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:1.0:1520181481.496100:0:29559:0:(import.c:519:import_select_connection()) lustre-OST0000-osc-ffff880076344000: connect to NID 10.9.6.83@tcp last attempt 0 00000100:00080000:1.0:1520181481.496103:0:29559:0:(import.c:589:import_select_connection()) lustre-OST0000-osc-ffff880076344000: Connection changing to lustre-OST0000 (at 10.9.6.83@tcp) 00000100:00080000:1.0:1520181481.496105:0:29559:0:(import.c:597:import_select_connection()) lustre-OST0000-osc-ffff880076344000: import ffff88007b158800 using connection 10.9.6.83@tcp/10.9.6.83@tcp 00000008:00100000:1.0:1520181481.496108:0:29559:0:(osc_request.c:2652:osc_reconnect()) ocd_connect_flags: 0x20445af0e3640478 ocd_version: 34210560 ocd_grant: 8388608, lost: 0. 00000100:00100000:1.0:1520181481.496114:0:29559:0:(import.c:763:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5) req@ffff880061791cc0 x1594011648012240/t0(0) o8->lustre-OST0000-osc-ffff880076344000@10.9.6.83@tcp:28/4 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1 00000020:01000000:1.0:1520181481.496126:0:29559:0:(obd_config.c:1370:class_process_proc_param()) lustre-OST0000-osc-ffff880076344000: set parameter 'import=connection=10.9.6.83@tcp::60' 10000000:01000000:1.0:1520181481.496137:0:29559:0:(mgc_request.c:2151:mgc_process_log()) MGC10.9.6.85@tcp: configuration from log 'lustre-cliir' succeeded (0). 00000100:00100000:0.0:1520181481.496137:0:31324:0:(client.c:1620:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:51f0d3b2-a534-84b0-eaaa-ade7c191b8ad:31324:1594011648012240:10.9.6.83@tcp:8 00010000:00010000:1.0:1520181481.496140:0:29559:0:(ldlm_lock.c:797:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(CR) ns: ?? lock: ffff88006109c000/0xdfe6b47beb46e4d3 lrc: 3/1,0 mode: CR/CR res: ?? rrc=?? type: ??? flags: 0x11000000000000 nid: local remote: 0xfafbc8b63ef58e4d expref: -99 pid: 29559 timeout: 0 lvb_type: 0 00010000:00010000:1.0:1520181481.496143:0:29559:0:(ldlm_lock.c:887:ldlm_lock_decref_internal()) ### do not add lock into lru list ns: ?? lock: ffff88006109c000/0xdfe6b47beb46e4d3 lrc: 2/0,0 mode: CR/CR res: ?? rrc=?? type: ??? flags: 0x11000000000000 nid: local remote: 0xfafbc8b63ef58e4d expref: -99 pid: 29559 timeout: 0 lvb_type: 0 00000100:00080000:0.0:1520181481.496831:0:31324:0:(import.c:1041:ptlrpc_connect_interpret()) lustre-OST0000-osc-ffff880076344000: connect to target with instance 60 00000080:00000004:0.0:1520181481.496841:0:31324:0:(lcommon_misc.c:97:cl_ocd_update()) Changing connect_flags: 0x20405af0e3440478 -> 0x20405af0e3440478 00000080:00080000:0.0:1520181481.496845:0:31324:0:(lcommon_misc.c:71:cl_init_ea_size()) updating def/max_easize: 72/216 00000100:00080000:0.0:1520181481.496850:0:31324:0:(import.c:1239:ptlrpc_connect_interpret()) @@@ lustre-OST0000-osc-ffff880076344000: evicting (reconnect/recover flags not set: 4) req@ffff880061791cc0 x1594011648012240/t0(0) o8->lustre-OST0000-osc-ffff880076344000@10.9.6.83@tcp:28/4 lens 520/384 e 0 to 0 dl 1520181506 ref 1 fl Interpret:RN/0/0 rc 0/0 00000100:00080000:0.0:1520181481.496854:0:31324:0:(import.c:1242:ptlrpc_connect_interpret()) ffff88007b158800 lustre-OST0000_UUID: changing import state from CONNECTING to EVICTED 00000100:02020000:0.0:1520181481.496857:0:31324:0:(import.c:1471:ptlrpc_import_recovery_state_machine()) 167-0: lustre-OST0000-osc-ffff880076344000: This client was evicted by lustre-OST0000; in progress operations using this service will fail. 00000100:00080000:0.0:1520181481.496873:0:31324:0:(import.c:1475:ptlrpc_import_recovery_state_machine()) evicted from lustre-OST0000_UUID@10.9.6.83@tcp; invalidating 00000100:00100000:0.0:1520181481.501384:0:31324:0:(client.c:2045:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:51f0d3b2-a534-84b0-eaaa-ade7c191b8ad:31324:1594011648012240:10.9.6.83@tcp:8 00000100:00080000:1.0:1520181481.501394:0:30975:0:(import.c:1417:ptlrpc_invalidate_import_thread()) thread invalidate import lustre-OST0000-osc-ffff880076344000 to lustre-OST0000_UUID@10.9.6.83@tcp 00000100:00080000:1.0:1520181481.501397:0:30975:0:(import.c:214:ptlrpc_deactivate_and_unlock_import()) setting import lustre-OST0000_UUID INVALID 00000100:00100000:1.0:1520181481.501398:0:30975:0:(client.c:2713:ptlrpc_free_committed()) lustre-OST0000-osc-ffff880076344000: committing for last_committed 0 gen 2 00000100:00100000:1.0:1520181481.501402:0:30975:0:(client.c:2733:ptlrpc_free_committed()) @@@ free request with old gen req@ffff880078cd53c0 x1594011648010144/t253403070465(253403070465) o10->lustre-OST0000-osc-ffff880076344000@10.9.6.83@tcp:6/4 lens 560/400 e 0 to 0 dl 1520181403 ref 1 fl Complete:R/4/0 rc 0/0 00000100:00100000:1.0:1520181481.501409:0:30975:0:(client.c:2733:ptlrpc_free_committed()) @@@ free request with old gen req@ffff880017b143c0 x1594011648010176/t253403070466(253403070466) o4->lustre-OST0000-osc-ffff880076344000@10.9.6.83@tcp:6/4 lens 608/416 e 0 to 0 dl 1520181398 ref 1 fl Complete:R/4/0 rc 0/0 00000100:00100000:1.0:1520181481.501723:0:30975:0:(client.c:2733:ptlrpc_free_committed()) @@@ free request with old gen req@ffff880017b146c0 x1594011648010208/t253403070467(253403070467) o4->lustre-OST0000-osc-ffff880076344000@10.9.6.83@tcp:6/4 lens 608/416 e 0 to 0 dl 1520181398 ref 1 fl Complete:R/4/0 rc 0/0 00020000:01000000:1.0:1520181481.501988:0:30975:0:(lov_obd.c:425:lov_set_osc_active()) Marking OSC lustre-OST0000_UUID inactive 00000080:00000004:1.0:1520181481.501990:0:30975:0:(lcommon_misc.c:97:cl_ocd_update()) Changing connect_flags: 0x20405af0e3440478 -> 0x20405af0e3440478 00000080:00080000:1.0:1520181481.501992:0:30975:0:(lcommon_misc.c:71:cl_init_ea_size()) updating def/max_easize: 72/192 00000100:00100000:1.0:1520181481.501994:0:30975:0:(import.c:318:ptlrpc_invalidate_import()) Sleeping 20 sec for inflight to error out 00000100:00080000:1.0:1520181481.502288:0:30975:0:(import.c:1426:ptlrpc_invalidate_import_thread()) ffff88007b158800 lustre-OST0000_UUID: changing import state from EVICTED to RECOVER 00000100:00100000:1.0:1520181481.497603:0:2470:0:(events.c:350:request_in_callback()) peer: 12345-10.9.6.78@tcp (source: 12345-10.9.6.78@tcp) 00000100:00100000:0.0:1520181481.497642:0:2506:0:(service.c:1939:ptlrpc_server_handle_req_in()) got req x1594011648012240 00000100:00100000:0.0:1520181481.497663:0:2506:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.9.6.78@tcp, seq: 54 00000100:00100000:0.0:1520181481.497669:0:2506:0:(service.c:2089:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost00_001:0+-99:31324:x1594011648012240:12345-10.9.6.78@tcp:8 00010000:00080000:0.0:1520181481.497685:0:2506:0:(ldlm_lib.c:1229:target_handle_connect()) lustre-OST0000: connection from 51f0d3b2-a534-84b0-eaaa-ade7c191b8ad@10.9.6.78@tcp t253403070465 exp (null) cur 1520181481 last 0 00000020:00000080:0.0:1520181481.497698:0:2506:0:(genops.c:1379:class_connect()) connect: client 51f0d3b2-a534-84b0-eaaa-ade7c191b8ad, cookie 0xe0f23292040a3e10 00002000:00100000:0.0:1520181481.497704:0:2506:0:(ofd_obd.c:157:ofd_parse_connect_data()) lustre-OST0000: cli 51f0d3b2-a534-84b0-eaaa-ade7c191b8ad/ffff88006a23d400 ocd_connect_flags: 0x20445af0e3640478 ocd_version: 20a0300 ocd_grant: 8388608 ocd_index: 0 ocd_group 0 00002000:00100000:0.0:1520181481.497712:0:2506:0:(ofd_obd.c:264:ofd_parse_connect_data()) lustre-OST0000: cli (no nid) supports cksum type 7, return 7 00000020:01000000:0.0:1520181481.497737:0:2506:0:(lprocfs_status_server.c:342:lprocfs_exp_setup()) using hash ffff88007a296c40 00002000:00080000:0.0:1520181481.497753:0:2506:0:(ofd_obd.c:397:ofd_obd_connect()) lustre-OST0000: get connection from MDS 0 00000100:00080000:0.0:1520181481.497762:0:2506:0:(import.c:104:ptlrpc_import_enter_resend()) ffff88007c65d800 : changing import state from NEW to RECOVER 00000100:00080000:0.0:1520181481.497766:0:2506:0:(import.c:1536:ptlrpc_import_recovery_state_machine()) ffff88007c65d800 : changing import state from RECOVER to FULL 00000100:02000000:0.0:1520181481.497768:0:2506:0:(import.c:1542:ptlrpc_import_recovery_state_machine()) lustre-OST0000: Connection restored to (at 10.9.6.78@tcp) 00000100:00100000:0.0:1520181481.497802:0:2506:0:(service.c:2135:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost00_001:51f0d3b2-a534-84b0-eaaa-ade7c191b8ad+6:31324:x1594011648012240:12345-10.9.6.78@tcp:8 Request procesed in 133us (206us total) trans 0 rc 0/0 Could the client connection be not committed before failover? |
| Comment by Artem Blagodarenko (Inactive) [ 23/Nov/21 ] |
|
+1 https://testing.whamcloud.com/test_sets/0b6cbfe1-ab6f-4ab1-adea-6d8d959d05f9 |