[LU-6670] Hard Failover recovery-small test_28: post-failover df: 1 Created: 01/Jun/15  Updated: 29/Jan/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Unresolved Votes: 0
Labels: patch
Environment:

lustre-master build #3029 SLES11 SP3


Issue Links:
Duplicate
is duplicated by LU-391 land bz24092 (build src.rpm for lustr... Resolved
Related
is related to LU-8544 recovery-double-scale test_pairwise_f... Resolved
is related to LU-5115 replay-single test_73b: @@@@@@ FAIL: ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/20c5cc8e-ff3d-11e4-a4ed-5254006e85c2.

The sub-test test_28 failed with the following error:

post-failover df: 1

client dmesg

[192506.215454] Lustre: DEBUG MARKER: == recovery-small test 28: handle error adding new clients (bug 6086) ================================ 04:02:23 (1432119743)
[192506.310965] Lustre: DEBUG MARKER: mcreate /mnt/lustre/f28.recovery-small
[192506.324746] Lustre: DEBUG MARKER: lctl set_param ldlm.namespaces.*.early_lock_cancel=0
[192506.331351] Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000305
[192506.338932] Lustre: DEBUG MARKER: chmod 0777 /mnt/lustre/f28.recovery-small
[192506.349402] Lustre: *** cfs_fail_loc=305, val=0***
[192506.349407] Lustre: Skipped 2 previous similar messages
[192506.377737] Lustre: DEBUG MARKER: lctl set_param fail_loc=0
[192506.388914] Lustre: DEBUG MARKER: lctl set_param fail_val=0
[192506.398826] Lustre: DEBUG MARKER: lctl set_param ldlm.namespaces.*.early_lock_cancel=1
[192506.635438] LustreError: 167-0: lustre-MDT0000-mdc-ffff880070b36000: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
[192506.635441] LustreError: Skipped 8 previous similar messages
[192506.635853] LustreError: 7641:0:(vvp_io.c:1444:vvp_io_init()) lustre: refresh file layout [0x200029440:0x28af:0x0] error -5.
[192506.655993] LustreError: 15736:0:(ldlm_resource.c:776:ldlm_resource_complain()) lustre-MDT0000-mdc-ffff880070b36000: namespace resource [0x200029440:0x299b:0x0].0 (ffff88005f6ee1c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
[192506.655998] LustreError: 15736:0:(ldlm_resource.c:1369:ldlm_resource_dump()) --- Resource: [0x200029440:0x299b:0x0].0 (ffff88005f6ee1c0) refcount = 2
[192517.136125] LustreError: 166-1: MGC10.1.6.246@tcp: Connection to MGS (at 10.1.6.250@tcp) was lost; in progress operations using this service will fail
[192537.136108] LustreError: 23405:0:(mgc_request.c:527:do_requeue()) failed processing log: -5
[192557.136202] LustreError: 23405:0:(mgc_request.c:527:do_requeue()) failed processing log: -5
[192565.200662] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/mpi/gcc/openmpi/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/l
[192565.453472] Lustre: DEBUG MARKER: lctl get_param -n at_max
[192565.516470] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
[192565.692596] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
[192577.140250] Lustre: Evicted from MGS (at 10.1.6.246@tcp) after server handle changed from 0x7860df5639e24149 to 0xd44806558079509b
[192582.157200] Lustre: 17892:0:(client.c:2755:ptlrpc_replay_interpret()) @@@ Version mismatch during replay
[192582.157203]   req@ffff88005f4a1cc0 x1501485024619144/t665719930885(665719930885) o36->lustre-MDT0000-mdc-ffff880070b36000@10.1.6.246@tcp:12/10 lens 504/424 e 0 to 0 dl 1432119950 ref 2 fl Interpret:R/4/0 rc -75/-75
[192713.156228] Lustre: 17892:0:(import.c:1293:completed_replay_interpret()) lustre-MDT0000-mdc-ffff880070b36000: version recovery fails, reconnecting
[192713.158878] LustreError: 16488:0:(lmv_obd.c:1474:lmv_statfs()) can't stat MDS #0 (lustre-MDT0000-mdc-ffff880070b36000), error -5
[192713.158895] LustreError: 16488:0:(llite_lib.c:1762:ll_statfs_internal()) md_statfs fails: rc = -5
[192713.159013] LustreError: 7641:0:(vvp_io.c:1444:vvp_io_init()) lustre: refresh file layout [0x200029440:0x2e5a:0x0] error -5.
[192713.159022] LustreError: 7641:0:(vvp_io.c:1444:vvp_io_init()) Skipped 115 previous similar messages
[192713.184101] LustreError: 16492:0:(ldlm_resource.c:776:ldlm_resource_complain()) lustre-MDT0000-mdc-ffff880070b36000: namespace resource [0x200029440:0x2f6b:0x0].0 (ffff88005f6f1300) refcount nonzero (1) after lock cleanup; forcing cleanup.
[192713.184105] LustreError: 16492:0:(ldlm_resource.c:1369:ldlm_resource_dump()) --- Resource: [0x200029440:0x2f6b:0x0].0 (ffff88005f6f1300) refcount = 1
[192713.184261] Lustre: lustre-MDT0000-mdc-ffff880070b36000: Connection restored to lustre-MDT0000 (at 10.1.6.246@tcp)
[192713.184263] Lustre: Skipped 20 previous similar messages
[192713.272121] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  recovery-small test_28: @@@@@@ FAIL: post-failover df: 1 
[192713.469987] Lustre: DEBUG MARKER: recovery-small test_28: @@@@@@ FAIL: post-failover df: 1
[192713.728348] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2015-05-18/lustre-master-el6_6-x86_64-vs-lustre-master-sles11sp3-x86_64--failover--2_9_1__3029__-70061678429280-004551/recovery-small.test_28.debug_log.$(hostname -s).1432119951.log;
[192713.728351]          dmesg > /logdir/test_logs



 Comments   
Comment by Gerrit Updater [ 25/Jul/16 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/21497
Subject: LU-6670 ptlrpc: commit the first req with transno
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a51eb89fc94041da92d07dfa3d82c544d720587a

Comment by Saurabh Tandan (Inactive) [ 06/Sep/16 ]

This issue has occurred almost 32 times in past 30 days.

Comment by Niu Yawei (Inactive) [ 07/Sep/16 ]

This looks similar to a issue I found before: LU-5115. I had a different thought on this issue, I was thinking it's a test script problem. Hongchao, could you take a look at the initial analysis on LU-5115 to see if it's correct?

Comment by Hongchao Zhang [ 26/Mar/18 ]

the patch http://review.whamcloud.com/21497 has been updated.

Generated at Sat Feb 10 02:02:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.