[LU-10950] replay-single test_0c: post-failover df failed Created: 24/Apr/18  Updated: 14/Jun/22

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.2, Lustre 2.12.5, Lustre 2.12.6
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-12769 replay-dual test 0b hangs in client m... Resolved
is related to LU-11762 replay-single test 0d fails with 'po... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/3f06735e-47a3-11e8-960d-52540065bddc

test_0c failed with the following error:

post-failover df failed

MDS console

[54749.278470] Lustre: DEBUG MARKER: dmesg
[54750.063053] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == replay-single test 0c: check replay-barrier ======================================================= 12:47:53 \(1524487673\)
[54750.285298] Lustre: DEBUG MARKER: == replay-single test 0c: check replay-barrier ======================================================= 12:47:53 (1524487673)
[54771.807392] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-3vm4.trevis.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[54772.067740] Lustre: DEBUG MARKER: trevis-3vm4.trevis.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[54774.441472] Lustre: Evicted from MGS (at 10.9.4.19@tcp) after server handle changed from 0x3be7d72ea5628d3a to 0x3be7d72ea5629b9c
[54774.447132] LustreError: 167-0: lustre-MDT0000-lwp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
[54774.451853] LustreError: Skipped 1 previous similar message
[54811.662136] LustreError: 11-0: lustre-MDT0000-lwp-MDT0001: operation quota_acquire to node 10.9.4.19@tcp failed: rc = -11
[54811.665037] LustreError: Skipped 1 previous similar message
[54871.629282] LustreError: 11-0: lustre-MDT0000-lwp-MDT0001: operation quota_acquire to node 10.9.4.19@tcp failed: rc = -11
[54931.629799] LustreError: 11-0: lustre-MDT0000-lwp-MDT0001: operation quota_acquire to node 10.9.4.19@tcp failed: rc = -11
[54953.483279] LustreError: 11-0: lustre-MDT0000-osp-MDT0001: operation ldlm_enqueue to node 10.9.4.19@tcp failed: rc = -107
[54954.441836] LustreError: 167-0: lustre-MDT0000-osp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
[54954.447019] LustreError: Skipped 1 previous similar message
[54955.493750] Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-single test_0c: @@@@@@ FAIL: post-failover df failed 
[54955.689555] Lustre: DEBUG MARKER: replay-single test_0c: @@@@@@ FAIL: post-failover df failed

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
replay-single test_0c - post-failover df failed



 Comments   
Comment by Jian Yu [ 04/Nov/18 ]

+1 on master branch: https://testing.whamcloud.com/test_sets/a167544c-dff0-11e8-a251-52540065bddc

Comment by James A Simmons [ 23/Sep/19 ]

Potential fix for LU-12769 should resolve this

Comment by James A Simmons [ 10/Oct/19 ]

This should be fixed by https://review.whamcloud.com/#/c/36274. If not we can reopen.

Comment by James Nunez (Inactive) [ 05/Nov/19 ]

It looks like we are still seeing this issue even after patch https://review.whamcloud.com/36274/ landed to master. Please see https://testing.whamcloud.com/test_sets/42920818-fe12-11e9-8e77-52540065bddc for a recent failure.

Comment by James A Simmons [ 05/Nov/19 ]

Can you try https://review.whamcloud.com/#/c/35627.

Comment by James Nunez (Inactive) [ 02/Jun/20 ]

I looks like we are still seeing this issue on the b2_12 branch (2.12.5 RC1) at https://testing.whamcloud.com/test_sets/b1f9a784-03da-4f5d-8586-8f22b0ec1803 . In the MDS 2, 4 console, we don't see the "operation quota_acquire to node 10.9.4.19@tcp failed: rc = -11", but se do see the ldlm_enqueue error

[68319.207306] Lustre: DEBUG MARKER: == replay-single test 0c: check replay-barrier ======================================================= 16:40:20 (1590856820)
[68341.027124] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == rpc test complete, duration -o sec ================================================================ 16:40:42 \(1590856842\)
[68341.238283] Lustre: DEBUG MARKER: == rpc test complete, duration -o sec ================================================================ 16:40:42 (1590856842)
[68341.608159] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-57vm4.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[68341.811858] Lustre: DEBUG MARKER: trevis-57vm4.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[68342.615134] LustreError: 167-0: lustre-MDT0000-lwp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
[68342.617641] LustreError: Skipped 1 previous similar message
[68342.618663] Lustre: Evicted from MGS (at 10.9.1.247@tcp) after server handle changed from 0x6113ba905ec7fb33 to 0x6113ba905ec808b5
[68521.699520] LustreError: 11-0: lustre-MDT0000-osp-MDT0003: operation ldlm_enqueue to node 10.9.1.247@tcp failed: rc = -107
[68522.897958] LustreError: 167-0: lustre-MDT0000-osp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
[68522.900333] LustreError: Skipped 1 previous similar message
[68527.244961] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_0c: @@@@@@ FAIL: post-failover df failed 
[68527.479305] Lustre: DEBUG MARKER: replay-single test_0c: @@@@@@ FAIL: post-failover df failed

On the MDS1, 3 console, we see

[68493.204660] Lustre: lustre-MDT0000: Denying connection for new client 00c98053-78c2-d29f-2efa-cfe5ab2d84b0 (at 10.9.1.244@tcp), waiting for 5 known clients (0 recovered, 4 in progress, and 0 evicted) to recover in 0:28
[68493.208019] Lustre: Skipped 12 previous similar messages
[68521.668216] Lustre: lustre-MDT0000: recovery is timed out, evict stale exports
[68521.669631] Lustre: lustre-MDT0000: disconnecting 1 stale clients
[68521.670781] Lustre: 824:0:(ldlm_lib.c:1782:extend_recovery_timer()) lustre-MDT0000: extended recovery timer reached hard limit: 180, extend: 1
[68521.673037] Lustre: 824:0:(ldlm_lib.c:2063:target_recovery_overseer()) lustre-MDT0000 recovery is aborted by hard timeout
[68521.674849] Lustre: 824:0:(ldlm_lib.c:2073:target_recovery_overseer()) recovery is aborted, evict exports in recovery
[68521.676769] LustreError: 824:0:(tgt_grant.c:248:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 0 != fo_tot_granted 2097152
[68521.678875] Lustre: 824:0:(ldlm_lib.c:1616:abort_req_replay_queue()) @@@ aborted:  req@ffff8a16576ce880 x1668123554855936/t0(8589934655) o36->c051fb07-a55c-c0a1-229e-7286a90a1044@10.9.1.245@tcp:309/0 lens 536/0 e 2 to 0 dl 1590857034 ref 1 fl Complete:/4/ffffffff rc 0/-1
[68521.683316] LustreError: 824:0:(ldlm_lib.c:1637:abort_lock_replay_queue()) @@@ aborted:  req@ffff8a164e504050 x1668118635732736/t0(0) o101->lustre-MDT0003-mdtlov_UUID@10.9.1.253@tcp:312/0 lens 328/0 e 6 to 0 dl 1590857037 ref 1 fl Complete:/40/ffffffff rc 0/-1
[68521.687295] LustreError: 11-0: lustre-MDT0000-osp-MDT0002: operation ldlm_enqueue to node 0@lo failed: rc = -107

Is this the same issue as described in this ticket?

Generated at Sat Feb 10 02:39:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.