[LU-10950] replay-single test_0c: post-failover df failed Created: 24/Apr/18 Updated: 14/Jun/22 |
|
| Status: | Reopened |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.2, Lustre 2.12.5, Lustre 2.12.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This issue was created by maloo for sarah_lw <wei3.liu@intel.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/3f06735e-47a3-11e8-960d-52540065bddc test_0c failed with the following error: post-failover df failed MDS console [54749.278470] Lustre: DEBUG MARKER: dmesg [54750.063053] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == replay-single test 0c: check replay-barrier ======================================================= 12:47:53 \(1524487673\) [54750.285298] Lustre: DEBUG MARKER: == replay-single test 0c: check replay-barrier ======================================================= 12:47:53 (1524487673) [54771.807392] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-3vm4.trevis.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [54772.067740] Lustre: DEBUG MARKER: trevis-3vm4.trevis.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [54774.441472] Lustre: Evicted from MGS (at 10.9.4.19@tcp) after server handle changed from 0x3be7d72ea5628d3a to 0x3be7d72ea5629b9c [54774.447132] LustreError: 167-0: lustre-MDT0000-lwp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. [54774.451853] LustreError: Skipped 1 previous similar message [54811.662136] LustreError: 11-0: lustre-MDT0000-lwp-MDT0001: operation quota_acquire to node 10.9.4.19@tcp failed: rc = -11 [54811.665037] LustreError: Skipped 1 previous similar message [54871.629282] LustreError: 11-0: lustre-MDT0000-lwp-MDT0001: operation quota_acquire to node 10.9.4.19@tcp failed: rc = -11 [54931.629799] LustreError: 11-0: lustre-MDT0000-lwp-MDT0001: operation quota_acquire to node 10.9.4.19@tcp failed: rc = -11 [54953.483279] LustreError: 11-0: lustre-MDT0000-osp-MDT0001: operation ldlm_enqueue to node 10.9.4.19@tcp failed: rc = -107 [54954.441836] LustreError: 167-0: lustre-MDT0000-osp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. [54954.447019] LustreError: Skipped 1 previous similar message [54955.493750] Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-single test_0c: @@@@@@ FAIL: post-failover df failed [54955.689555] Lustre: DEBUG MARKER: replay-single test_0c: @@@@@@ FAIL: post-failover df failed VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Jian Yu [ 04/Nov/18 ] |
|
+1 on master branch: https://testing.whamcloud.com/test_sets/a167544c-dff0-11e8-a251-52540065bddc |
| Comment by James A Simmons [ 23/Sep/19 ] |
|
Potential fix for |
| Comment by James A Simmons [ 10/Oct/19 ] |
|
This should be fixed by https://review.whamcloud.com/#/c/36274. If not we can reopen. |
| Comment by James Nunez (Inactive) [ 05/Nov/19 ] |
|
It looks like we are still seeing this issue even after patch https://review.whamcloud.com/36274/ landed to master. Please see https://testing.whamcloud.com/test_sets/42920818-fe12-11e9-8e77-52540065bddc for a recent failure. |
| Comment by James A Simmons [ 05/Nov/19 ] |
|
Can you try https://review.whamcloud.com/#/c/35627. |
| Comment by James Nunez (Inactive) [ 02/Jun/20 ] |
|
I looks like we are still seeing this issue on the b2_12 branch (2.12.5 RC1) at https://testing.whamcloud.com/test_sets/b1f9a784-03da-4f5d-8586-8f22b0ec1803 . In the MDS 2, 4 console, we don't see the "operation quota_acquire to node 10.9.4.19@tcp failed: rc = -11", but se do see the ldlm_enqueue error [68319.207306] Lustre: DEBUG MARKER: == replay-single test 0c: check replay-barrier ======================================================= 16:40:20 (1590856820) [68341.027124] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == rpc test complete, duration -o sec ================================================================ 16:40:42 \(1590856842\) [68341.238283] Lustre: DEBUG MARKER: == rpc test complete, duration -o sec ================================================================ 16:40:42 (1590856842) [68341.608159] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-57vm4.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [68341.811858] Lustre: DEBUG MARKER: trevis-57vm4.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [68342.615134] LustreError: 167-0: lustre-MDT0000-lwp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. [68342.617641] LustreError: Skipped 1 previous similar message [68342.618663] Lustre: Evicted from MGS (at 10.9.1.247@tcp) after server handle changed from 0x6113ba905ec7fb33 to 0x6113ba905ec808b5 [68521.699520] LustreError: 11-0: lustre-MDT0000-osp-MDT0003: operation ldlm_enqueue to node 10.9.1.247@tcp failed: rc = -107 [68522.897958] LustreError: 167-0: lustre-MDT0000-osp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. [68522.900333] LustreError: Skipped 1 previous similar message [68527.244961] Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-single test_0c: @@@@@@ FAIL: post-failover df failed [68527.479305] Lustre: DEBUG MARKER: replay-single test_0c: @@@@@@ FAIL: post-failover df failed On the MDS1, 3 console, we see [68493.204660] Lustre: lustre-MDT0000: Denying connection for new client 00c98053-78c2-d29f-2efa-cfe5ab2d84b0 (at 10.9.1.244@tcp), waiting for 5 known clients (0 recovered, 4 in progress, and 0 evicted) to recover in 0:28 [68493.208019] Lustre: Skipped 12 previous similar messages [68521.668216] Lustre: lustre-MDT0000: recovery is timed out, evict stale exports [68521.669631] Lustre: lustre-MDT0000: disconnecting 1 stale clients [68521.670781] Lustre: 824:0:(ldlm_lib.c:1782:extend_recovery_timer()) lustre-MDT0000: extended recovery timer reached hard limit: 180, extend: 1 [68521.673037] Lustre: 824:0:(ldlm_lib.c:2063:target_recovery_overseer()) lustre-MDT0000 recovery is aborted by hard timeout [68521.674849] Lustre: 824:0:(ldlm_lib.c:2073:target_recovery_overseer()) recovery is aborted, evict exports in recovery [68521.676769] LustreError: 824:0:(tgt_grant.c:248:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 0 != fo_tot_granted 2097152 [68521.678875] Lustre: 824:0:(ldlm_lib.c:1616:abort_req_replay_queue()) @@@ aborted: req@ffff8a16576ce880 x1668123554855936/t0(8589934655) o36->c051fb07-a55c-c0a1-229e-7286a90a1044@10.9.1.245@tcp:309/0 lens 536/0 e 2 to 0 dl 1590857034 ref 1 fl Complete:/4/ffffffff rc 0/-1 [68521.683316] LustreError: 824:0:(ldlm_lib.c:1637:abort_lock_replay_queue()) @@@ aborted: req@ffff8a164e504050 x1668118635732736/t0(0) o101->lustre-MDT0003-mdtlov_UUID@10.9.1.253@tcp:312/0 lens 328/0 e 6 to 0 dl 1590857037 ref 1 fl Complete:/40/ffffffff rc 0/-1 [68521.687295] LustreError: 11-0: lustre-MDT0000-osp-MDT0002: operation ldlm_enqueue to node 0@lo failed: rc = -107 Is this the same issue as described in this ticket? |