[LU-11762] replay-single test 0d fails with 'post-failover df failed' - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.14.0
Affects Version/s: Lustre 2.12.0, Lustre 2.12.1, Lustre 2.12.2
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

r eplay-single test_0d fails with 'post-failover df failed' due to all clients being evicted and not recovering. Looking at the logs from a recent failure, https://testing.whamcloud.com/test_sets/d34a9c44-fd82-11e8-b970-52540065bddc , in the client test_log, we see there is an problem mounting the file system on the second client (vm4)

Started lustre-MDT0000
Starting client: trevis-26vm3:  -o user_xattr,flock trevis-26vm6@tcp:/lustre /mnt/lustre
CMD: trevis-26vm3 mkdir -p /mnt/lustre
CMD: trevis-26vm3 mount -t lustre -o user_xattr,flock trevis-26vm6@tcp:/lustre /mnt/lustre
trevis-26vm4: error: invalid path '/mnt/lustre': Input/output error
 replay-single test_0d: @@@@@@ FAIL: post-failover df failed

Looking at the dmesg log from client 2 (vm4), we see the following errors

[44229.221245] LustreError: 166-1: MGC10.9.5.67@tcp: Connection to MGS (at 10.9.5.67@tcp) was lost; in progress operations using this service will fail
[44254.268743] Lustre: Evicted from MGS (at 10.9.5.67@tcp) after server handle changed from 0x306f28dc59d36b9 to 0x306f28dc59d3cc4
[44425.483787] LustreError: 11-0: lustre-MDT0000-mdc-ffff88007a5ac800: operation mds_reint to node 10.9.5.67@tcp failed: rc = -107
[44429.540695] LustreError: 167-0: lustre-MDT0000-mdc-ffff88007a5ac800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
[44429.542381] LustreError: 29222:0:(file.c:4393:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -5
[44429.542384] LustreError: 29222:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 15 previous similar messages
[44429.547526] Lustre: lustre-MDT0000-mdc-ffff88007a5ac800: Connection restored to 10.9.5.67@tcp (at 10.9.5.67@tcp)
[44429.547533] Lustre: Skipped 1 previous similar message
[44429.613758] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_0d: @@@@@@ FAIL: post-failover df failed

In the dmesg log for the MDS (vm6), we see

[44131.617072] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
[44135.460894] Lustre: 2440:0:(ldlm_lib.c:1771:extend_recovery_timer()) lustre-MDT0000: extended recovery timer reaching hard limit: 180, extend: 0
[44196.726935] Lustre: lustre-MDT0000: Denying connection for new client f33a3fe0-b38c-7f20-7b19-3c32e6a1bff3(at 10.9.5.64@tcp), waiting for 2 known clients (0 recovered, 1 in progress, and 0 evicted) already passed deadline 3:05
[44196.728849] Lustre: Skipped 21 previous similar messages
[44311.673038] Lustre: lustre-MDT0000: recovery is timed out, evict stale exports
[44311.673797] Lustre: lustre-MDT0000: disconnecting 1 stale clients
[44311.674391] Lustre: Skipped 1 previous similar message
[44311.675031] Lustre: 2500:0:(ldlm_lib.c:1771:extend_recovery_timer()) lustre-MDT0000: extended recovery timer reaching hard limit: 180, extend: 1
[44311.676331] Lustre: 2500:0:(ldlm_lib.c:2048:target_recovery_overseer()) lustre-MDT0000 recovery is aborted by hard timeout
[44311.677355] Lustre: 2500:0:(ldlm_lib.c:2048:target_recovery_overseer()) Skipped 2 previous similar messages
[44311.678318] Lustre: 2500:0:(ldlm_lib.c:2058:target_recovery_overseer()) recovery is aborted, evict exports in recovery
[44311.679301] Lustre: 2500:0:(ldlm_lib.c:2058:target_recovery_overseer()) Skipped 2 previous similar messages
[44311.680369] LustreError: 2500:0:(tgt_grant.c:248:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 0 != fo_tot_granted 2097152
[44311.681531] Lustre: 2500:0:(ldlm_lib.c:1617:abort_req_replay_queue()) @@@ aborted:  req@ffff922b644d6400 x1619523210909360/t0(12884901890) o36->94cd1843-54cb-a4d4-a0d3-b3519f2b7d2a@10.9.5.65@tcp:356/0 lens 512/0 e 3 to 0 dl 1544506121 ref 1 fl Complete:/4/ffffffff rc 0/-1
[44311.739670] Lustre: lustre-MDT0000: Recovery over after 3:00, of 2 clients 0 recovered and 2 were evicted.
[44311.930592] Lustre: lustre-MDT0000: Connection restored to e9848982-35c6-9607-086a-2eb07fd9bf44 (at 10.9.5.64@tcp)
[44311.931571] Lustre: Skipped 46 previous similar messages
[44315.952804] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_0d: @@@@@@ FAIL: post-failover df failed

We see replay-single test 0c also fail with similar messages in the logs; https://testing.whamcloud.com/test_sets/d20239e0-fd79-11e8-a97c-52540065bddc .

More logs for these failures are at
https://testing.whamcloud.com/test_sets/ea4338ea-fd67-11e8-8a18-52540065bddc
https://testing.whamcloud.com/test_sets/9efcb22c-f712-11e8-815b-52540065bddc

Attachments

Issue Links

is related to

LU-12769 replay-dual test 0b hangs in client mount

Resolved

LU-13614 replay-single test_117: LBUG: ASSERTION( atomic_read(&obd->obd_req_replay_clients) == 0 ) failed

Resolved

LU-11771 bad output in target_handle_reconnect: Recovery already passed deadline 71578:57

Resolved

LU-13339 patch for LU-11762 causes an assertion in replay-dual

Resolved

LU-9019 Migrate lustre to standard 64 bit time kernel API

Resolved

is related to

LU-10950 replay-single test_0c: post-failover df failed

Reopened

LU-12340 replay-dual test 0b timeouts

Resolved

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(2 is related to , 8 mentioned in)

replay-single test 0d fails with 'post-failover df failed'

Details

Description

Attachments

Issue Links

Activity

People

Dates