Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.12.0, Lustre 2.12.1, Lustre 2.12.2
-
None
-
3
-
9223372036854775807
Description
r eplay-single test_0d fails with 'post-failover df failed' due to all clients being evicted and not recovering. Looking at the logs from a recent failure, https://testing.whamcloud.com/test_sets/d34a9c44-fd82-11e8-b970-52540065bddc , in the client test_log, we see there is an problem mounting the file system on the second client (vm4)
Started lustre-MDT0000 Starting client: trevis-26vm3: -o user_xattr,flock trevis-26vm6@tcp:/lustre /mnt/lustre CMD: trevis-26vm3 mkdir -p /mnt/lustre CMD: trevis-26vm3 mount -t lustre -o user_xattr,flock trevis-26vm6@tcp:/lustre /mnt/lustre trevis-26vm4: error: invalid path '/mnt/lustre': Input/output error replay-single test_0d: @@@@@@ FAIL: post-failover df failed
Looking at the dmesg log from client 2 (vm4), we see the following errors
[44229.221245] LustreError: 166-1: MGC10.9.5.67@tcp: Connection to MGS (at 10.9.5.67@tcp) was lost; in progress operations using this service will fail [44254.268743] Lustre: Evicted from MGS (at 10.9.5.67@tcp) after server handle changed from 0x306f28dc59d36b9 to 0x306f28dc59d3cc4 [44425.483787] LustreError: 11-0: lustre-MDT0000-mdc-ffff88007a5ac800: operation mds_reint to node 10.9.5.67@tcp failed: rc = -107 [44429.540695] LustreError: 167-0: lustre-MDT0000-mdc-ffff88007a5ac800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. [44429.542381] LustreError: 29222:0:(file.c:4393:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -5 [44429.542384] LustreError: 29222:0:(file.c:4393:ll_inode_revalidate_fini()) Skipped 15 previous similar messages [44429.547526] Lustre: lustre-MDT0000-mdc-ffff88007a5ac800: Connection restored to 10.9.5.67@tcp (at 10.9.5.67@tcp) [44429.547533] Lustre: Skipped 1 previous similar message [44429.613758] Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-single test_0d: @@@@@@ FAIL: post-failover df failed
In the dmesg log for the MDS (vm6), we see
[44131.617072] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect [44135.460894] Lustre: 2440:0:(ldlm_lib.c:1771:extend_recovery_timer()) lustre-MDT0000: extended recovery timer reaching hard limit: 180, extend: 0 [44196.726935] Lustre: lustre-MDT0000: Denying connection for new client f33a3fe0-b38c-7f20-7b19-3c32e6a1bff3(at 10.9.5.64@tcp), waiting for 2 known clients (0 recovered, 1 in progress, and 0 evicted) already passed deadline 3:05 [44196.728849] Lustre: Skipped 21 previous similar messages [44311.673038] Lustre: lustre-MDT0000: recovery is timed out, evict stale exports [44311.673797] Lustre: lustre-MDT0000: disconnecting 1 stale clients [44311.674391] Lustre: Skipped 1 previous similar message [44311.675031] Lustre: 2500:0:(ldlm_lib.c:1771:extend_recovery_timer()) lustre-MDT0000: extended recovery timer reaching hard limit: 180, extend: 1 [44311.676331] Lustre: 2500:0:(ldlm_lib.c:2048:target_recovery_overseer()) lustre-MDT0000 recovery is aborted by hard timeout [44311.677355] Lustre: 2500:0:(ldlm_lib.c:2048:target_recovery_overseer()) Skipped 2 previous similar messages [44311.678318] Lustre: 2500:0:(ldlm_lib.c:2058:target_recovery_overseer()) recovery is aborted, evict exports in recovery [44311.679301] Lustre: 2500:0:(ldlm_lib.c:2058:target_recovery_overseer()) Skipped 2 previous similar messages [44311.680369] LustreError: 2500:0:(tgt_grant.c:248:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 0 != fo_tot_granted 2097152 [44311.681531] Lustre: 2500:0:(ldlm_lib.c:1617:abort_req_replay_queue()) @@@ aborted: req@ffff922b644d6400 x1619523210909360/t0(12884901890) o36->94cd1843-54cb-a4d4-a0d3-b3519f2b7d2a@10.9.5.65@tcp:356/0 lens 512/0 e 3 to 0 dl 1544506121 ref 1 fl Complete:/4/ffffffff rc 0/-1 [44311.739670] Lustre: lustre-MDT0000: Recovery over after 3:00, of 2 clients 0 recovered and 2 were evicted. [44311.930592] Lustre: lustre-MDT0000: Connection restored to e9848982-35c6-9607-086a-2eb07fd9bf44 (at 10.9.5.64@tcp) [44311.931571] Lustre: Skipped 46 previous similar messages [44315.952804] Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-single test_0d: @@@@@@ FAIL: post-failover df failed
We see replay-single test 0c also fail with similar messages in the logs; https://testing.whamcloud.com/test_sets/d20239e0-fd79-11e8-a97c-52540065bddc .
More logs for these failures are at
https://testing.whamcloud.com/test_sets/ea4338ea-fd67-11e8-8a18-52540065bddc
https://testing.whamcloud.com/test_sets/9efcb22c-f712-11e8-815b-52540065bddc
Attachments
Issue Links
- is related to
-
LU-12769 replay-dual test 0b hangs in client mount
- Resolved
-
LU-13614 replay-single test_117: LBUG: ASSERTION( atomic_read(&obd->obd_req_replay_clients) == 0 ) failed
- Resolved
-
LU-11771 bad output in target_handle_reconnect: Recovery already passed deadline 71578:57
- Resolved
-
LU-13339 patch for LU-11762 causes an assertion in replay-dual
- Resolved
-
LU-9019 Migrate lustre to standard 64 bit time kernel API
- Resolved
- is related to
-
LU-10950 replay-single test_0c: post-failover df failed
- Reopened
-
LU-12340 replay-dual test 0b timeouts
- Resolved
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...