[LU-7675] replay-single test_101 times out after aborting recovery on mount of the mds1 Created: 15/Jan/16  Updated: 14/Dec/21  Resolved: 14/Dec/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

autotest review-dne-part-2


Issue Links:
Related
is related to LU-8753 Recovery already passed deadline with... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-single test 101 times out on mount of the mds1 with the abort recovery flag. The last information in the test_log is

01:57:12 (1452765432) waiting for onyx-34vm7 network 900 secs ...
01:57:12 (1452765432) network interface is UP
CMD: onyx-34vm7 hostname
CMD: onyx-34vm7 test -b /dev/lvm-Role_MDS/P1
Starting mds1:  -o abort_recovery /dev/lvm-Role_MDS/P1 /mnt/mds1
CMD: onyx-34vm7 mkdir -p /mnt/mds1; mount -t lustre  -o abort_recovery 		                   /dev/lvm-Role_MDS/P1 /mnt/mds1

From the MDS1 console, we see:

01:57:22:LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts: 
01:57:22:LustreError: 14301:0:(mdt_handler.c:5605:mdt_iocontrol()) lustre-MDT0000: Aborting recovery for device
01:57:44:LustreError: 14301:0:(ldlm_lib.c:2479:target_stop_recovery_thread()) lustre-MDT0000: Aborting recovery
01:57:44:Lustre: 14377:0:(ldlm_lib.c:1945:target_recovery_overseer()) recovery is aborted, evict exports in recovery
01:57:44:Lustre: 14377:0:(ldlm_lib.c:1945:target_recovery_overseer()) Skipped 2 previous similar messages
01:57:44:Lustre: lustre-MDT0000: disconnecting 5 stale clients
01:57:44:LustreError: 14377:0:(update_records.c:72:update_records_dump()) master transno = 382252089401 batchid = 373662154835 flags = 0 ops = 19 params = 9
01:57:44:LustreError: 14377:0:(update_records.c:72:update_records_dump()) master transno = 382252089401 batchid = 373662154836 flags = 0 ops = 28 params = 24
01:57:44:LustreError: 14377:0:(update_records.c:72:update_records_dump()) master transno = 382252089401 batchid = 377957122268 flags = 0 ops = 19 params = 9
01:57:44:
Press any key to continue.
01:57:44:
Press any key to continue.
01:57:44:
Press any key to continue.
01:57:44:
Press any key to continue.
01:57:44:
Press any key to continue.
01:57:44: [H [J
01:57:44:    GNU GRUB  version 0.97  (631K lower / 2096116K upper memory)

We’ve seen this error four times in the past two months during review-dne-part-2 testing. Logs are at
2015-11-27 03:10:27 - https://testing.hpdd.intel.com/test_sets/874faa9a-9503-11e5-bdeb-5254006e85c2
2015-12-12 02:31:59 - https://testing.hpdd.intel.com/test_sets/77362cfc-a0e2-11e5-9d88-5254006e85c2
2016-01-02 08:22:17 - https://testing.hpdd.intel.com/test_sets/102b7ef4-b177-11e5-bf32-5254006e85c2
2016-01-14 08:30:36 - https://testing.hpdd.intel.com/test_sets/4723f9d4-bae8-11e5-87b4-5254006e85c2



 Comments   
Comment by Di Wang [ 20/Jan/16 ]

This will probably fixed by http://review.whamcloud.com/#/c/17885/ LU-7638.

Comment by James Nunez (Inactive) [ 20/Jan/16 ]

Another failure on master:
2016-01-19 20:37:06 - https://testing.hpdd.intel.com/test_sets/83512d64-bf39-11e5-90a1-5254006e85c2
2016-01-26 19:47:25 - https://testing.hpdd.intel.com/test_sets/199bf244-c4ae-11e5-9fd1-5254006e85c2

Generated at Sat Feb 10 02:10:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.