Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.8.0
-
lola
build: 2.7.63-28-g5fda01f, 5fda01f3002e7e742a206ce149652c6b78356828 + patches
-
3
-
9223372036854775807
Description
The error occurred during soak testing of build '20151201.1' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20151201.1). DNE is enabled. MDSes are set-up in active-active HA failover configuration.
The MDT recovery process stalls on the primary node in case the recovery process is interrupted on the secondary node by failing back the
resources immediately. This effects all running and new jobs using the remote MDTs.
Sequence of events:
- 2015-12-09 04:35:10 - Failover MDTs owned by lola-9 --> lola-8
- 2015-12-09 04:43:52 - MDTs mounted successful on secondary (lola-8
- 2015-12-09 04:44:13 - Stop recovery process (incomplete at that time, see soak.log), and initiated failback
- 2015-12-09 04:44:25 - mds_failover (failback) completed
Double checked MDTs are active and mounted:
[root@lola-16 lola]# ssh lola-9 'lctl dl | grep " mdt "' | less -i 4 UP mdt soaked-MDT0003 soaked-MDT0003_UUID 67 32 UP mdt soaked-MDT0002 soaked-MDT0002_UUID 63 [root@lola-16 lola]# ssh lola-9 'mount | grep lustre' | less -i /dev/mapper/360080e50002ffd8200000251520130a4p1 on /mnt/soaked-mdt3 type lustre (rw,user_xattr) /dev/mapper/360080e50002ff4f00000026d52013098p1 on /mnt/soaked-mdt2 type lustre (rw,user_xattr)
Recovery still ongoing after ~ 50 mins:
[root@lola-9 ~]# date Wed Dec 9 05:30:04 PST 2015 [root@lola-9 ~]# lctl get_param mdt.*.recovery_status mdt.soaked-MDT0002.recovery_status= status: RECOVERING recovery_start: 1449667442 time_remaining: 0 connected_clients: 16/16 req_replay_clients: 5 lock_repay_clients: 5 completed_clients: 11 evicted_clients: 0 replayed_requests: 0 queued_requests: 4 next_transno: 1090929750241 mdt.soaked-MDT0003.recovery_status= status: RECOVERING recovery_start: 1449667442 time_remaining: 0 connected_clients: 16/16 req_replay_clients: 5 lock_repay_clients: 5 completed_clients: 11 evicted_clients: 0 replayed_requests: 0 queued_requests: 4 next_transno: 1047980457114
Attached messages, console log file of MDT (lola-8), debug log file created manually at 04:55 and soak.log file.