[LU-7531] MDT recovery stalled if rescources are failed back immediatelly Created: 09/Dec/15  Updated: 21/Dec/15  Resolved: 21/Dec/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Frank Heckes (Inactive) Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: soak
Environment:

lola
build: 2.7.63-28-g5fda01f, 5fda01f3002e7e742a206ce149652c6b78356828 + patches


Attachments: File console-lola-9.log.bz2     File lola-9-lustre-log-recovery-stalled-2015-12-09-0455.log.bz2     File messages+stack-trace-lola-10.log.bz2     File messages+stack-trace-lola-11.log.bz2     File messages+stack-trace-lola-9.log.bz2     File messages-lola-9.log.bz2     File soak.log.bz2    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The error occurred during soak testing of build '20151201.1' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20151201.1). DNE is enabled. MDSes are set-up in active-active HA failover configuration.

The MDT recovery process stalls on the primary node in case the recovery process is interrupted on the secondary node by failing back the
resources immediately. This effects all running and new jobs using the remote MDTs.

Sequence of events:

  • 2015-12-09 04:35:10 - Failover MDTs owned by lola-9 --> lola-8
  • 2015-12-09 04:43:52 - MDTs mounted successful on secondary (lola-8
  • 2015-12-09 04:44:13 - Stop recovery process (incomplete at that time, see soak.log), and initiated failback
  • 2015-12-09 04:44:25 - mds_failover (failback) completed

Double checked MDTs are active and mounted:

[root@lola-16 lola]# ssh lola-9 'lctl dl | grep " mdt "' | less -i
  4 UP mdt soaked-MDT0003 soaked-MDT0003_UUID 67
 32 UP mdt soaked-MDT0002 soaked-MDT0002_UUID 63

[root@lola-16 lola]# ssh lola-9 'mount | grep lustre' | less -i
/dev/mapper/360080e50002ffd8200000251520130a4p1 on /mnt/soaked-mdt3 type lustre (rw,user_xattr)
/dev/mapper/360080e50002ff4f00000026d52013098p1 on /mnt/soaked-mdt2 type lustre (rw,user_xattr)

Recovery still ongoing after ~ 50 mins:

[root@lola-9 ~]# date
Wed Dec  9 05:30:04 PST 2015
[root@lola-9 ~]# lctl get_param mdt.*.recovery_status
mdt.soaked-MDT0002.recovery_status=
status: RECOVERING
recovery_start: 1449667442
time_remaining: 0
connected_clients: 16/16
req_replay_clients: 5
lock_repay_clients: 5
completed_clients: 11
evicted_clients: 0
replayed_requests: 0
queued_requests: 4
next_transno: 1090929750241
mdt.soaked-MDT0003.recovery_status=
status: RECOVERING
recovery_start: 1449667442
time_remaining: 0
connected_clients: 16/16
req_replay_clients: 5
lock_repay_clients: 5
completed_clients: 11
evicted_clients: 0
replayed_requests: 0
queued_requests: 4
next_transno: 1047980457114

Attached messages, console log file of MDT (lola-8), debug log file created manually at 04:55 and soak.log file.



 Comments   
Comment by Joseph Gmitter (Inactive) [ 09/Dec/15 ]

Assigning to Di who reported he will be looking into it shortly to assess.

Comment by Di Wang [ 10/Dec/15 ]

I just checked the debug log, and it looks like the recovery process is stuck during update recovery. Though I can not figure out why the recovery is stuck by current log.

Frank, Could you please re-run test, and get me the threads stack trace when the endless recovery happens? Unfortunately lola-9 has been restarted, thanks.

Comment by Di Wang [ 10/Dec/15 ]

I saw a similar issue in soak-test, and got the trace

tgt_recov     S 0000000000000013     0 13622      2 0x00000080
 ffff8803f5fb57e0 0000000000000046 0000000000000000 ffff8803f5fb57a4
 ffffffff00000000 ffff88043fe84400 000003162493ac87 0000000000000286
 ffff8803f5fb5780 ffffffff8108742c ffff8803f5fad068 ffff8803f5fb5fd8
Call Trace:
 [<ffffffff8108742c>] ? lock_timer_base+0x3c/0x70
 [<ffffffff8152b222>] schedule_timeout+0x192/0x2e0
 [<ffffffff81087540>] ? process_timeout+0x0/0x10
 [<ffffffffa0bad779>] ptlrpc_set_wait+0x319/0xa20 [ptlrpc]
 [<ffffffffa0967344>] ? lprocfs_stats_alloc_one+0x344/0x360 [obdclass]
 [<ffffffffa0ba2c20>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc]
 [<ffffffff81064c00>] ? default_wake_function+0x0/0x20
 [<ffffffffa0bb9b15>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc]
 [<ffffffffa0badf01>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]
 [<ffffffffa0b85f6e>] ldlm_cli_enqueue+0x37e/0x870 [ptlrpc]
 [<ffffffffa0b8b050>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
 [<ffffffffa122b390>] ? mdt_remote_blocking_ast+0x0/0x210 [mdt]
 [<ffffffffa140f615>] osp_md_object_lock+0x185/0x240 [osp]
 [<ffffffffa131c557>] lod_object_lock+0x147/0x860 [lod]
 [<ffffffffa0855109>] ? cfs_hash_bd_add_locked+0x29/0x90 [libcfs]
 [<ffffffffa099161f>] ? lu_object_find_try+0x1df/0x260 [obdclass]
 [<ffffffffa139e90b>] mdd_object_lock+0x3b/0xd0 [mdd]
 [<ffffffffa1238bba>] mdt_remote_object_lock+0x14a/0x310 [mdt]
 [<ffffffffa1238ea9>] mdt_object_lock_internal+0x129/0x2d0 [mdt]
 [<ffffffffa1239111>] mdt_object_lock+0x11/0x20 [mdt]
 [<ffffffffa1250474>] mdt_reint_unlink+0x204/0xff0 [mdt]
 [<ffffffffa1247a2d>] mdt_reint_rec+0x5d/0x200 [mdt]
 [<ffffffffa123381b>] mdt_reint_internal+0x62b/0xb80 [mdt]
 [<ffffffffa123420b>] mdt_reint+0x6b/0x120 [mdt]
 [<ffffffffa0c2350c>] tgt_request_handle+0x8ec/0x1470 [ptlrpc]
 [<ffffffffa0c22c20>] ? tgt_request_handle+0x0/0x1470 [ptlrpc]
 [<ffffffffa0b741eb>] handle_recovery_req+0x16b/0x290 [ptlrpc]
 [<ffffffffa0b7b17c>] target_recovery_thread+0x10dc/0x2550 [ptlrpc]
 [<ffffffff81064c12>] ? default_wake_function+0x12/0x20
 [<ffffffffa0b7a0a0>] ? target_recovery_thread+0x0/0x2550 [ptlrpc]
 [<ffffffff8109e78e>] kthread+0x9e/0xc0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20

Ah, we should allow the lock enqueue during recovering, I will cook a patch.

Comment by Gerrit Updater [ 10/Dec/15 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/17539
Subject: LU-7531 osp: allow enqueue during recovery
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 240fc3cf7188c4291085862225cf469ac3d29850

Comment by Frank Heckes (Inactive) [ 10/Dec/15 ]

Error happens again, even if recovery time is 1h. 6 MDTs are stalled for over 1 h now:

----------------
lola-9
----------------
mdt.soaked-MDT0002.recovery_status=
status: RECOVERING
recovery_start: 1449724636
time_remaining: 0
connected_clients: 9/16
req_replay_clients: 1
lock_repay_clients: 1
completed_clients: 8
evicted_clients: 7
replayed_requests: 10
queued_requests: 0
next_transno: 1090929751042
mdt.soaked-MDT0003.recovery_status=
status: RECOVERING
recovery_start: 1449724648
time_remaining: 0
connected_clients: 9/16
req_replay_clients: 1
lock_repay_clients: 1
completed_clients: 8
evicted_clients: 7
replayed_requests: 0
queued_requests: 0
next_transno: 1047980457124
----------------
lola-10
----------------
mdt.soaked-MDT0004.recovery_status=
status: RECOVERING
recovery_start: 1449735425
time_remaining: 0
connected_clients: 14/14
req_replay_clients: 0
lock_repay_clients: 0
completed_clients: 14
evicted_clients: 0
replayed_requests: 0
queued_requests: 0
next_transno: 1129576470250
mdt.soaked-MDT0005.recovery_status=
status: RECOVERING
recovery_start: 1449735422
time_remaining: 0
connected_clients: 14/14
req_replay_clients: 0
lock_repay_clients: 0
completed_clients: 14
evicted_clients: 0
replayed_requests: 0
queued_requests: 0
next_transno: 1138166481892
----------------
lola-11
----------------
mdt.soaked-MDT0006.recovery_status=
status: RECOVERING
recovery_start: 1449729112
time_remaining: 0
connected_clients: 14/14
req_replay_clients: 0
lock_repay_clients: 0
completed_clients: 14
evicted_clients: 0
replayed_requests: 0
queued_requests: 0
next_transno: 1116691658301
mdt.soaked-MDT0007.recovery_status=
status: RECOVERING
recovery_start: 1449729069
time_remaining: 0
connected_clients: 14/14
req_replay_clients: 0
lock_repay_clients: 0
completed_clients: 14
evicted_clients: 0
replayed_requests: 0
queued_requests: 0
next_transno: 1125281595927

I attached the messages file containing the stack traces of the nodes affected as requested above.

Comment by Gerrit Updater [ 21/Dec/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17539/
Subject: LU-7531 osp: allow few requests during recovery
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2fe2d1e82005746180309d9b79057a418a729e54

Comment by Joseph Gmitter (Inactive) [ 21/Dec/15 ]

Landed for 2.8

Generated at Sat Feb 10 02:09:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.