[LU-7531] MDT recovery stalled if rescources are failed back immediatelly Created: 09/Dec/15 Updated: 21/Dec/15 Resolved: 21/Dec/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Frank Heckes (Inactive) | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
lola |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
The error occurred during soak testing of build '20151201.1' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20151201.1). DNE is enabled. MDSes are set-up in active-active HA failover configuration. The MDT recovery process stalls on the primary node in case the recovery process is interrupted on the secondary node by failing back the Sequence of events:
Double checked MDTs are active and mounted: [root@lola-16 lola]# ssh lola-9 'lctl dl | grep " mdt "' | less -i 4 UP mdt soaked-MDT0003 soaked-MDT0003_UUID 67 32 UP mdt soaked-MDT0002 soaked-MDT0002_UUID 63 [root@lola-16 lola]# ssh lola-9 'mount | grep lustre' | less -i /dev/mapper/360080e50002ffd8200000251520130a4p1 on /mnt/soaked-mdt3 type lustre (rw,user_xattr) /dev/mapper/360080e50002ff4f00000026d52013098p1 on /mnt/soaked-mdt2 type lustre (rw,user_xattr) Recovery still ongoing after ~ 50 mins: [root@lola-9 ~]# date Wed Dec 9 05:30:04 PST 2015 [root@lola-9 ~]# lctl get_param mdt.*.recovery_status mdt.soaked-MDT0002.recovery_status= status: RECOVERING recovery_start: 1449667442 time_remaining: 0 connected_clients: 16/16 req_replay_clients: 5 lock_repay_clients: 5 completed_clients: 11 evicted_clients: 0 replayed_requests: 0 queued_requests: 4 next_transno: 1090929750241 mdt.soaked-MDT0003.recovery_status= status: RECOVERING recovery_start: 1449667442 time_remaining: 0 connected_clients: 16/16 req_replay_clients: 5 lock_repay_clients: 5 completed_clients: 11 evicted_clients: 0 replayed_requests: 0 queued_requests: 4 next_transno: 1047980457114 Attached messages, console log file of MDT (lola-8), debug log file created manually at 04:55 and soak.log file. |
| Comments |
| Comment by Joseph Gmitter (Inactive) [ 09/Dec/15 ] |
|
Assigning to Di who reported he will be looking into it shortly to assess. |
| Comment by Di Wang [ 10/Dec/15 ] |
|
I just checked the debug log, and it looks like the recovery process is stuck during update recovery. Though I can not figure out why the recovery is stuck by current log. Frank, Could you please re-run test, and get me the threads stack trace when the endless recovery happens? Unfortunately lola-9 has been restarted, thanks. |
| Comment by Di Wang [ 10/Dec/15 ] |
|
I saw a similar issue in soak-test, and got the trace tgt_recov S 0000000000000013 0 13622 2 0x00000080 ffff8803f5fb57e0 0000000000000046 0000000000000000 ffff8803f5fb57a4 ffffffff00000000 ffff88043fe84400 000003162493ac87 0000000000000286 ffff8803f5fb5780 ffffffff8108742c ffff8803f5fad068 ffff8803f5fb5fd8 Call Trace: [<ffffffff8108742c>] ? lock_timer_base+0x3c/0x70 [<ffffffff8152b222>] schedule_timeout+0x192/0x2e0 [<ffffffff81087540>] ? process_timeout+0x0/0x10 [<ffffffffa0bad779>] ptlrpc_set_wait+0x319/0xa20 [ptlrpc] [<ffffffffa0967344>] ? lprocfs_stats_alloc_one+0x344/0x360 [obdclass] [<ffffffffa0ba2c20>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc] [<ffffffff81064c00>] ? default_wake_function+0x0/0x20 [<ffffffffa0bb9b15>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc] [<ffffffffa0badf01>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc] [<ffffffffa0b85f6e>] ldlm_cli_enqueue+0x37e/0x870 [ptlrpc] [<ffffffffa0b8b050>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] [<ffffffffa122b390>] ? mdt_remote_blocking_ast+0x0/0x210 [mdt] [<ffffffffa140f615>] osp_md_object_lock+0x185/0x240 [osp] [<ffffffffa131c557>] lod_object_lock+0x147/0x860 [lod] [<ffffffffa0855109>] ? cfs_hash_bd_add_locked+0x29/0x90 [libcfs] [<ffffffffa099161f>] ? lu_object_find_try+0x1df/0x260 [obdclass] [<ffffffffa139e90b>] mdd_object_lock+0x3b/0xd0 [mdd] [<ffffffffa1238bba>] mdt_remote_object_lock+0x14a/0x310 [mdt] [<ffffffffa1238ea9>] mdt_object_lock_internal+0x129/0x2d0 [mdt] [<ffffffffa1239111>] mdt_object_lock+0x11/0x20 [mdt] [<ffffffffa1250474>] mdt_reint_unlink+0x204/0xff0 [mdt] [<ffffffffa1247a2d>] mdt_reint_rec+0x5d/0x200 [mdt] [<ffffffffa123381b>] mdt_reint_internal+0x62b/0xb80 [mdt] [<ffffffffa123420b>] mdt_reint+0x6b/0x120 [mdt] [<ffffffffa0c2350c>] tgt_request_handle+0x8ec/0x1470 [ptlrpc] [<ffffffffa0c22c20>] ? tgt_request_handle+0x0/0x1470 [ptlrpc] [<ffffffffa0b741eb>] handle_recovery_req+0x16b/0x290 [ptlrpc] [<ffffffffa0b7b17c>] target_recovery_thread+0x10dc/0x2550 [ptlrpc] [<ffffffff81064c12>] ? default_wake_function+0x12/0x20 [<ffffffffa0b7a0a0>] ? target_recovery_thread+0x0/0x2550 [ptlrpc] [<ffffffff8109e78e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 Ah, we should allow the lock enqueue during recovering, I will cook a patch. |
| Comment by Gerrit Updater [ 10/Dec/15 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/17539 |
| Comment by Frank Heckes (Inactive) [ 10/Dec/15 ] |
|
Error happens again, even if recovery time is 1h. 6 MDTs are stalled for over 1 h now: ---------------- lola-9 ---------------- mdt.soaked-MDT0002.recovery_status= status: RECOVERING recovery_start: 1449724636 time_remaining: 0 connected_clients: 9/16 req_replay_clients: 1 lock_repay_clients: 1 completed_clients: 8 evicted_clients: 7 replayed_requests: 10 queued_requests: 0 next_transno: 1090929751042 mdt.soaked-MDT0003.recovery_status= status: RECOVERING recovery_start: 1449724648 time_remaining: 0 connected_clients: 9/16 req_replay_clients: 1 lock_repay_clients: 1 completed_clients: 8 evicted_clients: 7 replayed_requests: 0 queued_requests: 0 next_transno: 1047980457124 ---------------- lola-10 ---------------- mdt.soaked-MDT0004.recovery_status= status: RECOVERING recovery_start: 1449735425 time_remaining: 0 connected_clients: 14/14 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 14 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 1129576470250 mdt.soaked-MDT0005.recovery_status= status: RECOVERING recovery_start: 1449735422 time_remaining: 0 connected_clients: 14/14 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 14 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 1138166481892 ---------------- lola-11 ---------------- mdt.soaked-MDT0006.recovery_status= status: RECOVERING recovery_start: 1449729112 time_remaining: 0 connected_clients: 14/14 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 14 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 1116691658301 mdt.soaked-MDT0007.recovery_status= status: RECOVERING recovery_start: 1449729069 time_remaining: 0 connected_clients: 14/14 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 14 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 1125281595927 I attached the messages file containing the stack traces of the nodes affected as requested above. |
| Comment by Gerrit Updater [ 21/Dec/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17539/ |
| Comment by Joseph Gmitter (Inactive) [ 21/Dec/15 ] |
|
Landed for 2.8 |