[LU-7848] Recovery process on MDS stalled Created: 04/Mar/16 Updated: 19/Mar/19 Resolved: 31/May/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Frank Heckes (Inactive) | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | dne2, soak | ||
| Environment: |
lola |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Error occurred during soak testing of build '20160302' (b2_8 RC4) (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160302 also). DNE is enabled. MDTs had been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active - active HA failover configuration. (For teset set-up configuration see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-Configuration) The following effects can be observed:
Here're failover and restart events listed for MDS node lola-11. The same 'structure' can be found for the other nodes: mds_failover : 2016-03-03 10:24:12,345 - 2016-03-03 10:32:12,647 lola-10 Mar 3 10:31:58 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 2:14, of 16 clients 0 recovered and 16 were evicted. Mar 3 10:32:06 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:20, of 16 clients 0 recovered and 16 were evicted. mds_failover : 2016-03-03 18:11:42,958 - 2016-03-03 18:18:17,112 lola-10 Mar 3 18:18:03 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:03, of 16 clients 0 recovered and 16 were evicted. Mar 3 18:18:10 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:08, of 16 clients 0 recovered and 16 were evicted. mds_failover : 2016-03-03 22:04:51,554 - 2016-03-03 22:12:03,652 lola-10 Mar 3 22:11:50 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:36, of 16 clients 0 recovered and 16 were evicted. Mar 3 22:11:57 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:22, of 16 clients 0 recovered and 16 were evicted. mds_failover : 2016-03-04 00:11:27,161 - 2016-03-04 00:18:36,686 lola-10 Mar 4 00:18:23 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:23, of 5 clients 0 recovered and 5 were evicted. Mar 4 00:18:30 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:23, of 6 clients 0 recovered and 6 were evicted. mds_failover : 2016-03-04 01:51:11,775 - 2016-03-04 01:58:40,927 lola-10 Mar 4 01:58:27 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:41, of 16 clients 0 recovered and 16 were evicted. Mar 4 01:58:34 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:23, of 16 clients 0 recovered and 16 were evicted. mds_failover : 2016-03-04 02:54:18,928 - 2016-03-04 03:01:00,519 lola-10 Mar 4 03:00:47 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:05, of 16 clients 0 recovered and 16 were evicted. Mar 4 03:00:54 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:09, of 16 clients 0 recovered and 16 were evicted. ------------------ mds_failover : 2016-03-03 09:36:44,457 - 2016-03-03 09:43:43,316 lola-11 Mar 3 09:50:42 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 6:59, of 16 clients 16 recovered and 0 were evicted. Mar 3 09:51:14 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 7:31, of 16 clients 8 recovered and 8 were evicted. mds_failover : 2016-03-03 13:06:05,210 - 2016-03-03 13:13:33,003 lola-11 Mar 3 14:13:46 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 40:56, of 16 clients 16 recovered and 0 were evicted. Mar 3 14:13:50 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 41:50, of 16 clients 16 recovered and 0 were evicted. mds_restart : 2016-03-03 13:26:05,005 - 2016-03-03 13:32:48,359 lola-11 Mar 3 14:13:46 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 40:56, of 16 clients 16 recovered and 0 were evicted. Mar 3 14:13:50 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 41:50, of 16 clients 16 recovered and 0 were evicted. mds_restart : 2016-03-03 20:14:23,309 - 2016-03-03 20:24:56,044 lola-11 Mar 3 20:37:51 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 12:50, of 16 clients 16 recovered and 0 were evicted. ---> MDT0007 never recovered mds_failover : 2016-03-03 22:15:27,654 - 2016-03-03 22:23:34,982 lola-11 Mar 4 01:03:03 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 159:29, of 16 clients 14 recovered and 2 were evicted. Mar 4 01:03:05 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 159:30, of 16 clients 14 recovered and 2 were evicted. mds_failover : 2016-03-04 05:10:37,638 - 2016-03-04 05:17:48,193 lola-11 ---> MDT0006 never recovered ---> MDT0007 never recovered mds_failover : 2016-03-04 05:35:12,194 - 2016-03-04 05:41:56,320 lola-11 ---> MDT0006 never recovered ---> MDT0007 never recovered mds_restart : 2016-03-04 06:53:30,098 - 2016-03-04 07:03:06,783 lola-11 ---> MDT0006 never recovered ---> MDT0007 never recovered Attached message, console and debug log files (with mask ' Same situation ended once with start of oom-killer (see |
| Comments |
| Comment by Frank Heckes (Inactive) [ 04/Mar/16 ] |
|
The recovery process can be aborted. The command lctl --device 4 abort_recovery will be blocked on IO. |
| Comment by Di Wang [ 04/Mar/16 ] |
|
It looks like the update recovery thread is stuck inside retrieving update log. lod0007_rec00 S 0000000000000016 0 4584 2 0x00000080^M ffff880408c0f900 0000000000000046 ffff880408c0f8f0 ffffffffa0a5ad2c^M 000001fc91878c1b 0000000000000000 0000000000000000 0000000000000286^M ffff880408c0f8a0 ffffffff8108742c ffff880408c065f8 ffff880408c0ffd8^M Call Trace:^M [<ffffffffa0a5ad2c>] ? ptlrpc_unregister_reply+0x6c/0x810 [ptlrpc]^M [<ffffffff8108742c>] ? lock_timer_base+0x3c/0x70^M [<ffffffff8152b222>] schedule_timeout+0x192/0x2e0^M [<ffffffff81087540>] ? process_timeout+0x0/0x10^M [<ffffffffa0a5db61>] ptlrpc_set_wait+0x321/0x960 [ptlrpc]^M [<ffffffffa0a533f0>] ? ptlrpc_interrupted_set+0x0/0x120 [ptlrpc]^M [<ffffffff81064c00>] ? default_wake_function+0x0/0x20^M [<ffffffffa0a69e55>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc]^M [<ffffffffa0a5e221>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]^M [<ffffffffa13378d9>] osp_remote_sync+0x129/0x190 [osp]^M [<ffffffffa131ba07>] osp_attr_get+0x417/0x700 [osp]^M [<ffffffffa131d457>] osp_object_init+0x1c7/0x330 [osp]^M [<ffffffffa084c1e8>] lu_object_alloc+0xd8/0x320 [obdclass]^M [<ffffffffa084d5d1>] lu_object_find_try+0x151/0x260 [obdclass]^M [<ffffffffa084d791>] lu_object_find_at+0xb1/0xe0 [obdclass]^M [<ffffffff81174450>] ? cache_alloc_refill+0x1c0/0x240^M [<ffffffffa084e52c>] dt_locate_at+0x1c/0xa0 [obdclass]^M [<ffffffffa0813dce>] llog_osd_get_cat_list+0x8e/0xcd0 [obdclass]^M [<ffffffffa1255750>] lod_sub_prep_llog+0x110/0x7a0 [lod]^M [<ffffffffa12309c0>] ? lod_sub_recovery_thread+0x0/0xaf0 [lod]^M [<ffffffff8105bd83>] ? __wake_up+0x53/0x70^M [<ffffffffa1230c04>] lod_sub_recovery_thread+0x244/0xaf0 [lod]^M [<ffffffffa12309c0>] ? lod_sub_recovery_thread+0x0/0xaf0 [lod]^M [<ffffffff8109e78e>] kthread+0x9e/0xc0^M [<ffffffff8100c28a>] child_rip+0xa/0x20^M [<ffffffff8109e6f0>] ? kthread+0x0/0xc0^M [<ffffffff8100c280>] ? child_rip+0x0/0x20^M Hmm, this kind of RPC should be allowed during recovery. Sigh, there are not enough log to tell what happen. will check more. |
| Comment by Di Wang [ 04/Mar/16 ] |
|
ah, it seems after we disable the MDT-MDT eviction by the recent patch http://review.whamcloud.com/#/c/18676/, then the new connection from the restart MDT will always be denied because of this. So the retrieving update log will be blocked. I will cook a patch. 00010000:02000400:27.0:1457125392.284909:0:6144:0:(ldlm_lib.c:751:target_handle_reconnect()) soaked-MDT0004: already connected client soaked-MDT0003-mdtlov_UUID (at 192.168.1.109@o2ib10) with handle 0xb3e905baff54ad2d. Rejecting client with the same UUID trying to reconnect with handle 0x91819165c59714eb 00010000:00000001:27.0:1457125392.284913:0:6144:0:(ldlm_lib.c:756:target_handle_reconnect()) Process leaving (rc=18446744073709551502 : -114 : ffffffffffffff8e) 00010000:00000001:27.0:1457125392.284916:0:6144:0:(ldlm_lib.c:1190:target_handle_connect()) Process leaving via out (rc=18446744073709551502 : -114 : 0xffffffffffffff8e) 00000020:00000040:27.0:1457125392.284920:0:6144:0:(genops.c:817:class_export_put()) PUTting export ffff8807f3ca9c00 : new refcount 5 00000020:00000040:27.0:1457125392.284922:0:6144:0:(obd_config.c:740:class_decref()) Decref soaked-MDT0004 (ffff880834f44078) now 67 00010000:00000001:27.0:1457125392.284924:0:6144:0:(ldlm_lib.c:1410:target_handle_connect()) Process leaving (rc=18446744073709551502 : -114 : ffffffffffffff8e) |
| Comment by Gerrit Updater [ 05/Mar/16 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18800 |
| Comment by Gerrit Updater [ 07/Mar/16 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18813 |
| Comment by Frank Heckes (Inactive) [ 08/Mar/16 ] |
|
The problem persist after installing build associated with change #18813. lola-10.log:Mar 8 02:28:45 lola-10 kernel: Lustre: soaked-MDT0005: Recovery over after 183:52, of 16 clients 16 recovered and 0 were evicted. lola-10.log:Mar 8 02:28:46 lola-10 kernel: Lustre: soaked-MDT0004: Recovery over after 183:04, of 16 clients 16 recovered and 0 were evicted. lola-10.log:Mar 8 05:01:17 lola-10 kernel: Lustre: soaked-MDT0004: Recovery over after 14:17, of 16 clients 13 recovered and 3 were evicted. lola-11.log:Mar 8 02:33:20 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 112:56, of 16 clients 16 recovered and 0 were evicted. lola-11.log:Mar 8 02:33:21 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 113:44, of 16 clients 16 recovered and 0 were evicted. lola-8.log:Mar 8 02:27:59 lola-8 kernel: Lustre: soaked-MDT0003: Recovery over after 0:45, of 16 clients 0 recovered and 16 were evicted. lola-8.log:Mar 8 02:28:10 lola-8 kernel: Lustre: soaked-MDT0002: Recovery over after 0:26, of 16 clients 0 recovered and 16 were evicted. lola-8.log:Mar 8 02:37:39 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 63:43, of 16 clients 16 recovered and 0 were evicted. lola-9.log:Mar 8 01:32:31 lola-9 kernel: Lustre: soaked-MDT0001: Recovery over after 1:03, of 16 clients 0 recovered and 16 were evicted. lola-9.log:Mar 8 01:33:23 lola-9 kernel: Lustre: soaked-MDT0000: Recovery over after 1:10, of 16 clients 0 recovered and 16 were evicted. lola-9.log:Mar 8 02:41:56 lola-9 kernel: Lustre: soaked-MDT0003: Recovery over after 13:41, of 16 clients 16 recovered and 0 were evicted. lola-9.log:Mar 8 02:41:58 lola-9 kernel: Lustre: soaked-MDT0002: Recovery over after 13:43, of 16 clients 16 recovered and 0 were evicted. |
| Comment by Di Wang [ 09/Mar/16 ] |
|
[3/8/16, 8:05:50 PM] wangdi: it seems the old code is running |
| Comment by Frank Heckes (Inactive) [ 09/Mar/16 ] |
|
| Comment by Frank Heckes (Inactive) [ 09/Mar/16 ] |
|
The effect still remains: mds_restart : 2016-03-09 03:44:21,690 - 2016-03-09 03:55:17,693 lola-8 lola-8.log:Mar 9 03:56:08 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 0:49, of 16 clients 16 recovered and 0 were evicted. lola-8.log:Mar 9 03:56:09 lola-8 kernel: Lustre: soaked-MDT0001: Recovery over after 2:29, of 16 clients 16 recovered and 0 were evicted. mds_restart : 2016-03-09 04:54:33,695 - 2016-03-09 05:02:29,259 lola-10 lola-10.log:Mar 9 05:16:28 lola-10 kernel: Lustre: soaked-MDT0007: Recovery over after 1:42, of 13 clients 0 recovered and 13 were evicted. lola-10.log:Mar 9 05:16:44 lola-10 kernel: Lustre: soaked-MDT0006: Recovery over after 0:32, of 7 clients 0 recovered and 7 were evicted. mds_failover : 2016-03-09 05:08:46,261 - 2016-03-09 05:16:50,083 lola-11 lola-11.log:Mar 9 05:47:15 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 30:24, of 10 clients 10 recovered and 0 were evicted. lola-11.log:Mar 9 05:47:23 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 30:33, of 13 clients 12 recovered and 1 was evicted. This is for build https://build.hpdd.intel.com/job/lustre-reviews/37662/ |
| Comment by Frank Heckes (Inactive) [ 09/Mar/16 ] |
|
All files (messages, debug logs) have been copied to lhn.lola.hpdd.intel.com:/scratch/crashdumps/lu-7848. The test session were executed between Mar, 9th 02:52 – 06:19. |
| Comment by Frank Heckes (Inactive) [ 14/Mar/16 ] |
|
The problem also occurs during soak testing of b2_8 RC5 (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160309).
|
| Comment by Di Wang [ 14/Mar/16 ] |
|
Ah, recovery process on the remote target is stuck in lu_object_find(). mdt_out03_021 D 000000000000000e 0 121458 2 0x00000080 ffff88081be7bb50 0000000000000046 0000000000000000 0000000000000000 ffff880378ae49c0 ffff88081ec00118 0000cf356d708513 ffff88081ec00118 ffff88081be7bb50 000000010d91895e ffff8808161145f8 ffff88081be7bfd8 Call Trace: [<ffffffffa088b71d>] lu_object_find_at+0x3d/0xe0 [obdclass] [<ffffffffa0acd342>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc] [<ffffffff81064c00>] ? default_wake_function+0x0/0x20 [<ffffffffa0aa2dd0>] ? lustre_swab_object_update_reply+0x0/0xc0 [ptlrpc] [<ffffffffa088c52c>] dt_locate_at+0x1c/0xa0 [obdclass] [<ffffffffa0b13970>] out_handle+0x1030/0x1880 [ptlrpc] [<ffffffff8105872d>] ? check_preempt_curr+0x6d/0x90 [<ffffffff8152b83e>] ? mutex_lock+0x1e/0x50 [<ffffffffa0b031ca>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc] [<ffffffffa0b0ac2c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] [<ffffffffa0ab7c61>] ptlrpc_main+0xd21/0x1800 [ptlrpc] [<ffffffff8152a39e>] ? thread_return+0x4e/0x7d0 [<ffffffffa0ab6f40>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [<ffffffff8109e78e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 mdt_out03_022 D 000000000000001c 0 121824 2 0x00000080 ffff88081c4afb50 0000000000000046 0000000000000000 0000000000000000 ffff880378ae49c0 ffff8807e4900118 0000d18fc8bbd702 ffff8807e4900118 ffff88081c4afb50 000000010db905ef ffff8807e0153ad8 ffff88081c4affd8 Call Trace: [<ffffffffa088b71d>] lu_object_find_at+0x3d/0xe0 [obdclass] [<ffffffffa0acd342>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc] [<ffffffff81064c00>] ? default_wake_function+0x0/0x20 [<ffffffffa0aa2dd0>] ? lustre_swab_object_update_reply+0x0/0xc0 [ptlrpc] [<ffffffffa088c52c>] dt_locate_at+0x1c/0xa0 [obdclass] [<ffffffffa0b13970>] out_handle+0x1030/0x1880 [ptlrpc] [<ffffffff8105872d>] ? check_preempt_curr+0x6d/0x90 [<ffffffff8152b83e>] ? mutex_lock+0x1e/0x50 [<ffffffffa0b031ca>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc] [<ffffffffa0b0ac2c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] [<ffffffffa0ab7c61>] ptlrpc_main+0xd21/0x1800 [ptlrpc] [<ffffffff8152a39e>] ? thread_return+0x4e/0x7d0 [<ffffffffa0ab6f40>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [<ffffffff8109e78e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 |
| Comment by Frank Heckes (Inactive) [ 17/Mar/16 ] |
|
Soak has been continued to execute b2_8 RC5 build with reformatted Lustre FS. The recovery process never stalls now neither for MDS restarts nor failover. All |
| Comment by Frank Heckes (Inactive) [ 18/Mar/16 ] |
|
After ~ 73 hours the recovery process stalled again and lead to an continuously increasing allocation of slabs.
Sequence of events
MDT stalled after time_remaining is 0: ---------------- lola-8 ---------------- mdt.soaked-MDT0000.recovery_status= status: RECOVERING recovery_start: 1458286947 time_remaining: 0 connected_clients: 12/12 req_replay_clients: 2 lock_repay_clients: 4 completed_clients: 8 evicted_clients: 0 replayed_requests: 0 queued_requests: 2 next_transno: 167504104141 The recovery process can't be interrupted: [root@lola-8 ~]# ps aux -L | grep lctl | grep -v grep root 8740 8740 0.0 1 0.0 15268 732 pts/1 D+ 02:38 0:00 lctl --device 5 abort_recovery [root@lola-8 ~]# cat /proc/8740/wchan target_stop_recovery_thread For this event the debug log file lola-8-lustre-log-20160318-0240 has been attached. |
| Comment by Frank Heckes (Inactive) [ 18/Mar/16 ] |
|
Same for umount : [root@lola-8 ~]# ps aux | grep umount root 21524 0.0 0.0 105184 776 pts/0 D+ 08:45 0:00 umount /mnt/soaked-mdt0 [root@lola-8 ~]# cat /proc/21524/wchan target_stop_recovery_thread[root@lola-8 ~]# |
| Comment by Di Wang [ 22/Mar/16 ] |
|
In the newest run, MDT failover seems stuck again. Here is what happened. 1. lola 9 is restarted, and MDT0001 is failover on lola-8. LustreError: 11-0: soaked-MDT0002-osp-MDT0001: operation mds_connect to node 192.168.1.110@o2ib10 failed: rc = -114 LustreError: 11-0: soaked-MDT0002-osp-MDT0001: operation mds_connect to node 192.168.1.110@o2ib10 failed: rc = -114 3. And on MDT0002 (lola-10) denies the new connection because of Lustre: soaked-MDT0002: Client soaked-MDT0001-mdtlov_UUID seen on new nid 192.168.1.108@o2ib10 when existing nid 192.168.1.109@o2ib10 is already connected Since MDT always the same uuid to connect other MDT, I think we need remove the old export when it finds same export comes from a different nid. |
| Comment by Cliff White (Inactive) [ 25/Mar/16 ] |
|
As of Friday morning, we have had 12 MDS failovers without an issues and 11 MDS restarts. Average times are low. lola-11.log:Mar 24 15:14:37 lola-11 kernel: Lustre: soaked-MDT0002: Recovery over after 1:11, of 12 clients 12 recovered and 0 were evicted. lola-11.log:Mar 24 17:12:16 lola-11 kernel: Lustre: soaked-MDT0002: Recovery over after 0:55, of 12 clients 12 recovered and 0 were evicted. lola-11.log:Mar 24 18:53:29 lola-11 kernel: Lustre: soaked-MDT0002: Recovery over after 1:09, of 12 clients 12 recovered and 0 were evicted. lola-11.log:Mar 24 20:37:20 lola-11 kernel: Lustre: soaked-MDT0002: Recovery over after 0:48, of 12 clients 12 recovered and 0 were evicted. lola-11.log:Mar 24 23:50:03 lola-11 kernel: Lustre: soaked-MDT0003: Recovery over after 0:55, of 12 clients 12 recovered and 0 were evicted. lola-11.log:Mar 25 01:05:09 lola-11 kernel: Lustre: soaked-MDT0003: Recovery over after 1:09, of 12 clients 12 recovered and 0 were evicted. lola-11.log:Mar 25 02:05:57 lola-11 kernel: Lustre: soaked-MDT0003: Recovery over after 1:11, of 12 clients 12 recovered and 0 were evicted. lola-11.log:Mar 25 07:16:49 lola-11 kernel: Lustre: soaked-MDT0003: Recovery over after 1:05, of 12 clients 12 recovered and 0 were evicted. lola-11.log:Mar 25 08:37:15 lola-11 kernel: Lustre: soaked-MDT0003: Recovery over after 0:20, of 12 clients 12 recovered and 0 were evicted. lola-8.log:Mar 24 08:56:21 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 5:00, of 12 clients 3 recovered and 9 were evicted. lola-8.log:Mar 24 14:14:03 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 1:25, of 12 clients 12 recovered and 0 were evicted. lola-8.log:Mar 24 18:07:07 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 0:24, of 12 clients 12 recovered and 0 were evicted. lola-8.log:Mar 24 19:16:10 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 1:06, of 12 clients 12 recovered and 0 were evicted. lola-8.log:Mar 25 00:49:16 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 1:24, of 12 clients 12 recovered and 0 were evicted. lola-8.log:Mar 25 02:17:26 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 0:28, of 12 clients 12 recovered and 0 were evicted. lola-8.log:Mar 25 03:03:24 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 0:18, of 12 clients 12 recovered and 0 were evicted. lola-8.log:Mar 25 06:09:20 lola-8 kernel: Lustre: soaked-MDT0001: Recovery over after 0:45, of 12 clients 12 recovered and 0 were evicted. lola-8.log:Mar 25 08:00:11 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 1:13, of 12 clients 12 recovered and 0 were evicted. lola-8.log:Mar 25 09:23:57 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 0:28, of 12 clients 12 recovered and 0 were evicted. lola-9.log:Mar 24 08:56:21 lola-9 kernel: Lustre: soaked-MDT0001: Recovery over after 5:00, of 12 clients 3 recovered and 9 were evicted. lola-9.log:Mar 24 10:58:12 lola-9 kernel: Lustre: soaked-MDT0001: Recovery over after 0:48, of 12 clients 12 recovered and 0 were evicted. |
| Comment by Cliff White (Inactive) [ 28/Mar/16 ] |
|
Soak hit the OOM killer again after ~48 hours of running. 2016-03-25 23:35:03,097:fsmgmt.fsmgmt:INFO Mounting soaked-MDT0003 on lola-10 ...
2016-03-25 23:37:36,548:fsmgmt.fsmgmt:INFO ... soaked-MDT0003 mounted successfully on lola-10
2016-03-25 23:37:36,548:fsmgmt.fsmgmt:INFO ... soaked-MDT0003 failed over
2016-03-25 23:37:36,549:fsmgmt.fsmgmt:INFO Wait for recovery to complete
errors reported to syslog during the recovery Mar 26 00:00:54 lola-10 kernel: LustreError: 4923:0:(ldlm_lib.c:2773:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff8803efc440c0 x1529699890967980/t0(270583579075) o36->df764ab7-38a0-d041-b3b3-6cbdb5966d59@192.168.1.131@o2ib100:70/0 lens 768/0 e 0 to 0 dl 1458975660 ref 2 fl Interpret:/6/ffffffff rc 0/-1 Mar 26 00:00:54 lola-10 kernel: LustreError: 4923:0:(ldlm_lib.c:2773:target_queue_recovery_request()) Skipped 320 previous similar messages Mar 26 00:02:21 lola-10 kernel: LustreError: 3413:0:(import.c:675:ptlrpc_connect_import()) already connecting Mar 26 00:02:21 lola-10 kernel: LustreError: 3413:0:(import.c:675:ptlrpc_connect_import()) Skipped 3 previous similar messages Mar 27 03:15:21 lola-10 kernel: Lustre: soaked-MDT0003: Recovery already passed deadline 1650:08, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery. Due to the weekend, the recovery was hung for quite a long time, last console message is dated Mar 27 03:19:15 |
| Comment by Cliff White (Inactive) [ 31/Mar/16 ] |
|
The same failure recurred, Mar 31. Seems to be very regular, but takes a day or so. Lustre: Skipped 1164 previous similar messages^M Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.110@o2ib10, removing former export from same NID^M Lustre: Skipped 287 previous similar messages^M LustreError: 4406:0:(ldlm_lib.c:2773:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff88040c491850 x1530264857542592/t0(399435290195) o1000->soaked-MDT0001-mdtlov_UUID@0@lo:409/0 lens 344/0 e 0 to 0 dl 1459462974 ref 2 fl Interpret:/6/ffffffff rc 0/-1^M LustreError: 4406:0:(ldlm_lib.c:2773:target_queue_recovery_request()) Skipped 263 previous similar messages^M LustreError: 4304:0:(client.c:2874:ptlrpc_replay_interpret()) @@@ request replay timed out.^M req@ffff8807008016c0 x1530264857542592/t399435290195(399435290195) o1000->soaked-MDT0000-osp-MDT0001@0@lo:24/4 lens 344/4288 e 1146 to 1 dl 1459462988 ref 2 fl Interpret:EX/6/ffffffff rc -110/-1^M LustreError: 4304:0:(client.c:2874:ptlrpc_replay_interpret()) Skipped 87 previous similar messages^M Mar 31 15:25:01 lola-9 TIME: Time stamp for console Lustre: soaked-MDT0003-osp-MDT0000: Connection to soaked-MDT0003 (at 192.168.1.111@o2ib10) was lost; in progress operations using this service will wait for recovery to complete^M Lustre: soaked-MDT0000: Recovery already passed deadline 477:37, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.^M Lustre: Skipped 875 previous similar messages^M |
| Comment by Frank Heckes (Inactive) [ 14/Apr/16 ] |
|
The error appears also during soak test of build '20160413' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160413). Test configuration is the same as specified above again. After MDS restart and failover of 3 MDTs lola-[8,10,11] the recovery process is stucked, Sequence of events:
Attached files: message, console and debug_kernel files of lola-[8,10,11]. Debug files contain |
| Comment by Di Wang [ 14/Apr/16 ] |
|
You can ignore these console message for now, and this is an known issue, see soaked-MDT0003: waking for gap in transno, VBR is OFF (skip: 12890084916, ql: 5, comp: 15, conn: 20, next: 12890084917, next_update 12890084974 last_committed: 12890084882) ... Hmm, I did not see the console message from lola-9. And also it seems lola-10 crashed around 04-14 6am because of |
| Comment by Cliff White (Inactive) [ 14/Apr/16 ] |
|
Frank can confirm, but I believe the stuck recovery happened on the first mds_failover after we restarted. The restart was to recover from the lola-10 crash. So, not related. |
| Comment by Gerrit Updater [ 31/May/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18800/ |
| Comment by Peter Jones [ 31/May/16 ] |
|
Landed for 2.9 |