Sep 05 05:45:10 fir-md1-s1 kernel: LNet: HW NUMA nodes: 4, HW CPU cores: 48, npartitions: 4 Sep 05 05:45:10 fir-md1-s1 kernel: alg: No test for adler32 (adler32-zlib) Sep 05 05:45:11 fir-md1-s1 kernel: Lustre: Lustre: Build Version: 2.12.2_119_g2d4809a Sep 05 05:45:11 fir-md1-s1 kernel: LNet: 22224:0:(config.c:1626:lnet_inet_enumerate()) lnet: Ignoring interface em2: it's down Sep 05 05:45:11 fir-md1-s1 kernel: LNet: Using FastReg for registration Sep 05 05:45:11 fir-md1-s1 kernel: LNet: Added LNI 10.0.10.51@o2ib7 [8/256/0/180] Sep 05 05:45:12 fir-md1-s1 kernel: LNet: 22268:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.0.10.202@o2ib7: 2268 seconds Sep 05 05:45:12 fir-md1-s1 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc Sep 05 05:45:12 fir-md1-s1 kernel: Lustre: MGS: Connection restored to b92d4030-d279-928f-b696-75826ee7c592 (at 0@lo) Sep 05 05:45:15 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 408c8c85-aca5-7ce2-7b02-e750452f6017 (at 10.8.18.21@o2ib6) Sep 05 05:45:15 fir-md1-s1 kernel: Lustre: Skipped 1 previous similar message Sep 05 05:45:17 fir-md1-s1 kernel: Lustre: MGS: Connection restored to dc1ee954-dfc3-5c2d-e823-d294d4ae61e4 (at 10.9.112.12@o2ib4) Sep 05 05:45:20 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 29d81558-4c1b-13d3-e3e1-806fb30c371f (at 10.9.105.46@o2ib4) Sep 05 05:45:26 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 4f91a6da-6194-b1e8-5362-485e544ba5ec (at 10.8.20.13@o2ib6) Sep 05 05:45:26 fir-md1-s1 kernel: Lustre: Skipped 2 previous similar messages Sep 05 05:45:34 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 64b06c12-ce32-7428-d275-e4aa1cc89770 (at 10.9.101.8@o2ib4) Sep 05 05:45:34 fir-md1-s1 kernel: Lustre: Skipped 13 previous similar messages Sep 05 05:45:51 fir-md1-s1 kernel: Lustre: MGS: Connection restored to ee55bad1-01b6-ecb0-a263-9bee4ab8ca2a (at 10.8.23.33@o2ib6) Sep 05 05:45:51 fir-md1-s1 kernel: Lustre: Skipped 4 previous similar messages Sep 05 05:46:23 fir-md1-s1 kernel: Lustre: MGS: Connection restored to f87b07ba-93fc-f590-97d2-380e3737e466 (at 10.9.109.7@o2ib4) Sep 05 05:46:23 fir-md1-s1 kernel: Lustre: Skipped 197 previous similar messages Sep 05 05:46:25 fir-md1-s1 kernel: Lustre: MGS: Received new LWP connection from 10.9.110.8@o2ib4, removing former export from same NID Sep 05 05:46:28 fir-md1-s1 kernel: Lustre: 22431:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1567687581/real 1567687581] req@ffff90a45823f980 x1643839499535792/t0(0) o104->MGS@10.9.108.44@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1567687588 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Sep 05 05:46:33 fir-md1-s1 kernel: Lustre: MGS: Received new LWP connection from 10.9.103.42@o2ib4, removing former export from same NID Sep 05 05:47:16 fir-md1-s1 kernel: LustreError: 22454:0:(ldlm_lib.c:3269:target_bulk_io()) @@@ truncated bulk READ 0(64) req@ffff90d47f27a050 x1631593079798720/t0(0) o256->b768c225-a390-a49b-7e5f-8edc021e781a@10.9.103.38@o2ib4:363/0 lens 304/240 e 1 to 0 dl 1567687648 ref 1 fl Interpret:/0/0 rc 0/0 Sep 05 05:47:23 fir-md1-s1 kernel: Lustre: MGS: Received new LWP connection from 10.9.115.1@o2ib4, removing former export from same NID Sep 05 05:47:23 fir-md1-s1 kernel: Lustre: Skipped 3 previous similar messages Sep 05 05:47:29 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 0e575d38-5ae9-d68c-4105-0d5b3367ebcb (at 10.9.106.55@o2ib4) Sep 05 05:47:29 fir-md1-s1 kernel: Lustre: Skipped 1152 previous similar messages Sep 05 05:47:36 fir-md1-s1 kernel: Lustre: MGS: Received new LWP connection from 10.9.106.10@o2ib4, removing former export from same NID Sep 05 05:47:44 fir-md1-s1 kernel: Lustre: MGS: Received new LWP connection from 10.9.115.1@o2ib4, removing former export from same NID Sep 05 05:48:37 fir-md1-s1 kernel: LDISKFS-fs (dm-3): file extents enabled Sep 05 05:48:37 fir-md1-s1 kernel: LDISKFS-fs (dm-0): file extents enabled Sep 05 05:48:37 fir-md1-s1 kernel: , maximum tree depth=5 Sep 05 05:48:37 fir-md1-s1 kernel: , maximum tree depth=5 Sep 05 05:48:37 fir-md1-s1 kernel: LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc Sep 05 05:48:37 fir-md1-s1 kernel: LDISKFS-fs (dm-3): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc Sep 05 05:48:38 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.8.27.13@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server. Sep 05 05:48:38 fir-md1-s1 kernel: Lustre: fir-MDT0002: Not available for connect from 10.0.10.52@o2ib7 (not set up) Sep 05 05:48:39 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0003_UUID: not available for connect from 10.9.110.23@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server. Sep 05 05:48:39 fir-md1-s1 kernel: LustreError: Skipped 3 previous similar messages Sep 05 05:48:39 fir-md1-s1 kernel: Lustre: fir-MDT0002: Not available for connect from 10.9.101.66@o2ib4 (not set up) Sep 05 05:48:40 fir-md1-s1 kernel: LustreError: 11-0: fir-OST000c-osc-MDT0002: operation ost_connect to node 10.0.10.103@o2ib7 failed: rc = -16 Sep 05 05:48:40 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0003_UUID: not available for connect from 10.8.11.36@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server. Sep 05 05:48:40 fir-md1-s1 kernel: LustreError: Skipped 6 previous similar messages Sep 05 05:48:40 fir-md1-s1 kernel: LustreError: 11-0: fir-OST000e-osc-MDT0002: operation ost_connect to node 10.0.10.103@o2ib7 failed: rc = -16 Sep 05 05:48:40 fir-md1-s1 kernel: LustreError: Skipped 3 previous similar messages Sep 05 05:48:43 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0000_UUID: not available for connect from 10.8.13.26@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server. Sep 05 05:48:43 fir-md1-s1 kernel: LustreError: Skipped 16 previous similar messages Sep 05 05:48:47 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0000_UUID: not available for connect from 10.0.10.102@o2ib7 (no target). If you are running an HA pair check that the target is mounted on the other server. Sep 05 05:48:47 fir-md1-s1 kernel: LustreError: Skipped 29 previous similar messages Sep 05 05:48:52 fir-md1-s1 kernel: Lustre: MGS: Received new LWP connection from 10.9.102.39@o2ib4, removing former export from same NID Sep 05 05:48:52 fir-md1-s1 kernel: Lustre: Skipped 2 previous similar messages Sep 05 05:48:57 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0003_UUID: not available for connect from 10.8.30.36@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server. Sep 05 05:48:57 fir-md1-s1 kernel: LustreError: Skipped 49 previous similar messages Sep 05 05:49:05 fir-md1-s1 kernel: LustreError: 11-0: fir-OST000e-osc-MDT0002: operation ost_connect to node 10.0.10.103@o2ib7 failed: rc = -16 Sep 05 05:49:05 fir-md1-s1 kernel: LustreError: Skipped 43 previous similar messages Sep 05 05:49:13 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0000_UUID: not available for connect from 10.9.106.40@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server. Sep 05 05:49:13 fir-md1-s1 kernel: LustreError: Skipped 606 previous similar messages Sep 05 05:49:30 fir-md1-s1 kernel: LustreError: 11-0: fir-OST000c-osc-MDT0002: operation ost_connect to node 10.0.10.103@o2ib7 failed: rc = -16 Sep 05 05:49:30 fir-md1-s1 kernel: LustreError: Skipped 46 previous similar messages Sep 05 05:49:45 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0000_UUID: not available for connect from 10.9.102.7@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server. Sep 05 05:49:45 fir-md1-s1 kernel: LustreError: Skipped 1955 previous similar messages Sep 05 05:49:55 fir-md1-s1 kernel: LustreError: 11-0: fir-OST000c-osc-MDT0002: operation ost_connect to node 10.0.10.103@o2ib7 failed: rc = -16 Sep 05 05:49:55 fir-md1-s1 kernel: LustreError: Skipped 35 previous similar messages Sep 05 05:50:17 fir-md1-s1 kernel: LNet: 22268:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Timed out tx for 10.0.10.202@o2ib7: 0 seconds Sep 05 05:50:17 fir-md1-s1 kernel: LNet: 22268:0:(o2iblnd_cb.c:3381:kiblnd_check_conns()) Skipped 1 previous similar message Sep 05 05:50:21 fir-md1-s1 kernel: LustreError: 11-0: fir-OST000c-osc-MDT0002: operation ost_connect to node 10.0.10.103@o2ib7 failed: rc = -16 Sep 05 05:50:21 fir-md1-s1 kernel: LustreError: Skipped 35 previous similar messages Sep 05 05:50:46 fir-md1-s1 kernel: LustreError: 11-0: fir-OST000c-osc-MDT0002: operation ost_connect to node 10.0.10.103@o2ib7 failed: rc = -16 Sep 05 05:50:46 fir-md1-s1 kernel: LustreError: Skipped 35 previous similar messages Sep 05 05:50:49 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0003_UUID: not available for connect from 10.8.25.21@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server. Sep 05 05:50:49 fir-md1-s1 kernel: LustreError: Skipped 3414 previous similar messages Sep 05 05:51:21 fir-md1-s1 kernel: Lustre: MGS: Received new LWP connection from 10.9.103.36@o2ib4, removing former export from same NID Sep 05 05:51:21 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 6b80f591-7f85-840f-613a-8029966db0a5 (at 10.9.103.36@o2ib4) Sep 05 05:51:21 fir-md1-s1 kernel: Lustre: Skipped 14 previous similar messages Sep 05 05:51:24 fir-md1-s1 kernel: INFO: task mount.lustre:22672 blocked for more than 120 seconds. Sep 05 05:51:24 fir-md1-s1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 05 05:51:24 fir-md1-s1 kernel: mount.lustre D ffff90b467585140 0 22672 22670 0x00000082 Sep 05 05:51:24 fir-md1-s1 kernel: Call Trace: Sep 05 05:51:24 fir-md1-s1 kernel: [] schedule_preempt_disabled+0x29/0x70 Sep 05 05:51:24 fir-md1-s1 kernel: [] __mutex_lock_slowpath+0xc7/0x1d0 Sep 05 05:51:24 fir-md1-s1 kernel: [] mutex_lock+0x1f/0x2f Sep 05 05:51:24 fir-md1-s1 kernel: [] mgc_set_info_async+0xa98/0x15f0 [mgc] Sep 05 05:51:24 fir-md1-s1 kernel: [] ? keys_fill+0xfc/0x180 [obdclass] Sep 05 05:51:24 fir-md1-s1 kernel: [] server_start_targets+0x31a/0x2a20 [obdclass] Sep 05 05:51:24 fir-md1-s1 kernel: [] ? lustre_start_mgc+0x260/0x2510 [obdclass] Sep 05 05:51:24 fir-md1-s1 kernel: [] ? do_lcfg+0x2f0/0x500 [obdclass] Sep 05 05:51:24 fir-md1-s1 kernel: [] server_fill_super+0x10cc/0x1890 [obdclass] Sep 05 05:51:24 fir-md1-s1 kernel: [] lustre_fill_super+0x328/0x950 [obdclass] Sep 05 05:51:24 fir-md1-s1 kernel: [] ? lustre_common_put_super+0x270/0x270 [obdclass] Sep 05 05:51:24 fir-md1-s1 kernel: [] mount_nodev+0x4f/0xb0 Sep 05 05:51:24 fir-md1-s1 kernel: [] lustre_mount+0x38/0x60 [obdclass] Sep 05 05:51:24 fir-md1-s1 kernel: [] mount_fs+0x3e/0x1b0 Sep 05 05:51:24 fir-md1-s1 kernel: [] vfs_kern_mount+0x67/0x110 Sep 05 05:51:24 fir-md1-s1 kernel: [] do_mount+0x1ef/0xce0 Sep 05 05:51:24 fir-md1-s1 kernel: [] ? __check_object_size+0x1ca/0x250 Sep 05 05:51:24 fir-md1-s1 kernel: [] ? kmem_cache_alloc_trace+0x3c/0x200 Sep 05 05:51:24 fir-md1-s1 kernel: [] SyS_mount+0x83/0xd0 Sep 05 05:51:24 fir-md1-s1 kernel: [] system_call_fastpath+0x22/0x27 Sep 05 05:51:28 fir-md1-s1 kernel: LustreError: 22431:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567687588, 300s ago); not entering recovery in server code, just going back to sleep ns: MGS lock: ffff90a456f4b600/0x98816ce1399337c3 lrc: 3/0,1 mode: --/EX res: [0x726966:0x2:0x0].0x0 rrc: 1386 type: PLN flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 22431 timeout: 0 lvb_type: 0 Sep 05 05:51:28 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1567687888.22431 Sep 05 05:51:36 fir-md1-s1 kernel: LustreError: 11-0: fir-OST002e-osc-MDT0002: operation ost_connect to node 10.0.10.107@o2ib7 failed: rc = -16 Sep 05 05:51:36 fir-md1-s1 kernel: LustreError: Skipped 49 previous similar messages Sep 05 05:51:58 fir-md1-s1 kernel: Lustre: MGS: Received new LWP connection from 10.8.27.14@o2ib6, removing former export from same NID Sep 05 05:51:58 fir-md1-s1 kernel: Lustre: Skipped 356 previous similar messages Sep 05 05:52:51 fir-md1-s1 kernel: LustreError: 11-0: fir-OST002e-osc-MDT0002: operation ost_connect to node 10.0.10.107@o2ib7 failed: rc = -16 Sep 05 05:52:51 fir-md1-s1 kernel: LustreError: Skipped 5 previous similar messages Sep 05 05:52:57 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0000_UUID: not available for connect from 10.9.106.60@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server. Sep 05 05:52:57 fir-md1-s1 kernel: LustreError: Skipped 6985 previous similar messages Sep 05 05:53:24 fir-md1-s1 kernel: INFO: task mount.lustre:22672 blocked for more than 120 seconds. Sep 05 05:53:24 fir-md1-s1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 05 05:53:24 fir-md1-s1 kernel: mount.lustre D ffff90b467585140 0 22672 22670 0x00000082 Sep 05 05:53:24 fir-md1-s1 kernel: Call Trace: Sep 05 05:53:24 fir-md1-s1 kernel: [] schedule_preempt_disabled+0x29/0x70 Sep 05 05:53:24 fir-md1-s1 kernel: [] __mutex_lock_slowpath+0xc7/0x1d0 Sep 05 05:53:24 fir-md1-s1 kernel: [] mutex_lock+0x1f/0x2f Sep 05 05:53:24 fir-md1-s1 kernel: [] mgc_set_info_async+0xa98/0x15f0 [mgc] Sep 05 05:53:24 fir-md1-s1 kernel: [] ? keys_fill+0xfc/0x180 [obdclass] Sep 05 05:53:24 fir-md1-s1 kernel: [] server_start_targets+0x31a/0x2a20 [obdclass] Sep 05 05:53:24 fir-md1-s1 kernel: [] ? lustre_start_mgc+0x260/0x2510 [obdclass] Sep 05 05:53:24 fir-md1-s1 kernel: [] ? do_lcfg+0x2f0/0x500 [obdclass] Sep 05 05:53:24 fir-md1-s1 kernel: [] server_fill_super+0x10cc/0x1890 [obdclass] Sep 05 05:53:24 fir-md1-s1 kernel: [] lustre_fill_super+0x328/0x950 [obdclass] Sep 05 05:53:24 fir-md1-s1 kernel: [] ? lustre_common_put_super+0x270/0x270 [obdclass] Sep 05 05:53:24 fir-md1-s1 kernel: [] mount_nodev+0x4f/0xb0 Sep 05 05:53:24 fir-md1-s1 kernel: [] lustre_mount+0x38/0x60 [obdclass] Sep 05 05:53:24 fir-md1-s1 kernel: [] mount_fs+0x3e/0x1b0 Sep 05 05:53:24 fir-md1-s1 kernel: [] vfs_kern_mount+0x67/0x110 Sep 05 05:53:24 fir-md1-s1 kernel: [] do_mount+0x1ef/0xce0 Sep 05 05:53:24 fir-md1-s1 kernel: [] ? __check_object_size+0x1ca/0x250 Sep 05 05:53:24 fir-md1-s1 kernel: [] ? kmem_cache_alloc_trace+0x3c/0x200 Sep 05 05:53:24 fir-md1-s1 kernel: [] SyS_mount+0x83/0xd0 Sep 05 05:53:24 fir-md1-s1 kernel: [] system_call_fastpath+0x22/0x27 Sep 05 05:53:38 fir-md1-s1 kernel: Lustre: MGS: Received new LWP connection from 10.0.10.52@o2ib7, removing former export from same NID Sep 05 05:53:38 fir-md1-s1 kernel: Lustre: Skipped 1016 previous similar messages Sep 05 05:53:40 fir-md1-s1 kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail Sep 05 05:53:40 fir-md1-s1 kernel: LustreError: 22671:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567687720, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff90a44ec24800/0x98816ce13993ad5e lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x98816ce13993ad65 expref: -99 pid: 22671 timeout: 0 lvb_type: 0 Sep 05 05:53:40 fir-md1-s1 kernel: LustreError: 22950:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff90a421fbb980) refcount nonzero (1) after lock cleanup; forcing cleanup. Sep 05 05:53:40 fir-md1-s1 kernel: Lustre: fir-MDT0002: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 Sep 05 05:53:41 fir-md1-s1 kernel: Lustre: fir-MDD0002: changelog on Sep 05 05:53:41 fir-md1-s1 kernel: Lustre: fir-MDT0002: in recovery but waiting for the first client to connect Sep 05 05:53:41 fir-md1-s1 kernel: Lustre: fir-MDT0000: Not available for connect from 10.8.11.8@o2ib6 (not set up) Sep 05 05:53:41 fir-md1-s1 kernel: Lustre: fir-MDT0002: Will be in recovery for at least 2:30, or until 1377 clients reconnect Sep 05 05:54:41 fir-md1-s1 kernel: Lustre: 22978:0:(ldlm_lib.c:1763:extend_recovery_timer()) fir-MDT0002: extended recovery timer reaching hard limit: 900, extend: 1 Sep 05 05:55:41 fir-md1-s1 kernel: Lustre: 22978:0:(ldlm_lib.c:1763:extend_recovery_timer()) fir-MDT0002: extended recovery timer reaching hard limit: 900, extend: 1 Sep 05 05:55:41 fir-md1-s1 kernel: Lustre: 22978:0:(ldlm_lib.c:1763:extend_recovery_timer()) Skipped 1 previous similar message Sep 05 05:56:11 fir-md1-s1 kernel: Lustre: 22978:0:(ldlm_lib.c:1763:extend_recovery_timer()) fir-MDT0002: extended recovery timer reaching hard limit: 900, extend: 1 Sep 05 05:56:11 fir-md1-s1 kernel: Lustre: 22978:0:(ldlm_lib.c:1763:extend_recovery_timer()) Skipped 2 previous similar messages Sep 05 05:56:27 fir-md1-s1 kernel: Lustre: MGS: Received new LWP connection from 10.8.23.17@o2ib6, removing former export from same NID Sep 05 05:56:27 fir-md1-s1 kernel: Lustre: Skipped 2 previous similar messages Sep 05 05:56:27 fir-md1-s1 kernel: Lustre: MGS: Connection restored to b59cba3c-65eb-14f7-6f1c-4bd005d05348 (at 10.8.23.17@o2ib6) Sep 05 05:56:27 fir-md1-s1 kernel: Lustre: Skipped 2804 previous similar messages Sep 05 05:56:37 fir-md1-s1 kernel: LustreError: 11-0: fir-MDT0000-osp-MDT0002: operation mds_connect to node 0@lo failed: rc = -11 Sep 05 05:56:37 fir-md1-s1 kernel: LustreError: Skipped 3 previous similar messages Sep 05 05:56:41 fir-md1-s1 kernel: Lustre: 22978:0:(ldlm_lib.c:1763:extend_recovery_timer()) fir-MDT0002: extended recovery timer reaching hard limit: 900, extend: 1 Sep 05 05:57:41 fir-md1-s1 kernel: Lustre: 22978:0:(ldlm_lib.c:1763:extend_recovery_timer()) fir-MDT0002: extended recovery timer reaching hard limit: 900, extend: 1 Sep 05 05:57:41 fir-md1-s1 kernel: Lustre: 22978:0:(ldlm_lib.c:1763:extend_recovery_timer()) Skipped 2 previous similar messages Sep 05 05:58:41 fir-md1-s1 kernel: Lustre: 22978:0:(ldlm_lib.c:1763:extend_recovery_timer()) fir-MDT0002: extended recovery timer reaching hard limit: 900, extend: 1 Sep 05 05:58:41 fir-md1-s1 kernel: Lustre: 22978:0:(ldlm_lib.c:1763:extend_recovery_timer()) Skipped 2 previous similar messages Sep 05 05:58:42 fir-md1-s1 kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail Sep 05 05:58:42 fir-md1-s1 kernel: LustreError: 22672:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567688022, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff90a41afa5100/0x98816ce139946e7f lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 3 type: PLN flags: 0x1000000000000 nid: local remote: 0x98816ce139946e86 expref: -99 pid: 22672 timeout: 0 lvb_type: 0 Sep 05 05:58:42 fir-md1-s1 kernel: LustreError: 23377:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff90a4156c6780) refcount nonzero (2) after lock cleanup; forcing cleanup. Sep 05 05:58:42 fir-md1-s1 kernel: Lustre: fir-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 Sep 05 05:58:42 fir-md1-s1 kernel: Lustre: fir-MDD0000: changelog on Sep 05 05:58:42 fir-md1-s1 kernel: Lustre: fir-MDT0000: in recovery but waiting for the first client to connect Sep 05 05:58:42 fir-md1-s1 kernel: Lustre: fir-MDT0000: Will be in recovery for at least 2:30, or until 1377 clients reconnect Sep 05 05:58:44 fir-md1-s1 kernel: LustreError: 22717:0:(tgt_handler.c:525:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff90d47b7fd100 x1643839542539440/t0(0) o601->fir-MDT0000-lwp-OST000c_UUID@10.0.10.103@o2ib7:290/0 lens 336/0 e 0 to 0 dl 1567688330 ref 1 fl Interpret:/0/ffffffff rc 0/-1 Sep 05 05:58:44 fir-md1-s1 kernel: LustreError: 22717:0:(tgt_handler.c:525:tgt_filter_recovery_request()) Skipped 100 previous similar messages Sep 05 05:58:49 fir-md1-s1 kernel: LustreError: 23418:0:(tgt_handler.c:525:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff90d46a1af980 x1643839542539376/t0(0) o601->fir-MDT0000-lwp-OST002d_UUID@10.0.10.108@o2ib7:295/0 lens 336/0 e 0 to 0 dl 1567688335 ref 1 fl Interpret:/0/ffffffff rc 0/-1 Sep 05 05:58:49 fir-md1-s1 kernel: LustreError: 23418:0:(tgt_handler.c:525:tgt_filter_recovery_request()) Skipped 1152 previous similar messages Sep 05 05:58:50 fir-md1-s1 kernel: LustreError: 23430:0:(tgt_handler.c:525:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff90a4127a1f80 x1643839542540304/t0(0) o601->fir-MDT0000-lwp-OST0007_UUID@10.0.10.102@o2ib7:296/0 lens 336/0 e 0 to 0 dl 1567688336 ref 1 fl Interpret:/0/ffffffff rc 0/-1 Sep 05 05:58:50 fir-md1-s1 kernel: LustreError: 22711:0:(tgt_handler.c:525:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff90a4222b9200 x1643839542540320/t0(0) o601->fir-MDT0000-lwp-OST0007_UUID@10.0.10.102@o2ib7:296/0 lens 336/0 e 0 to 0 dl 1567688336 ref 1 fl Interpret:/0/ffffffff rc 0/-1 Sep 05 05:58:50 fir-md1-s1 kernel: LustreError: 22711:0:(tgt_handler.c:525:tgt_filter_recovery_request()) Skipped 1041 previous similar messages Sep 05 05:58:50 fir-md1-s1 kernel: LustreError: 23430:0:(tgt_handler.c:525:tgt_filter_recovery_request()) Skipped 91 previous similar messages Sep 05 05:58:54 fir-md1-s1 kernel: LustreError: 23493:0:(tgt_handler.c:525:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff90c46d4fc800 x1643839542554608/t0(0) o601->fir-MDT0000-lwp-OST000a_UUID@10.0.10.101@o2ib7:300/0 lens 336/0 e 0 to 0 dl 1567688340 ref 1 fl Interpret:/0/ffffffff rc 0/-1 Sep 05 05:58:54 fir-md1-s1 kernel: LustreError: 23493:0:(tgt_handler.c:525:tgt_filter_recovery_request()) Skipped 1608 previous similar messages Sep 05 05:59:04 fir-md1-s1 kernel: LustreError: 23500:0:(tgt_handler.c:525:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff90b472ffd100 x1643839742677680/t0(0) o601->fir-MDT0000-lwp-MDT0001_UUID@10.0.10.52@o2ib7:310/0 lens 336/0 e 0 to 0 dl 1567688350 ref 1 fl Interpret:/0/ffffffff rc 0/-1 Sep 05 05:59:04 fir-md1-s1 kernel: LustreError: 23500:0:(tgt_handler.c:525:tgt_filter_recovery_request()) Skipped 5312 previous similar messages Sep 05 05:59:07 fir-md1-s1 kernel: Lustre: fir-MDT0002: recovery is timed out, evict stale exports Sep 05 05:59:07 fir-md1-s1 kernel: Lustre: 22978:0:(ldlm_lib.c:1763:extend_recovery_timer()) fir-MDT0002: extended recovery timer reaching hard limit: 900, extend: 1 Sep 05 05:59:07 fir-md1-s1 kernel: Lustre: 22978:0:(ldlm_lib.c:1763:extend_recovery_timer()) Skipped 2 previous similar messages Sep 05 05:59:07 fir-md1-s1 kernel: Lustre: fir-MDT0002: Recovery already passed deadline 2:56. If you do not want to wait more, you may force taget eviction via 'lctl --device fir-MDT0002 abort_recovery. Sep 05 05:59:07 fir-md1-s1 kernel: Lustre: fir-MDT0002: Recovery over after 5:26, of 1379 clients 1379 recovered and 0 were evicted. Sep 05 05:59:20 fir-md1-s1 kernel: LustreError: 23421:0:(tgt_handler.c:525:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff90d468a20480 x1643839542663616/t0(0) o601->fir-MDT0000-lwp-OST002b_UUID@10.0.10.108@o2ib7:326/0 lens 336/0 e 0 to 0 dl 1567688366 ref 1 fl Interpret:/0/ffffffff rc 0/-1 Sep 05 05:59:20 fir-md1-s1 kernel: LustreError: 23421:0:(tgt_handler.c:525:tgt_filter_recovery_request()) Skipped 2561 previous similar messages Sep 05 05:59:25 fir-md1-s1 kernel: Lustre: fir-MDT0000: Recovery over after 0:43, of 1379 clients 1379 recovered and 0 were evicted. Sep 05 06:01:33 fir-md1-s1 kernel: Lustre: MGS: Received new LWP connection from 10.9.107.39@o2ib4, removing former export from same NID Sep 05 06:01:33 fir-md1-s1 kernel: Lustre: Skipped 1377 previous similar messages Sep 05 06:03:49 fir-md1-s1 kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail Sep 05 06:03:49 fir-md1-s1 kernel: LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567688329, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff90b468cead00/0x98816ce1399728f7 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x98816ce1399728fe expref: -99 pid: 22397 timeout: 0 lvb_type: 0 Sep 05 06:03:49 fir-md1-s1 kernel: LustreError: 24840:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff90b46daa2540) refcount nonzero (1) after lock cleanup; forcing cleanup. Sep 05 06:06:40 fir-md1-s1 kernel: Lustre: MGS: Connection restored to bcc37d55-9177-b56c-8520-c6a26701976e (at 10.9.112.7@o2ib4) Sep 05 06:06:40 fir-md1-s1 kernel: Lustre: Skipped 4195 previous similar messages Sep 05 06:08:49 fir-md1-s1 kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail Sep 05 06:08:49 fir-md1-s1 kernel: LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567688629, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff90b425d5c800/0x98816ce13afb4ef4 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x98816ce13afb4efb expref: -99 pid: 22397 timeout: 0 lvb_type: 0 Sep 05 06:08:49 fir-md1-s1 kernel: LustreError: 24931:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff90b425fffec0) refcount nonzero (1) after lock cleanup; forcing cleanup. Sep 05 06:11:46 fir-md1-s1 kernel: Lustre: MGS: Received new LWP connection from 10.8.25.32@o2ib6, removing former export from same NID Sep 05 06:11:46 fir-md1-s1 kernel: Lustre: Skipped 2755 previous similar messages Sep 05 06:13:59 fir-md1-s1 kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail Sep 05 06:13:59 fir-md1-s1 kernel: LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567688939, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff90b417be69c0/0x98816ce1404b70ca lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x98816ce1404b70d8 expref: -99 pid: 22397 timeout: 0 lvb_type: 0 Sep 05 06:13:59 fir-md1-s1 kernel: LustreError: 25007:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff90b4177820c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Sep 05 06:16:52 fir-md1-s1 kernel: Lustre: MGS: Connection restored to 489d3ef6-62df-7dc5-e7ae-6fd4b28cf491 (at 10.8.25.32@o2ib6) Sep 05 06:16:52 fir-md1-s1 kernel: Lustre: Skipped 2757 previous similar messages Sep 05 06:18:59 fir-md1-s1 kernel: LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail Sep 05 06:18:59 fir-md1-s1 kernel: LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567689239, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff90c3ef810900/0x98816ce146f4034d lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x98816ce146f40370 expref: -99 pid: 22397 timeout: 0 lvb_type: 0 Sep 05 06:18:59 fir-md1-s1 kernel: LustreError: 25081:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff90b413e012c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Sep 05 06:22:01 fir-md1-s1 kernel: Lustre: MGS: Received new LWP connection from 10.8.25.32@o2ib6, removing former export from same NID Sep 05 06:22:01 fir-md1-s1 kernel: Lustre: Skipped 2755 previous similar messages Sep 05 06:23:27 fir-md1-s1 kernel: Lustre: DEBUG MARKER: Thu Sep 5 06:23:27 2019