Apr 26 06:56:01 fir-md1-s2 kernel: LNet: HW NUMA nodes: 4, HW CPU cores: 48, npartitions: 4 Apr 26 06:56:01 fir-md1-s2 kernel: alg: No test for adler32 (adler32-zlib) Apr 26 06:56:02 fir-md1-s2 kernel: Lustre: Lustre: Build Version: 2.12.0.pl7 Apr 26 06:56:02 fir-md1-s2 kernel: LNet: Using FastReg for registration Apr 26 06:56:02 fir-md1-s2 kernel: LNet: Added LNI 10.0.10.52@o2ib7 [8/256/0/180] Apr 26 06:57:04 fir-md1-s2 kernel: LNetError: 98926:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-103, 0) Apr 26 06:57:04 fir-md1-s2 kernel: LNetError: 98928:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-103, 0) Apr 26 06:57:05 fir-md1-s2 kernel: LDISKFS-fs (dm-2): file extents enabled Apr 26 06:57:05 fir-md1-s2 kernel: LDISKFS-fs (dm-3): file extents enabled Apr 26 06:57:05 fir-md1-s2 kernel: , maximum tree depth=5 Apr 26 06:57:05 fir-md1-s2 kernel: , maximum tree depth=5 Apr 26 06:57:05 fir-md1-s2 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc Apr 26 06:57:05 fir-md1-s2 kernel: LDISKFS-fs (dm-3): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc Apr 26 06:57:05 fir-md1-s2 kernel: LustreError: 137-5: fir-MDT0001_UUID: not available for connect from 10.8.24.14@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server. Apr 26 06:57:05 fir-md1-s2 kernel: LustreError: Skipped 5 previous similar messages Apr 26 06:57:06 fir-md1-s2 kernel: Lustre: fir-MDT0001: Not available for connect from 10.8.23.21@o2ib6 (not set up) Apr 26 06:57:06 fir-md1-s2 kernel: LustreError: 137-5: fir-MDT0003_UUID: not available for connect from 10.9.108.72@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server. Apr 26 06:57:06 fir-md1-s2 kernel: LustreError: Skipped 33 previous similar messages Apr 26 06:57:07 fir-md1-s2 kernel: LustreError: 137-5: fir-MDT0003_UUID: not available for connect from 10.9.108.6@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server. Apr 26 06:57:07 fir-md1-s2 kernel: LustreError: Skipped 65 previous similar messages Apr 26 06:57:09 fir-md1-s2 kernel: LustreError: 137-5: fir-MDT0003_UUID: not available for connect from 10.8.21.32@o2ib6 (no target). If you are running an HA pair check that the target is mounted on the other server. Apr 26 06:57:09 fir-md1-s2 kernel: LustreError: Skipped 138 previous similar messages Apr 26 06:57:09 fir-md1-s2 kernel: Lustre: fir-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 Apr 26 06:57:09 fir-md1-s2 kernel: Lustre: fir-MDD0001: changelog on Apr 26 06:57:09 fir-md1-s2 kernel: Lustre: fir-MDT0001: in recovery but waiting for the first client to connect Apr 26 06:57:09 fir-md1-s2 kernel: Lustre: fir-MDT0001: Will be in recovery for at least 2:30, or until 1330 clients reconnect Apr 26 06:57:09 fir-md1-s2 kernel: Lustre: fir-MDT0003: Not available for connect from 10.9.115.6@o2ib4 (not set up) Apr 26 06:57:09 fir-md1-s2 kernel: Lustre: Skipped 15 previous similar messages Apr 26 06:57:10 fir-md1-s2 kernel: LustreError: 11-0: fir-MDT0001-osp-MDT0003: operation mds_connect to node 0@lo failed: rc = -114 Apr 26 06:57:10 fir-md1-s2 kernel: Lustre: fir-MDT0003: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 Apr 26 06:57:10 fir-md1-s2 kernel: Lustre: fir-MDD0003: changelog on Apr 26 06:57:10 fir-md1-s2 kernel: Lustre: fir-MDT0003: in recovery but waiting for the first client to connect Apr 26 06:57:10 fir-md1-s2 kernel: Lustre: fir-MDT0003: Will be in recovery for at least 2:30, or until 1330 clients reconnect Apr 26 06:57:10 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to (at 10.9.104.50@o2ib4) Apr 26 06:57:10 fir-md1-s2 kernel: Lustre: Skipped 69 previous similar messages Apr 26 06:57:11 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to 7488b122-2e25-f42e-2c39-dfd38301e617 (at 10.9.102.12@o2ib4) Apr 26 06:57:11 fir-md1-s2 kernel: Lustre: Skipped 52 previous similar messages Apr 26 06:57:12 fir-md1-s2 kernel: Lustre: fir-MDT0003: Connection restored to 578f5a4f-2859-cb4e-4904-9e418412cd99 (at 10.8.7.14@o2ib6) Apr 26 06:57:12 fir-md1-s2 kernel: Lustre: Skipped 109 previous similar messages Apr 26 06:57:14 fir-md1-s2 kernel: Lustre: fir-MDT0003: Connection restored to 3f8d56d1-76ff-3834-a514-770cf38ca107 (at 10.9.105.1@o2ib4) Apr 26 06:57:14 fir-md1-s2 kernel: Lustre: Skipped 250 previous similar messages Apr 26 06:57:18 fir-md1-s2 kernel: Lustre: fir-MDT0003: Connection restored to (at 10.9.0.64@o2ib4) Apr 26 06:57:18 fir-md1-s2 kernel: Lustre: Skipped 1587 previous similar messages Apr 26 06:57:26 fir-md1-s2 kernel: Lustre: fir-MDT0003: Connection restored to ac9cd631-a534-1fba-753c-5069b079d1ad (at 10.8.24.16@o2ib6) Apr 26 06:57:26 fir-md1-s2 kernel: Lustre: Skipped 605 previous similar messages Apr 26 06:57:46 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Apr 26 06:57:46 fir-md1-s2 kernel: Lustre: Skipped 35 previous similar messages Apr 26 06:57:59 fir-md1-s2 kernel: Lustre: fir-MDT0001: Denying connection for new client 53385b7b-a550-b1a8-0abe-3b8ac836eb95(at 10.8.10.20@o2ib6), waiting for 1330 known clients (1203 recovered, 125 in progress, and 0 evicted) already passed deadline 3:18 Apr 26 06:57:59 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 06:58:24 fir-md1-s2 kernel: Lustre: fir-MDT0003: Denying connection for new client 53385b7b-a550-b1a8-0abe-3b8ac836eb95(at 10.8.10.20@o2ib6), waiting for 1330 known clients (1238 recovered, 91 in progress, and 0 evicted) already passed deadline 3:43 Apr 26 06:58:24 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 06:58:36 fir-md1-s2 kernel: Lustre: fir-MDT0002-osp-MDT0001: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) Apr 26 06:58:36 fir-md1-s2 kernel: Lustre: Skipped 32 previous similar messages Apr 26 06:58:49 fir-md1-s2 kernel: Lustre: fir-MDT0003: Denying connection for new client 53385b7b-a550-b1a8-0abe-3b8ac836eb95(at 10.8.10.20@o2ib6), waiting for 1330 known clients (1238 recovered, 91 in progress, and 0 evicted) already passed deadline 4:08 Apr 26 06:58:49 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 06:59:16 fir-md1-s2 kernel: Lustre: fir-MDT0001: Denying connection for new client 53385b7b-a550-b1a8-0abe-3b8ac836eb95(at 10.8.10.20@o2ib6), waiting for 1330 known clients (1204 recovered, 125 in progress, and 0 evicted) already passed deadline 4:36 Apr 26 06:59:16 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 06:59:40 fir-md1-s2 kernel: Lustre: fir-MDT0001: recovery is timed out, evict stale exports Apr 26 06:59:40 fir-md1-s2 kernel: Lustre: fir-MDT0001: disconnecting 1 stale clients Apr 26 06:59:40 fir-md1-s2 kernel: Lustre: fir-MDT0001: Recovery already passed deadline 4:00, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery. Apr 26 06:59:40 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to 9134d6ed-2ca0-ddb8-9f7a-4d783ed8d98e (at 10.9.101.49@o2ib4) Apr 26 06:59:40 fir-md1-s2 kernel: Lustre: Skipped 3 previous similar messages Apr 26 06:59:40 fir-md1-s2 kernel: Lustre: Skipped 3 previous similar messages Apr 26 06:59:40 fir-md1-s2 kernel: Lustre: fir-MDT0001: Recovery over after 2:31, of 1330 clients 1329 recovered and 1 was evicted. Apr 26 06:59:40 fir-md1-s2 kernel: Lustre: fir-MDT0003: recovery is timed out, evict stale exports Apr 26 06:59:40 fir-md1-s2 kernel: Lustre: fir-MDT0003: disconnecting 1 stale clients Apr 26 06:59:40 fir-md1-s2 kernel: Lustre: fir-MDT0003: Recovery already passed deadline 4:00, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery. Apr 26 06:59:40 fir-md1-s2 kernel: Lustre: Skipped 6 previous similar messages Apr 26 06:59:41 fir-md1-s2 kernel: Lustre: fir-MDT0003: Recovery over after 2:31, of 1330 clients 1329 recovered and 1 was evicted. Apr 26 07:09:08 fir-md1-s2 kernel: Lustre: 100172:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 07:28:54 fir-md1-s2 kernel: Lustre: fir-MDT0001: haven't heard from client 96991bbd-d088-ba9b-da44-41ffcb01dcb5 (at 10.8.15.10@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff987b13346000, cur 1556288934 expire 1556288784 last 1556288707 Apr 26 07:39:20 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to c0a509a5-2c41-578e-21fb-9d567e5f805f (at 10.8.26.8@o2ib6) Apr 26 07:39:20 fir-md1-s2 kernel: Lustre: Skipped 26 previous similar messages Apr 26 07:40:42 fir-md1-s2 kernel: Lustre: fir-MDT0001: haven't heard from client c0a509a5-2c41-578e-21fb-9d567e5f805f (at 10.8.26.8@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff987abe6fc000, cur 1556289642 expire 1556289492 last 1556289415 Apr 26 07:40:42 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 07:41:12 fir-md1-s2 kernel: Lustre: 100362:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 07:51:36 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to b8566a76-ed42-2ee8-d9fd-567ffce8f1d3 (at 10.8.9.8@o2ib6) Apr 26 07:51:36 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 07:52:23 fir-md1-s2 kernel: Lustre: fir-MDT0001: haven't heard from client b8566a76-ed42-2ee8-d9fd-567ffce8f1d3 (at 10.8.9.8@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff986977e43c00, cur 1556290343 expire 1556290193 last 1556290116 Apr 26 07:52:23 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 08:25:10 fir-md1-s2 kernel: Lustre: 100396:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 08:25:10 fir-md1-s2 kernel: Lustre: 100396:0:(mdd_device.c:1794:mdd_changelog_clear()) Skipped 5 previous similar messages Apr 26 08:43:05 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to (at 10.8.19.8@o2ib6) Apr 26 08:45:01 fir-md1-s2 kernel: Lustre: fir-MDT0001: haven't heard from client 95e6fd6a-706d-ff18-fa02-0b0e9d53d014 (at 10.8.19.8@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff98688475d000, cur 1556293501 expire 1556293351 last 1556293274 Apr 26 08:45:01 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 08:50:09 fir-md1-s2 kernel: Lustre: fir-MDT0001: haven't heard from client 820961b3-2494-2624-1872-4e954309f717 (at 10.8.7.23@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff987a71f4c000, cur 1556293809 expire 1556293659 last 1556293582 Apr 26 08:50:09 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 08:51:48 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to 5b36977e-2123-3c92-aece-1cc3fbfc3aea (at 10.8.14.6@o2ib6) Apr 26 08:51:48 fir-md1-s2 kernel: Lustre: Skipped 2 previous similar messages Apr 26 08:51:56 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to 820961b3-2494-2624-1872-4e954309f717 (at 10.8.7.23@o2ib6) Apr 26 08:51:56 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 08:52:01 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to (at 10.8.14.4@o2ib6) Apr 26 08:52:01 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 08:52:58 fir-md1-s2 kernel: Lustre: fir-MDT0001: haven't heard from client 1ea1acc1-3fe4-ba60-b1d5-137c8d4a178d (at 10.8.14.4@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff986b1e6c4400, cur 1556293978 expire 1556293828 last 1556293751 Apr 26 08:52:58 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 08:54:41 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to 38012cd9-c129-8cc6-54ac-519b05aa44c7 (at 10.8.11.14@o2ib6) Apr 26 08:54:41 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 09:00:28 fir-md1-s2 kernel: Lustre: fir-MDT0001: haven't heard from client d4240347-d03f-94ef-97f8-f6139e140ab0 (at 10.8.20.5@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9866c9ebe400, cur 1556294428 expire 1556294278 last 1556294201 Apr 26 09:00:28 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 09:01:40 fir-md1-s2 kernel: Lustre: 100454:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 09:01:40 fir-md1-s2 kernel: Lustre: 100454:0:(mdd_device.c:1794:mdd_changelog_clear()) Skipped 1 previous similar message Apr 26 09:16:03 fir-md1-s2 kernel: Lustre: 100187:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 09:16:03 fir-md1-s2 kernel: Lustre: 100187:0:(mdd_device.c:1794:mdd_changelog_clear()) Skipped 5 previous similar messages Apr 26 09:22:23 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to b0ace5e9-c2a4-c49d-1c2e-c6c1f30dfaa4 (at 10.9.106.14@o2ib4) Apr 26 09:22:23 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 09:27:22 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to 11c0ad46-dc6c-2ef5-3cdb-fbfdb747c28e (at 10.8.14.5@o2ib6) Apr 26 09:27:22 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 09:27:33 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to 7f823a2b-8dc1-fc3e-31c6-2badf11024f4 (at 10.8.14.9@o2ib6) Apr 26 09:27:33 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 09:29:00 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to (at 10.8.20.5@o2ib6) Apr 26 09:29:00 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 09:30:06 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to 385ad677-c9f9-e2fe-015c-c152ce073ee0 (at 10.9.102.57@o2ib4) Apr 26 09:30:06 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 09:33:41 fir-md1-s2 kernel: Lustre: 100415:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 09:35:16 fir-md1-s2 kernel: Lustre: fir-MDT0001: haven't heard from client d16504c2-48e2-66bb-8872-c6000ccd4b69 (at 10.8.26.28@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff987b13347c00, cur 1556296516 expire 1556296366 last 1556296289 Apr 26 09:35:16 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 09:35:17 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to d16504c2-48e2-66bb-8872-c6000ccd4b69 (at 10.8.26.28@o2ib6) Apr 26 09:35:17 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 09:38:00 fir-md1-s2 kernel: Lustre: 100448:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 09:39:01 fir-md1-s2 kernel: Lustre: 100415:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 09:41:25 fir-md1-s2 kernel: Lustre: 100362:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 09:43:09 fir-md1-s2 kernel: Lustre: 100454:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 10:06:54 fir-md1-s2 kernel: Lustre: 100437:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 10:06:54 fir-md1-s2 kernel: Lustre: 100437:0:(mdd_device.c:1794:mdd_changelog_clear()) Skipped 1 previous similar message Apr 26 10:14:30 fir-md1-s2 kernel: Lustre: fir-MDT0003: haven't heard from client c4f3aa61-62f8-6b68-bd9d-2e7b54f4f422 (at 10.8.0.66@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9875d737e000, cur 1556298870 expire 1556298720 last 1556298643 Apr 26 10:14:30 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 10:17:34 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to c4f3aa61-62f8-6b68-bd9d-2e7b54f4f422 (at 10.8.0.66@o2ib6) Apr 26 10:17:34 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 10:20:28 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to (at 10.8.27.23@o2ib6) Apr 26 10:20:28 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 10:21:19 fir-md1-s2 kernel: Lustre: fir-MDT0003: haven't heard from client d2a4a3c8-307a-b27b-571e-a59089481ceb (at 10.8.27.23@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff987324715000, cur 1556299279 expire 1556299129 last 1556299052 Apr 26 10:21:19 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 10:27:30 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to 77874b32-e186-7da5-3231-7675bcd6ec17 (at 10.9.102.6@o2ib4) Apr 26 10:27:30 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 10:39:55 fir-md1-s2 kernel: Lustre: 100187:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 10:39:55 fir-md1-s2 kernel: Lustre: 100187:0:(mdd_device.c:1794:mdd_changelog_clear()) Skipped 4 previous similar messages Apr 26 10:54:17 fir-md1-s2 kernel: Lustre: 100213:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 10:54:17 fir-md1-s2 kernel: Lustre: 100213:0:(mdd_device.c:1794:mdd_changelog_clear()) Skipped 9 previous similar messages Apr 26 10:54:40 fir-md1-s2 kernel: Lustre: 100141:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 10:54:40 fir-md1-s2 kernel: Lustre: 100141:0:(mdd_device.c:1794:mdd_changelog_clear()) Skipped 1 previous similar message Apr 26 11:04:55 fir-md1-s2 kernel: Lustre: 100437:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 11:05:50 fir-md1-s2 kernel: Lustre: 100448:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 11:05:50 fir-md1-s2 kernel: Lustre: 100448:0:(mdd_device.c:1794:mdd_changelog_clear()) Skipped 10 previous similar messages Apr 26 11:08:34 fir-md1-s2 kernel: Lustre: fir-MDT0003: haven't heard from client 3ffe46ec-c7a3-b9d8-f46a-3b5970e85cf0 (at 10.8.27.23@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff985b004ab800, cur 1556302114 expire 1556301964 last 1556301887 Apr 26 11:08:34 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 11:08:50 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to (at 10.8.27.23@o2ib6) Apr 26 11:08:50 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 11:11:09 fir-md1-s2 kernel: Lustre: 99509:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 11:24:26 fir-md1-s2 kernel: Lustre: 100437:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 11:24:26 fir-md1-s2 kernel: Lustre: 100437:0:(mdd_device.c:1794:mdd_changelog_clear()) Skipped 3 previous similar messages Apr 26 11:31:17 fir-md1-s2 kernel: Lustre: 100466:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 11:31:17 fir-md1-s2 kernel: Lustre: 100466:0:(mdd_device.c:1794:mdd_changelog_clear()) Skipped 2 previous similar messages Apr 26 11:34:59 fir-md1-s2 kernel: Lustre: 99509:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 11:34:59 fir-md1-s2 kernel: Lustre: 99509:0:(mdd_device.c:1794:mdd_changelog_clear()) Skipped 7 previous similar messages Apr 26 11:35:58 fir-md1-s2 kernel: Lustre: 100466:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 11:35:58 fir-md1-s2 kernel: Lustre: 100466:0:(mdd_device.c:1794:mdd_changelog_clear()) Skipped 1 previous similar message Apr 26 11:37:16 fir-md1-s2 kernel: Lustre: 100214:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 11:37:16 fir-md1-s2 kernel: Lustre: 100214:0:(mdd_device.c:1794:mdd_changelog_clear()) Skipped 3 previous similar messages Apr 26 11:38:14 fir-md1-s2 kernel: Lustre: 100214:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 11:38:14 fir-md1-s2 kernel: Lustre: 100214:0:(mdd_device.c:1794:mdd_changelog_clear()) Skipped 9 previous similar messages Apr 26 11:38:46 fir-md1-s2 kernel: Lustre: 100466:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 11:38:46 fir-md1-s2 kernel: Lustre: 100466:0:(mdd_device.c:1794:mdd_changelog_clear()) Skipped 5 previous similar messages Apr 26 11:48:45 fir-md1-s2 kernel: Lustre: 100357:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 11:48:45 fir-md1-s2 kernel: Lustre: 100357:0:(mdd_device.c:1794:mdd_changelog_clear()) Skipped 1 previous similar message Apr 26 11:51:18 fir-md1-s2 kernel: Lustre: 100213:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 11:51:18 fir-md1-s2 kernel: Lustre: 100213:0:(mdd_device.c:1794:mdd_changelog_clear()) Skipped 8 previous similar messages Apr 26 12:08:06 fir-md1-s2 kernel: Lustre: 100141:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 12:08:06 fir-md1-s2 kernel: Lustre: 100141:0:(mdd_device.c:1794:mdd_changelog_clear()) Skipped 1 previous similar message Apr 26 12:10:37 fir-md1-s2 kernel: Lustre: 99547:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 12:14:10 fir-md1-s2 kernel: Lustre: fir-MDT0001: haven't heard from client c41887d8-667a-fcc3-3801-53e405eea2a0 (at 10.8.30.34@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff98798a6a7400, cur 1556306050 expire 1556305900 last 1556305823 Apr 26 12:14:10 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 12:14:20 fir-md1-s2 kernel: Lustre: fir-MDT0001: Connection restored to c41887d8-667a-fcc3-3801-53e405eea2a0 (at 10.8.30.34@o2ib6) Apr 26 12:14:20 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 12:14:54 fir-md1-s2 kernel: Lustre: 100466:0:(mdd_device.c:1794:mdd_changelog_clear()) fir-MDD0001: Failure to clear the changelog for user 1: -22 Apr 26 12:14:54 fir-md1-s2 kernel: Lustre: 100466:0:(mdd_device.c:1794:mdd_changelog_clear()) Skipped 3 previous similar messages Apr 26 12:22:58 fir-md1-s2 kernel: LNetError: 98921:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) Apr 26 12:23:05 fir-md1-s2 kernel: Lustre: fir-MDT0003: Client 78ab2c22-394d-bdd4-0b8e-3553d6a47e28 (at 10.8.17.2@o2ib6) reconnecting Apr 26 12:23:05 fir-md1-s2 kernel: Lustre: fir-MDT0003: Connection restored to 78ab2c22-394d-bdd4-0b8e-3553d6a47e28 (at 10.8.17.2@o2ib6) Apr 26 12:26:28 fir-md1-s2 kernel: LNetError: 98919:0:(o2iblnd_cb.c:3324:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds Apr 26 12:26:28 fir-md1-s2 kernel: LNetError: 98919:0:(o2iblnd_cb.c:3399:kiblnd_check_conns()) Timed out RDMA with 10.0.10.3@o2ib7 (7): c: 4, oc: 0, rc: 8 Apr 26 12:26:28 fir-md1-s2 kernel: Lustre: 100425:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1556306782/real 1556306788] req@ffff988b214b2700 x1631885367955696/t0(0) o104->fir-MDT0003@10.0.10.3@o2ib7:15/16 lens 296/224 e 0 to 1 dl 1556306789 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1 Apr 26 12:26:28 fir-md1-s2 kernel: Lustre: 100425:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 306 previous similar messages Apr 26 12:26:29 fir-md1-s2 kernel: Lustre: 100425:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1556306789/real 1556306789] req@ffff988b214b2700 x1631885367955696/t0(0) o104->fir-MDT0003@10.0.10.3@o2ib7:15/16 lens 296/224 e 0 to 1 dl 1556306796 ref 1 fl Rpc:eX/2/ffffffff rc 0/-1 Apr 26 12:26:29 fir-md1-s2 kernel: Lustre: 100425:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 7756 previous similar messages Apr 26 12:26:30 fir-md1-s2 kernel: Lustre: 99522:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1556306790/real 1556306790] req@ffff9867e2fea400 x1631885368035728/t0(0) o104->fir-MDT0003@10.0.10.3@o2ib7:15/16 lens 296/224 e 0 to 1 dl 1556306797 ref 1 fl Rpc:eX/2/ffffffff rc 0/-1 Apr 26 12:26:30 fir-md1-s2 kernel: Lustre: 99522:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 15549 previous similar messages Apr 26 12:26:32 fir-md1-s2 kernel: Lustre: 99522:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1556306792/real 1556306792] req@ffff9867e2fea400 x1631885368035728/t0(0) o104->fir-MDT0003@10.0.10.3@o2ib7:15/16 lens 296/224 e 0 to 1 dl 1556306799 ref 1 fl Rpc:eX/2/ffffffff rc 0/-1 Apr 26 12:26:32 fir-md1-s2 kernel: Lustre: 99522:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 31593 previous similar messages Apr 26 12:26:36 fir-md1-s2 kernel: Lustre: 99158:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1556306796/real 1556306796] req@ffff984f987d1800 x1631885368101776/t0(0) o104->fir-MDT0003@10.0.10.3@o2ib7:15/16 lens 296/224 e 0 to 1 dl 1556306803 ref 1 fl Rpc:eX/2/ffffffff rc 0/-1 Apr 26 12:26:36 fir-md1-s2 kernel: Lustre: 99158:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 81946 previous similar messages Apr 26 12:26:37 fir-md1-s2 kernel: Lustre: 100490:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply req@ffff988a82273600 x1631588137131600/t0(0) o101->90d81c86-5db8-d29f-71be-9c3030e109bc@10.9.102.49@o2ib4:12/0 lens 480/568 e 1 to 0 dl 1556306802 ref 2 fl Interpret:/0/0 rc 0/0 Apr 26 12:26:38 fir-md1-s2 kernel: Lustre: 100187:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply req@ffff98502ce68000 x1631532741432784/t0(0) o101->3f0e14d1-f8f7-87d0-dc33-650c4734f9c2@10.9.104.10@o2ib4:13/0 lens 376/1600 e 1 to 0 dl 1556306803 ref 2 fl Interpret:/0/0 rc 0/0 Apr 26 12:26:43 fir-md1-s2 kernel: Lustre: 100300:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply req@ffff986a8f0b0c00 x1631596147305200/t0(0) o101->d3e22dd2-d25d-28e8-5f86-5d27043eaa8d@10.8.7.18@o2ib6:18/0 lens 480/568 e 1 to 0 dl 1556306808 ref 2 fl Interpret:/0/0 rc 0/0 Apr 26 12:26:44 fir-md1-s2 kernel: Lustre: fir-MDT0003: Client 90d81c86-5db8-d29f-71be-9c3030e109bc (at 10.9.102.49@o2ib4) reconnecting Apr 26 12:26:44 fir-md1-s2 kernel: Lustre: fir-MDT0003: Connection restored to 90d81c86-5db8-d29f-71be-9c3030e109bc (at 10.9.102.49@o2ib4) Apr 26 12:26:44 fir-md1-s2 kernel: Lustre: 100448:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1556306804/real 1556306804] req@ffff9859bf774b00 x1631885368144032/t0(0) o104->fir-MDT0003@10.0.10.3@o2ib7:15/16 lens 296/224 e 0 to 1 dl 1556306811 ref 1 fl Rpc:eX/2/ffffffff rc 0/-1 Apr 26 12:26:44 fir-md1-s2 kernel: Lustre: 100448:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 254150 previous similar messages Apr 26 12:26:49 fir-md1-s2 kernel: Lustre: fir-MDT0003: Client d3e22dd2-d25d-28e8-5f86-5d27043eaa8d (at 10.8.7.18@o2ib6) reconnecting Apr 26 12:26:49 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 12:26:49 fir-md1-s2 kernel: Lustre: fir-MDT0003: Connection restored to (at 10.8.7.18@o2ib6) Apr 26 12:26:49 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 12:26:53 fir-md1-s2 kernel: LustreError: 100425:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.0.10.3@o2ib7) failed to reply to blocking AST (req@ffff988b214b2700 x1631885367955696 status 0 rc -110), evict it ns: mdt-fir-MDT0003_UUID lock: ffff985aa8702640/0x4f3cef65e16dc8a2 lrc: 4/0,0 mode: PR/PR res: [0x28001b6c4:0x9a:0x0].0x0 bits 0x40/0x0 rrc: 9 type: IBT flags: 0x60000400000020 nid: 10.0.10.3@o2ib7 remote: 0xbbb5b46b2935fbb1 expref: 1984118 pid: 100415 timeout: 306236 lvb_type: 0 Apr 26 12:26:53 fir-md1-s2 kernel: LustreError: 138-a: fir-MDT0003: A client on nid 10.0.10.3@o2ib7 was evicted due to a lock blocking callback time out: rc -110 Apr 26 12:26:53 fir-md1-s2 kernel: LustreError: 99143:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 31s: evicting client at 10.0.10.3@o2ib7 ns: mdt-fir-MDT0003_UUID lock: ffff985aa8702640/0x4f3cef65e16dc8a2 lrc: 3/0,0 mode: PR/PR res: [0x28001b6c4:0x9a:0x0].0x0 bits 0x40/0x0 rrc: 9 type: IBT flags: 0x60000400000020 nid: 10.0.10.3@o2ib7 remote: 0xbbb5b46b2935fbb1 expref: 1984119 pid: 100415 timeout: 0 lvb_type: 0 Apr 26 12:26:53 fir-md1-s2 kernel: LustreError: 100172:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff985954f2f800 x1631885368408896/t0(0) o104->fir-MDT0003@10.0.10.3@o2ib7:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 Apr 26 12:26:54 fir-md1-s2 kernel: LustreError: 99402:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff987a813a1500 x1631885368425280/t0(0) o104->fir-MDT0003@10.0.10.3@o2ib7:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 Apr 26 12:26:54 fir-md1-s2 kernel: LustreError: 99402:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 3 previous similar messages Apr 26 12:27:00 fir-md1-s2 kernel: LustreError: 100020:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff9872cc240f00 x1631885368487232/t0(0) o104->fir-MDT0003@10.0.10.3@o2ib7:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 Apr 26 12:27:00 fir-md1-s2 kernel: LustreError: 100020:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 2 previous similar messages Apr 26 12:27:04 fir-md1-s2 kernel: LustreError: 100228:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff988826f96f00 x1631885368553984/t0(0) o104->fir-MDT0003@10.0.10.3@o2ib7:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 Apr 26 12:27:04 fir-md1-s2 kernel: LustreError: 100228:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 1 previous similar message Apr 26 12:27:05 fir-md1-s2 kernel: Lustre: fir-MDT0003: Client 90d81c86-5db8-d29f-71be-9c3030e109bc (at 10.9.102.49@o2ib4) reconnecting Apr 26 12:27:05 fir-md1-s2 kernel: Lustre: fir-MDT0003: Connection restored to 90d81c86-5db8-d29f-71be-9c3030e109bc (at 10.9.102.49@o2ib4) Apr 26 12:27:11 fir-md1-s2 kernel: LustreError: 100198:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff986ab1834e00 x1631885368648752/t0(0) o104->fir-MDT0003@10.0.10.3@o2ib7:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 Apr 26 12:27:11 fir-md1-s2 kernel: LustreError: 100198:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 2 previous similar messages Apr 26 12:27:18 fir-md1-s2 kernel: Lustre: 100403:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply req@ffff988a4ea20c00 x1631535279942864/t270596782450(0) o36->b5280270-3b22-224e-0daa-bad5776be543@10.9.103.24@o2ib4:23/0 lens 488/3152 e 0 to 0 dl 1556306843 ref 2 fl Interpret:/0/0 rc 0/0 Apr 26 12:27:21 fir-md1-s2 kernel: LustreError: 100466:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff985ae06dd400 x1631885368778128/t0(0) o104->fir-MDT0003@10.0.10.3@o2ib7:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 Apr 26 12:27:23 fir-md1-s2 kernel: LustreError: 99143:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 30s: evicting client at 10.0.10.3@o2ib7 ns: mdt-fir-MDT0003_UUID lock: ffff985ad2f557c0/0x4f3cef65c8d97a8f lrc: 3/0,0 mode: PR/PR res: [0x28001a57e:0x231b:0x0].0x0 bits 0x1b/0x0 rrc: 6 type: IBT flags: 0x60200400000020 nid: 10.0.10.3@o2ib7 remote: 0xbbb5b46b24220451 expref: 1645176 pid: 100213 timeout: 306237 lvb_type: 0 Apr 26 12:27:24 fir-md1-s2 kernel: Lustre: fir-MDT0003: Client b5280270-3b22-224e-0daa-bad5776be543 (at 10.9.103.24@o2ib4) reconnecting Apr 26 12:27:24 fir-md1-s2 kernel: Lustre: fir-MDT0003: Connection restored to (at 10.9.103.24@o2ib4) Apr 26 12:27:26 fir-md1-s2 kernel: Lustre: 99921:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply req@ffff987a9f8c1e00 x1631293233408080/t0(0) o101->766c6e9e-6589-78e8-fb69-8836dc850825@10.8.28.2@o2ib6:0/0 lens 480/568 e 0 to 0 dl 1556306850 ref 2 fl Interpret:/0/0 rc 0/0 Apr 26 12:27:26 fir-md1-s2 kernel: Lustre: 99921:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message Apr 26 12:27:29 fir-md1-s2 kernel: LustreError: 99143:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 29s: evicting client at 10.0.10.3@o2ib7 ns: mdt-fir-MDT0003_UUID lock: ffff9868c5207500/0x4f3cef65e0462566 lrc: 3/0,0 mode: PR/PR res: [0x2800166fc:0xd6:0x0].0x0 bits 0x40/0x0 rrc: 9 type: IBT flags: 0x60000400000020 nid: 10.0.10.3@o2ib7 remote: 0xbbb5b46b28ec57e6 expref: 1612249 pid: 99158 timeout: 306243 lvb_type: 0 Apr 26 12:27:29 fir-md1-s2 kernel: LustreError: 99143:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 3 previous similar messages Apr 26 12:27:32 fir-md1-s2 kernel: Lustre: fir-MDT0003: Client 0b49eccd-cda4-7bac-8560-4f28415786a3 (at 10.9.0.62@o2ib4) reconnecting Apr 26 12:27:32 fir-md1-s2 kernel: Lustre: Skipped 3 previous similar messages Apr 26 12:27:34 fir-md1-s2 kernel: LustreError: 99143:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 30s: evicting client at 10.0.10.3@o2ib7 ns: mdt-fir-MDT0003_UUID lock: ffff98545d68ec00/0x4f3cef65e225471e lrc: 3/0,0 mode: PR/PR res: [0x28001b767:0x25:0x0].0x0 bits 0x1b/0x0 rrc: 14 type: IBT flags: 0x60200400000020 nid: 10.0.10.3@o2ib7 remote: 0xbbb5b46b295c3548 expref: 1586794 pid: 99158 timeout: 306248 lvb_type: 0 Apr 26 12:27:36 fir-md1-s2 kernel: Lustre: 100113:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply req@ffff9869d7353600 x1631642407568800/t0(0) o101->8290d58b-0905-6161-be47-84efd8d09138@10.9.108.18@o2ib4:11/0 lens 376/1600 e 0 to 0 dl 1556306861 ref 2 fl Interpret:/0/0 rc 0/0 Apr 26 12:27:36 fir-md1-s2 kernel: Lustre: 100113:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 3 previous similar messages Apr 26 12:27:37 fir-md1-s2 kernel: LustreError: 100259:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff987362e2f200 x1631885368992128/t0(0) o104->fir-MDT0003@10.0.10.3@o2ib7:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 Apr 26 12:27:37 fir-md1-s2 kernel: LustreError: 100259:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 5 previous similar messages Apr 26 12:27:41 fir-md1-s2 kernel: LustreError: 99143:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 30s: evicting client at 10.0.10.3@o2ib7 ns: mdt-fir-MDT0003_UUID lock: ffff9868a4afd340/0x4f3cef65e2b3e1d7 lrc: 3/0,0 mode: PR/PR res: [0x28001b608:0x32:0x0].0x0 bits 0x1b/0x0 rrc: 7 type: IBT flags: 0x60200400000020 nid: 10.0.10.3@o2ib7 remote: 0xbbb5b46b2979c513 expref: 1553661 pid: 99158 timeout: 306255 lvb_type: 0 Apr 26 12:27:41 fir-md1-s2 kernel: LustreError: 99143:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 1 previous similar message Apr 26 12:27:42 fir-md1-s2 kernel: Lustre: fir-MDT0003: Connection restored to (at 10.9.108.18@o2ib4) Apr 26 12:27:42 fir-md1-s2 kernel: Lustre: Skipped 5 previous similar messages Apr 26 12:27:50 fir-md1-s2 kernel: LustreError: 99143:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 29s: evicting client at 10.0.10.3@o2ib7 ns: mdt-fir-MDT0003_UUID lock: ffff985b126fc5c0/0x4f3cef65e27daeb8 lrc: 3/0,0 mode: PR/PR res: [0x28001b54b:0x30e6:0x0].0x0 bits 0x13/0x0 rrc: 5 type: IBT flags: 0x60200400000020 nid: 10.0.10.3@o2ib7 remote: 0xbbb5b46b296de4b2 expref: 1514410 pid: 100415 timeout: 306264 lvb_type: 0 Apr 26 12:27:52 fir-md1-s2 kernel: Lustre: fir-MDT0003: Client d3ad20a5-81a2-915a-58a3-1542c85784cf (at 10.9.107.53@o2ib4) reconnecting Apr 26 12:27:52 fir-md1-s2 kernel: Lustre: Skipped 3 previous similar messages Apr 26 12:27:53 fir-md1-s2 kernel: Lustre: 100172:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply req@ffff985ac5afc800 x1631641796829440/t0(0) o36->e0e6d63f-0238-284f-ef41-faf2bb976ece@10.9.108.52@o2ib4:28/0 lens 496/448 e 0 to 0 dl 1556306878 ref 2 fl Interpret:/0/0 rc 0/0 Apr 26 12:27:53 fir-md1-s2 kernel: Lustre: 100172:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 3 previous similar messages Apr 26 12:28:06 fir-md1-s2 kernel: LustreError: 99143:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 29s: evicting client at 10.0.10.3@o2ib7 ns: mdt-fir-MDT0003_UUID lock: ffff98699f35c5c0/0x4f3cef65e1986edb lrc: 3/0,0 mode: PR/PR res: [0x28001a6a3:0x1c:0x0].0x0 bits 0x40/0x0 rrc: 8 type: IBT flags: 0x60000400000020 nid: 10.0.10.3@o2ib7 remote: 0xbbb5b46b293de440 expref: 1451654 pid: 99547 timeout: 306280 lvb_type: 0 Apr 26 12:28:06 fir-md1-s2 kernel: LustreError: 99143:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 5 previous similar messages Apr 26 12:28:12 fir-md1-s2 kernel: LustreError: 99386:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff986ab1834b00 x1631885369446256/t0(0) o104->fir-MDT0003@10.0.10.3@o2ib7:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 Apr 26 12:28:12 fir-md1-s2 kernel: LustreError: 99386:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 12 previous similar messages Apr 26 12:28:15 fir-md1-s2 kernel: Lustre: fir-MDT0003: Connection restored to c4a32940-a9be-512a-496b-f65411562f7a (at 10.9.106.43@o2ib4) Apr 26 12:28:15 fir-md1-s2 kernel: Lustre: Skipped 16 previous similar messages Apr 26 12:28:23 fir-md1-s2 kernel: LustreError: 100425:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1556306813, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0003_UUID lock: ffff988863715c40/0x4f3cef65e51250b0 lrc: 3/0,1 mode: --/PW res: [0x28001b6c4:0x9a:0x0].0x0 bits 0x40/0x0 rrc: 9 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 100425 timeout: 0 lvb_type: 0 Apr 26 12:28:23 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556306903.100425 Apr 26 12:28:25 fir-md1-s2 kernel: LustreError: 100186:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1556306814, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0003_UUID lock: ffff986a007eb3c0/0x4f3cef65e52a5160 lrc: 3/0,1 mode: --/EX res: [0x28001ad52:0x203:0x0].0x0 bits 0x3/0x0 rrc: 6 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 100186 timeout: 0 lvb_type: 0 Apr 26 12:28:25 fir-md1-s2 kernel: LustreError: 100186:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 1 previous similar message Apr 26 12:28:26 fir-md1-s2 kernel: Lustre: fir-MDT0003: Client f4957443-90e7-a7b5-e3db-c45f8726f1c2 (at 10.9.102.13@o2ib4) reconnecting Apr 26 12:28:26 fir-md1-s2 kernel: Lustre: Skipped 17 previous similar messages Apr 26 12:28:27 fir-md1-s2 kernel: Lustre: 100107:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply req@ffff986abd912400 x1631534643110416/t0(0) o101->4fd3697b-8ac3-d03c-d547-c2a2aae5b292@10.8.28.8@o2ib6:2/0 lens 1784/3288 e 0 to 0 dl 1556306912 ref 2 fl Interpret:/0/0 rc 0/0 Apr 26 12:28:27 fir-md1-s2 kernel: Lustre: 100107:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 7 previous similar messages Apr 26 12:28:30 fir-md1-s2 kernel: LustreError: 100020:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1556306820, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0003_UUID lock: ffff9878eda45100/0x4f3cef65e52df428 lrc: 3/0,1 mode: --/PW res: [0x2800166fc:0xd6:0x0].0x0 bits 0x40/0x0 rrc: 9 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 100020 timeout: 0 lvb_type: 0 Apr 26 12:28:33 fir-md1-s2 kernel: LustreError: 100439:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1556306823, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0003_UUID lock: ffff985ab92c5e80/0x4f3cef65e530fd6e lrc: 3/1,0 mode: --/PR res: [0x28001a57e:0x4bf8:0x0].0x0 bits 0x13/0x0 rrc: 11 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 100439 timeout: 0 lvb_type: 0 Apr 26 12:28:33 fir-md1-s2 kernel: LustreError: 100439:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 1 previous similar message Apr 26 12:28:41 fir-md1-s2 kernel: LustreError: 99143:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 29s: evicting client at 10.0.10.3@o2ib7 ns: mdt-fir-MDT0003_UUID lock: ffff98649bb3b3c0/0x4f3cef65e2b232a1 lrc: 3/0,0 mode: PR/PR res: [0x28001b6b8:0x4a:0x0].0x0 bits 0x12/0x0 rrc: 5 type: IBT flags: 0x60200400000020 nid: 10.0.10.3@o2ib7 remote: 0xbbb5b46b297976fb expref: 1335151 pid: 100466 timeout: 306315 lvb_type: 0 Apr 26 12:28:41 fir-md1-s2 kernel: LustreError: 99143:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 12 previous similar messages Apr 26 12:28:41 fir-md1-s2 kernel: LustreError: 100198:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1556306831, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0003_UUID lock: ffff986aa371d7c0/0x4f3cef65e537204d lrc: 3/0,1 mode: --/EX res: [0x28001b608:0x32:0x0].0x0 bits 0x8/0x0 rrc: 7 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 100198 timeout: 0 lvb_type: 0 Apr 26 12:28:41 fir-md1-s2 kernel: LustreError: 100198:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 1 previous similar message Apr 26 12:28:51 fir-md1-s2 kernel: LustreError: 100466:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1556306841, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0003_UUID lock: ffff985112fe2880/0x4f3cef65e53dbc24 lrc: 3/0,1 mode: --/CW res: [0x28001b54b:0x30e6:0x0].0x0 bits 0x2/0x0 rrc: 5 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 100466 timeout: 0 lvb_type: 0 Apr 26 12:29:07 fir-md1-s2 kernel: LustreError: 100259:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1556306857, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0003_UUID lock: ffff9872e7f08fc0/0x4f3cef65e5495590 lrc: 3/0,1 mode: --/PW res: [0x28001a6a3:0x1c:0x0].0x0 bits 0x40/0x0 rrc: 8 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 100259 timeout: 0 lvb_type: 0 Apr 26 12:29:07 fir-md1-s2 kernel: LustreError: 100259:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 1 previous similar message Apr 26 12:29:20 fir-md1-s2 kernel: Lustre: fir-MDT0003: Connection restored to (at 10.9.108.72@o2ib4) Apr 26 12:29:20 fir-md1-s2 kernel: Lustre: Skipped 43 previous similar messages Apr 26 12:29:24 fir-md1-s2 kernel: LustreError: 100507:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff98882f7fa100 x1631885370066432/t0(0) o104->fir-MDT0003@10.0.10.3@o2ib7:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 Apr 26 12:29:24 fir-md1-s2 kernel: LustreError: 100507:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 13 previous similar messages Apr 26 12:29:31 fir-md1-s2 kernel: Lustre: 99921:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-9), not sending early reply req@ffff987a75d6dd00 x1631526800400224/t0(0) o101->0b49eccd-cda4-7bac-8560-4f28415786a3@10.9.0.62@o2ib4:6/0 lens 576/3264 e 0 to 0 dl 1556306976 ref 2 fl Interpret:/0/0 rc 0/0 Apr 26 12:29:31 fir-md1-s2 kernel: Lustre: 99921:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 10 previous similar messages Apr 26 12:29:32 fir-md1-s2 kernel: Lustre: fir-MDT0003: Client 90d81c86-5db8-d29f-71be-9c3030e109bc (at 10.9.102.49@o2ib4) reconnecting Apr 26 12:29:32 fir-md1-s2 kernel: Lustre: Skipped 49 previous similar messages Apr 26 12:29:42 fir-md1-s2 kernel: LustreError: 99386:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1556306892, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0003_UUID lock: ffff9869862bbcc0/0x4f3cef65e562724c lrc: 3/0,1 mode: --/EX res: [0x28001b6b8:0x4a:0x0].0x0 bits 0x3/0x0 rrc: 5 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 99386 timeout: 0 lvb_type: 0 Apr 26 12:29:42 fir-md1-s2 kernel: LustreError: 99386:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 9 previous similar messages Apr 26 12:29:43 fir-md1-s2 kernel: LNet: Service thread pid 100425 was inactive for 200.05s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Apr 26 12:29:43 fir-md1-s2 kernel: Pid: 100425, comm: mdt03_069 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Apr 26 12:29:43 fir-md1-s2 kernel: Call Trace: Apr 26 12:29:43 fir-md1-s2 kernel: [] ldlm_completion_ast+0x4e5/0x890 [ptlrpc] Apr 26 12:29:43 fir-md1-s2 kernel: [] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] Apr 26 12:29:43 fir-md1-s2 kernel: [] mdt_object_local_lock+0x50b/0xb20 [mdt] Apr 26 12:29:43 fir-md1-s2 kernel: [] mdt_object_lock_internal+0x70/0x3e0 [mdt] Apr 26 12:29:43 fir-md1-s2 kernel: [] mdt_object_lock+0x20/0x30 [mdt] Apr 26 12:29:43 fir-md1-s2 kernel: [] mdt_brw_enqueue+0x44b/0x760 [mdt] Apr 26 12:29:43 fir-md1-s2 kernel: [] mdt_intent_brw+0x1f/0x30 [mdt] Apr 26 12:29:43 fir-md1-s2 kernel: [] mdt_intent_policy+0x2e8/0xd00 [mdt] Apr 26 12:29:43 fir-md1-s2 kernel: [] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] Apr 26 12:29:43 fir-md1-s2 kernel: [] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] Apr 26 12:29:43 fir-md1-s2 kernel: [] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 26 12:29:43 fir-md1-s2 kernel: [] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 26 12:29:43 fir-md1-s2 kernel: [] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 26 12:29:43 fir-md1-s2 kernel: [] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Apr 26 12:29:43 fir-md1-s2 kernel: [] kthread+0xd1/0xe0 Apr 26 12:29:43 fir-md1-s2 kernel: [] ret_from_fork_nospec_begin+0xe/0x21 Apr 26 12:29:43 fir-md1-s2 kernel: [] 0xffffffffffffffff Apr 26 12:29:43 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556306983.100425 Apr 26 12:29:53 fir-md1-s2 kernel: Lustre: fir-MDT0001: haven't heard from client bc889374-b0ed-2371-0c2c-d84fc0dd852e (at 10.0.10.3@o2ib7) in 227 seconds. I think it's dead, and I am evicting it. exp ffff98798a6a3800, cur 1556306993 expire 1556306843 last 1556306766 Apr 26 12:29:53 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Apr 26 12:29:53 fir-md1-s2 kernel: LustreError: 99143:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 29s: evicting client at 10.0.10.3@o2ib7 ns: mdt-fir-MDT0003_UUID lock: ffff98545ee0f980/0x4f3cef65e2d63195 lrc: 3/0,0 mode: PR/PR res: [0x28001a6d5:0x3de:0x0].0x0 bits 0x1b/0x0 rrc: 7 type: IBT flags: 0x60200400000020 nid: 10.0.10.3@o2ib7 remote: 0xbbb5b46b2981fe29 expref: 1144786 pid: 100415 timeout: 306387 lvb_type: 0 Apr 26 12:29:53 fir-md1-s2 kernel: LustreError: 99143:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 12 previous similar messages Apr 26 12:30:13 fir-md1-s2 kernel: LNet: Service thread pid 100166 was inactive for 200.28s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Apr 26 12:30:13 fir-md1-s2 kernel: Pid: 100166, comm: mdt03_026 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Apr 26 12:30:13 fir-md1-s2 kernel: Call Trace: Apr 26 12:30:13 fir-md1-s2 kernel: [] ldlm_completion_ast+0x4e5/0x890 [ptlrpc] Apr 26 12:30:13 fir-md1-s2 kernel: [] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] Apr 26 12:30:13 fir-md1-s2 kernel: [] mdt_object_local_lock+0x50b/0xb20 [mdt] Apr 26 12:30:13 fir-md1-s2 kernel: [] mdt_object_lock_internal+0x70/0x3e0 [mdt] Apr 26 12:30:13 fir-md1-s2 kernel: [] mdt_reint_object_lock+0x2c/0x60 [mdt] Apr 26 12:30:13 fir-md1-s2 kernel: [] mdt_reint_striped_lock+0x8c/0x510 [mdt] Apr 26 12:30:13 fir-md1-s2 kernel: [] mdt_reint_setattr+0x6c8/0x1340 [mdt] Apr 26 12:30:13 fir-md1-s2 kernel: [] mdt_reint_rec+0x83/0x210 [mdt] Apr 26 12:30:13 fir-md1-s2 kernel: [] mdt_reint_internal+0x6e3/0xaf0 [mdt] Apr 26 12:30:13 fir-md1-s2 kernel: [] mdt_reint+0x67/0x140 [mdt] Apr 26 12:30:13 fir-md1-s2 kernel: [] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 26 12:30:13 fir-md1-s2 kernel: [] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 26 12:30:13 fir-md1-s2 kernel: [] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Apr 26 12:30:13 fir-md1-s2 kernel: [] kthread+0xd1/0xe0 Apr 26 12:30:13 fir-md1-s2 kernel: [] ret_from_fork_nospec_begin+0xe/0x21 Apr 26 12:30:13 fir-md1-s2 kernel: [] 0xffffffffffffffff Apr 26 12:30:13 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307013.100166 Apr 26 12:30:15 fir-md1-s2 kernel: LNet: Service thread pid 100186 was inactive for 200.30s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Apr 26 12:30:15 fir-md1-s2 kernel: Pid: 100186, comm: mdt01_060 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Apr 26 12:30:15 fir-md1-s2 kernel: Call Trace: Apr 26 12:30:15 fir-md1-s2 kernel: [] ldlm_completion_ast+0x4e5/0x890 [ptlrpc] Apr 26 12:30:15 fir-md1-s2 kernel: [] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] Apr 26 12:30:15 fir-md1-s2 kernel: [] mdt_object_local_lock+0x50b/0xb20 [mdt] Apr 26 12:30:15 fir-md1-s2 kernel: [] mdt_object_lock_internal+0x70/0x3e0 [mdt] Apr 26 12:30:15 fir-md1-s2 kernel: [] mdt_reint_object_lock+0x2c/0x60 [mdt] Apr 26 12:30:15 fir-md1-s2 kernel: [] mdt_reint_striped_lock+0x8c/0x510 [mdt] Apr 26 12:30:15 fir-md1-s2 kernel: [] mdt_reint_unlink+0x704/0x1430 [mdt] Apr 26 12:30:15 fir-md1-s2 kernel: [] mdt_reint_rec+0x83/0x210 [mdt] Apr 26 12:30:15 fir-md1-s2 kernel: [] mdt_reint_internal+0x6e3/0xaf0 [mdt] Apr 26 12:30:15 fir-md1-s2 kernel: [] mdt_reint+0x67/0x140 [mdt] Apr 26 12:30:15 fir-md1-s2 kernel: [] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 26 12:30:15 fir-md1-s2 kernel: [] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 26 12:30:15 fir-md1-s2 kernel: [] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Apr 26 12:30:15 fir-md1-s2 kernel: [] kthread+0xd1/0xe0 Apr 26 12:30:15 fir-md1-s2 kernel: [] ret_from_fork_nospec_begin+0xe/0x21 Apr 26 12:30:15 fir-md1-s2 kernel: [] 0xffffffffffffffff Apr 26 12:30:15 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307015.100186 Apr 26 12:30:20 fir-md1-s2 kernel: LNet: Service thread pid 100020 was inactive for 200.35s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Apr 26 12:30:20 fir-md1-s2 kernel: Pid: 100020, comm: mdt02_027 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Apr 26 12:30:20 fir-md1-s2 kernel: Call Trace: Apr 26 12:30:20 fir-md1-s2 kernel: [] ldlm_completion_ast+0x4e5/0x890 [ptlrpc] Apr 26 12:30:20 fir-md1-s2 kernel: [] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] Apr 26 12:30:20 fir-md1-s2 kernel: [] mdt_object_local_lock+0x50b/0xb20 [mdt] Apr 26 12:30:20 fir-md1-s2 kernel: [] mdt_object_lock_internal+0x70/0x3e0 [mdt] Apr 26 12:30:20 fir-md1-s2 kernel: [] mdt_object_lock+0x20/0x30 [mdt] Apr 26 12:30:20 fir-md1-s2 kernel: [] mdt_brw_enqueue+0x44b/0x760 [mdt] Apr 26 12:30:20 fir-md1-s2 kernel: [] mdt_intent_brw+0x1f/0x30 [mdt] Apr 26 12:30:20 fir-md1-s2 kernel: [] mdt_intent_policy+0x2e8/0xd00 [mdt] Apr 26 12:30:20 fir-md1-s2 kernel: [] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] Apr 26 12:30:20 fir-md1-s2 kernel: [] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] Apr 26 12:30:20 fir-md1-s2 kernel: [] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 26 12:30:20 fir-md1-s2 kernel: [] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 26 12:30:20 fir-md1-s2 kernel: [] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 26 12:30:20 fir-md1-s2 kernel: [] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Apr 26 12:30:20 fir-md1-s2 kernel: [] kthread+0xd1/0xe0 Apr 26 12:30:20 fir-md1-s2 kernel: [] ret_from_fork_nospec_begin+0xe/0x21 Apr 26 12:30:20 fir-md1-s2 kernel: [] 0xffffffffffffffff Apr 26 12:30:20 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307020.100020 Apr 26 12:30:22 fir-md1-s2 kernel: Pid: 99402, comm: mdt02_007 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Apr 26 12:30:22 fir-md1-s2 kernel: Call Trace: Apr 26 12:30:22 fir-md1-s2 kernel: [] ldlm_completion_ast+0x4e5/0x890 [ptlrpc] Apr 26 12:30:22 fir-md1-s2 kernel: [] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] Apr 26 12:30:22 fir-md1-s2 kernel: [] mdt_object_local_lock+0x50b/0xb20 [mdt] Apr 26 12:30:22 fir-md1-s2 kernel: [] mdt_object_lock_internal+0x70/0x3e0 [mdt] Apr 26 12:30:22 fir-md1-s2 kernel: [] mdt_getattr_name_lock+0x90a/0x1c30 [mdt] Apr 26 12:30:22 fir-md1-s2 kernel: [] mdt_intent_getattr+0x2b5/0x480 [mdt] Apr 26 12:30:22 fir-md1-s2 kernel: [] mdt_intent_policy+0x2e8/0xd00 [mdt] Apr 26 12:30:22 fir-md1-s2 kernel: [] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] Apr 26 12:30:22 fir-md1-s2 kernel: [] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] Apr 26 12:30:22 fir-md1-s2 kernel: [] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 26 12:30:22 fir-md1-s2 kernel: [] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 26 12:30:22 fir-md1-s2 kernel: [] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 26 12:30:22 fir-md1-s2 kernel: [] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Apr 26 12:30:22 fir-md1-s2 kernel: [] kthread+0xd1/0xe0 Apr 26 12:30:22 fir-md1-s2 kernel: [] ret_from_fork_nospec_begin+0xe/0x21 Apr 26 12:30:22 fir-md1-s2 kernel: [] 0xffffffffffffffff Apr 26 12:30:22 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307022.99402 Apr 26 12:30:23 fir-md1-s2 kernel: LNet: Service thread pid 100439 was inactive for 200.49s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Apr 26 12:30:23 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307023.100439 Apr 26 12:30:25 fir-md1-s2 kernel: LNet: Service thread pid 100228 was inactive for 200.23s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Apr 26 12:30:25 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307025.100228 Apr 26 12:30:27 fir-md1-s2 kernel: LNet: Service thread pid 100425 completed after 244.42s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 26 12:30:32 fir-md1-s2 kernel: LNet: Service thread pid 100198 was inactive for 200.50s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Apr 26 12:30:32 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307032.100198 Apr 26 12:30:41 fir-md1-s2 kernel: LNet: Service thread pid 100466 was inactive for 200.20s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Apr 26 12:30:41 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307041.100466 Apr 26 12:30:44 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307044.99920 Apr 26 12:30:54 fir-md1-s2 kernel: LustreError: 100507:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1556306964, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0003_UUID lock: ffff988992ab4a40/0x4f3cef65e582ebf2 lrc: 3/0,1 mode: --/EX res: [0x28001a6d5:0x3de:0x0].0x0 bits 0x8/0x0 rrc: 7 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 100507 timeout: 0 lvb_type: 0 Apr 26 12:30:54 fir-md1-s2 kernel: LustreError: 100507:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 9 previous similar messages Apr 26 12:30:57 fir-md1-s2 kernel: LNet: Service thread pid 100259 was inactive for 200.54s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Apr 26 12:30:57 fir-md1-s2 kernel: LNet: Skipped 1 previous similar message Apr 26 12:30:57 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307057.100259 Apr 26 12:31:00 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307060.99895 Apr 26 12:31:02 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307062.99974 Apr 26 12:31:04 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307064.100200 Apr 26 12:31:07 fir-md1-s2 kernel: LNet: Service thread pid 100129 was inactive for 200.02s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Apr 26 12:31:07 fir-md1-s2 kernel: LNet: Skipped 4 previous similar messages Apr 26 12:31:07 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307067.100129 Apr 26 12:31:16 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307076.99165 Apr 26 12:31:22 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307082.99383 Apr 26 12:31:23 fir-md1-s2 kernel: LNet: Service thread pid 100308 was inactive for 200.69s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Apr 26 12:31:23 fir-md1-s2 kernel: LNet: Skipped 2 previous similar messages Apr 26 12:31:23 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307083.100308 Apr 26 12:31:24 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307084.100238 Apr 26 12:31:28 fir-md1-s2 kernel: Lustre: fir-MDT0003: Connection restored to 8b33fbf1-f2ea-97c7-949f-7519ee33fba7 (at 10.8.2.26@o2ib6) Apr 26 12:31:28 fir-md1-s2 kernel: Lustre: Skipped 123 previous similar messages Apr 26 12:31:32 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307092.99386 Apr 26 12:31:35 fir-md1-s2 kernel: LustreError: 100425:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff9889dbedda00 x1631885371083120/t0(0) o104->fir-MDT0003@10.0.10.3@o2ib7:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 Apr 26 12:31:35 fir-md1-s2 kernel: LustreError: 100425:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 30 previous similar messages Apr 26 12:31:36 fir-md1-s2 kernel: LNet: Service thread pid 100259 completed after 239.63s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 26 12:31:37 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307097.100175 Apr 26 12:31:41 fir-md1-s2 kernel: Lustre: fir-MDT0003: Client 0b49eccd-cda4-7bac-8560-4f28415786a3 (at 10.9.0.62@o2ib4) reconnecting Apr 26 12:31:41 fir-md1-s2 kernel: Lustre: Skipped 132 previous similar messages Apr 26 12:31:44 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307104.99522 Apr 26 12:31:45 fir-md1-s2 kernel: LNet: Service thread pid 100166 completed after 291.91s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 26 12:31:49 fir-md1-s2 kernel: Lustre: 99967:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply req@ffff9889dbedd400 x1631588137168768/t270596893122(0) o36->90d81c86-5db8-d29f-71be-9c3030e109bc@10.9.102.49@o2ib4:24/0 lens 488/3152 e 0 to 0 dl 1556307114 ref 2 fl Interpret:/0/0 rc 0/0 Apr 26 12:31:49 fir-md1-s2 kernel: Lustre: 99967:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 19 previous similar messages Apr 26 12:31:56 fir-md1-s2 kernel: LNet: Service thread pid 100238 completed after 232.84s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 26 12:32:04 fir-md1-s2 kernel: LustreError: 99143:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 29s: evicting client at 10.0.10.3@o2ib7 ns: mdt-fir-MDT0003_UUID lock: ffff985a03bedc40/0x4f3cef65e05c2e7f lrc: 3/0,0 mode: PR/PR res: [0x28001b779:0x1e:0x0].0x0 bits 0x40/0x0 rrc: 7 type: IBT flags: 0x60000400000020 nid: 10.0.10.3@o2ib7 remote: 0xbbb5b46b28f225fa expref: 879584 pid: 100172 timeout: 306518 lvb_type: 0 Apr 26 12:32:04 fir-md1-s2 kernel: LustreError: 99143:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 27 previous similar messages Apr 26 12:32:06 fir-md1-s2 kernel: LNet: Service thread pid 100086 was inactive for 200.74s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Apr 26 12:32:06 fir-md1-s2 kernel: LNet: Skipped 4 previous similar messages Apr 26 12:32:06 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307126.100086 Apr 26 12:32:12 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307132.99167 Apr 26 12:32:13 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307133.100081 Apr 26 12:32:22 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307142.100320 Apr 26 12:32:31 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307151.99547 Apr 26 12:32:39 fir-md1-s2 kernel: LNet: Service thread pid 100466 completed after 318.09s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 26 12:32:44 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307164.100507 Apr 26 12:32:47 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307167.100214 Apr 26 12:32:50 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307170.100172 Apr 26 12:32:53 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307173.99199 Apr 26 12:33:03 fir-md1-s2 kernel: LustreError: 100089:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1556307093, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0003_UUID lock: ffff986a42609440/0x4f3cef65e5b5d049 lrc: 3/1,0 mode: --/PR res: [0x28001a57e:0x4bee:0x0].0x0 bits 0x13/0x0 rrc: 7 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 100089 timeout: 0 lvb_type: 0 Apr 26 12:33:03 fir-md1-s2 kernel: LustreError: 100089:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 18 previous similar messages Apr 26 12:33:10 fir-md1-s2 kernel: LNet: Service thread pid 100086 completed after 264.47s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 26 12:33:10 fir-md1-s2 kernel: LNet: Skipped 1 previous similar message Apr 26 12:33:23 fir-md1-s2 kernel: LNet: Service thread pid 100261 was inactive for 200.59s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Apr 26 12:33:23 fir-md1-s2 kernel: LNet: Skipped 9 previous similar messages Apr 26 12:33:23 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307203.100261 Apr 26 12:33:37 fir-md1-s2 kernel: LNet: Service thread pid 100198 completed after 385.53s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 26 12:33:37 fir-md1-s2 kernel: LNet: Skipped 1 previous similar message Apr 26 12:33:54 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307234.99549 Apr 26 12:34:08 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307248.99892 Apr 26 12:34:10 fir-md1-s2 kernel: LNet: Service thread pid 100228 completed after 425.49s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 26 12:34:10 fir-md1-s2 kernel: LNet: Skipped 1 previous similar message Apr 26 12:34:22 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307262.99509 Apr 26 12:34:31 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307271.100141 Apr 26 12:34:45 fir-md1-s2 kernel: LNet: Service thread pid 100353 was inactive for 200.21s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Apr 26 12:34:45 fir-md1-s2 kernel: LNet: Skipped 1 previous similar message Apr 26 12:34:45 fir-md1-s2 kernel: Pid: 100353, comm: mdt03_059 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Apr 26 12:34:45 fir-md1-s2 kernel: Call Trace: Apr 26 12:34:45 fir-md1-s2 kernel: [] ldlm_completion_ast+0x4e5/0x890 [ptlrpc] Apr 26 12:34:45 fir-md1-s2 kernel: [] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] Apr 26 12:34:45 fir-md1-s2 kernel: [] mdt_object_local_lock+0x50b/0xb20 [mdt] Apr 26 12:34:45 fir-md1-s2 kernel: [] mdt_object_lock_internal+0x70/0x3e0 [mdt] Apr 26 12:34:45 fir-md1-s2 kernel: [] mdt_reint_object_lock+0x2c/0x60 [mdt] Apr 26 12:34:45 fir-md1-s2 kernel: [] mdt_reint_striped_lock+0x8c/0x510 [mdt] Apr 26 12:34:45 fir-md1-s2 kernel: [] mdt_reint_setattr+0x6c8/0x1340 [mdt] Apr 26 12:34:45 fir-md1-s2 kernel: [] mdt_reint_rec+0x83/0x210 [mdt] Apr 26 12:34:45 fir-md1-s2 kernel: [] mdt_reint_internal+0x6e3/0xaf0 [mdt] Apr 26 12:34:45 fir-md1-s2 kernel: [] mdt_reint+0x67/0x140 [mdt] Apr 26 12:34:45 fir-md1-s2 kernel: [] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 26 12:34:45 fir-md1-s2 kernel: [] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 26 12:34:45 fir-md1-s2 kernel: [] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Apr 26 12:34:45 fir-md1-s2 kernel: [] kthread+0xd1/0xe0 Apr 26 12:34:45 fir-md1-s2 kernel: [] ret_from_fork_nospec_begin+0xe/0x21 Apr 26 12:34:45 fir-md1-s2 kernel: [] 0xffffffffffffffff Apr 26 12:34:45 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307285.100353 Apr 26 12:34:46 fir-md1-s2 kernel: Pid: 99395, comm: mdt01_016 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Apr 26 12:34:46 fir-md1-s2 kernel: Call Trace: Apr 26 12:34:46 fir-md1-s2 kernel: [] ldlm_completion_ast+0x4e5/0x890 [ptlrpc] Apr 26 12:34:46 fir-md1-s2 kernel: [] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] Apr 26 12:34:46 fir-md1-s2 kernel: [] mdt_object_local_lock+0x50b/0xb20 [mdt] Apr 26 12:34:46 fir-md1-s2 kernel: [] mdt_object_lock_internal+0x70/0x3e0 [mdt] Apr 26 12:34:46 fir-md1-s2 kernel: [] mdt_object_lock+0x20/0x30 [mdt] Apr 26 12:34:46 fir-md1-s2 kernel: [] mdt_brw_enqueue+0x44b/0x760 [mdt] Apr 26 12:34:46 fir-md1-s2 kernel: [] mdt_intent_brw+0x1f/0x30 [mdt] Apr 26 12:34:46 fir-md1-s2 kernel: [] mdt_intent_policy+0x2e8/0xd00 [mdt] Apr 26 12:34:46 fir-md1-s2 kernel: [] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] Apr 26 12:34:46 fir-md1-s2 kernel: [] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] Apr 26 12:34:46 fir-md1-s2 kernel: [] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 26 12:34:46 fir-md1-s2 kernel: [] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 26 12:34:46 fir-md1-s2 kernel: [] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 26 12:34:46 fir-md1-s2 kernel: [] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Apr 26 12:34:46 fir-md1-s2 kernel: [] kthread+0xd1/0xe0 Apr 26 12:34:46 fir-md1-s2 kernel: [] ret_from_fork_nospec_begin+0xe/0x21 Apr 26 12:34:46 fir-md1-s2 kernel: [] 0xffffffffffffffff Apr 26 12:34:51 fir-md1-s2 kernel: Pid: 100212, comm: mdt03_034 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Apr 26 12:34:51 fir-md1-s2 kernel: Call Trace: Apr 26 12:34:51 fir-md1-s2 kernel: [] ldlm_completion_ast+0x4e5/0x890 [ptlrpc] Apr 26 12:34:51 fir-md1-s2 kernel: [] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] Apr 26 12:34:51 fir-md1-s2 kernel: [] mdt_object_local_lock+0x50b/0xb20 [mdt] Apr 26 12:34:51 fir-md1-s2 kernel: [] mdt_object_lock_internal+0x70/0x3e0 [mdt] Apr 26 12:34:51 fir-md1-s2 kernel: [] mdt_reint_object_lock+0x2c/0x60 [mdt] Apr 26 12:34:51 fir-md1-s2 kernel: [] mdt_reint_striped_lock+0x8c/0x510 [mdt] Apr 26 12:34:51 fir-md1-s2 kernel: [] mdt_reint_unlink+0x704/0x1430 [mdt] Apr 26 12:34:51 fir-md1-s2 kernel: [] mdt_reint_rec+0x83/0x210 [mdt] Apr 26 12:34:51 fir-md1-s2 kernel: [] mdt_reint_internal+0x6e3/0xaf0 [mdt] Apr 26 12:34:51 fir-md1-s2 kernel: [] mdt_reint+0x67/0x140 [mdt] Apr 26 12:34:51 fir-md1-s2 kernel: [] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 26 12:34:51 fir-md1-s2 kernel: [] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 26 12:34:51 fir-md1-s2 kernel: [] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Apr 26 12:34:51 fir-md1-s2 kernel: [] kthread+0xd1/0xe0 Apr 26 12:34:51 fir-md1-s2 kernel: [] ret_from_fork_nospec_begin+0xe/0x21 Apr 26 12:34:51 fir-md1-s2 kernel: [] 0xffffffffffffffff Apr 26 12:34:51 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307291.100212 Apr 26 12:34:53 fir-md1-s2 kernel: LNet: Service thread pid 100089 was inactive for 200.02s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Apr 26 12:34:53 fir-md1-s2 kernel: LNet: Skipped 2 previous similar messages Apr 26 12:34:53 fir-md1-s2 kernel: Pid: 100089, comm: mdt01_045 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Apr 26 12:34:53 fir-md1-s2 kernel: Call Trace: Apr 26 12:34:53 fir-md1-s2 kernel: [] ldlm_completion_ast+0x4e5/0x890 [ptlrpc] Apr 26 12:34:53 fir-md1-s2 kernel: [] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] Apr 26 12:34:53 fir-md1-s2 kernel: [] mdt_object_local_lock+0x50b/0xb20 [mdt] Apr 26 12:34:53 fir-md1-s2 kernel: [] mdt_object_lock_internal+0x70/0x3e0 [mdt] Apr 26 12:34:53 fir-md1-s2 kernel: [] mdt_getattr_name_lock+0x90a/0x1c30 [mdt] Apr 26 12:34:53 fir-md1-s2 kernel: [] mdt_intent_getattr+0x2b5/0x480 [mdt] Apr 26 12:34:53 fir-md1-s2 kernel: [] mdt_intent_policy+0x2e8/0xd00 [mdt] Apr 26 12:34:53 fir-md1-s2 kernel: [] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] Apr 26 12:34:53 fir-md1-s2 kernel: [] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] Apr 26 12:34:53 fir-md1-s2 kernel: [] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 26 12:34:53 fir-md1-s2 kernel: [] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 26 12:34:53 fir-md1-s2 kernel: [] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 26 12:34:53 fir-md1-s2 kernel: [] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Apr 26 12:34:53 fir-md1-s2 kernel: [] kthread+0xd1/0xe0 Apr 26 12:34:53 fir-md1-s2 kernel: [] ret_from_fork_nospec_begin+0xe/0x21 Apr 26 12:34:53 fir-md1-s2 kernel: [] 0xffffffffffffffff Apr 26 12:34:53 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307293.100089 Apr 26 12:34:55 fir-md1-s2 kernel: Pid: 100425, comm: mdt03_069 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Apr 26 12:34:55 fir-md1-s2 kernel: Call Trace: Apr 26 12:34:55 fir-md1-s2 kernel: [] ldlm_completion_ast+0x4e5/0x890 [ptlrpc] Apr 26 12:34:55 fir-md1-s2 kernel: [] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] Apr 26 12:34:55 fir-md1-s2 kernel: [] mdt_object_local_lock+0x50b/0xb20 [mdt] Apr 26 12:34:55 fir-md1-s2 kernel: [] mdt_object_lock_internal+0x70/0x3e0 [mdt] Apr 26 12:34:55 fir-md1-s2 kernel: [] mdt_object_lock+0x20/0x30 [mdt] Apr 26 12:34:55 fir-md1-s2 kernel: [] mdt_brw_enqueue+0x44b/0x760 [mdt] Apr 26 12:34:55 fir-md1-s2 kernel: [] mdt_intent_brw+0x1f/0x30 [mdt] Apr 26 12:34:55 fir-md1-s2 kernel: [] mdt_intent_policy+0x2e8/0xd00 [mdt] Apr 26 12:34:55 fir-md1-s2 kernel: [] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] Apr 26 12:34:55 fir-md1-s2 kernel: [] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] Apr 26 12:34:55 fir-md1-s2 kernel: [] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 26 12:34:55 fir-md1-s2 kernel: [] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 26 12:34:55 fir-md1-s2 kernel: [] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 26 12:34:55 fir-md1-s2 kernel: [] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Apr 26 12:34:55 fir-md1-s2 kernel: [] kthread+0xd1/0xe0 Apr 26 12:34:55 fir-md1-s2 kernel: [] ret_from_fork_nospec_begin+0xe/0x21 Apr 26 12:34:55 fir-md1-s2 kernel: [] 0xffffffffffffffff Apr 26 12:34:55 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307295.100425 Apr 26 12:35:00 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307300.100483 Apr 26 12:35:21 fir-md1-s2 kernel: LNet: Service thread pid 99165 completed after 444.58s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 26 12:35:30 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307330.100415 Apr 26 12:35:43 fir-md1-s2 kernel: LNet: Service thread pid 100193 was inactive for 200.13s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Apr 26 12:35:43 fir-md1-s2 kernel: LNet: Skipped 7 previous similar messages Apr 26 12:35:43 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307343.100193 Apr 26 12:35:44 fir-md1-s2 kernel: Lustre: fir-MDT0003: Connection restored to (at 10.9.108.59@o2ib4) Apr 26 12:35:44 fir-md1-s2 kernel: Lustre: Skipped 322 previous similar messages Apr 26 12:35:57 fir-md1-s2 kernel: Lustre: fir-MDT0003: Client 5af85e95-71ec-5689-9879-f126f8845b44 (at 10.8.27.1@o2ib6) reconnecting Apr 26 12:35:57 fir-md1-s2 kernel: Lustre: Skipped 319 previous similar messages Apr 26 12:35:57 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307357.100248 Apr 26 12:36:05 fir-md1-s2 kernel: LustreError: 100228:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff9889c87b9200 x1631885376015424/t0(0) o104->fir-MDT0003@10.0.10.3@o2ib7:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 Apr 26 12:36:05 fir-md1-s2 kernel: LustreError: 100228:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 45 previous similar messages Apr 26 12:36:05 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307365.100073 Apr 26 12:36:15 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307375.100300 Apr 26 12:36:30 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307390.99168 Apr 26 12:36:31 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307391.100473 Apr 26 12:36:35 fir-md1-s2 kernel: LustreError: 99143:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 30s: evicting client at 10.0.10.3@o2ib7 ns: mdt-fir-MDT0003_UUID lock: ffff985032f5da00/0x4f3cef65e276a9ef lrc: 3/0,0 mode: PR/PR res: [0x28001b678:0x3c:0x0].0x0 bits 0x40/0x0 rrc: 3 type: IBT flags: 0x60000400010020 nid: 10.0.10.3@o2ib7 remote: 0xbbb5b46b296c76f9 expref: 472223 pid: 100396 timeout: 306789 lvb_type: 0 Apr 26 12:36:35 fir-md1-s2 kernel: LustreError: 99143:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 34 previous similar messages Apr 26 12:36:35 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307395.100187 Apr 26 12:36:38 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307398.99265 Apr 26 12:36:54 fir-md1-s2 kernel: Lustre: 99549:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-5), not sending early reply req@ffff9868997acb00 x1631535279996032/t270597000539(0) o36->b5280270-3b22-224e-0daa-bad5776be543@10.9.103.24@o2ib4:29/0 lens 488/3152 e 0 to 0 dl 1556307419 ref 2 fl Interpret:/0/0 rc 0/0 Apr 26 12:36:54 fir-md1-s2 kernel: Lustre: 99549:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 36 previous similar messages Apr 26 12:37:02 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307422.100259 Apr 26 12:37:08 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307428.100362 Apr 26 12:37:10 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307430.99472 Apr 26 12:37:18 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307438.100396 Apr 26 12:37:21 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307441.100143 Apr 26 12:37:29 fir-md1-s2 kernel: LNet: Service thread pid 100259 completed after 227.52s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 26 12:37:29 fir-md1-s2 kernel: LNet: Skipped 10 previous similar messages Apr 26 12:37:33 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307453.99546 Apr 26 12:37:53 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307473.100106 Apr 26 12:37:59 fir-md1-s2 kernel: LustreError: 100112:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1556307389, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0003_UUID lock: ffff986b209ce9c0/0x4f3cef65e780a6d4 lrc: 3/0,1 mode: --/PW res: [0x28001a57e:0x2320:0x0].0x0 bits 0x2/0x0 rrc: 6 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 100112 timeout: 0 lvb_type: 0 Apr 26 12:37:59 fir-md1-s2 kernel: LustreError: 100112:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 27 previous similar messages Apr 26 12:38:29 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307509.100103 Apr 26 12:38:44 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307524.100026 Apr 26 12:40:00 fir-md1-s2 kernel: Lustre: 100507:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (154:1s); client may timeout. req@ffff988b1ca34b00 x1631558211119824/t270597073983(0) o101->ed4bb535-6b9d-701d-993b-133faa2d1314@10.9.105.25@o2ib4:25/0 lens 376/1568 e 0 to 0 dl 1556307599 ref 1 fl Complete:/0/0 rc 0/0 Apr 26 12:40:13 fir-md1-s2 kernel: LNet: Service thread pid 100353 was inactive for 200.43s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Apr 26 12:40:13 fir-md1-s2 kernel: LNet: Skipped 1 previous similar message Apr 26 12:40:13 fir-md1-s2 kernel: Pid: 100353, comm: mdt03_059 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Apr 26 12:40:13 fir-md1-s2 kernel: Call Trace: Apr 26 12:40:13 fir-md1-s2 kernel: [] ldlm_completion_ast+0x4e5/0x890 [ptlrpc] Apr 26 12:40:13 fir-md1-s2 kernel: [] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] Apr 26 12:40:13 fir-md1-s2 kernel: [] mdt_object_local_lock+0x50b/0xb20 [mdt] Apr 26 12:40:13 fir-md1-s2 kernel: [] mdt_object_lock_internal+0x70/0x3e0 [mdt] Apr 26 12:40:13 fir-md1-s2 kernel: [] mdt_object_lock+0x20/0x30 [mdt] Apr 26 12:40:13 fir-md1-s2 kernel: [] mdt_brw_enqueue+0x44b/0x760 [mdt] Apr 26 12:40:13 fir-md1-s2 kernel: [] mdt_intent_brw+0x1f/0x30 [mdt] Apr 26 12:40:13 fir-md1-s2 kernel: [] mdt_intent_policy+0x2e8/0xd00 [mdt] Apr 26 12:40:13 fir-md1-s2 kernel: [] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] Apr 26 12:40:13 fir-md1-s2 kernel: [] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] Apr 26 12:40:13 fir-md1-s2 kernel: [] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 26 12:40:13 fir-md1-s2 kernel: [] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 26 12:40:13 fir-md1-s2 kernel: [] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 26 12:40:13 fir-md1-s2 kernel: [] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Apr 26 12:40:13 fir-md1-s2 kernel: [] kthread+0xd1/0xe0 Apr 26 12:40:13 fir-md1-s2 kernel: [] ret_from_fork_nospec_begin+0xe/0x21 Apr 26 12:40:13 fir-md1-s2 kernel: [] 0xffffffffffffffff Apr 26 12:40:13 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307613.100353 Apr 26 12:40:39 fir-md1-s2 kernel: Pid: 99530, comm: mdt00_014 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Apr 26 12:40:39 fir-md1-s2 kernel: Call Trace: Apr 26 12:40:39 fir-md1-s2 kernel: [] ldlm_completion_ast+0x4e5/0x890 [ptlrpc] Apr 26 12:40:39 fir-md1-s2 kernel: [] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] Apr 26 12:40:39 fir-md1-s2 kernel: [] mdt_object_local_lock+0x50b/0xb20 [mdt] Apr 26 12:40:39 fir-md1-s2 kernel: [] mdt_object_lock_internal+0x70/0x3e0 [mdt] Apr 26 12:40:39 fir-md1-s2 kernel: [] mdt_object_lock+0x20/0x30 [mdt] Apr 26 12:40:39 fir-md1-s2 kernel: [] mdt_brw_enqueue+0x44b/0x760 [mdt] Apr 26 12:40:39 fir-md1-s2 kernel: [] mdt_intent_brw+0x1f/0x30 [mdt] Apr 26 12:40:39 fir-md1-s2 kernel: [] mdt_intent_policy+0x2e8/0xd00 [mdt] Apr 26 12:40:39 fir-md1-s2 kernel: [] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] Apr 26 12:40:39 fir-md1-s2 kernel: [] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] Apr 26 12:40:39 fir-md1-s2 kernel: [] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 26 12:40:40 fir-md1-s2 kernel: [] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 26 12:40:40 fir-md1-s2 kernel: [] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 26 12:40:40 fir-md1-s2 kernel: [] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Apr 26 12:40:40 fir-md1-s2 kernel: [] kthread+0xd1/0xe0 Apr 26 12:40:40 fir-md1-s2 kernel: [] ret_from_fork_nospec_begin+0xe/0x21 Apr 26 12:40:40 fir-md1-s2 kernel: [] 0xffffffffffffffff Apr 26 12:40:40 fir-md1-s2 kernel: LustreError: dumping log to /tmp/lustre-log.1556307640.99530 Apr 26 12:40:47 fir-md1-s2 kernel: Lustre: 100454:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (30:1s); client may timeout. req@ffff984b0f119500 x1631686176801424/t270597078705(0) o36->e069e613-f413-14c2-adc9-8bb2c0565535@10.8.20.30@o2ib6:16/0 lens 488/424 e 0 to 0 dl 1556307646 ref 1 fl Complete:/0/0 rc 0/0 Apr 26 12:41:48 fir-md1-s2 kernel: LNet: Service thread pid 100187 completed after 512.84s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 26 12:41:48 fir-md1-s2 kernel: LNet: Skipped 23 previous similar messages Apr 26 12:41:53 fir-md1-s2 kernel: Lustre: 100141:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (61:1s); client may timeout. req@ffff98680d7ddd00 x1631588137219376/t0(0) o101->90d81c86-5db8-d29f-71be-9c3030e109bc@10.9.102.49@o2ib4:21/0 lens 480/536 e 0 to 0 dl 1556307712 ref 1 fl Complete:/0/0 rc 0/0