Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.12.5
-
None
-
Lustre server 2.12.5_7.srcc (https://github.com/stanford-rc/lustre/commits/b2_12_5), changelog reader client: 2.12.5; OS Server: CentOS 7.6, client: CentOS 7.8
-
2
-
9223372036854775807
Description
Following our upgrade of Oak from 2.10 to 2.12, changelogs are not working anymore on any of Oak's 4 MDTs. I tried to deregister/register a new reader (cl2) for each MDTs, but it's the same problem.
Worth noting: when starting MDTs, we can see the following warning messages:
Oct 19 15:26:37 oak-md1-s1 kernel: Lustre: 15665:0:(mdd_device.c:545:mdd_changelog_llog_init()) oak-MDD0002 : orphan changelog records found, starting from index 0 to index 22264575363, being cleared now Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: 41861:0:(mdd_device.c:545:mdd_changelog_llog_init()) oak-MDD0001 : orphan changelog records found, starting from index 0 to index 9422682426, being cleared now
Config for oak-MDT0001 (example, but it's the same with the 3 other MDTs):
[root@oak-md1-s1 ~]# lctl get_param mdd.oak-MDT0001.changelog_* mdd.oak-MDT0001.changelog_deniednext=60 mdd.oak-MDT0001.changelog_gc=0 mdd.oak-MDT0001.changelog_max_idle_indexes=2097446912 mdd.oak-MDT0001.changelog_max_idle_time=2592000 mdd.oak-MDT0001.changelog_min_free_cat_entries=2 mdd.oak-MDT0001.changelog_min_gc_interval=3600 mdd.oak-MDT0001.changelog_size=4169832 mdd.oak-MDT0001.changelog_mask= CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO CLOSE LYOUT TRUNC SATTR HSM MTIME CTIME MIGRT FLRW RESYNC mdd.oak-MDT0001.changelog_users= current index: 9422682426 ID index (idle seconds) cl2 9422682426 (38908)
We notice that the "current index" is not increasing, but the idle time does. Note that there is activity on the filesystem (it's in production).
When trying to read changelogs from the client:
[root@oak-rbh01 robinhood]# lfs changelog oak-MDT0001 [root@oak-rbh01 robinhood]# echo $? 0
When I tried to restart oak-MDT0001 this morning, it did the following:
Oct 21 08:32:11 oak-md1-s1 kernel: Lustre: server umount oak-MDT0001 complete Oct 21 08:32:11 oak-md1-s1 kernel: LustreError: 137-5: oak-MDT0001_UUID: not available for connect from 10.210.12.6@tcp1 (no target). If you are running an HA pair check that the target is mounted on the other server. Oct 21 08:32:11 oak-md1-s1 kernel: LustreError: Skipped 28 previous similar messages Oct 21 08:32:12 oak-md1-s1 kernel: LustreError: 137-5: oak-MDT0001_UUID: not available for connect from 10.49.18.28@o2ib1 (no target). If you are running an HA pair check that the target is mounted on the other server. Oct 21 08:32:12 oak-md1-s1 kernel: LustreError: Skipped 57 previous similar messages Oct 21 08:32:12 oak-md1-s1 kernel: LustreError: 18064:0:(ldlm_lockd.c:2324:ldlm_cancel_handler()) ldlm_cancel from 10.50.10.71@o2ib2 arrived at 1603294332 with bad export cookie 14346833159486330259 Oct 21 08:32:14 oak-md1-s1 kernel: LustreError: 137-5: oak-MDT0001_UUID: not available for connect from 10.50.4.55@o2ib2 (no target). If you are running an HA pair check that the target is mounted on the other server. Oct 21 08:32:14 oak-md1-s1 kernel: LustreError: Skipped 126 previous similar messages Oct 21 08:32:17 oak-md1-s1 kernel: LustreError: 24771:0:(ldlm_lockd.c:2324:ldlm_cancel_handler()) ldlm_cancel from 10.210.12.119@tcp1 arrived at 1603294337 with bad export cookie 14346833159486337952 Oct 21 08:32:18 oak-md1-s1 kernel: LustreError: 137-5: oak-MDT0001_UUID: not available for connect from 10.0.3.1@o2ib5 (no target). If you are running an HA pair check that the target is mounted on the other server. Oct 21 08:32:18 oak-md1-s1 kernel: LustreError: Skipped 243 previous similar messages Oct 21 08:32:22 oak-md1-s1 kernel: LDISKFS-fs (dm-3): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc Oct 21 08:32:22 oak-md1-s1 kernel: Lustre: oak-MDT0001: Not available for connect from 10.49.30.7@o2ib1 (not set up) Oct 21 08:32:22 oak-md1-s1 kernel: Lustre: Skipped 42 previous similar messages Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: oak-MDT0002: Connection restored to 2baf6034-0457-482e-32d4-2a55d4c43944 (at 0@lo) Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: oak-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: oak-MDD0001: changelog on Oct 21 08:32:23 oak-md1-s1 kernel: LustreError: 15560:0:(mgs_handler.c:282:mgs_revoke_lock()) MGS: can't take cfg lock for 0x6b616f/0x2 : rc = -11 Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: 42675:0:(llog.c:615:llog_process_thread()) oak-MDT0000-osp-MDT0001: invalid length 0 in llog [0x1:0x181e9:0x2]record for index 0/6 Oct 21 08:32:23 oak-md1-s1 kernel: LustreError: 42675:0:(lod_dev.c:434:lod_sub_recovery_thread()) oak-MDT0000-osp-MDT0001 get update log failed: rc = -22 Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: 41861:0:(mdd_device.c:545:mdd_changelog_llog_init()) oak-MDD0001 : orphan changelog records found, starting from index 0 to index 9422682426, being cleared now Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: oak-MDT0001: in recovery but waiting for the first client to connect Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: oak-MDT0001: Connection restored to 6b4880ff-499b-4 (at 10.50.3.30@o2ib2) Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: 42678:0:(ldlm_lib.c:2073:target_recovery_overseer()) recovery is aborted, evict exports in recovery Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: 42678:0:(ldlm_lib.c:2073:target_recovery_overseer()) Skipped 1 previous similar message Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: oak-MDT0001: disconnecting 1636 stale clients Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: oak-MDT0001: Denying connection for new client 4cc5cd3c-4881-4 (at 10.50.14.2@o2ib2), waiting for 1639 known clients (3 recovered, 0 in progress, and 819 evicted) already passed deadline 2476:54 Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: Skipped 212 previous similar messages Oct 21 08:32:24 oak-md1-s1 kernel: Lustre: oak-MDT0001: nosquash_nids set to 10.49.0.[11-12]@o2ib1 10.50.0.1@o2ib2 10.50.0.[11-12]@o2ib2 10.50.1.[59-60]@o2ib2 10.51.0.[1-2]@o2ib3 10.51.0.[11-18]@o2ib3 10.0.2.[1-3]@o2ib5 10.0.2.[51-58]@o2i Oct 21 08:32:24 oak-md1-s1 kernel: Lustre: oak-MDT0001: root_squash is set to 99:99 Oct 21 08:32:24 oak-md1-s1 kernel: Lustre: Skipped 2 previous similar messages Oct 21 08:32:24 oak-md1-s1 kernel: Lustre: oak-MDT0001: Connection restored to b7359770-1c3d-4 (at 10.50.1.48@o2ib2) Oct 21 08:32:24 oak-md1-s1 kernel: Lustre: Skipped 60 previous similar messages Oct 21 08:32:26 oak-md1-s1 kernel: Lustre: oak-MDT0001: Connection restored to 10.0.2.104@o2ib5 (at 10.0.2.104@o2ib5) Oct 21 08:32:26 oak-md1-s1 kernel: Lustre: Skipped 124 previous similar messages Oct 21 08:32:29 oak-md1-s1 kernel: LustreError: 15695:0:(ldlm_lockd.c:2324:ldlm_cancel_handler()) ldlm_cancel from 10.210.12.115@tcp1 arrived at 1603294349 with bad export cookie 14346833159486332149 Oct 21 08:32:29 oak-md1-s1 kernel: LustreError: 15695:0:(ldlm_lockd.c:2324:ldlm_cancel_handler()) Skipped 1 previous similar message Oct 21 08:32:30 oak-md1-s1 kernel: Lustre: oak-MDT0001: Connection restored to 8f081fef-be09-4 (at 10.49.30.36@o2ib1) Oct 21 08:32:30 oak-md1-s1 kernel: Lustre: Skipped 840 previous similar messages Oct 21 08:32:58 oak-md1-s1 kernel: Lustre: oak-MDT0001: Connection restored to 10.0.2.117@o2ib5 (at 10.0.2.117@o2ib5) Oct 21 08:32:58 oak-md1-s1 kernel: Lustre: Skipped 683 previous similar messages Oct 21 08:33:13 oak-md1-s1 kernel: LustreError: 167-0: oak-MDT0001-osp-MDT0002: This client was evicted by oak-MDT0001; in progress operations using this service will fail.
... it's started but changelogs are still not working.
Note the:
Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: 42675:0:(llog.c:615:llog_process_thread()) oak-MDT0000-osp-MDT0001: invalid length 0 in llog [0x1:0x181e9:0x2]record for index 0/6
Any idea?