Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14059

Changelogs not working after 2.10 to 2.12 upgrade

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.12.5
    • None
    • Lustre server 2.12.5_7.srcc (https://github.com/stanford-rc/lustre/commits/b2_12_5), changelog reader client: 2.12.5; OS Server: CentOS 7.6, client: CentOS 7.8
    • 2
    • 9223372036854775807

    Description

      Following our upgrade of Oak from 2.10 to 2.12, changelogs are not working anymore on any of Oak's 4 MDTs. I tried to deregister/register a new reader  (cl2) for each MDTs, but it's the same problem.

       

      Worth noting: when starting MDTs, we can see the following warning messages:

      Oct 19 15:26:37 oak-md1-s1 kernel: Lustre: 15665:0:(mdd_device.c:545:mdd_changelog_llog_init()) oak-MDD0002 : orphan changelog records found, starting from index 0 to index 22264575363, being cleared now
      
      Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: 41861:0:(mdd_device.c:545:mdd_changelog_llog_init()) oak-MDD0001 : orphan changelog records found, starting from index 0 to index 9422682426, being cleared now 

       

      Config for oak-MDT0001 (example, but it's the same with the 3 other MDTs):

      [root@oak-md1-s1 ~]# lctl get_param mdd.oak-MDT0001.changelog_*
      mdd.oak-MDT0001.changelog_deniednext=60
      mdd.oak-MDT0001.changelog_gc=0
      mdd.oak-MDT0001.changelog_max_idle_indexes=2097446912
      mdd.oak-MDT0001.changelog_max_idle_time=2592000
      mdd.oak-MDT0001.changelog_min_free_cat_entries=2
      mdd.oak-MDT0001.changelog_min_gc_interval=3600
      mdd.oak-MDT0001.changelog_size=4169832
      mdd.oak-MDT0001.changelog_mask=
      CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO CLOSE LYOUT TRUNC SATTR HSM MTIME CTIME MIGRT FLRW RESYNC 
      mdd.oak-MDT0001.changelog_users=
      current index: 9422682426
      ID    index (idle seconds)
      cl2   9422682426 (38908)
      

      We notice that the "current index" is not increasing, but the idle time does. Note that there is activity on the filesystem (it's in production).

      When trying to read changelogs from the client:

      [root@oak-rbh01 robinhood]# lfs changelog oak-MDT0001
      [root@oak-rbh01 robinhood]# echo $?
      0
      

      When I tried to restart oak-MDT0001 this morning, it did the following:

      Oct 21 08:32:11 oak-md1-s1 kernel: Lustre: server umount oak-MDT0001 complete
      Oct 21 08:32:11 oak-md1-s1 kernel: LustreError: 137-5: oak-MDT0001_UUID: not available for connect from 10.210.12.6@tcp1 (no target). If you are running an HA pair check that the target is mounted on the other server.
      Oct 21 08:32:11 oak-md1-s1 kernel: LustreError: Skipped 28 previous similar messages
      Oct 21 08:32:12 oak-md1-s1 kernel: LustreError: 137-5: oak-MDT0001_UUID: not available for connect from 10.49.18.28@o2ib1 (no target). If you are running an HA pair check that the target is mounted on the other server.
      Oct 21 08:32:12 oak-md1-s1 kernel: LustreError: Skipped 57 previous similar messages
      Oct 21 08:32:12 oak-md1-s1 kernel: LustreError: 18064:0:(ldlm_lockd.c:2324:ldlm_cancel_handler()) ldlm_cancel from 10.50.10.71@o2ib2 arrived at 1603294332 with bad export cookie 14346833159486330259
      Oct 21 08:32:14 oak-md1-s1 kernel: LustreError: 137-5: oak-MDT0001_UUID: not available for connect from 10.50.4.55@o2ib2 (no target). If you are running an HA pair check that the target is mounted on the other server.
      Oct 21 08:32:14 oak-md1-s1 kernel: LustreError: Skipped 126 previous similar messages
      Oct 21 08:32:17 oak-md1-s1 kernel: LustreError: 24771:0:(ldlm_lockd.c:2324:ldlm_cancel_handler()) ldlm_cancel from 10.210.12.119@tcp1 arrived at 1603294337 with bad export cookie 14346833159486337952
      Oct 21 08:32:18 oak-md1-s1 kernel: LustreError: 137-5: oak-MDT0001_UUID: not available for connect from 10.0.3.1@o2ib5 (no target). If you are running an HA pair check that the target is mounted on the other server.
      Oct 21 08:32:18 oak-md1-s1 kernel: LustreError: Skipped 243 previous similar messages
      Oct 21 08:32:22 oak-md1-s1 kernel: LDISKFS-fs (dm-3): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc
      Oct 21 08:32:22 oak-md1-s1 kernel: Lustre: oak-MDT0001: Not available for connect from 10.49.30.7@o2ib1 (not set up)
      Oct 21 08:32:22 oak-md1-s1 kernel: Lustre: Skipped 42 previous similar messages
      Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: oak-MDT0002: Connection restored to 2baf6034-0457-482e-32d4-2a55d4c43944 (at 0@lo)
      Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: oak-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: oak-MDD0001: changelog on
      Oct 21 08:32:23 oak-md1-s1 kernel: LustreError: 15560:0:(mgs_handler.c:282:mgs_revoke_lock()) MGS: can't take cfg lock for 0x6b616f/0x2 : rc = -11
      Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: 42675:0:(llog.c:615:llog_process_thread()) oak-MDT0000-osp-MDT0001: invalid length 0 in llog [0x1:0x181e9:0x2]record for index 0/6
      Oct 21 08:32:23 oak-md1-s1 kernel: LustreError: 42675:0:(lod_dev.c:434:lod_sub_recovery_thread()) oak-MDT0000-osp-MDT0001 get update log failed: rc = -22
      Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: 41861:0:(mdd_device.c:545:mdd_changelog_llog_init()) oak-MDD0001 : orphan changelog records found, starting from index 0 to index 9422682426, being cleared now
      Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: oak-MDT0001: in recovery but waiting for the first client to connect
      Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: oak-MDT0001: Connection restored to 6b4880ff-499b-4 (at 10.50.3.30@o2ib2)
      Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: 42678:0:(ldlm_lib.c:2073:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: 42678:0:(ldlm_lib.c:2073:target_recovery_overseer()) Skipped 1 previous similar message
      Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: oak-MDT0001: disconnecting 1636 stale clients
      Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: oak-MDT0001: Denying connection for new client 4cc5cd3c-4881-4 (at 10.50.14.2@o2ib2), waiting for 1639 known clients (3 recovered, 0 in progress, and 819 evicted) already passed deadline 2476:54
      Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: Skipped 212 previous similar messages
      Oct 21 08:32:24 oak-md1-s1 kernel: Lustre: oak-MDT0001: nosquash_nids set to 10.49.0.[11-12]@o2ib1 10.50.0.1@o2ib2 10.50.0.[11-12]@o2ib2 10.50.1.[59-60]@o2ib2 10.51.0.[1-2]@o2ib3 10.51.0.[11-18]@o2ib3 10.0.2.[1-3]@o2ib5 10.0.2.[51-58]@o2i
      Oct 21 08:32:24 oak-md1-s1 kernel: Lustre: oak-MDT0001: root_squash is set to 99:99
      Oct 21 08:32:24 oak-md1-s1 kernel: Lustre: Skipped 2 previous similar messages
      Oct 21 08:32:24 oak-md1-s1 kernel: Lustre: oak-MDT0001: Connection restored to b7359770-1c3d-4 (at 10.50.1.48@o2ib2)
      Oct 21 08:32:24 oak-md1-s1 kernel: Lustre: Skipped 60 previous similar messages
      Oct 21 08:32:26 oak-md1-s1 kernel: Lustre: oak-MDT0001: Connection restored to 10.0.2.104@o2ib5 (at 10.0.2.104@o2ib5)
      Oct 21 08:32:26 oak-md1-s1 kernel: Lustre: Skipped 124 previous similar messages
      Oct 21 08:32:29 oak-md1-s1 kernel: LustreError: 15695:0:(ldlm_lockd.c:2324:ldlm_cancel_handler()) ldlm_cancel from 10.210.12.115@tcp1 arrived at 1603294349 with bad export cookie 14346833159486332149
      Oct 21 08:32:29 oak-md1-s1 kernel: LustreError: 15695:0:(ldlm_lockd.c:2324:ldlm_cancel_handler()) Skipped 1 previous similar message
      Oct 21 08:32:30 oak-md1-s1 kernel: Lustre: oak-MDT0001: Connection restored to 8f081fef-be09-4 (at 10.49.30.36@o2ib1)
      Oct 21 08:32:30 oak-md1-s1 kernel: Lustre: Skipped 840 previous similar messages
      Oct 21 08:32:58 oak-md1-s1 kernel: Lustre: oak-MDT0001: Connection restored to 10.0.2.117@o2ib5 (at 10.0.2.117@o2ib5)
      Oct 21 08:32:58 oak-md1-s1 kernel: Lustre: Skipped 683 previous similar messages
      Oct 21 08:33:13 oak-md1-s1 kernel: LustreError: 167-0: oak-MDT0001-osp-MDT0002: This client was evicted by oak-MDT0001; in progress operations using this service will fail.
      

      ... it's started but changelogs are still not working.

       

      Note the:

      Oct 21 08:32:23 oak-md1-s1 kernel: Lustre: 42675:0:(llog.c:615:llog_process_thread()) oak-MDT0000-osp-MDT0001: invalid length 0 in llog [0x1:0x181e9:0x2]record for index 0/6 

      Any idea?

      Attachments

        Activity

          People

            hongchao.zhang Hongchao Zhang
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: