Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11980

When attempting to enable changelogs system is locking up with "no more free slots in catalog"

    XMLWordPrintable

Details

    • Question/Request
    • Resolution: Done
    • Minor
    • None
    • Lustre 2.8.0
    • None
    • Dell R720 servers running TOSS 3.1-4.1 (RHEL 7.3)/Lustre 2.8.0.9 IB attached to DDN 7700.
    • 9223372036854775807

    Description

      We have been working with changelogs recently and on this particular file system when changelogs were enabled, getting changelog information was very slow when specifying a short range greater than the index. Because of the response, changelogs have been registered and deregister a number of time. 

      Today we attempted to re-register changelogs and the current index never changed and after just a couple of minutes the system froze. 

      One thing just occurred to me: Should we clear the changelogs before re-registering? We haven't done that.

       

      dmesg after a reboot (without deregistering changelogs) produced the following output prior to the hang:

      [Wed Feb 20 10:42:41 2019] LDISKFS-fs (dm-0): recovery complete
      [Wed Feb 20 10:42:41 2019] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache
      [Wed Feb 20 10:42:43 2019] Lustre: MGS: Connection restored to f81759a0-aecf-98f8-d1e0-a6cf70343cdd (at 0@lo)
      [Wed Feb 20 10:42:43 2019] LustreError: 137-5: gscratch-MDT0000_UUID: not available for connect from 192.168.163.154@o2ib8 (no target). If you are running an HA pair check that the target is mounted on the oth.
      [Wed Feb 20 10:42:43 2019] Lustre: gscratch-MDT0000: Not available for connect from 192.168.146.62@o2ib7 (not set up)
      [Wed Feb 20 10:42:43 2019] Lustre: MGS: Connection restored to (at 192.168.129.20@o2ib4)
      [Wed Feb 20 10:42:43 2019] Lustre: Skipped 21 previous similar messages
      [Wed Feb 20 10:42:43 2019] sd 7:0:0:0: Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatical
      [Wed Feb 20 10:42:44 2019] Lustre: gscratch-MDT0000: Not available for connect from 192.168.133.66@o2ib4 (not set up)
      [Wed Feb 20 10:42:44 2019] Lustre: Skipped 24 previous similar messages
      [Wed Feb 20 10:42:44 2019] Lustre: MGS: Connection restored to (at 192.168.163.242@o2ib8)
      [Wed Feb 20 10:42:44 2019] Lustre: Skipped 71 previous similar messages
      [Wed Feb 20 10:42:46 2019] Lustre: MGS: Connection restored to (at 192.168.128.111@o2ib4)
      [Wed Feb 20 10:42:46 2019] Lustre: Skipped 144 previous similar messages
      [Wed Feb 20 10:42:50 2019] Lustre: MGS: Connection restored to 670522c1-3eaa-288f-b21f-556fb85bf7bb (at 192.168.81.28@tcp2)
      [Wed Feb 20 10:42:50 2019] Lustre: Skipped 336 previous similar messages
      [Wed Feb 20 10:42:51 2019] Lustre: 5685:0:(llog_cat.c:956:llog_cat_reverse_process()) catalog 0x7:10 crosses index zero
      [Wed Feb 20 10:42:51 2019] Lustre: gscratch-MDD0000: changelog on
      [Wed Feb 20 10:42:52 2019] Lustre: gscratch-MDT0000: Will be in recovery for at least 5:00, or until 6653 clients reconnect
      [Wed Feb 20 10:42:58 2019] Lustre: gscratch-MDT0000: Connection restored to (at 192.168.146.61@o2ib7)
      [Wed Feb 20 10:42:58 2019] Lustre: Skipped 1892 previous similar messages
      [Wed Feb 20 10:43:14 2019] Lustre: gscratch-MDT0000: Connection restored to 41d8e069-ae86-1be2-dc41-7835d4a0c892 (at 192.168.133.43@o2ib4)
      [Wed Feb 20 10:43:14 2019] Lustre: Skipped 1630 previous similar messages
      [Wed Feb 20 10:43:46 2019] Lustre: MGS: Connection restored to c858508f-9415-a50e-e377-4c75ecc97e6a (at 192.168.160.40@o2ib8)
      [Wed Feb 20 10:43:46 2019] Lustre: Skipped 7165 previous similar messages
      [Wed Feb 20 10:43:53 2019] Lustre: gscratch-MDT0000: Recovery already passed deadline 3:59. If you do not want to wait more, please abort the recovery by force.
      [Wed Feb 20 10:43:53 2019] Lustre: gscratch-MDT0000: Recovery already passed deadline 3:58. If you do not want to wait more, please abort the recovery by force.
      [Wed Feb 20 10:43:53 2019] Lustre: Skipped 68 previous similar messages
      [Wed Feb 20 10:43:54 2019] Lustre: gscratch-MDT0000: Recovery already passed deadline 3:57. If you do not want to wait more, please abort the recovery by force.
      [Wed Feb 20 10:43:54 2019] Lustre: Skipped 127 previous similar messages
      [Wed Feb 20 10:43:56 2019] Lustre: gscratch-MDT0000: Recovery already passed deadline 3:55. If you do not want to wait more, please abort the recovery by force.
      [Wed Feb 20 10:43:56 2019] Lustre: Skipped 261 previous similar messages
      [Wed Feb 20 10:43:58 2019] LustreError: 5851:0:(ldlm_lib.c:2778:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff880f9ad07500 x1623562148962904/t0(737400040385) o35->c1db237f-2447-dcf4-1
      [Wed Feb 20 10:44:00 2019] Lustre: gscratch-MDT0000: Recovery already passed deadline 3:51. If you do not want to wait more, please abort the recovery by force.
      [Wed Feb 20 10:44:00 2019] Lustre: Skipped 656 previous similar messages
      [Wed Feb 20 10:44:00 2019] LustreError: 5851:0:(ldlm_lib.c:2778:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff880fe010ac50 x1623924373420932/t0(737400041101) o35->28ad855d-72b3-7388-1
      [Wed Feb 20 10:44:02 2019] LustreError: 6378:0:(ldlm_lib.c:2778:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff880f9a40ce00 x1623286549665444/t0(737400045945) o101->ed31116f-3909-bacf1
      [Wed Feb 20 10:44:08 2019] Lustre: gscratch-MDT0000: Recovery already passed deadline 3:43. If you do not want to wait more, please abort the recovery by force.
      [Wed Feb 20 10:44:08 2019] Lustre: Skipped 591 previous similar messages
      [Wed Feb 20 10:44:24 2019] Lustre: gscratch-MDT0000: Recovery already passed deadline 3:27. If you do not want to wait more, please abort the recovery by force.
      [Wed Feb 20 10:44:24 2019] Lustre: Skipped 1055 previous similar messages
      [Wed Feb 20 10:44:32 2019] LustreError: 6261:0:(ldlm_lib.c:2778:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff8807c7e6a700 x1624190622706812/t0(737400044160) o101->48ebd6c1-d5c4-15b01
      [Wed Feb 20 10:44:32 2019] LustreError: 6261:0:(ldlm_lib.c:2778:target_queue_recovery_request()) Skipped 1 previous similar message
      [Wed Feb 20 10:44:37 2019] LustreError: 5851:0:(ldlm_lib.c:2778:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff880f99ce9800 x1623562475389600/t0(737400021311) o35->f9a9f0dd-fe45-980e-1
      [Wed Feb 20 10:44:50 2019] Lustre: gscratch-MDT0000: Connection restored to 4f538c64-8ac4-c14d-4e39-050d775a27fb (at 192.168.145.204@o2ib7)
      [Wed Feb 20 10:44:50 2019] Lustre: Skipped 7950 previous similar messages
      [Wed Feb 20 10:44:56 2019] Lustre: gscratch-MDT0000: Recovery already passed deadline 2:55. If you do not want to wait more, please abort the recovery by force.
      [Wed Feb 20 10:44:56 2019] Lustre: Skipped 3929 previous similar messages
      [Wed Feb 20 10:44:59 2019] LustreError: 5851:0:(ldlm_lib.c:2778:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff8800af8b2700 x1623562148962904/t0(737400040385) o35->c1db237f-2447-dcf4-1
      [Wed Feb 20 10:45:33 2019] LustreError: 6313:0:(ldlm_lib.c:2778:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff880f99236900 x1624190622706812/t0(737400044160) o101->48ebd6c1-d5c4-15b01
      [Wed Feb 20 10:45:33 2019] LustreError: 6313:0:(ldlm_lib.c:2778:target_queue_recovery_request()) Skipped 3 previous similar messages

       

      Snippet from syslog:

      Feb 20 10:18:09 gmds1 dbus[3789]: [system] Successfully activated service 'org.freedesktop.problems'
      Feb 20 10:18:49 gmds1 kernel: [2070231.187993] Lustre: gscratch-MDD0000: changelog on
      Feb 20 10:18:49 gmds1 kernel: [2070231.196897] Lustre: 7104:0:(llog_cat.c:93:llog_cat_new_log()) gscratch-MDD0000: there are no more free slots in catalog
      Feb 20 10:18:49 gmds1 kernel: [2070231.210447] LustreError: 6714:0:(mdd_dir.c:887:mdd_changelog_ns_store()) gscratch-MDD0000: cannot store changelog record: type = 8, name = 'rest.colvars.state.old', t = [0
      x200
      1755d7:0xdc8:0x0], p = [0x200171e8a:0x12d2d:0x0]: rc = -28
      Feb 20 10:18:49 gmds1 kernel: [2070231.221771] LustreError: 7227:0:(llog_cat.c:385:llog_cat_current_log()) gscratch-MDD0000: next log does not exist!
      Feb 20 10:18:49 gmds1 kernel: [2070231.249353] LustreError: 6714:0:(mdd_dir.c:887:mdd_changelog_ns_store()) Skipped 3 previous similar messages
      Feb 20 10:18:49 gmds1 kernel: [2070231.697900] Lustre: 7237:0:(llog_cat.c:93:llog_cat_new_log()) gscratch-MDD0000: there are no more free slots in catalog
      Feb 20 10:18:49 gmds1 kernel: [2070231.711550] Lustre: 7237:0:(llog_cat.c:93:llog_cat_new_log()) Skipped 278 previous similar messages
      Feb 20 10:18:49 gmds1 kernel: [2070231.723222] LustreError: 6711:0:(mdd_dir.c:887:mdd_changelog_ns_store()) gscratch-MDD0000: cannot store changelog record: type = 8, name = 'rest.colvars.state.old', t = [0x200
      17748a:0x3684:0x0], p = [0x200171e8a:0x12c65:0x0]: rc = -28
      Feb 20 10:18:49 gmds1 kernel: [2070231.749406] LustreError: 6711:0:(mdd_dir.c:887:mdd_changelog_ns_store()) Skipped 62 previous similar messages
      Feb 20 10:18:49 gmds1 kernel: [2070231.814879] LustreError: 6909:0:(llog_cat.c:385:llog_cat_current_log()) gscratch-MDD0000: next log does not exist!
      Feb 20 10:18:49 gmds1 kernel: [2070231.828077] LustreError: 6909:0:(llog_cat.c:385:llog_cat_current_log()) Skipped 43 previous similar messages
      Feb 20 10:18:50 gmds1 kernel: [2070232.707459] Lustre: 6628:0:(llog_cat.c:93:llog_cat_new_log()) gscratch-MDD0000: there are no more free slots in catalog
      Feb 20 10:18:50 gmds1 kernel: [2070232.721229] Lustre: 6628:0:(llog_cat.c:93:llog_cat_new_log()) Skipped 273 previous similar messages
      Feb 20 10:18:51 gmds1 kernel: [2070232.998096] LustreError: 6711:0:(mdd_dir.c:887:mdd_changelog_ns_store()) gscratch-MDD0000: cannot store changelog recor

      Attachments

        Activity

          People

            tappro Mikhail Pershin
            jamervi Joe Mervini
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: