Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14958

configurable hash table size for jbd2

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0, Lustre 2.15.3
    • None
    • None
    • 9223372036854775807

    Description

      revoke hash table can be enourmous with multi-GB journal. this may result in millions of revoke records which are loaded and inserted into specific hashtable during journal replay.
      currently the revoke hashtable's size is hard-coded as 256, thus every slot may get too many records.

      a simple benchmark of that code:
      1048576 - 95 seconds
      2097152 - 580 seconds
      in the fields it can be upto 30M records to find/insert.

      with 8192 buckets in the hash table:
      4194304 - 59 seconds
      8388608 - 247 seconds

      Attachments

        1. dumpe2fs.txt
          2 kB
        2. jbd2_debugfs.gz
          106.24 MB

        Issue Links

          Activity

            [LU-14958] configurable hash table size for jbd2

            An equivalent patch to resolve the revoke block scalability issue during journal replay was landed to the upstream ext4 code:
            https://patchwork.ozlabs.org/project/linux-ext4/patch/20250121140925.17231-2-jack@suse.cz/

            adilger Andreas Dilger added a comment - An equivalent patch to resolve the revoke block scalability issue during journal replay was landed to the upstream ext4 code: https://patchwork.ozlabs.org/project/linux-ext4/patch/20250121140925.17231-2-jack@suse.cz/

            It looks like this patch solved the problem with the kernel journal revoke record handling, but there is still a similar problem in e2fsprogs handling of journal replay. We hit an issue with slow journal recovery while deleting a large number of changelogs, and "tune2fs" and "e2fsck" were hung in journal recovery for hours before being interrupted. The problem was eventually fixed by updating server to include this fix in ldiskfs and then mounting the filesystem to do journal recovery in the kernel.

            adilger Andreas Dilger added a comment - It looks like this patch solved the problem with the kernel journal revoke record handling, but there is still a similar problem in e2fsprogs handling of journal replay. We hit an issue with slow journal recovery while deleting a large number of changelogs, and " tune2fs " and " e2fsck " were hung in journal recovery for hours before being interrupted. The problem was eventually fixed by updating server to include this fix in ldiskfs and then mounting the filesystem to do journal recovery in the kernel.
            pjones Peter Jones added a comment -

            Seems to be merged for 2.15.3 and 2.16

            pjones Peter Jones added a comment - Seems to be merged for 2.15.3 and 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50730/
            Subject: LU-14958 kernel: use rhashtable for revoke records in jbd2
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 52bc546e91e415315e6cf9a46608264122e64ef3

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50730/ Subject: LU-14958 kernel: use rhashtable for revoke records in jbd2 Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 52bc546e91e415315e6cf9a46608264122e64ef3

            "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50730
            Subject: LU-14958 kernel: use rhashtable for revoke records in jbd2
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 53e110a884423b137b44ea44b0f1327b1535bfa1

            gerrit Gerrit Updater added a comment - "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50730 Subject: LU-14958 kernel: use rhashtable for revoke records in jbd2 Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 53e110a884423b137b44ea44b0f1327b1535bfa1

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/45122/
            Subject: LU-14958 kernel: use rhashtable for revoke records in jbd2
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: c3bb2b778d6b40a5cecb01993b55fcc107305b4a

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/45122/ Subject: LU-14958 kernel: use rhashtable for revoke records in jbd2 Project: fs/lustre-release Branch: master Current Patch Set: Commit: c3bb2b778d6b40a5cecb01993b55fcc107305b4a

            Here the "/proc/fs/jbd2/info" with the LU-14688:

            131 transactions (127 requested), each up to 262144 blocks
            average: 
              0ms waiting for transaction
              0ms request delay
              4770ms running transaction
              13ms transaction was being locked
              0ms flushing data (in ordered mode)
              60ms logging transaction
              91593us average transaction commit time
              826 handles per transaction
              372 blocks per transaction
              373 logged blocks per transaction
            
            eaujames Etienne Aujames added a comment - Here the "/proc/fs/jbd2/info" with the LU-14688 : 131 transactions (127 requested), each up to 262144 blocks average: 0ms waiting for transaction 0ms request delay 4770ms running transaction 13ms transaction was being locked 0ms flushing data (in ordered mode) 60ms logging transaction 91593us average transaction commit time 826 handles per transaction 372 blocks per transaction 373 logged blocks per transaction
            1858 transactions (1850 requested), each up to 262144 blocks
            average: 
              0ms waiting for transaction
              0ms request delay
              4988ms running transaction
              8ms transaction was being locked
              0ms flushing data (in ordered mode)
              32ms logging transaction
              43059us average transaction commit time
              392638 handles per transaction
              49 blocks per transaction
              50 logged blocks per transaction
            

            thanks a lot! and, if possible, the same with LU-14688 applied if you have few spare cycles. thanks in advance!

            I think this confirms the theory that basically "deregistiring" is CPU bound and produces a lot of tiny transaction which aren't checkpointed (I guess there is no need to do so - no memory pressure as number of modified blocks is tiny). given they aren't checkpointed by the crash, JBD has to replay them all and need to skip revoked blocks so fill/lookup big revoke table.
            the situation changes with LU-14688 as the process is less CPU-bound and generates bigger transactions and (somehow) transactions get checkpointed more frequently leaving less to replay.

            the important question here is whether we still need to fix JBD.. I tend to think so as there are another use cases when external processes (like llsom_sync) may consume and clear changelog at high rate and this would result in tiny transactions as before LU-14688 patch.

            bzzz Alex Zhuravlev added a comment - 1858 transactions (1850 requested), each up to 262144 blocks average: 0ms waiting for transaction 0ms request delay 4988ms running transaction 8ms transaction was being locked 0ms flushing data (in ordered mode) 32ms logging transaction 43059us average transaction commit time 392638 handles per transaction 49 blocks per transaction 50 logged blocks per transaction thanks a lot! and, if possible, the same with LU-14688 applied if you have few spare cycles. thanks in advance! I think this confirms the theory that basically "deregistiring" is CPU bound and produces a lot of tiny transaction which aren't checkpointed (I guess there is no need to do so - no memory pressure as number of modified blocks is tiny). given they aren't checkpointed by the crash, JBD has to replay them all and need to skip revoked blocks so fill/lookup big revoke table. the situation changes with LU-14688 as the process is less CPU-bound and generates bigger transactions and (somehow) transactions get checkpointed more frequently leaving less to replay. the important question here is whether we still need to fix JBD.. I tend to think so as there are another use cases when external processes (like llsom_sync) may consume and clear changelog at high rate and this would result in tiny transactions as before LU-14688 patch.

            Here the "dumpe2fs -h" after recovery: dumpe2fs.txt

            eaujames Etienne Aujames added a comment - Here the "dumpe2fs -h" after recovery: dumpe2fs.txt

            Concerning the difference between the tests with and without the LU-14688. At the end of the the changelog_deregister:

            • with the LU-14688: I have 50k on disk revoke records ( ~3 revoke records per llog plain)
            • without the LU-14688: I have 30M on disk revoke records ( ~1700 revoke records per llog plain)
            eaujames Etienne Aujames added a comment - Concerning the difference between the tests with and without the LU-14688 . At the end of the the changelog_deregister: with the LU-14688 : I have 50k on disk revoke records ( ~3 revoke records per llog plain) without the LU-14688 : I have 30M on disk revoke records ( ~1700 revoke records per llog plain)

            People

              bzzz Alex Zhuravlev
              bzzz Alex Zhuravlev
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: