Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18354

sanity test_136: ZFS crash due to OOM/NULL pointer deref

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.17.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/37653b0d-00b2-48d6-b8ce-0c09bb5f3f0a

      test_136 failed with the following error:

      trevis-99vm4 crashed during sanity test_136
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-master-patchless/772 - 4.18.0-240.22.1.el8_3.x86_64
      servers: https://build.whamcloud.com/job/lustre-master-patchless/772 - 4.18.0-240.22.1.el8_3.x86_64

      It looks like this has been crashing for a long time, but only in full testing because the test is only run with "SLOW=y", and is skipped otherwise. The test itself is allocating and immediately deleting about 150k files in a loop.

      The first failure is on 2023-04-22 with commit v2_15_55-90-g73a7b1c2a3, and it looks like the early test failures are all with ZFS hitting OOM:
      https://testing.whamcloud.com/test_sets/37653b0d-00b2-48d6-b8ce-0c09bb5f3f0a

      but by 2023-07-01 they are hitting a NULL pointer dereference (likely also an allocation failure before OOM) in the ZFS inode handling:
      https://testing.whamcloud.com/test_sets/855d00e9-7906-465b-96cf-f7cfbc08c16a

      [19856.663841] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
      [19856.673442] CPU: 1 PID: 851991 Comm: dp_sync_taskq 4.18.0-477.10.1.el8_lustre.x86_64 #1
      [19856.675740] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [19856.676816] RIP: 0010:arc_write+0xf5/0x460 [zfs]
      [19856.692605] Call Trace:
      [19856.696922]  dbuf_write+0x2ff/0x550 [zfs]
      [19856.700667]  dbuf_sync_leaf+0x137/0x660 [zfs]
      [19856.703284]  dbuf_sync_list+0xcf/0x120 [zfs]
      [19856.704161]  dbuf_sync_indirect+0xe2/0x170 [zfs]
      [19856.705108]  dbuf_sync_list+0xae/0x120 [zfs]
      [19856.705992]  dbuf_sync_indirect+0xe2/0x170 [zfs]
      [19856.706935]  dbuf_sync_list+0xae/0x120 [zfs]
      [19856.707814]  dnode_sync+0x365/0xa20 [zfs]
      [19856.709421]  sync_dnodes_task+0x71/0xa0 [zfs]
      [19856.710341]  taskq_thread+0x2e1/0x510 [spl]
      [19856.712759]  kthread+0x134/0x150
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity test_136 - trevis-99vm4 crashed during sanity test_136

      Attachments

        Issue Links

          Activity

            [LU-18354] sanity test_136: ZFS crash due to OOM/NULL pointer deref

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/58199/
            Subject: LU-18354 tests: speed up sanity/136 on non-ZFS
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 5f3cb9ccc4dc222520836a0e652d4c5cbbf11ee0

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/58199/ Subject: LU-18354 tests: speed up sanity/136 on non-ZFS Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5f3cb9ccc4dc222520836a0e652d4c5cbbf11ee0

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/58199
            Subject: LU-18354 tests: speed up sanity/136 on non-ZFS
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 37caf6b66a6d712904e4b6fe0f502014ab31fefd

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/58199 Subject: LU-18354 tests: speed up sanity/136 on non-ZFS Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 37caf6b66a6d712904e4b6fe0f502014ab31fefd
            pjones Peter Jones added a comment -

            Merged for 2.17

            pjones Peter Jones added a comment - Merged for 2.17

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/57036/
            Subject: LU-18354 tests: avoid sanity/136 OOM on ZFS servers
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 627cc62369fcfda5c8893c2aba4ad721c4ef1996

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/57036/ Subject: LU-18354 tests: avoid sanity/136 OOM on ZFS servers Project: fs/lustre-release Branch: master Current Patch Set: Commit: 627cc62369fcfda5c8893c2aba4ad721c4ef1996

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57036
            Subject: LU-18354 tests: avoid sanity/136 OOM on ZFS servers
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 3a81beb3de0ccec3283dd19fdcbab55612526c9f

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57036 Subject: LU-18354 tests: avoid sanity/136 OOM on ZFS servers Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3a81beb3de0ccec3283dd19fdcbab55612526c9f
            adilger Andreas Dilger added a comment - - edited

            This looks like it is crashing mainly in interop testing with b2_12 clients and master servers, but occasionally with master client+server and ldiskfs storage.

            adilger Andreas Dilger added a comment - - edited This looks like it is crashing mainly in interop testing with b2_12 clients and master servers, but occasionally with master client+server and ldiskfs storage.

            KM_SLEEP will never return a NULL pointer (it will block forever).

            Also sizeof(arc_write_callback_t) changed:

            commit ccec7fbe1c66c5b63a3af9d152403ce43344f4ab
            Author: Alexander Motin <mav@FreeBSD.org>
            Date:   Thu Jun 15 13:49:03 2023 -0400
            
                Remove ARC/ZIO physdone callbacks.
            

            Shrinking it to 0x0028 for 2.2.x
            It suggests that awcb_buf which starts at offset 0x28 in 2.1.x is beyond the end of allocated space in 2.2.x.
            Is it possible that there is a header vs binary confusion with the loaded ZFS module(s)?

            stancheff Shaun Tancheff added a comment - KM_SLEEP will never return a NULL pointer (it will block forever). Also sizeof(arc_write_callback_t) changed: commit ccec7fbe1c66c5b63a3af9d152403ce43344f4ab Author: Alexander Motin <mav@FreeBSD.org> Date: Thu Jun 15 13:49:03 2023 -0400 Remove ARC/ZIO physdone callbacks. Shrinking it to 0x0028 for 2.2.x It suggests that awcb_buf which starts at offset 0x28 in 2.1.x is beyond the end of allocated space in 2.2.x. Is it possible that there is a header vs binary confusion with the loaded ZFS module(s)?

            The most recent failures are when running zfs-2.1.15 on the servers, and the first failures were using zfs-2.1.5.

            Looking at arc_write() it appears that it is doing an allocation and dereferencing the memory without any checking:

                    callback = kmem_zalloc(sizeof (arc_write_callback_t), KM_SLEEP);
                    callback->awcb_ready = ready;
                    callback->awcb_children_ready = children_ready;
                    callback->awcb_done = done;
                    callback->awcb_private = private;
            
            adilger Andreas Dilger added a comment - The most recent failures are when running zfs-2.1.15 on the servers, and the first failures were using zfs-2.1.5. Looking at arc_write() it appears that it is doing an allocation and dereferencing the memory without any checking: callback = kmem_zalloc(sizeof (arc_write_callback_t), KM_SLEEP); callback->awcb_ready = ready; callback->awcb_children_ready = children_ready; callback->awcb_done = done; callback->awcb_private = private ;

            People

              adilger Andreas Dilger
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: