Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15357

hash.c:1111:cfs_hash_destroy() ASSERTION( !cfs_hash_with_assert_empty(hs) ) failed: hash lu_site_osd-zfs bucket 1(8) is not empty: 1 items left

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0, Lustre 2.15.3
    • Lustre 2.12.7
    • lustre-2.12.7_2.llnl-2
      zfs-0.7.11-9.8llnl.ch6.x86_64
      3.10.0-1160.45.1.1chaos.ch6.x86_64
      rhel7-based
    • 3
    • 9223372036854775807

    Description

      Looks like the same issue as LU-10433 to me. Occurred during shutdown of an MDT. vmcore-dmesg.txt file from crash dump shows:

      [7949565.740966] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) header@ffffa03aa81b09c0[0x4, 1, [0x1:0x0:0x0] hash exist]{
      
      [7949565.756016] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....local_storage@ffffa03aa81b0a10
      
      [7949565.768738] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....osd-zfs@ffffa03aa6c80140osd-zfs-object@ffffa03aa6c80140
      
      [7949565.783886] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) } header@ffffa03aa81b09c0
      
      [7949565.795741] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) header@ffffa03aa81b00c0[0x4, 1, [0xa:0x0:0x0] hash exist]{
      
      [7949565.810788] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....local_storage@ffffa03aa81b0110
      
      [7949565.823509] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....osd-zfs@ffffa03aa6c803c0osd-zfs-object@ffffa03aa6c803c0
      
      [7949565.838649] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) } header@ffffa03aa81b00c0
      
      [7949565.850558] LustreError: 4224:0:(hash.c:1111:cfs_hash_destroy()) ASSERTION( !cfs_hash_with_assert_empty(hs) ) failed: hash lu_site_osd-zfs bucket 1(3) is not  empty: 1 items left
      [7949565.868513] LustreError: 4224:0:(hash.c:1111:cfs_hash_destroy()) LBUG
      [7949565.875900] Pid: 4224, comm: umount 3.10.0-1160.36.2.1chaos.ch6.x86_64 #1 SMP Wed Jul 21 15:34:23 PDT 2021
      [7949565.886871] Call Trace:
      [7949565.889807]  [<ffffffffc12407ec>] libcfs_call_trace+0x8c/0xd0 [libcfs]
      [7949565.897316]  [<ffffffffc12408ac>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [7949565.904423]  [<ffffffffc124f85c>] cfs_hash_putref+0x3cc/0x520 [libcfs]
      [7949565.911928]  [<ffffffffc1529e54>] lu_site_fini+0x54/0xa0 [obdclass]
      [7949565.919162]  [<ffffffffc133d0cb>] osd_device_free+0x9b/0x2e0 [osd_zfs]
      [7949565.926665]  [<ffffffffc14fcf82>] class_free_dev+0x4c2/0x720 [obdclass]
      [7949565.934267]  [<ffffffffc14fd3e0>] class_export_put+0x200/0x2d0 [obdclass]
      [7949565.942059]  [<ffffffffc14fef05>] class_unlink_export+0x145/0x180 [obdclass]
      [7949565.950159]  [<ffffffffc1514990>] class_decref+0x80/0x160 [obdclass]
      [7949565.950169]  [<ffffffffc1514e13>] class_detach+0x1d3/0x300 [obdclass]
      [7949565.950179]  [<ffffffffc151bae8>] class_process_config+0x1a38/0x2830 [obdclass]
      [7949565.950189]  [<ffffffffc151cac0>] class_manual_cleanup+0x1e0/0x710 [obdclass]
      [7949565.950197]  [<ffffffffc133cd15>] osd_obd_disconnect+0x165/0x1a0 [osd_zfs]
      [7949565.950208]  [<ffffffffc1526cc6>] lustre_put_lsi+0x106/0x4d0 [obdclass]
      [7949565.950217]  [<ffffffffc1527200>] lustre_common_put_super+0x170/0x270 [obdclass]
      [7949565.950230]  [<ffffffffc154ea00>] server_put_super+0x120/0xd00 [obdclass]
      [7949565.950235]  [<ffffffffbca61e5d>] generic_shutdown_super+0x6d/0x110
      [7949565.950236]  [<ffffffffbca61f12>] kill_anon_super+0x12/0x20
      [7949565.950246]  [<ffffffffc151f6b2>] lustre_kill_super+0x32/0x50 [obdclass]
      [7949565.950247]  [<ffffffffbca6189e>] deactivate_locked_super+0x4e/0x70
      [7949565.950248]  [<ffffffffbca61906>] deactivate_super+0x46/0x60
      [7949565.950251]  [<ffffffffbca8372f>] cleanup_mnt+0x3f/0x80
      [7949565.950253]  [<ffffffffbca837c2>] __cleanup_mnt+0x12/0x20
      [7949565.950255]  [<ffffffffbc8c7cab>] task_work_run+0xbb/0xf0
      [7949565.950258]  [<ffffffffbc82dd95>] do_notify_resume+0xa5/0xc0
      [7949565.950262]  [<ffffffffbcfc44ef>] int_signal+0x12/0x17
      [7949565.950293]  [<ffffffffffffffff>] 0xffffffffffffffff
      

      We had never seen this before, but started seeing it frequently 2021-11-10. Before that there were two changes that may be related:
      1. We enabled changelogs on the filesystem ("brass") and started consuming them
      2. We updated the system to the above-mentioned kernel and lustre versions. I'll find the prior versions and post them.

      We've seen this crash 5 times so far (in about 1 month), when failing MDTs over for maintenance on the nodes.

      For Lustre patch stack, see https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl
      For ZFS patch stack, see https://github.com/LLNL/zfs/releases/tag/zfs-0.7.11-8llnl
      For SPL patch stack, see https://github.com/LLNL/spl/releases/tag/spl-0.7.11-8llnl
      (Note that the spl/zfs rpm version is misleading, it really is the spl- and zfs-0.7.11-8llnl tag that was used to build those rpms)

      Attachments

        Issue Links

          Activity

            [LU-15357] hash.c:1111:cfs_hash_destroy() ASSERTION( !cfs_hash_with_assert_empty(hs) ) failed: hash lu_site_osd-zfs bucket 1(8) is not empty: 1 items left

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49566/
            Subject: LU-15357 iokit: fix the obsolete usage of cfg_device
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 1e7624b711e173baf893cec0bfc687ddee000fc7

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49566/ Subject: LU-15357 iokit: fix the obsolete usage of cfg_device Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 1e7624b711e173baf893cec0bfc687ddee000fc7

            "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49566
            Subject: LU-15357 iokit: fix the obsolete usage of cfg_device
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: b02fcbd3e8763173fc997e2bb49bc06d8855fb67

            gerrit Gerrit Updater added a comment - "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49566 Subject: LU-15357 iokit: fix the obsolete usage of cfg_device Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: b02fcbd3e8763173fc997e2bb49bc06d8855fb67

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45872/
            Subject: LU-15357 iokit: fix the obsolete usage of cfg_device
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a20b78a81d091cebd6b9e6c87537b2c955084cd5

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45872/ Subject: LU-15357 iokit: fix the obsolete usage of cfg_device Project: fs/lustre-release Branch: master Current Patch Set: Commit: a20b78a81d091cebd6b9e6c87537b2c955084cd5
            ofaaland Olaf Faaland added a comment -

            Since the backport was landed to b2_12, and the patch was already in 2.15.0, resolving this ticket. I think the 2.14 backport can be abandoned.

            ofaaland Olaf Faaland added a comment - Since the backport was landed to b2_12, and the patch was already in 2.15.0, resolving this ticket. I think the 2.14 backport can be abandoned.

            Fixing the changelog context leak appears to have resolved the issue at our site.

            ofaaland Olaf Faaland added a comment - Fixing the changelog context leak appears to have resolved the issue at our site.

            "Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/46529
            Subject: LU-15357 mdd: fix changelog context leak
            Project: fs/lustre-release
            Branch: b2_14
            Current Patch Set: 1
            Commit: 8fa010f2202b97e0e4b76e295386ec438dd341b6

            gerrit Gerrit Updater added a comment - "Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/46529 Subject: LU-15357 mdd: fix changelog context leak Project: fs/lustre-release Branch: b2_14 Current Patch Set: 1 Commit: 8fa010f2202b97e0e4b76e295386ec438dd341b6

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45832/
            Subject: LU-15357 mdd: fix changelog context leak
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 55fa745234ef0e1a01e0fb0622f1c00ecce047ea

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45832/ Subject: LU-15357 mdd: fix changelog context leak Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 55fa745234ef0e1a01e0fb0622f1c00ecce047ea

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45831/
            Subject: LU-15357 mdd: fix changelog context leak
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: d083c93c6fd9251d6637d33029049b1d27d2a20a

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45831/ Subject: LU-15357 mdd: fix changelog context leak Project: fs/lustre-release Branch: master Current Patch Set: Commit: d083c93c6fd9251d6637d33029049b1d27d2a20a

            "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45872
            Subject: LU-15357 iokit: fix the obsolete usage of cfg_device
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: be87f89f6a6c72b101f7fd6792d4ed679f90de0e

            gerrit Gerrit Updater added a comment - "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45872 Subject: LU-15357 iokit: fix the obsolete usage of cfg_device Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: be87f89f6a6c72b101f7fd6792d4ed679f90de0e

            This is the most obvious source of problem which is worth to check first

            tappro Mikhail Pershin added a comment - This is the most obvious source of problem which is worth to check first

            People

              tappro Mikhail Pershin
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: