Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15357

hash.c:1111:cfs_hash_destroy() ASSERTION( !cfs_hash_with_assert_empty(hs) ) failed: hash lu_site_osd-zfs bucket 1(8) is not empty: 1 items left

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.15.3
    • Lustre 2.12.7
    • lustre-2.12.7_2.llnl-2
      zfs-0.7.11-9.8llnl.ch6.x86_64
      3.10.0-1160.45.1.1chaos.ch6.x86_64
      rhel7-based
    • 3
    • 9223372036854775807

    Description

      Looks like the same issue as LU-10433 to me. Occurred during shutdown of an MDT. vmcore-dmesg.txt file from crash dump shows:

      [7949565.740966] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) header@ffffa03aa81b09c0[0x4, 1, [0x1:0x0:0x0] hash exist]{
      
      [7949565.756016] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....local_storage@ffffa03aa81b0a10
      
      [7949565.768738] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....osd-zfs@ffffa03aa6c80140osd-zfs-object@ffffa03aa6c80140
      
      [7949565.783886] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) } header@ffffa03aa81b09c0
      
      [7949565.795741] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) header@ffffa03aa81b00c0[0x4, 1, [0xa:0x0:0x0] hash exist]{
      
      [7949565.810788] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....local_storage@ffffa03aa81b0110
      
      [7949565.823509] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....osd-zfs@ffffa03aa6c803c0osd-zfs-object@ffffa03aa6c803c0
      
      [7949565.838649] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) } header@ffffa03aa81b00c0
      
      [7949565.850558] LustreError: 4224:0:(hash.c:1111:cfs_hash_destroy()) ASSERTION( !cfs_hash_with_assert_empty(hs) ) failed: hash lu_site_osd-zfs bucket 1(3) is not  empty: 1 items left
      [7949565.868513] LustreError: 4224:0:(hash.c:1111:cfs_hash_destroy()) LBUG
      [7949565.875900] Pid: 4224, comm: umount 3.10.0-1160.36.2.1chaos.ch6.x86_64 #1 SMP Wed Jul 21 15:34:23 PDT 2021
      [7949565.886871] Call Trace:
      [7949565.889807]  [<ffffffffc12407ec>] libcfs_call_trace+0x8c/0xd0 [libcfs]
      [7949565.897316]  [<ffffffffc12408ac>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [7949565.904423]  [<ffffffffc124f85c>] cfs_hash_putref+0x3cc/0x520 [libcfs]
      [7949565.911928]  [<ffffffffc1529e54>] lu_site_fini+0x54/0xa0 [obdclass]
      [7949565.919162]  [<ffffffffc133d0cb>] osd_device_free+0x9b/0x2e0 [osd_zfs]
      [7949565.926665]  [<ffffffffc14fcf82>] class_free_dev+0x4c2/0x720 [obdclass]
      [7949565.934267]  [<ffffffffc14fd3e0>] class_export_put+0x200/0x2d0 [obdclass]
      [7949565.942059]  [<ffffffffc14fef05>] class_unlink_export+0x145/0x180 [obdclass]
      [7949565.950159]  [<ffffffffc1514990>] class_decref+0x80/0x160 [obdclass]
      [7949565.950169]  [<ffffffffc1514e13>] class_detach+0x1d3/0x300 [obdclass]
      [7949565.950179]  [<ffffffffc151bae8>] class_process_config+0x1a38/0x2830 [obdclass]
      [7949565.950189]  [<ffffffffc151cac0>] class_manual_cleanup+0x1e0/0x710 [obdclass]
      [7949565.950197]  [<ffffffffc133cd15>] osd_obd_disconnect+0x165/0x1a0 [osd_zfs]
      [7949565.950208]  [<ffffffffc1526cc6>] lustre_put_lsi+0x106/0x4d0 [obdclass]
      [7949565.950217]  [<ffffffffc1527200>] lustre_common_put_super+0x170/0x270 [obdclass]
      [7949565.950230]  [<ffffffffc154ea00>] server_put_super+0x120/0xd00 [obdclass]
      [7949565.950235]  [<ffffffffbca61e5d>] generic_shutdown_super+0x6d/0x110
      [7949565.950236]  [<ffffffffbca61f12>] kill_anon_super+0x12/0x20
      [7949565.950246]  [<ffffffffc151f6b2>] lustre_kill_super+0x32/0x50 [obdclass]
      [7949565.950247]  [<ffffffffbca6189e>] deactivate_locked_super+0x4e/0x70
      [7949565.950248]  [<ffffffffbca61906>] deactivate_super+0x46/0x60
      [7949565.950251]  [<ffffffffbca8372f>] cleanup_mnt+0x3f/0x80
      [7949565.950253]  [<ffffffffbca837c2>] __cleanup_mnt+0x12/0x20
      [7949565.950255]  [<ffffffffbc8c7cab>] task_work_run+0xbb/0xf0
      [7949565.950258]  [<ffffffffbc82dd95>] do_notify_resume+0xa5/0xc0
      [7949565.950262]  [<ffffffffbcfc44ef>] int_signal+0x12/0x17
      [7949565.950293]  [<ffffffffffffffff>] 0xffffffffffffffff
      

      We had never seen this before, but started seeing it frequently 2021-11-10. Before that there were two changes that may be related:
      1. We enabled changelogs on the filesystem ("brass") and started consuming them
      2. We updated the system to the above-mentioned kernel and lustre versions. I'll find the prior versions and post them.

      We've seen this crash 5 times so far (in about 1 month), when failing MDTs over for maintenance on the nodes.

      For Lustre patch stack, see https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl
      For ZFS patch stack, see https://github.com/LLNL/zfs/releases/tag/zfs-0.7.11-8llnl
      For SPL patch stack, see https://github.com/LLNL/spl/releases/tag/spl-0.7.11-8llnl
      (Note that the spl/zfs rpm version is misleading, it really is the spl- and zfs-0.7.11-8llnl tag that was used to build those rpms)

      Attachments

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: