Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15357

hash.c:1111:cfs_hash_destroy() ASSERTION( !cfs_hash_with_assert_empty(hs) ) failed: hash lu_site_osd-zfs bucket 1(8) is not empty: 1 items left

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0, Lustre 2.15.3
    • Lustre 2.12.7
    • lustre-2.12.7_2.llnl-2
      zfs-0.7.11-9.8llnl.ch6.x86_64
      3.10.0-1160.45.1.1chaos.ch6.x86_64
      rhel7-based
    • 3
    • 9223372036854775807

    Description

      Looks like the same issue as LU-10433 to me. Occurred during shutdown of an MDT. vmcore-dmesg.txt file from crash dump shows:

      [7949565.740966] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) header@ffffa03aa81b09c0[0x4, 1, [0x1:0x0:0x0] hash exist]{
      
      [7949565.756016] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....local_storage@ffffa03aa81b0a10
      
      [7949565.768738] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....osd-zfs@ffffa03aa6c80140osd-zfs-object@ffffa03aa6c80140
      
      [7949565.783886] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) } header@ffffa03aa81b09c0
      
      [7949565.795741] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) header@ffffa03aa81b00c0[0x4, 1, [0xa:0x0:0x0] hash exist]{
      
      [7949565.810788] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....local_storage@ffffa03aa81b0110
      
      [7949565.823509] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....osd-zfs@ffffa03aa6c803c0osd-zfs-object@ffffa03aa6c803c0
      
      [7949565.838649] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) } header@ffffa03aa81b00c0
      
      [7949565.850558] LustreError: 4224:0:(hash.c:1111:cfs_hash_destroy()) ASSERTION( !cfs_hash_with_assert_empty(hs) ) failed: hash lu_site_osd-zfs bucket 1(3) is not  empty: 1 items left
      [7949565.868513] LustreError: 4224:0:(hash.c:1111:cfs_hash_destroy()) LBUG
      [7949565.875900] Pid: 4224, comm: umount 3.10.0-1160.36.2.1chaos.ch6.x86_64 #1 SMP Wed Jul 21 15:34:23 PDT 2021
      [7949565.886871] Call Trace:
      [7949565.889807]  [<ffffffffc12407ec>] libcfs_call_trace+0x8c/0xd0 [libcfs]
      [7949565.897316]  [<ffffffffc12408ac>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [7949565.904423]  [<ffffffffc124f85c>] cfs_hash_putref+0x3cc/0x520 [libcfs]
      [7949565.911928]  [<ffffffffc1529e54>] lu_site_fini+0x54/0xa0 [obdclass]
      [7949565.919162]  [<ffffffffc133d0cb>] osd_device_free+0x9b/0x2e0 [osd_zfs]
      [7949565.926665]  [<ffffffffc14fcf82>] class_free_dev+0x4c2/0x720 [obdclass]
      [7949565.934267]  [<ffffffffc14fd3e0>] class_export_put+0x200/0x2d0 [obdclass]
      [7949565.942059]  [<ffffffffc14fef05>] class_unlink_export+0x145/0x180 [obdclass]
      [7949565.950159]  [<ffffffffc1514990>] class_decref+0x80/0x160 [obdclass]
      [7949565.950169]  [<ffffffffc1514e13>] class_detach+0x1d3/0x300 [obdclass]
      [7949565.950179]  [<ffffffffc151bae8>] class_process_config+0x1a38/0x2830 [obdclass]
      [7949565.950189]  [<ffffffffc151cac0>] class_manual_cleanup+0x1e0/0x710 [obdclass]
      [7949565.950197]  [<ffffffffc133cd15>] osd_obd_disconnect+0x165/0x1a0 [osd_zfs]
      [7949565.950208]  [<ffffffffc1526cc6>] lustre_put_lsi+0x106/0x4d0 [obdclass]
      [7949565.950217]  [<ffffffffc1527200>] lustre_common_put_super+0x170/0x270 [obdclass]
      [7949565.950230]  [<ffffffffc154ea00>] server_put_super+0x120/0xd00 [obdclass]
      [7949565.950235]  [<ffffffffbca61e5d>] generic_shutdown_super+0x6d/0x110
      [7949565.950236]  [<ffffffffbca61f12>] kill_anon_super+0x12/0x20
      [7949565.950246]  [<ffffffffc151f6b2>] lustre_kill_super+0x32/0x50 [obdclass]
      [7949565.950247]  [<ffffffffbca6189e>] deactivate_locked_super+0x4e/0x70
      [7949565.950248]  [<ffffffffbca61906>] deactivate_super+0x46/0x60
      [7949565.950251]  [<ffffffffbca8372f>] cleanup_mnt+0x3f/0x80
      [7949565.950253]  [<ffffffffbca837c2>] __cleanup_mnt+0x12/0x20
      [7949565.950255]  [<ffffffffbc8c7cab>] task_work_run+0xbb/0xf0
      [7949565.950258]  [<ffffffffbc82dd95>] do_notify_resume+0xa5/0xc0
      [7949565.950262]  [<ffffffffbcfc44ef>] int_signal+0x12/0x17
      [7949565.950293]  [<ffffffffffffffff>] 0xffffffffffffffff
      

      We had never seen this before, but started seeing it frequently 2021-11-10. Before that there were two changes that may be related:
      1. We enabled changelogs on the filesystem ("brass") and started consuming them
      2. We updated the system to the above-mentioned kernel and lustre versions. I'll find the prior versions and post them.

      We've seen this crash 5 times so far (in about 1 month), when failing MDTs over for maintenance on the nodes.

      For Lustre patch stack, see https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl
      For ZFS patch stack, see https://github.com/LLNL/zfs/releases/tag/zfs-0.7.11-8llnl
      For SPL patch stack, see https://github.com/LLNL/spl/releases/tag/spl-0.7.11-8llnl
      (Note that the spl/zfs rpm version is misleading, it really is the spl- and zfs-0.7.11-8llnl tag that was used to build those rpms)

      Attachments

        Issue Links

          Activity

            [LU-15357] hash.c:1111:cfs_hash_destroy() ASSERTION( !cfs_hash_with_assert_empty(hs) ) failed: hash lu_site_osd-zfs bucket 1(8) is not empty: 1 items left

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45872/
            Subject: LU-15357 iokit: fix the obsolete usage of cfg_device
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a20b78a81d091cebd6b9e6c87537b2c955084cd5

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45872/ Subject: LU-15357 iokit: fix the obsolete usage of cfg_device Project: fs/lustre-release Branch: master Current Patch Set: Commit: a20b78a81d091cebd6b9e6c87537b2c955084cd5
            ofaaland Olaf Faaland added a comment -

            Since the backport was landed to b2_12, and the patch was already in 2.15.0, resolving this ticket. I think the 2.14 backport can be abandoned.

            ofaaland Olaf Faaland added a comment - Since the backport was landed to b2_12, and the patch was already in 2.15.0, resolving this ticket. I think the 2.14 backport can be abandoned.

            Fixing the changelog context leak appears to have resolved the issue at our site.

            ofaaland Olaf Faaland added a comment - Fixing the changelog context leak appears to have resolved the issue at our site.

            "Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/46529
            Subject: LU-15357 mdd: fix changelog context leak
            Project: fs/lustre-release
            Branch: b2_14
            Current Patch Set: 1
            Commit: 8fa010f2202b97e0e4b76e295386ec438dd341b6

            gerrit Gerrit Updater added a comment - "Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/46529 Subject: LU-15357 mdd: fix changelog context leak Project: fs/lustre-release Branch: b2_14 Current Patch Set: 1 Commit: 8fa010f2202b97e0e4b76e295386ec438dd341b6

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45832/
            Subject: LU-15357 mdd: fix changelog context leak
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 55fa745234ef0e1a01e0fb0622f1c00ecce047ea

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45832/ Subject: LU-15357 mdd: fix changelog context leak Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 55fa745234ef0e1a01e0fb0622f1c00ecce047ea

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45831/
            Subject: LU-15357 mdd: fix changelog context leak
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: d083c93c6fd9251d6637d33029049b1d27d2a20a

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45831/ Subject: LU-15357 mdd: fix changelog context leak Project: fs/lustre-release Branch: master Current Patch Set: Commit: d083c93c6fd9251d6637d33029049b1d27d2a20a

            "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45872
            Subject: LU-15357 iokit: fix the obsolete usage of cfg_device
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: be87f89f6a6c72b101f7fd6792d4ed679f90de0e

            gerrit Gerrit Updater added a comment - "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45872 Subject: LU-15357 iokit: fix the obsolete usage of cfg_device Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: be87f89f6a6c72b101f7fd6792d4ed679f90de0e

            This is the most obvious source of problem which is worth to check first

            tappro Mikhail Pershin added a comment - This is the most obvious source of problem which is worth to check first

            "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45832
            Subject: LU-15357 mdd: fix changelog context leak
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: eccb58ff0700dfea617880dc22cec8ab8ce883ea

            gerrit Gerrit Updater added a comment - "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45832 Subject: LU-15357 mdd: fix changelog context leak Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: eccb58ff0700dfea617880dc22cec8ab8ce883ea

            "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45831
            Subject: LU-15357 mdd: fix changelog context leak
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4e3f19eb13650098c1f3b459b2273b45523869fc

            gerrit Gerrit Updater added a comment - "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45831 Subject: LU-15357 mdd: fix changelog context leak Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4e3f19eb13650098c1f3b459b2273b45523869fc
            tappro Mikhail Pershin added a comment - - edited

            Just few notes about the issue. First of all, the same issue may occur with different stack trace in newer code:

            [23370.525027] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) header@ffff90577d346600[0x4, 1, [0x200000003:0x0:0x0] hash exist]{
            
            [23370.527626] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) ....local_storage@ffff90577d346658
            
            [23370.529614] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) ....osd-ldiskfs@ffff9057177f8000osd-ldiskfs-object@ffff9057177f8000(i:ffff90575898f9e8:117/1506646403)[plain]
            
            [23370.532701] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) } header@ffff90577d346600
            
            [23370.534628] LustreError: 29448:0:(lu_object.c:1297:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
            [23370.536881] LustreError: 29448:0:(lu_object.c:1297:lu_device_fini()) LBUG
            [23370.538240] Pid: 29448, comm: umount 3.10.0-1160.25.1.el7_lustre.ddn13.x86_64 #1 SMP Wed May 19 03:51:33 UTC 2021
            [23370.540143] Call Trace:
            [23370.540789] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
            [23370.541758] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
            [23370.542946] [<0>] lu_device_fini+0xbb/0xc0 [obdclass]
            [23370.543961] [<0>] dt_device_fini+0xe/0x10 [obdclass]
            [23370.545011] [<0>] osd_device_free+0xae/0x2c0 [osd_ldiskfs]
            [23370.546099] [<0>] class_free_dev+0x4c2/0x720 [obdclass]

             the reason is the using rhashtable instead of cfs_hash, so ASSERTION is in lu_device_fini() but the same problem.

            As for our case, this is due to changelog enabled most likely, because local storage FIDs are: [0x1:0x0:0x0] and [0xa:0x0:0x0] which are FID_SEQ_LLOG and FID_SEQ_LLOG_NAME respectively, so it looks like changelog llog object wasn't properly put or changelog subsystem is still in use. I am going to check related issues in that area first

             

            tappro Mikhail Pershin added a comment - - edited Just few notes about the issue. First of all, the same issue may occur with different stack trace in newer code: [23370.525027] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) header@ffff90577d346600[0x4, 1, [0x200000003:0x0:0x0] hash exist]{ [23370.527626] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) ....local_storage@ffff90577d346658 [23370.529614] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) ....osd-ldiskfs@ffff9057177f8000osd-ldiskfs-object@ffff9057177f8000(i:ffff90575898f9e8:117/1506646403)[plain] [23370.532701] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) } header@ffff90577d346600 [23370.534628] LustreError: 29448:0:(lu_object.c:1297:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1 [23370.536881] LustreError: 29448:0:(lu_object.c:1297:lu_device_fini()) LBUG [23370.538240] Pid: 29448, comm: umount 3.10.0-1160.25.1.el7_lustre.ddn13.x86_64 #1 SMP Wed May 19 03:51:33 UTC 2021 [23370.540143] Call Trace: [23370.540789] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs] [23370.541758] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs] [23370.542946] [<0>] lu_device_fini+0xbb/0xc0 [obdclass] [23370.543961] [<0>] dt_device_fini+0xe/0x10 [obdclass] [23370.545011] [<0>] osd_device_free+0xae/0x2c0 [osd_ldiskfs] [23370.546099] [<0>] class_free_dev+0x4c2/0x720 [obdclass]  the reason is the using rhashtable instead of cfs_hash, so ASSERTION is in lu_device_fini() but the same problem. As for our case, this is due to changelog enabled most likely, because local storage FIDs are: [0x1:0x0:0x0] and [0xa:0x0:0x0] which are FID_SEQ_LLOG and FID_SEQ_LLOG_NAME respectively, so it looks like changelog llog object wasn't properly put or changelog subsystem is still in use. I am going to check related issues in that area first  

            People

              tappro Mikhail Pershin
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: