Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15357

hash.c:1111:cfs_hash_destroy() ASSERTION( !cfs_hash_with_assert_empty(hs) ) failed: hash lu_site_osd-zfs bucket 1(8) is not empty: 1 items left

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0, Lustre 2.15.3
    • Lustre 2.12.7
    • lustre-2.12.7_2.llnl-2
      zfs-0.7.11-9.8llnl.ch6.x86_64
      3.10.0-1160.45.1.1chaos.ch6.x86_64
      rhel7-based
    • 3
    • 9223372036854775807

    Description

      Looks like the same issue as LU-10433 to me. Occurred during shutdown of an MDT. vmcore-dmesg.txt file from crash dump shows:

      [7949565.740966] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) header@ffffa03aa81b09c0[0x4, 1, [0x1:0x0:0x0] hash exist]{
      
      [7949565.756016] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....local_storage@ffffa03aa81b0a10
      
      [7949565.768738] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....osd-zfs@ffffa03aa6c80140osd-zfs-object@ffffa03aa6c80140
      
      [7949565.783886] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) } header@ffffa03aa81b09c0
      
      [7949565.795741] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) header@ffffa03aa81b00c0[0x4, 1, [0xa:0x0:0x0] hash exist]{
      
      [7949565.810788] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....local_storage@ffffa03aa81b0110
      
      [7949565.823509] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....osd-zfs@ffffa03aa6c803c0osd-zfs-object@ffffa03aa6c803c0
      
      [7949565.838649] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) } header@ffffa03aa81b00c0
      
      [7949565.850558] LustreError: 4224:0:(hash.c:1111:cfs_hash_destroy()) ASSERTION( !cfs_hash_with_assert_empty(hs) ) failed: hash lu_site_osd-zfs bucket 1(3) is not  empty: 1 items left
      [7949565.868513] LustreError: 4224:0:(hash.c:1111:cfs_hash_destroy()) LBUG
      [7949565.875900] Pid: 4224, comm: umount 3.10.0-1160.36.2.1chaos.ch6.x86_64 #1 SMP Wed Jul 21 15:34:23 PDT 2021
      [7949565.886871] Call Trace:
      [7949565.889807]  [<ffffffffc12407ec>] libcfs_call_trace+0x8c/0xd0 [libcfs]
      [7949565.897316]  [<ffffffffc12408ac>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [7949565.904423]  [<ffffffffc124f85c>] cfs_hash_putref+0x3cc/0x520 [libcfs]
      [7949565.911928]  [<ffffffffc1529e54>] lu_site_fini+0x54/0xa0 [obdclass]
      [7949565.919162]  [<ffffffffc133d0cb>] osd_device_free+0x9b/0x2e0 [osd_zfs]
      [7949565.926665]  [<ffffffffc14fcf82>] class_free_dev+0x4c2/0x720 [obdclass]
      [7949565.934267]  [<ffffffffc14fd3e0>] class_export_put+0x200/0x2d0 [obdclass]
      [7949565.942059]  [<ffffffffc14fef05>] class_unlink_export+0x145/0x180 [obdclass]
      [7949565.950159]  [<ffffffffc1514990>] class_decref+0x80/0x160 [obdclass]
      [7949565.950169]  [<ffffffffc1514e13>] class_detach+0x1d3/0x300 [obdclass]
      [7949565.950179]  [<ffffffffc151bae8>] class_process_config+0x1a38/0x2830 [obdclass]
      [7949565.950189]  [<ffffffffc151cac0>] class_manual_cleanup+0x1e0/0x710 [obdclass]
      [7949565.950197]  [<ffffffffc133cd15>] osd_obd_disconnect+0x165/0x1a0 [osd_zfs]
      [7949565.950208]  [<ffffffffc1526cc6>] lustre_put_lsi+0x106/0x4d0 [obdclass]
      [7949565.950217]  [<ffffffffc1527200>] lustre_common_put_super+0x170/0x270 [obdclass]
      [7949565.950230]  [<ffffffffc154ea00>] server_put_super+0x120/0xd00 [obdclass]
      [7949565.950235]  [<ffffffffbca61e5d>] generic_shutdown_super+0x6d/0x110
      [7949565.950236]  [<ffffffffbca61f12>] kill_anon_super+0x12/0x20
      [7949565.950246]  [<ffffffffc151f6b2>] lustre_kill_super+0x32/0x50 [obdclass]
      [7949565.950247]  [<ffffffffbca6189e>] deactivate_locked_super+0x4e/0x70
      [7949565.950248]  [<ffffffffbca61906>] deactivate_super+0x46/0x60
      [7949565.950251]  [<ffffffffbca8372f>] cleanup_mnt+0x3f/0x80
      [7949565.950253]  [<ffffffffbca837c2>] __cleanup_mnt+0x12/0x20
      [7949565.950255]  [<ffffffffbc8c7cab>] task_work_run+0xbb/0xf0
      [7949565.950258]  [<ffffffffbc82dd95>] do_notify_resume+0xa5/0xc0
      [7949565.950262]  [<ffffffffbcfc44ef>] int_signal+0x12/0x17
      [7949565.950293]  [<ffffffffffffffff>] 0xffffffffffffffff
      

      We had never seen this before, but started seeing it frequently 2021-11-10. Before that there were two changes that may be related:
      1. We enabled changelogs on the filesystem ("brass") and started consuming them
      2. We updated the system to the above-mentioned kernel and lustre versions. I'll find the prior versions and post them.

      We've seen this crash 5 times so far (in about 1 month), when failing MDTs over for maintenance on the nodes.

      For Lustre patch stack, see https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl
      For ZFS patch stack, see https://github.com/LLNL/zfs/releases/tag/zfs-0.7.11-8llnl
      For SPL patch stack, see https://github.com/LLNL/spl/releases/tag/spl-0.7.11-8llnl
      (Note that the spl/zfs rpm version is misleading, it really is the spl- and zfs-0.7.11-8llnl tag that was used to build those rpms)

      Attachments

        Issue Links

          Activity

            [LU-15357] hash.c:1111:cfs_hash_destroy() ASSERTION( !cfs_hash_with_assert_empty(hs) ) failed: hash lu_site_osd-zfs bucket 1(8) is not empty: 1 items left

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45832/
            Subject: LU-15357 mdd: fix changelog context leak
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 55fa745234ef0e1a01e0fb0622f1c00ecce047ea

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45832/ Subject: LU-15357 mdd: fix changelog context leak Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 55fa745234ef0e1a01e0fb0622f1c00ecce047ea

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45831/
            Subject: LU-15357 mdd: fix changelog context leak
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: d083c93c6fd9251d6637d33029049b1d27d2a20a

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45831/ Subject: LU-15357 mdd: fix changelog context leak Project: fs/lustre-release Branch: master Current Patch Set: Commit: d083c93c6fd9251d6637d33029049b1d27d2a20a

            "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45872
            Subject: LU-15357 iokit: fix the obsolete usage of cfg_device
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: be87f89f6a6c72b101f7fd6792d4ed679f90de0e

            gerrit Gerrit Updater added a comment - "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45872 Subject: LU-15357 iokit: fix the obsolete usage of cfg_device Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: be87f89f6a6c72b101f7fd6792d4ed679f90de0e

            This is the most obvious source of problem which is worth to check first

            tappro Mikhail Pershin added a comment - This is the most obvious source of problem which is worth to check first

            "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45832
            Subject: LU-15357 mdd: fix changelog context leak
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: eccb58ff0700dfea617880dc22cec8ab8ce883ea

            gerrit Gerrit Updater added a comment - "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45832 Subject: LU-15357 mdd: fix changelog context leak Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: eccb58ff0700dfea617880dc22cec8ab8ce883ea

            "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45831
            Subject: LU-15357 mdd: fix changelog context leak
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4e3f19eb13650098c1f3b459b2273b45523869fc

            gerrit Gerrit Updater added a comment - "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45831 Subject: LU-15357 mdd: fix changelog context leak Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4e3f19eb13650098c1f3b459b2273b45523869fc
            tappro Mikhail Pershin added a comment - - edited

            Just few notes about the issue. First of all, the same issue may occur with different stack trace in newer code:

            [23370.525027] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) header@ffff90577d346600[0x4, 1, [0x200000003:0x0:0x0] hash exist]{
            
            [23370.527626] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) ....local_storage@ffff90577d346658
            
            [23370.529614] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) ....osd-ldiskfs@ffff9057177f8000osd-ldiskfs-object@ffff9057177f8000(i:ffff90575898f9e8:117/1506646403)[plain]
            
            [23370.532701] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) } header@ffff90577d346600
            
            [23370.534628] LustreError: 29448:0:(lu_object.c:1297:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
            [23370.536881] LustreError: 29448:0:(lu_object.c:1297:lu_device_fini()) LBUG
            [23370.538240] Pid: 29448, comm: umount 3.10.0-1160.25.1.el7_lustre.ddn13.x86_64 #1 SMP Wed May 19 03:51:33 UTC 2021
            [23370.540143] Call Trace:
            [23370.540789] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
            [23370.541758] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
            [23370.542946] [<0>] lu_device_fini+0xbb/0xc0 [obdclass]
            [23370.543961] [<0>] dt_device_fini+0xe/0x10 [obdclass]
            [23370.545011] [<0>] osd_device_free+0xae/0x2c0 [osd_ldiskfs]
            [23370.546099] [<0>] class_free_dev+0x4c2/0x720 [obdclass]

             the reason is the using rhashtable instead of cfs_hash, so ASSERTION is in lu_device_fini() but the same problem.

            As for our case, this is due to changelog enabled most likely, because local storage FIDs are: [0x1:0x0:0x0] and [0xa:0x0:0x0] which are FID_SEQ_LLOG and FID_SEQ_LLOG_NAME respectively, so it looks like changelog llog object wasn't properly put or changelog subsystem is still in use. I am going to check related issues in that area first

             

            tappro Mikhail Pershin added a comment - - edited Just few notes about the issue. First of all, the same issue may occur with different stack trace in newer code: [23370.525027] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) header@ffff90577d346600[0x4, 1, [0x200000003:0x0:0x0] hash exist]{ [23370.527626] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) ....local_storage@ffff90577d346658 [23370.529614] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) ....osd-ldiskfs@ffff9057177f8000osd-ldiskfs-object@ffff9057177f8000(i:ffff90575898f9e8:117/1506646403)[plain] [23370.532701] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) } header@ffff90577d346600 [23370.534628] LustreError: 29448:0:(lu_object.c:1297:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1 [23370.536881] LustreError: 29448:0:(lu_object.c:1297:lu_device_fini()) LBUG [23370.538240] Pid: 29448, comm: umount 3.10.0-1160.25.1.el7_lustre.ddn13.x86_64 #1 SMP Wed May 19 03:51:33 UTC 2021 [23370.540143] Call Trace: [23370.540789] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs] [23370.541758] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs] [23370.542946] [<0>] lu_device_fini+0xbb/0xc0 [obdclass] [23370.543961] [<0>] dt_device_fini+0xe/0x10 [obdclass] [23370.545011] [<0>] osd_device_free+0xae/0x2c0 [osd_ldiskfs] [23370.546099] [<0>] class_free_dev+0x4c2/0x720 [obdclass]  the reason is the using rhashtable instead of cfs_hash, so ASSERTION is in lu_device_fini() but the same problem. As for our case, this is due to changelog enabled most likely, because local storage FIDs are: [0x1:0x0:0x0] and [0xa:0x0:0x0] which are FID_SEQ_LLOG and FID_SEQ_LLOG_NAME respectively, so it looks like changelog llog object wasn't properly put or changelog subsystem is still in use. I am going to check related issues in that area first  

            Olaf, thanks for info. Andreas, I will port debug patch first, then will update EX-1873 for major branches

            tappro Mikhail Pershin added a comment - Olaf, thanks for info. Andreas, I will port debug patch first, then will update EX-1873 for major branches
            ofaaland Olaf Faaland added a comment -

            Is this being hit during normal testing

            This is a file system that has been in production for about 4 years, and at Lustre 2.12.X for around a year.  Leading up to the first occurrence, and since then, it's been under use by ordinary users, and not used in any other way.

            are you running mds-survey (lcfg cfg_device) sometime before the filesystem is unmounted?

            No, we have not run mds-survey, nor any other test tools to my knowledge, against this system nor on the clients that mount it. No one else uses tools like that, besides Cameron and myself, and he would have coordinated it with me.

            ofaaland Olaf Faaland added a comment - Is this being hit during normal testing This is a file system that has been in production for about 4 years, and at Lustre 2.12.X for around a year.  Leading up to the first occurrence, and since then, it's been under use by ordinary users, and not used in any other way. are you running mds-survey (lcfg cfg_device) sometime before the filesystem is unmounted? No, we have not run mds-survey, nor any other test tools to my knowledge, against this system nor on the clients that mount it. No one else uses tools like that, besides Cameron and myself, and he would have coordinated it with me.

            Is this being hit during normal testing, or are you running mds-survey (lcfg cfg_device) sometime before the filesystem is unmounted? We hit a problem similar to this in testing some time ago that was related to how mds-survey was being run.

            adilger Andreas Dilger added a comment - Is this being hit during normal testing, or are you running mds-survey ( lcfg cfg_device ) sometime before the filesystem is unmounted? We hit a problem similar to this in testing some time ago that was related to how mds-survey was being run.
            pjones Peter Jones added a comment -

            Mike

            Could you please advise?

            Thanks

            Peter

            pjones Peter Jones added a comment - Mike Could you please advise? Thanks Peter

            People

              tappro Mikhail Pershin
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: