[LU-15357] hash.c:1111:cfs_hash_destroy() ASSERTION( !cfs_hash_with_assert_empty(hs) ) failed: hash lu_site_osd-zfs bucket 1(8) is not empty: 1 items left Created: 10/Dec/21  Updated: 19/Apr/23  Resolved: 22/Aug/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.7
Fix Version/s: Lustre 2.15.3

Type: Bug Priority: Major
Reporter: Olaf Faaland Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

lustre-2.12.7_2.llnl-2
zfs-0.7.11-9.8llnl.ch6.x86_64
3.10.0-1160.45.1.1chaos.ch6.x86_64
rhel7-based


Issue Links:
Related
is related to LU-14553 Inappropriate console log messages du... Resolved
is related to LU-15382 ASSERTION(!cfs_hash_with_assert_empty... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Looks like the same issue as LU-10433 to me. Occurred during shutdown of an MDT. vmcore-dmesg.txt file from crash dump shows:

[7949565.740966] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) header@ffffa03aa81b09c0[0x4, 1, [0x1:0x0:0x0] hash exist]{

[7949565.756016] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....local_storage@ffffa03aa81b0a10

[7949565.768738] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....osd-zfs@ffffa03aa6c80140osd-zfs-object@ffffa03aa6c80140

[7949565.783886] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) } header@ffffa03aa81b09c0

[7949565.795741] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) header@ffffa03aa81b00c0[0x4, 1, [0xa:0x0:0x0] hash exist]{

[7949565.810788] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....local_storage@ffffa03aa81b0110

[7949565.823509] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....osd-zfs@ffffa03aa6c803c0osd-zfs-object@ffffa03aa6c803c0

[7949565.838649] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) } header@ffffa03aa81b00c0

[7949565.850558] LustreError: 4224:0:(hash.c:1111:cfs_hash_destroy()) ASSERTION( !cfs_hash_with_assert_empty(hs) ) failed: hash lu_site_osd-zfs bucket 1(3) is not  empty: 1 items left
[7949565.868513] LustreError: 4224:0:(hash.c:1111:cfs_hash_destroy()) LBUG
[7949565.875900] Pid: 4224, comm: umount 3.10.0-1160.36.2.1chaos.ch6.x86_64 #1 SMP Wed Jul 21 15:34:23 PDT 2021
[7949565.886871] Call Trace:
[7949565.889807]  [<ffffffffc12407ec>] libcfs_call_trace+0x8c/0xd0 [libcfs]
[7949565.897316]  [<ffffffffc12408ac>] lbug_with_loc+0x4c/0xa0 [libcfs]
[7949565.904423]  [<ffffffffc124f85c>] cfs_hash_putref+0x3cc/0x520 [libcfs]
[7949565.911928]  [<ffffffffc1529e54>] lu_site_fini+0x54/0xa0 [obdclass]
[7949565.919162]  [<ffffffffc133d0cb>] osd_device_free+0x9b/0x2e0 [osd_zfs]
[7949565.926665]  [<ffffffffc14fcf82>] class_free_dev+0x4c2/0x720 [obdclass]
[7949565.934267]  [<ffffffffc14fd3e0>] class_export_put+0x200/0x2d0 [obdclass]
[7949565.942059]  [<ffffffffc14fef05>] class_unlink_export+0x145/0x180 [obdclass]
[7949565.950159]  [<ffffffffc1514990>] class_decref+0x80/0x160 [obdclass]
[7949565.950169]  [<ffffffffc1514e13>] class_detach+0x1d3/0x300 [obdclass]
[7949565.950179]  [<ffffffffc151bae8>] class_process_config+0x1a38/0x2830 [obdclass]
[7949565.950189]  [<ffffffffc151cac0>] class_manual_cleanup+0x1e0/0x710 [obdclass]
[7949565.950197]  [<ffffffffc133cd15>] osd_obd_disconnect+0x165/0x1a0 [osd_zfs]
[7949565.950208]  [<ffffffffc1526cc6>] lustre_put_lsi+0x106/0x4d0 [obdclass]
[7949565.950217]  [<ffffffffc1527200>] lustre_common_put_super+0x170/0x270 [obdclass]
[7949565.950230]  [<ffffffffc154ea00>] server_put_super+0x120/0xd00 [obdclass]
[7949565.950235]  [<ffffffffbca61e5d>] generic_shutdown_super+0x6d/0x110
[7949565.950236]  [<ffffffffbca61f12>] kill_anon_super+0x12/0x20
[7949565.950246]  [<ffffffffc151f6b2>] lustre_kill_super+0x32/0x50 [obdclass]
[7949565.950247]  [<ffffffffbca6189e>] deactivate_locked_super+0x4e/0x70
[7949565.950248]  [<ffffffffbca61906>] deactivate_super+0x46/0x60
[7949565.950251]  [<ffffffffbca8372f>] cleanup_mnt+0x3f/0x80
[7949565.950253]  [<ffffffffbca837c2>] __cleanup_mnt+0x12/0x20
[7949565.950255]  [<ffffffffbc8c7cab>] task_work_run+0xbb/0xf0
[7949565.950258]  [<ffffffffbc82dd95>] do_notify_resume+0xa5/0xc0
[7949565.950262]  [<ffffffffbcfc44ef>] int_signal+0x12/0x17
[7949565.950293]  [<ffffffffffffffff>] 0xffffffffffffffff

We had never seen this before, but started seeing it frequently 2021-11-10. Before that there were two changes that may be related:
1. We enabled changelogs on the filesystem ("brass") and started consuming them
2. We updated the system to the above-mentioned kernel and lustre versions. I'll find the prior versions and post them.

We've seen this crash 5 times so far (in about 1 month), when failing MDTs over for maintenance on the nodes.

For Lustre patch stack, see https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl
For ZFS patch stack, see https://github.com/LLNL/zfs/releases/tag/zfs-0.7.11-8llnl
For SPL patch stack, see https://github.com/LLNL/spl/releases/tag/spl-0.7.11-8llnl
(Note that the spl/zfs rpm version is misleading, it really is the spl- and zfs-0.7.11-8llnl tag that was used to build those rpms)



 Comments   
Comment by Peter Jones [ 10/Dec/21 ]

Mike

Could you please advise?

Thanks

Peter

Comment by Andreas Dilger [ 10/Dec/21 ]

Is this being hit during normal testing, or are you running mds-survey (lcfg cfg_device) sometime before the filesystem is unmounted? We hit a problem similar to this in testing some time ago that was related to how mds-survey was being run.

Comment by Olaf Faaland [ 10/Dec/21 ]

Is this being hit during normal testing

This is a file system that has been in production for about 4 years, and at Lustre 2.12.X for around a year.  Leading up to the first occurrence, and since then, it's been under use by ordinary users, and not used in any other way.

are you running mds-survey (lcfg cfg_device) sometime before the filesystem is unmounted?

No, we have not run mds-survey, nor any other test tools to my knowledge, against this system nor on the clients that mount it. No one else uses tools like that, besides Cameron and myself, and he would have coordinated it with me.

Comment by Mikhail Pershin [ 10/Dec/21 ]

Olaf, thanks for info. Andreas, I will port debug patch first, then will update EX-1873 for major branches

Comment by Mikhail Pershin [ 11/Dec/21 ]

Just few notes about the issue. First of all, the same issue may occur with different stack trace in newer code:

[23370.525027] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) header@ffff90577d346600[0x4, 1, [0x200000003:0x0:0x0] hash exist]{

[23370.527626] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) ....local_storage@ffff90577d346658

[23370.529614] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) ....osd-ldiskfs@ffff9057177f8000osd-ldiskfs-object@ffff9057177f8000(i:ffff90575898f9e8:117/1506646403)[plain]

[23370.532701] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) } header@ffff90577d346600

[23370.534628] LustreError: 29448:0:(lu_object.c:1297:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
[23370.536881] LustreError: 29448:0:(lu_object.c:1297:lu_device_fini()) LBUG
[23370.538240] Pid: 29448, comm: umount 3.10.0-1160.25.1.el7_lustre.ddn13.x86_64 #1 SMP Wed May 19 03:51:33 UTC 2021
[23370.540143] Call Trace:
[23370.540789] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
[23370.541758] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
[23370.542946] [<0>] lu_device_fini+0xbb/0xc0 [obdclass]
[23370.543961] [<0>] dt_device_fini+0xe/0x10 [obdclass]
[23370.545011] [<0>] osd_device_free+0xae/0x2c0 [osd_ldiskfs]
[23370.546099] [<0>] class_free_dev+0x4c2/0x720 [obdclass]

 the reason is the using rhashtable instead of cfs_hash, so ASSERTION is in lu_device_fini() but the same problem.

As for our case, this is due to changelog enabled most likely, because local storage FIDs are: [0x1:0x0:0x0] and [0xa:0x0:0x0] which are FID_SEQ_LLOG and FID_SEQ_LLOG_NAME respectively, so it looks like changelog llog object wasn't properly put or changelog subsystem is still in use. I am going to check related issues in that area first

 

Comment by Gerrit Updater [ 11/Dec/21 ]

"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45831
Subject: LU-15357 mdd: fix changelog context leak
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4e3f19eb13650098c1f3b459b2273b45523869fc

Comment by Gerrit Updater [ 11/Dec/21 ]

"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45832
Subject: LU-15357 mdd: fix changelog context leak
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: eccb58ff0700dfea617880dc22cec8ab8ce883ea

Comment by Mikhail Pershin [ 11/Dec/21 ]

This is the most obvious source of problem which is worth to check first

Comment by Gerrit Updater [ 16/Dec/21 ]

"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45872
Subject: LU-15357 iokit: fix the obsolete usage of cfg_device
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: be87f89f6a6c72b101f7fd6792d4ed679f90de0e

Comment by Gerrit Updater [ 23/Dec/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45831/
Subject: LU-15357 mdd: fix changelog context leak
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d083c93c6fd9251d6637d33029049b1d27d2a20a

Comment by Gerrit Updater [ 30/Jan/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45832/
Subject: LU-15357 mdd: fix changelog context leak
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 55fa745234ef0e1a01e0fb0622f1c00ecce047ea

Comment by Gerrit Updater [ 15/Feb/22 ]

"Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/46529
Subject: LU-15357 mdd: fix changelog context leak
Project: fs/lustre-release
Branch: b2_14
Current Patch Set: 1
Commit: 8fa010f2202b97e0e4b76e295386ec438dd341b6

Comment by Olaf Faaland [ 21/Mar/22 ]

Fixing the changelog context leak appears to have resolved the issue at our site.

Comment by Olaf Faaland [ 22/Aug/22 ]

Since the backport was landed to b2_12, and the patch was already in 2.15.0, resolving this ticket. I think the 2.14 backport can be abandoned.

Comment by Gerrit Updater [ 01/Sep/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45872/
Subject: LU-15357 iokit: fix the obsolete usage of cfg_device
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a20b78a81d091cebd6b9e6c87537b2c955084cd5

Comment by Gerrit Updater [ 06/Jan/23 ]

"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49566
Subject: LU-15357 iokit: fix the obsolete usage of cfg_device
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: b02fcbd3e8763173fc997e2bb49bc06d8855fb67

Comment by Gerrit Updater [ 19/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49566/
Subject: LU-15357 iokit: fix the obsolete usage of cfg_device
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 1e7624b711e173baf893cec0bfc687ddee000fc7

Generated at Sat Feb 10 03:17:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.