[LU-15357] hash.c:1111:cfs_hash_destroy() ASSERTION( !cfs_hash_with_assert_empty(hs) ) failed: hash lu_site_osd-zfs bucket 1(8) is not empty: 1 items left Created: 10/Dec/21 Updated: 19/Apr/23 Resolved: 22/Aug/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.7 |
| Fix Version/s: | Lustre 2.15.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | Olaf Faaland | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
lustre-2.12.7_2.llnl-2 |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Looks like the same issue as [7949565.740966] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) header@ffffa03aa81b09c0[0x4, 1, [0x1:0x0:0x0] hash exist]{
[7949565.756016] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....local_storage@ffffa03aa81b0a10
[7949565.768738] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....osd-zfs@ffffa03aa6c80140osd-zfs-object@ffffa03aa6c80140
[7949565.783886] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) } header@ffffa03aa81b09c0
[7949565.795741] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) header@ffffa03aa81b00c0[0x4, 1, [0xa:0x0:0x0] hash exist]{
[7949565.810788] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....local_storage@ffffa03aa81b0110
[7949565.823509] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....osd-zfs@ffffa03aa6c803c0osd-zfs-object@ffffa03aa6c803c0
[7949565.838649] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) } header@ffffa03aa81b00c0
[7949565.850558] LustreError: 4224:0:(hash.c:1111:cfs_hash_destroy()) ASSERTION( !cfs_hash_with_assert_empty(hs) ) failed: hash lu_site_osd-zfs bucket 1(3) is not empty: 1 items left
[7949565.868513] LustreError: 4224:0:(hash.c:1111:cfs_hash_destroy()) LBUG
[7949565.875900] Pid: 4224, comm: umount 3.10.0-1160.36.2.1chaos.ch6.x86_64 #1 SMP Wed Jul 21 15:34:23 PDT 2021
[7949565.886871] Call Trace:
[7949565.889807] [<ffffffffc12407ec>] libcfs_call_trace+0x8c/0xd0 [libcfs]
[7949565.897316] [<ffffffffc12408ac>] lbug_with_loc+0x4c/0xa0 [libcfs]
[7949565.904423] [<ffffffffc124f85c>] cfs_hash_putref+0x3cc/0x520 [libcfs]
[7949565.911928] [<ffffffffc1529e54>] lu_site_fini+0x54/0xa0 [obdclass]
[7949565.919162] [<ffffffffc133d0cb>] osd_device_free+0x9b/0x2e0 [osd_zfs]
[7949565.926665] [<ffffffffc14fcf82>] class_free_dev+0x4c2/0x720 [obdclass]
[7949565.934267] [<ffffffffc14fd3e0>] class_export_put+0x200/0x2d0 [obdclass]
[7949565.942059] [<ffffffffc14fef05>] class_unlink_export+0x145/0x180 [obdclass]
[7949565.950159] [<ffffffffc1514990>] class_decref+0x80/0x160 [obdclass]
[7949565.950169] [<ffffffffc1514e13>] class_detach+0x1d3/0x300 [obdclass]
[7949565.950179] [<ffffffffc151bae8>] class_process_config+0x1a38/0x2830 [obdclass]
[7949565.950189] [<ffffffffc151cac0>] class_manual_cleanup+0x1e0/0x710 [obdclass]
[7949565.950197] [<ffffffffc133cd15>] osd_obd_disconnect+0x165/0x1a0 [osd_zfs]
[7949565.950208] [<ffffffffc1526cc6>] lustre_put_lsi+0x106/0x4d0 [obdclass]
[7949565.950217] [<ffffffffc1527200>] lustre_common_put_super+0x170/0x270 [obdclass]
[7949565.950230] [<ffffffffc154ea00>] server_put_super+0x120/0xd00 [obdclass]
[7949565.950235] [<ffffffffbca61e5d>] generic_shutdown_super+0x6d/0x110
[7949565.950236] [<ffffffffbca61f12>] kill_anon_super+0x12/0x20
[7949565.950246] [<ffffffffc151f6b2>] lustre_kill_super+0x32/0x50 [obdclass]
[7949565.950247] [<ffffffffbca6189e>] deactivate_locked_super+0x4e/0x70
[7949565.950248] [<ffffffffbca61906>] deactivate_super+0x46/0x60
[7949565.950251] [<ffffffffbca8372f>] cleanup_mnt+0x3f/0x80
[7949565.950253] [<ffffffffbca837c2>] __cleanup_mnt+0x12/0x20
[7949565.950255] [<ffffffffbc8c7cab>] task_work_run+0xbb/0xf0
[7949565.950258] [<ffffffffbc82dd95>] do_notify_resume+0xa5/0xc0
[7949565.950262] [<ffffffffbcfc44ef>] int_signal+0x12/0x17
[7949565.950293] [<ffffffffffffffff>] 0xffffffffffffffff
We had never seen this before, but started seeing it frequently 2021-11-10. Before that there were two changes that may be related: We've seen this crash 5 times so far (in about 1 month), when failing MDTs over for maintenance on the nodes. For Lustre patch stack, see https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl |
| Comments |
| Comment by Peter Jones [ 10/Dec/21 ] |
|
Mike Could you please advise? Thanks Peter |
| Comment by Andreas Dilger [ 10/Dec/21 ] |
|
Is this being hit during normal testing, or are you running mds-survey (lcfg cfg_device) sometime before the filesystem is unmounted? We hit a problem similar to this in testing some time ago that was related to how mds-survey was being run. |
| Comment by Olaf Faaland [ 10/Dec/21 ] |
This is a file system that has been in production for about 4 years, and at Lustre 2.12.X for around a year. Leading up to the first occurrence, and since then, it's been under use by ordinary users, and not used in any other way.
No, we have not run mds-survey, nor any other test tools to my knowledge, against this system nor on the clients that mount it. No one else uses tools like that, besides Cameron and myself, and he would have coordinated it with me. |
| Comment by Mikhail Pershin [ 10/Dec/21 ] |
|
Olaf, thanks for info. Andreas, I will port debug patch first, then will update EX-1873 for major branches |
| Comment by Mikhail Pershin [ 11/Dec/21 ] |
|
Just few notes about the issue. First of all, the same issue may occur with different stack trace in newer code:
[23370.525027] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) header@ffff90577d346600[0x4, 1, [0x200000003:0x0:0x0] hash exist]{
[23370.527626] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) ....local_storage@ffff90577d346658
[23370.529614] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) ....osd-ldiskfs@ffff9057177f8000osd-ldiskfs-object@ffff9057177f8000(i:ffff90575898f9e8:117/1506646403)[plain]
[23370.532701] LustreError: 29448:0:(lu_object.c:1027:lu_site_print()) } header@ffff90577d346600
[23370.534628] LustreError: 29448:0:(lu_object.c:1297:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
[23370.536881] LustreError: 29448:0:(lu_object.c:1297:lu_device_fini()) LBUG
[23370.538240] Pid: 29448, comm: umount 3.10.0-1160.25.1.el7_lustre.ddn13.x86_64 #1 SMP Wed May 19 03:51:33 UTC 2021
[23370.540143] Call Trace:
[23370.540789] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
[23370.541758] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
[23370.542946] [<0>] lu_device_fini+0xbb/0xc0 [obdclass]
[23370.543961] [<0>] dt_device_fini+0xe/0x10 [obdclass]
[23370.545011] [<0>] osd_device_free+0xae/0x2c0 [osd_ldiskfs]
[23370.546099] [<0>] class_free_dev+0x4c2/0x720 [obdclass]
the reason is the using rhashtable instead of cfs_hash, so ASSERTION is in lu_device_fini() but the same problem. As for our case, this is due to changelog enabled most likely, because local storage FIDs are: [0x1:0x0:0x0] and [0xa:0x0:0x0] which are FID_SEQ_LLOG and FID_SEQ_LLOG_NAME respectively, so it looks like changelog llog object wasn't properly put or changelog subsystem is still in use. I am going to check related issues in that area first
|
| Comment by Gerrit Updater [ 11/Dec/21 ] |
|
"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45831 |
| Comment by Gerrit Updater [ 11/Dec/21 ] |
|
"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45832 |
| Comment by Mikhail Pershin [ 11/Dec/21 ] |
|
This is the most obvious source of problem which is worth to check first |
| Comment by Gerrit Updater [ 16/Dec/21 ] |
|
"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45872 |
| Comment by Gerrit Updater [ 23/Dec/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45831/ |
| Comment by Gerrit Updater [ 30/Jan/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45832/ |
| Comment by Gerrit Updater [ 15/Feb/22 ] |
|
"Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/46529 |
| Comment by Olaf Faaland [ 21/Mar/22 ] |
|
Fixing the changelog context leak appears to have resolved the issue at our site. |
| Comment by Olaf Faaland [ 22/Aug/22 ] |
|
Since the backport was landed to b2_12, and the patch was already in 2.15.0, resolving this ticket. I think the 2.14 backport can be abandoned. |
| Comment by Gerrit Updater [ 01/Sep/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45872/ |
| Comment by Gerrit Updater [ 06/Jan/23 ] |
|
"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49566 |
| Comment by Gerrit Updater [ 19/Apr/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49566/ |