[LU-15139] sanity test_160h: dt_record_write() ASSERTION( dt->do_body_ops->dbo_write ) failed Created: 26/Feb/21 Updated: 07/Nov/23 Resolved: 02/Nov/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0, Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/a9a49ddc-ce2e-4d36-a040-a2522bc7bfaa test_160h failed with the following error: [ 7154.952072] LustreError: 27898:0:(dt_object.c:503:dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed: [ 7154.954429] LustreError: 27898:0:(dt_object.c:503:dt_record_write()) LBUG [ 7154.955686] Pid: 27898, comm: mdt_out00_000 3.10.0-1160.15.2.el7_lustre.ddn12.x86_64 #1 SMP Tue Feb 23 18:25:44 UTC 2021 [ 7154.957907] Call Trace: [ 7154.958488] [<ffffffffc08197cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [ 7154.960155] [<ffffffffc081987c>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 7154.961662] [<ffffffffc0a50272>] dt_record_write+0xb2/0x120 [obdclass] [ 7154.963451] [<ffffffffc0da7b69>] out_tx_write_exec+0x169/0x2f0 [ptlrpc] [ 7154.965451] [<ffffffffc0d9ff51>] out_tx_end+0xe1/0x5c0 [ptlrpc] [ 7154.967053] [<ffffffffc0da4172>] out_handle+0x1452/0x1bc0 [ptlrpc] [ 7154.968687] [<ffffffffc0d98a3e>] tgt_request_handle+0xaee/0x15f0 [ptlrpc] [ 7154.970322] [<ffffffffc0d3f40b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 7154.972268] [<ffffffffc0d42d74>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Another test had a slightly different crash that is also new that may be related: [86376.168111] LustreError: 31898:0:(mdt_handler.c:5316:mdt_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: [86376.169205] LustreError: 31898:0:(mdt_handler.c:5316:mdt_fini()) LBUG [86376.169811] Pid: 31898, comm: umount 3.10.0-1160.15.2.el7_lustre.ddn12.x86_64 #1 SMP Tue Feb 23 03:03:21 UTC 2021 [86376.170763] Call Trace: [86376.171066] [<ffffffffc0bed7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [86376.171857] [<ffffffffc0bed87c>] lbug_with_loc+0x4c/0xa0 [libcfs] [86376.172537] [<ffffffffc148b5ea>] mdt_device_fini+0x8fa/0x930 [mdt] [86376.173227] [<ffffffffc0e0ea68>] class_cleanup+0x9a8/0xc40 [obdclass] [86376.174052] [<ffffffffc0e0f98c>] class_process_config+0x65c/0x2830 [obdclass] [86376.174793] [<ffffffffc0e11d26>] class_manual_cleanup+0x1c6/0x710 [obdclass] [86376.175574] [<ffffffffc0e446a5>] server_put_super+0xa35/0x1150 [obdclass] [86376.176295] [<ffffffff8c4506fd>] generic_shutdown_super+0x6d/0x100 [86376.177103] [<ffffffff8c450b02>] kill_anon_super+0x12/0x20 [86376.177684] [<ffffffffc0e14932>] lustre_kill_super+0x32/0x50 [obdclass] [86376.178360] [<ffffffff8c450ede>] deactivate_locked_super+0x4e/0x70 [86376.179043] [<ffffffff8c451666>] deactivate_super+0x46/0x60 [86376.179630] [<ffffffff8c470cff>] cleanup_mnt+0x3f/0x80 [86376.180163] [<ffffffff8c470d92>] __cleanup_mnt+0x12/0x20 [86376.180745] [<ffffffff8c2c294b>] task_work_run+0xbb/0xe0 [86376.181296] [<ffffffff8c22cc65>] do_notify_resume+0xa5/0xc0 VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 26/Feb/21 ] |
|
It is possible that these failures relate to |
| Comment by Andreas Dilger [ 01/Nov/21 ] |
|
+1 on master: |
| Comment by Andreas Dilger [ 09/Nov/21 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/9c12f7a3-252e-48ae-aa7f-804bbd117895 |
| Comment by Alex Zhuravlev [ 09/Nov/21 ] |
|
I can reproduce the same assertion with recovery-small/149, very rarely, collecting data, but IMO this has a different cause. anyway, this is on my radar. |
| Comment by Andreas Dilger [ 23/Mar/22 ] |
|
+12 crashes in recovery-small test_149 in the past 4 weeks with the out_tx_write_exec stack, but only 2 in the past few days, the rest are over 2 weeks ago. |
| Comment by Alex Zhuravlev [ 23/Mar/22 ] |
|
yes, I also hit this locally few times a week, adding more and more debug ... |
| Comment by Alex Zhuravlev [ 04/Apr/22 ] |
|
I was able to reproduce this and have log. and the log is confusing: 00000004:00000010:0.0:1649084555.961084:0:6155:0:(mdt_handler.c:6301:mdt_object_alloc()) slab-alloced 'mo': 912 at 00000000092425e9. 00000004:00000040:0.0:1649084555.961088:0:6155:0:(mdt_handler.c:6338:mdt_object_init()) object init, fid = [0x240001b70:0x1:0x0] 00000004:00000010:0.0:1649084555.961095:0:6155:0:(lod_dev.c:214:lod_object_alloc()) slab-alloced 'lod_obj': 280 at 0000000012d03b04. 00080000:00000010:0.0:1649084555.961107:0:6155:0:(osd_handler.c:420:osd_object_alloc()) kmalloced '(mo)': 544 at 000000000ab89345. 00000004:00000040:0.0:1649084555.961102:0:6155:0:(lod_dev.c:169:lod_fld_lookup()) lustre-MDT0001-mdtlov: got tgt 1 for sequence: 0x240001b70 00000020:00000002:0.0:1649084555.962456:0:6155:0:(lu_object.c:224:lu_object_put()) Add 000000000ab89345/00000000092425e9 to site lru. bkt: 00000000cf2c67a7 00000004:00000001:0.0:1649084555.973169:0:6155:0:(mdt_handler.c:6299:mdt_object_alloc()) Process entered 00000004:00000010:0.0:1649084555.973174:0:6155:0:(mdt_handler.c:6301:mdt_object_alloc()) slab-alloced 'mo': 912 at 000000000b6883b5. 00000004:00000040:0.0:1649084555.973177:0:6155:0:(mdt_handler.c:6338:mdt_object_init()) object init, fid = [0x240001b70:0x1:0x0] 00000004:00000010:0.0:1649084555.973179:0:6155:0:(mdd_object.c:277:mdd_object_alloc()) slab-alloced 'mdd_obj': 96 at 00000000a2579935. 00000004:00000040:0.0:1649084555.973194:0:6155:0:(lod_dev.c:169:lod_fld_lookup()) lustre-MDT0001-mdtlov: got tgt 1 for sequence: 0x240001b70 00000004:00000001:0.0:1649084555.973195:0:6155:0:(lod_dev.c:171:lod_fld_lookup()) Process leaving (rc=0 : 0 : 0) 00080000:00000010:0.0:1649084555.973200:0:6155:0:(osd_handler.c:420:osd_object_alloc()) kmalloced '(mo)': 544 at 00000000b65f9684. |
| Comment by Gerrit Updater [ 06/Apr/22 ] |
|
"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47003 |
| Comment by Alex Zhuravlev [ 07/Apr/22 ] |
|
the patch above survived ~30 hours. w/o the patch it usually takes few hours to hit the assertion. |
| Comment by Nikitas Angelinas [ 18/May/22 ] |
|
+1 on master in recovery-small test_149: https://testing.whamcloud.com/test_sets/68f3e91b-d748-4a92-89d1-babf4dcd7fb3 |
| Comment by Yang Sheng [ 31/May/22 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/88d891a9-d784-432b-a414-ef82751ac528 [ 7977.738850] LustreError: 14410:0:(dt_object.c:502:dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed: [ 7977.740884] LustreError: 14410:0:(dt_object.c:502:dt_record_write()) LBUG [ 7977.742136] Pid: 14410, comm: mdt_out00_003 4.18.0-348.2.1.el8_lustre.x86_64 #1 SMP Fri May 6 20:34:00 UTC 2022 [ 7977.743950] Call Trace TBD: [ 7977.744707] [<0>] libcfs_call_trace+0x6f/0x90 [libcfs] [ 7977.745679] [<0>] lbug_with_loc+0x43/0x80 [libcfs] [ 7977.746842] [<0>] dt_record_write+0x10a/0x110 [obdclass] [ 7977.748285] [<0>] out_tx_write_exec+0x92/0x380 [ptlrpc] [ 7977.749324] [<0>] out_tx_end+0x166/0x5f0 [ptlrpc] [ 7977.750300] [<0>] out_handle+0x17ef/0x2180 [ptlrpc] [ 7977.751308] [<0>] tgt_request_handle+0xc93/0x1a40 [ptlrpc] [ 7977.752407] [<0>] ptlrpc_server_handle_request+0x323/0xbd0 [ptlrpc] [ 7977.753643] [<0>] ptlrpc_main+0xc06/0x1560 [ptlrpc] [ 7977.754607] [<0>] kthread+0x116/0x130 [ 7977.755345] [<0>] ret_from_fork+0x35/0x40 [ 7977.756132] Kernel panic - not syncing: LBUG The osd_object: struct osd_object {
oo_dt = {
do_lu = {
lo_header = 0xffff96c0fb0a2930,
lo_dev = 0xffff96c0f1f40000,
lo_ops = 0xffffffffc116a440 <__func__.70851>,
lo_linkage = {
next = 0xffff96c0fb0a2968,
prev = 0xffff96c0e6c87a30
},
lo_dev_ref = {<No data fields>}
},
do_ops = 0xffffffffc116a320 <__func__.71148+16>,
do_body_ops = 0xffffffffc116b760 <__flags.79514>,
do_index_ops = 0x0
},
oo_inode = 0x0,
oo_hl_head = 0x0,
oo_ext_idx_sem = {
count = {
counter = 256
},
crash> sym 0xffffffffc116b760
ffffffffc116b760 (r) osd_body_ops_new [osd_ldiskfs]
crash>
The oo_inode was NULL, since osd_fid_lookup return from:
00000004:00000001:0.0:1653480139.386555:0:14410:0:(mdt_handler.c:6335:mdt_object_init()) Process entered
00000004:00000040:0.0:1653480139.386555:0:14410:0:(mdt_handler.c:6338:mdt_object_init()) object init, fid = [0x240001b70:0x1:0x0]
00000004:00000010:0.0:1653480139.386557:0:14410:0:(mdd_object.c:277:mdd_object_alloc()) slab-alloced 'mdd_obj': 96 at 00000000b01970bb.
00000004:00000001:0.0:1653480139.386558:0:14410:0:(mdt_handler.c:6347:mdt_object_init()) Process leaving (rc=0 : 0 : 0)
00000004:00000001:0.0:1653480139.386558:0:14410:0:(mdd_object.c:299:mdd_object_init()) Process entered
00000004:00000001:0.0:1653480139.386559:0:14410:0:(lod_dev.c:212:lod_object_alloc()) Process entered
00000004:00000010:0.0:1653480139.386560:0:14410:0:(lod_dev.c:214:lod_object_alloc()) slab-alloced 'lod_obj': 152 at 0000000059b054cd.
00000004:00000001:0.0:1653480139.386560:0:14410:0:(lod_dev.c:224:lod_object_alloc()) Process leaving (rc=18446628353982626328 : -115719726925288 : ffff96c0e6c87a18)
00000004:00000001:0.0:1653480139.386561:0:14410:0:(mdd_object.c:309:mdd_object_init()) Process leaving (rc=0 : 0 : 0)
00000004:00000001:0.0:1653480139.386562:0:14410:0:(lod_object.c:9014:lod_object_init()) Process entered
00000004:00000001:0.0:1653480139.386563:0:14410:0:(lod_dev.c:128:lod_fld_lookup()) Process entered
80000000:00000001:0.0:1653480139.386563:0:14410:0:(fld_handler.c:247:fld_server_lookup()) Process entered
80000000:00000001:0.0:1653480139.386564:0:14410:0:(fld_handler.c:211:fld_local_lookup()) Process entered
80000000:00000001:0.0:1653480139.386565:0:14410:0:(fld_cache.c:468:fld_cache_lookup()) Process entered
80000000:00000001:0.0:1653480139.386565:0:14410:0:(fld_cache.c:487:fld_cache_lookup()) Process leaving (rc=0 : 0 : 0)
80000000:00000001:0.0:1653480139.386566:0:14410:0:(fld_handler.c:228:fld_local_lookup()) Process leaving (rc=0 : 0 : 0)
80000000:00000001:0.0:1653480139.386566:0:14410:0:(fld_handler.c:251:fld_server_lookup()) Process leaving (rc=0 : 0 : 0)
00000004:00000040:0.0:1653480139.386567:0:14410:0:(lod_dev.c:169:lod_fld_lookup()) lustre-MDT0001-mdtlov: got tgt 1 for sequence: 0x240001b70
00000004:00000001:0.0:1653480139.386568:0:14410:0:(lod_dev.c:171:lod_fld_lookup()) Process leaving (rc=0 : 0 : 0)
00080000:00000010:0.0:1653480139.386569:0:14410:0:(osd_handler.c:420:osd_object_alloc()) kmalloced '(mo)': 224 at 00000000b474d102.
00000004:00000001:0.0:1653480139.386570:0:14410:0:(lod_object.c:9057:lod_object_init()) Process leaving (rc=0 : 0 : 0)
00080000:00000001:0.0:1653480139.386571:0:14410:0:(osd_handler.c:1126:osd_fid_lookup()) Process entered
00080000:00000001:0.0:1653480139.386571:0:14410:0:(osd_oi.c:576:fid_is_on_ost()) Process entered
80000000:00000001:0.0:1653480139.386572:0:14410:0:(fld_handler.c:211:fld_local_lookup()) Process entered
80000000:00000001:0.0:1653480139.386572:0:14410:0:(fld_cache.c:468:fld_cache_lookup()) Process entered
80000000:00000001:0.0:1653480139.386573:0:14410:0:(fld_cache.c:487:fld_cache_lookup()) Process leaving (rc=0 : 0 : 0)
80000000:00000001:0.0:1653480139.386573:0:14410:0:(fld_handler.c:228:fld_local_lookup()) Process leaving (rc=0 : 0 : 0)
00080000:00000001:0.0:1653480139.386574:0:14410:0:(osd_oi.c:615:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0)
80000000:00000001:0.0:1653480139.386572:0:14410:0:(fld_handler.c:211:fld_local_lookup()) Process entered
80000000:00000001:0.0:1653480139.386572:0:14410:0:(fld_cache.c:468:fld_cache_lookup()) Process entered
80000000:00000001:0.0:1653480139.386573:0:14410:0:(fld_cache.c:487:fld_cache_lookup()) Process leaving (rc=0 : 0 : 0)
80000000:00000001:0.0:1653480139.386573:0:14410:0:(fld_handler.c:228:fld_local_lookup()) Process leaving (rc=0 : 0 : 0)
00080000:00000001:0.0:1653480139.386574:0:14410:0:(osd_oi.c:615:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0)
00080000:00000001:0.0:1653480139.386575:0:14410:0:(osd_oi.c:546:osd_oi_iam_lookup()) Process entered
00000001:00000010:0.0:1653480139.386576:0:14410:0:(osd_dynlocks.c:109:dynlock_lock()) slab-alloced 'nhl': 72 at 0000000006b44178.
00000001:00000010:0.0:1653480139.386580:0:14410:0:(osd_dynlocks.c:193:dynlock_unlock()) slab-freed 'hl': 72 at 0000000006b44178.
00080000:00000001:0.0:1653480139.386582:0:14410:0:(osd_oi.c:568:osd_oi_iam_lookup()) Process leaving (rc=0 : 0 : 0)
00080000:00000001:0.0:1653480139.386582:0:14410:0:(osd_oi.c:576:fid_is_on_ost()) Process entered
80000000:00000001:0.0:1653480139.386583:0:14410:0:(fld_handler.c:211:fld_local_lookup()) Process entered
80000000:00000001:0.0:1653480139.386583:0:14410:0:(fld_cache.c:468:fld_cache_lookup()) Process entered
80000000:00000001:0.0:1653480139.386584:0:14410:0:(fld_cache.c:487:fld_cache_lookup()) Process leaving (rc=0 : 0 : 0)
80000000:00000001:0.0:1653480139.386584:0:14410:0:(fld_handler.c:228:fld_local_lookup()) Process leaving (rc=0 : 0 : 0)
00080000:00000001:0.0:1653480139.386585:0:14410:0:(osd_oi.c:615:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0)
00080000:00000001:0.0:1653480139.386586:0:14410:0:(osd_handler.c:1186:osd_fid_lookup()) Process leaving via out (rc=0 : 0 : 0x0). <<<<<<<-----
1179 /* Search order: 3. OI files. */
1180 result = osd_oi_lookup(info, dev, fid, id, OI_CHECK_FLD);
1181 if (result == -ENOENT) {
1182 if (!(fid_is_norm(fid) || fid_is_igif(fid)) ||
1183 fid_is_on_ost(info, dev, fid, OI_CHECK_FLD) ||
1184 !ldiskfs_test_bit(osd_oi_fid2idx(dev, fid),
1185 sf->sf_oi_bitmap))
1186 GOTO(out, result = 0);
1187
1188 goto trigger;
Look at the sf_oi_bitmap: os_file = {
sf_uuid = {
b = "\202%\024\\_EL\322\240\220\060\322\250\244\357k"
},
sf_flags = 0x0,
sf_magic = 0x4c5fd252,
sf_status = 0x0,
sf_param = 0x0,
sf_time_last_complete = 0x0,
sf_time_latest_start = 0x0,
sf_time_last_checkpoint = 0x0,
sf_pos_latest_start = 0x0,
sf_pos_last_checkpoint = 0x0,
sf_pos_first_inconsistent = 0x0,
sf_items_checked = 0x0,
sf_items_updated = 0x0,
sf_items_failed = 0x0,
sf_items_updated_prior = 0x0,
sf_items_noscrub = 0x0,
sf_items_igif = 0x0,
sf_run_time = 0x0,
sf_success_count = 0x0,
sf_oi_count = 0x40,
sf_internal_flags = 0x1,
sf_reserved_1 = 0x0,
sf_reserved_2 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
sf_oi_bitmap = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
},
|
| Comment by Gerrit Updater [ 02/Nov/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/47003/ |
| Comment by Peter Jones [ 02/Nov/22 ] |
|
Landed for 2.16 |
| Comment by Gerrit Updater [ 04/Jul/23 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51558 |