[LU-15139] sanity test_160h: dt_record_write() ASSERTION( dt->do_body_ops->dbo_write ) failed Created: 26/Feb/21  Updated: 07/Nov/23  Resolved: 02/Nov/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.15.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-15849 replay-dual: test_26 hit dt_declare_r... Open
Related
is related to LU-13195 replay-single test_118: dt_declare_re... Resolved
is related to LU-16398 ost-pools: FAIL: remove sub-test dirs... Resolved
is related to LU-15809 replay-dual test_29: timeout llog_ver... Open
is related to LU-17255 dt_record_write()) ASSERTION( dt->do_... Open
is related to LU-15769 llog_osd_next_block()): invalid llog ... In Progress
is related to LU-14932 runtests: test_1 llog_cat_cleanup()) ... Resolved
is related to LU-15695 dt_ladvise(): ASSERTION( dt->do_body_... Resolved
is related to LU-15938 MDT recovery did not finish due to co... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/a9a49ddc-ce2e-4d36-a040-a2522bc7bfaa

test_160h failed with the following error:

[ 7154.952072] LustreError: 27898:0:(dt_object.c:503:dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed: 
[ 7154.954429] LustreError: 27898:0:(dt_object.c:503:dt_record_write()) LBUG
[ 7154.955686] Pid: 27898, comm: mdt_out00_000 3.10.0-1160.15.2.el7_lustre.ddn12.x86_64 #1 SMP Tue Feb 23 18:25:44 UTC 2021
[ 7154.957907] Call Trace:
[ 7154.958488]  [<ffffffffc08197cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[ 7154.960155]  [<ffffffffc081987c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[ 7154.961662]  [<ffffffffc0a50272>] dt_record_write+0xb2/0x120 [obdclass]
[ 7154.963451]  [<ffffffffc0da7b69>] out_tx_write_exec+0x169/0x2f0 [ptlrpc]
[ 7154.965451]  [<ffffffffc0d9ff51>] out_tx_end+0xe1/0x5c0 [ptlrpc]
[ 7154.967053]  [<ffffffffc0da4172>] out_handle+0x1452/0x1bc0 [ptlrpc]
[ 7154.968687]  [<ffffffffc0d98a3e>] tgt_request_handle+0xaee/0x15f0 [ptlrpc]
[ 7154.970322]  [<ffffffffc0d3f40b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[ 7154.972268]  [<ffffffffc0d42d74>] ptlrpc_main+0xb34/0x1470 [ptlrpc]

Another test had a slightly different crash that is also new that may be related:
https://testing.whamcloud.com/test_sets/d3ede9b0-ecaa-484e-8840-891e62bdb314

[86376.168111] LustreError: 31898:0:(mdt_handler.c:5316:mdt_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: 
[86376.169205] LustreError: 31898:0:(mdt_handler.c:5316:mdt_fini()) LBUG
[86376.169811] Pid: 31898, comm: umount 3.10.0-1160.15.2.el7_lustre.ddn12.x86_64 #1 SMP Tue Feb 23 03:03:21 UTC 2021
[86376.170763] Call Trace:
[86376.171066]  [<ffffffffc0bed7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[86376.171857]  [<ffffffffc0bed87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[86376.172537]  [<ffffffffc148b5ea>] mdt_device_fini+0x8fa/0x930 [mdt]
[86376.173227]  [<ffffffffc0e0ea68>] class_cleanup+0x9a8/0xc40 [obdclass]
[86376.174052]  [<ffffffffc0e0f98c>] class_process_config+0x65c/0x2830 [obdclass]
[86376.174793]  [<ffffffffc0e11d26>] class_manual_cleanup+0x1c6/0x710 [obdclass]
[86376.175574]  [<ffffffffc0e446a5>] server_put_super+0xa35/0x1150 [obdclass]
[86376.176295]  [<ffffffff8c4506fd>] generic_shutdown_super+0x6d/0x100
[86376.177103]  [<ffffffff8c450b02>] kill_anon_super+0x12/0x20
[86376.177684]  [<ffffffffc0e14932>] lustre_kill_super+0x32/0x50 [obdclass]
[86376.178360]  [<ffffffff8c450ede>] deactivate_locked_super+0x4e/0x70
[86376.179043]  [<ffffffff8c451666>] deactivate_super+0x46/0x60
[86376.179630]  [<ffffffff8c470cff>] cleanup_mnt+0x3f/0x80
[86376.180163]  [<ffffffff8c470d92>] __cleanup_mnt+0x12/0x20
[86376.180745]  [<ffffffff8c2c294b>] task_work_run+0xbb/0xe0
[86376.181296]  [<ffffffff8c22cc65>] do_notify_resume+0xa5/0xc0

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_160h - onyx-44vm4 crashed during sanity test_160h



 Comments   
Comment by Andreas Dilger [ 26/Feb/21 ]

It is possible that these failures relate to LU-14341 since we are now testing on 3.10.0-1160.15.2.el7 kernels, and that problem started with 3.10.0-1160.8.1.el7 kernels.

Comment by Andreas Dilger [ 01/Nov/21 ]

+1 on master:
https://testing.whamcloud.com/test_sets/0804055c-4eea-45f5-a43c-66079e3f3436

Comment by Andreas Dilger [ 09/Nov/21 ]

+1 on master: https://testing.whamcloud.com/test_sets/9c12f7a3-252e-48ae-aa7f-804bbd117895

Comment by Alex Zhuravlev [ 09/Nov/21 ]

I can reproduce the same assertion with recovery-small/149, very rarely, collecting data, but IMO this has a different cause. anyway, this is on my radar.

Comment by Andreas Dilger [ 23/Mar/22 ]

+12 crashes in recovery-small test_149 in the past 4 weeks with the out_tx_write_exec stack, but only 2 in the past few days, the rest are over 2 weeks ago.

Comment by Alex Zhuravlev [ 23/Mar/22 ]

yes, I also hit this locally few times a week, adding more and more debug ...

Comment by Alex Zhuravlev [ 04/Apr/22 ]

I was able to reproduce this and have log. and the log is confusing:

00000004:00000010:0.0:1649084555.961084:0:6155:0:(mdt_handler.c:6301:mdt_object_alloc()) slab-alloced 'mo': 912 at 00000000092425e9.
00000004:00000040:0.0:1649084555.961088:0:6155:0:(mdt_handler.c:6338:mdt_object_init()) object init, fid = [0x240001b70:0x1:0x0]
00000004:00000010:0.0:1649084555.961095:0:6155:0:(lod_dev.c:214:lod_object_alloc()) slab-alloced 'lod_obj': 280 at 0000000012d03b04.
00080000:00000010:0.0:1649084555.961107:0:6155:0:(osd_handler.c:420:osd_object_alloc()) kmalloced '(mo)': 544 at 000000000ab89345.
00000004:00000040:0.0:1649084555.961102:0:6155:0:(lod_dev.c:169:lod_fld_lookup()) lustre-MDT0001-mdtlov: got tgt 1 for sequence: 0x240001b70

00000020:00000002:0.0:1649084555.962456:0:6155:0:(lu_object.c:224:lu_object_put()) Add 000000000ab89345/00000000092425e9 to site lru. bkt: 00000000cf2c67a7

00000004:00000001:0.0:1649084555.973169:0:6155:0:(mdt_handler.c:6299:mdt_object_alloc()) Process entered
00000004:00000010:0.0:1649084555.973174:0:6155:0:(mdt_handler.c:6301:mdt_object_alloc()) slab-alloced 'mo': 912 at 000000000b6883b5.
00000004:00000040:0.0:1649084555.973177:0:6155:0:(mdt_handler.c:6338:mdt_object_init()) object init, fid = [0x240001b70:0x1:0x0]
00000004:00000010:0.0:1649084555.973179:0:6155:0:(mdd_object.c:277:mdd_object_alloc()) slab-alloced 'mdd_obj': 96 at 00000000a2579935.
00000004:00000040:0.0:1649084555.973194:0:6155:0:(lod_dev.c:169:lod_fld_lookup()) lustre-MDT0001-mdtlov: got tgt 1 for sequence: 0x240001b70
00000004:00000001:0.0:1649084555.973195:0:6155:0:(lod_dev.c:171:lod_fld_lookup()) Process leaving (rc=0 : 0 : 0)
00080000:00000010:0.0:1649084555.973200:0:6155:0:(osd_handler.c:420:osd_object_alloc()) kmalloced '(mo)': 544 at 00000000b65f9684.
Comment by Gerrit Updater [ 06/Apr/22 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47003
Subject: LU-15139 osp: block reads until writes are done
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a09bc918bf4c18a4ff1c948237da5d5b3d391078

Comment by Alex Zhuravlev [ 07/Apr/22 ]

the patch above survived ~30 hours. w/o the patch it usually takes few hours to hit the assertion.
one of the early version confirmat that it's possible to hit reads during write-in-flight in OSP - basically at recovery time MDS's thread cleaning orphans from PENDING/ race with the recovery thread fetching updates from the remote llogs.

Comment by Nikitas Angelinas [ 18/May/22 ]

+1 on master in recovery-small test_149: https://testing.whamcloud.com/test_sets/68f3e91b-d748-4a92-89d1-babf4dcd7fb3

Comment by Yang Sheng [ 31/May/22 ]

+1 on master: https://testing.whamcloud.com/test_sets/88d891a9-d784-432b-a414-ef82751ac528
I have decoded the vmcore:

[ 7977.738850] LustreError: 14410:0:(dt_object.c:502:dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed:
[ 7977.740884] LustreError: 14410:0:(dt_object.c:502:dt_record_write()) LBUG
[ 7977.742136] Pid: 14410, comm: mdt_out00_003 4.18.0-348.2.1.el8_lustre.x86_64 #1 SMP Fri May 6 20:34:00 UTC 2022
[ 7977.743950] Call Trace TBD:
[ 7977.744707] [<0>] libcfs_call_trace+0x6f/0x90 [libcfs]
[ 7977.745679] [<0>] lbug_with_loc+0x43/0x80 [libcfs]
[ 7977.746842] [<0>] dt_record_write+0x10a/0x110 [obdclass]
[ 7977.748285] [<0>] out_tx_write_exec+0x92/0x380 [ptlrpc]
[ 7977.749324] [<0>] out_tx_end+0x166/0x5f0 [ptlrpc]
[ 7977.750300] [<0>] out_handle+0x17ef/0x2180 [ptlrpc]
[ 7977.751308] [<0>] tgt_request_handle+0xc93/0x1a40 [ptlrpc]
[ 7977.752407] [<0>] ptlrpc_server_handle_request+0x323/0xbd0 [ptlrpc]
[ 7977.753643] [<0>] ptlrpc_main+0xc06/0x1560 [ptlrpc]
[ 7977.754607] [<0>] kthread+0x116/0x130
[ 7977.755345] [<0>] ret_from_fork+0x35/0x40
[ 7977.756132] Kernel panic - not syncing: LBUG

The osd_object:

struct osd_object {
  oo_dt = {
    do_lu = {
      lo_header = 0xffff96c0fb0a2930, 
      lo_dev = 0xffff96c0f1f40000, 
      lo_ops = 0xffffffffc116a440 <__func__.70851>, 
      lo_linkage = {
        next = 0xffff96c0fb0a2968, 
        prev = 0xffff96c0e6c87a30
      }, 
      lo_dev_ref = {<No data fields>}
    }, 
    do_ops = 0xffffffffc116a320 <__func__.71148+16>, 
    do_body_ops = 0xffffffffc116b760 <__flags.79514>, 
    do_index_ops = 0x0
  }, 
  oo_inode = 0x0, 
  oo_hl_head = 0x0, 
  oo_ext_idx_sem = {
    count = {
      counter = 256
    }, 
crash> sym 0xffffffffc116b760
ffffffffc116b760 (r) osd_body_ops_new [osd_ldiskfs] 
crash> 


The oo_inode was NULL, since osd_fid_lookup return from:

00000004:00000001:0.0:1653480139.386555:0:14410:0:(mdt_handler.c:6335:mdt_object_init()) Process entered
00000004:00000040:0.0:1653480139.386555:0:14410:0:(mdt_handler.c:6338:mdt_object_init()) object init, fid = [0x240001b70:0x1:0x0]
00000004:00000010:0.0:1653480139.386557:0:14410:0:(mdd_object.c:277:mdd_object_alloc()) slab-alloced 'mdd_obj': 96 at 00000000b01970bb.
00000004:00000001:0.0:1653480139.386558:0:14410:0:(mdt_handler.c:6347:mdt_object_init()) Process leaving (rc=0 : 0 : 0)
00000004:00000001:0.0:1653480139.386558:0:14410:0:(mdd_object.c:299:mdd_object_init()) Process entered
00000004:00000001:0.0:1653480139.386559:0:14410:0:(lod_dev.c:212:lod_object_alloc()) Process entered
00000004:00000010:0.0:1653480139.386560:0:14410:0:(lod_dev.c:214:lod_object_alloc()) slab-alloced 'lod_obj': 152 at 0000000059b054cd.
00000004:00000001:0.0:1653480139.386560:0:14410:0:(lod_dev.c:224:lod_object_alloc()) Process leaving (rc=18446628353982626328 : -115719726925288 : ffff96c0e6c87a18)
00000004:00000001:0.0:1653480139.386561:0:14410:0:(mdd_object.c:309:mdd_object_init()) Process leaving (rc=0 : 0 : 0)
00000004:00000001:0.0:1653480139.386562:0:14410:0:(lod_object.c:9014:lod_object_init()) Process entered
00000004:00000001:0.0:1653480139.386563:0:14410:0:(lod_dev.c:128:lod_fld_lookup()) Process entered
80000000:00000001:0.0:1653480139.386563:0:14410:0:(fld_handler.c:247:fld_server_lookup()) Process entered
80000000:00000001:0.0:1653480139.386564:0:14410:0:(fld_handler.c:211:fld_local_lookup()) Process entered
80000000:00000001:0.0:1653480139.386565:0:14410:0:(fld_cache.c:468:fld_cache_lookup()) Process entered
80000000:00000001:0.0:1653480139.386565:0:14410:0:(fld_cache.c:487:fld_cache_lookup()) Process leaving (rc=0 : 0 : 0)
80000000:00000001:0.0:1653480139.386566:0:14410:0:(fld_handler.c:228:fld_local_lookup()) Process leaving (rc=0 : 0 : 0)
80000000:00000001:0.0:1653480139.386566:0:14410:0:(fld_handler.c:251:fld_server_lookup()) Process leaving (rc=0 : 0 : 0)
00000004:00000040:0.0:1653480139.386567:0:14410:0:(lod_dev.c:169:lod_fld_lookup()) lustre-MDT0001-mdtlov: got tgt 1 for sequence: 0x240001b70
00000004:00000001:0.0:1653480139.386568:0:14410:0:(lod_dev.c:171:lod_fld_lookup()) Process leaving (rc=0 : 0 : 0)
00080000:00000010:0.0:1653480139.386569:0:14410:0:(osd_handler.c:420:osd_object_alloc()) kmalloced '(mo)': 224 at 00000000b474d102.
00000004:00000001:0.0:1653480139.386570:0:14410:0:(lod_object.c:9057:lod_object_init()) Process leaving (rc=0 : 0 : 0)
00080000:00000001:0.0:1653480139.386571:0:14410:0:(osd_handler.c:1126:osd_fid_lookup()) Process entered
00080000:00000001:0.0:1653480139.386571:0:14410:0:(osd_oi.c:576:fid_is_on_ost()) Process entered
80000000:00000001:0.0:1653480139.386572:0:14410:0:(fld_handler.c:211:fld_local_lookup()) Process entered
80000000:00000001:0.0:1653480139.386572:0:14410:0:(fld_cache.c:468:fld_cache_lookup()) Process entered
80000000:00000001:0.0:1653480139.386573:0:14410:0:(fld_cache.c:487:fld_cache_lookup()) Process leaving (rc=0 : 0 : 0)
80000000:00000001:0.0:1653480139.386573:0:14410:0:(fld_handler.c:228:fld_local_lookup()) Process leaving (rc=0 : 0 : 0)
00080000:00000001:0.0:1653480139.386574:0:14410:0:(osd_oi.c:615:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0)
80000000:00000001:0.0:1653480139.386572:0:14410:0:(fld_handler.c:211:fld_local_lookup()) Process entered
80000000:00000001:0.0:1653480139.386572:0:14410:0:(fld_cache.c:468:fld_cache_lookup()) Process entered
80000000:00000001:0.0:1653480139.386573:0:14410:0:(fld_cache.c:487:fld_cache_lookup()) Process leaving (rc=0 : 0 : 0)
80000000:00000001:0.0:1653480139.386573:0:14410:0:(fld_handler.c:228:fld_local_lookup()) Process leaving (rc=0 : 0 : 0)
00080000:00000001:0.0:1653480139.386574:0:14410:0:(osd_oi.c:615:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0)
00080000:00000001:0.0:1653480139.386575:0:14410:0:(osd_oi.c:546:osd_oi_iam_lookup()) Process entered
00000001:00000010:0.0:1653480139.386576:0:14410:0:(osd_dynlocks.c:109:dynlock_lock()) slab-alloced 'nhl': 72 at 0000000006b44178.
00000001:00000010:0.0:1653480139.386580:0:14410:0:(osd_dynlocks.c:193:dynlock_unlock()) slab-freed 'hl': 72 at 0000000006b44178.
00080000:00000001:0.0:1653480139.386582:0:14410:0:(osd_oi.c:568:osd_oi_iam_lookup()) Process leaving (rc=0 : 0 : 0)
00080000:00000001:0.0:1653480139.386582:0:14410:0:(osd_oi.c:576:fid_is_on_ost()) Process entered
80000000:00000001:0.0:1653480139.386583:0:14410:0:(fld_handler.c:211:fld_local_lookup()) Process entered
80000000:00000001:0.0:1653480139.386583:0:14410:0:(fld_cache.c:468:fld_cache_lookup()) Process entered
80000000:00000001:0.0:1653480139.386584:0:14410:0:(fld_cache.c:487:fld_cache_lookup()) Process leaving (rc=0 : 0 : 0)
80000000:00000001:0.0:1653480139.386584:0:14410:0:(fld_handler.c:228:fld_local_lookup()) Process leaving (rc=0 : 0 : 0)
00080000:00000001:0.0:1653480139.386585:0:14410:0:(osd_oi.c:615:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0)
00080000:00000001:0.0:1653480139.386586:0:14410:0:(osd_handler.c:1186:osd_fid_lookup()) Process leaving via out (rc=0 : 0 : 0x0).  <<<<<<<-----

1179         /* Search order: 3. OI files. */
1180         result = osd_oi_lookup(info, dev, fid, id, OI_CHECK_FLD);
1181         if (result == -ENOENT) {
1182                 if (!(fid_is_norm(fid) || fid_is_igif(fid)) ||
1183                     fid_is_on_ost(info, dev, fid, OI_CHECK_FLD) ||
1184                     !ldiskfs_test_bit(osd_oi_fid2idx(dev, fid),
1185                                       sf->sf_oi_bitmap))
1186                         GOTO(out, result = 0);
1187 
1188                 goto trigger;

Look at the sf_oi_bitmap:

      os_file = {
        sf_uuid = {
          b = "\202%\024\\_EL\322\240\220\060\322\250\244\357k"
        },
        sf_flags = 0x0,
        sf_magic = 0x4c5fd252,
        sf_status = 0x0,
        sf_param = 0x0,
        sf_time_last_complete = 0x0,
        sf_time_latest_start = 0x0,
        sf_time_last_checkpoint = 0x0,
        sf_pos_latest_start = 0x0,
        sf_pos_last_checkpoint = 0x0,
        sf_pos_first_inconsistent = 0x0,
        sf_items_checked = 0x0,
        sf_items_updated = 0x0,
        sf_items_failed = 0x0,
        sf_items_updated_prior = 0x0,
        sf_items_noscrub = 0x0,
        sf_items_igif = 0x0,
        sf_run_time = 0x0,
        sf_success_count = 0x0,
        sf_oi_count = 0x40,
        sf_internal_flags = 0x1,
        sf_reserved_1 = 0x0,
        sf_reserved_2 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
        sf_oi_bitmap = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
      },

Comment by Gerrit Updater [ 02/Nov/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/47003/
Subject: LU-15139 osp: block reads until the object is created
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4f2914537cc32fe89c4781bcfc87c38e3fe4419c

Comment by Peter Jones [ 02/Nov/22 ]

Landed for 2.16

Comment by Gerrit Updater [ 04/Jul/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51558
Subject: LU-15139 osp: block reads until the object is created
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 4afaf1249439ea7a27910b0f77a84232c268df61

Generated at Sat Feb 10 03:15:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.