[LU-9924] LBUG dt_object.c:513:dt_record_write Created: 28/Aug/17  Updated: 13/Apr/20  Resolved: 13/Apr/20

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.1, Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Cliff White (Inactive) Assignee: Lai Siyao
Resolution: Duplicate Votes: 0
Labels: soak
Environment:

Soak performance cluster -version=2.10.0_61_g6aabd4a


Issue Links:
Related
is related to LU-10143 LBUG dt_object.h:2166:dt_declare_reco... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Appears to be a re-appearance of LU-6846
Server crashed after multiple failovers.

Aug 26 01:45:48 soak-10 kernel: LustreError: 3333:0:(dt_object.c:513:dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed:
Aug 26 01:45:48 soak-10 kernel: LustreError: 3333:0:(dt_object.c:513:dt_record_write()) LBUG
Aug 26 01:45:48 soak-10 kernel: Pid: 3333, comm: mdt_out01_000
Aug 26 01:45:48 soak-10 kernel: #012Call Trace:
Aug 26 01:45:48 soak-10 kernel: [<ffffffffa0d457ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
Aug 26 01:45:48 soak-10 kernel: [<ffffffffa0d4587c>] lbug_with_loc+0x4c/0xb0 [libcfs]
Aug 26 01:45:48 soak-10 kernel: [<ffffffffa0ea6d25>] dt_record_write+0xb5/0x120 [obdclass]
Aug 26 01:45:48 soak-10 kernel: [<ffffffffa113e266>] out_tx_write_exec+0x166/0x320 [ptlrpc]
Aug 26 01:45:48 soak-10 kernel: [<ffffffffa113712e>] out_tx_end+0xde/0x5c0 [ptlrpc]
Aug 26 01:45:48 soak-10 kernel: [<ffffffffa11397e7>] out_handle+0x11e7/0x1920 [ptlrpc]
Aug 26 01:45:48 soak-10 kernel: [<ffffffff810ce4c4>] ? update_curr+0x104/0x190
Aug 26 01:45:49 soak-10 kernel: [<ffffffffa1086ac0>] ? target_bulk_timeout+0x0/0xb0 [ptlrpc]
Aug 26 01:45:49 soak-10 kernel: [<ffffffffa11308f5>] tgt_request_handle+0x925/0x1370 [ptlrpc]
Aug 26 01:45:49 soak-10 kernel: [<ffffffffa10d92c6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
Aug 26 01:45:49 soak-10 kernel: [<ffffffffa10d6ab8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
Aug 26 01:45:49 soak-10 kernel: [<ffffffff810c54f2>] ? default_wake_function+0x12/0x20
Aug 26 01:45:49 soak-10 kernel: [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90
Aug 26 01:45:49 soak-10 kernel: [<ffffffffa10dd2a0>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
Aug 26 01:45:49 soak-10 kernel: [<ffffffffa10dc800>] ? ptlrpc_main+0x0/0x1de0 [ptlrpc]


 Comments   
Comment by Joseph Gmitter (Inactive) [ 28/Aug/17 ]

Hi Lai,

Can you please investigate?

Thanks.
Joe

Comment by Lai Siyao [ 29/Aug/17 ]

Cliff, what disk fs is used? ldiskfs or zfs?

Comment by Cliff White (Inactive) [ 29/Aug/17 ]

Ldiskfs on the MDS. the OSS are ZFS - soak-10 is MDS

Comment by Oleg Drokin [ 22/Aug/18 ]

Saw this on in my pot

[71169.050844] Lustre: DEBUG MARKER: == replay-single test 118: invalidate osp update will not cause update log corruption ================ 18:09:49 (1534889389)
[71169.459671] Lustre: *** cfs_fail_loc=1705, val=0***
[71169.470523] Lustre: *** cfs_fail_loc=1705, val=0***
[71169.473869] LustreError: 15314:0:(dt_object.c:515:dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed: 
[71169.476829] LustreError: 15314:0:(dt_object.c:515:dt_record_write()) LBUG
[71169.477891] Pid: 15314, comm: ll_ost_out06_00 3.10.0-7.5-debug #1 SMP Sun Jun 3 13:35:38 EDT 2018
[71169.480212] Call Trace:
[71169.481130]  [<ffffffffa01b77dc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[71169.482216]  [<ffffffffa01b788c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[71169.483539]  [<ffffffffa0365ab5>] dt_record_write+0xb5/0x120 [obdclass]
[71169.485568]  [<ffffffffa0658226>] out_tx_write_exec+0x166/0x2f0 [ptlrpc]
[71169.487207]  [<ffffffffa065071e>] out_tx_end+0xde/0x5c0 [ptlrpc]
[71169.488202]  [<ffffffffa0654210>] out_handle+0x13f0/0x1b50 [ptlrpc]
[71169.489126]  [<ffffffffa064da55>] tgt_request_handle+0xaf5/0x1590 [ptlrpc]
[71169.490080]  [<ffffffffa05f1eb6>] ptlrpc_server_handle_request+0x256/0xad0 [ptlrpc]
[71169.491917]  [<ffffffffa05f5cae>] ptlrpc_main+0xabe/0x1f80 [ptlrpc]
[71169.492800]  [<ffffffff810ae864>] kthread+0xe4/0xf0
[71169.493653]  [<ffffffff81783777>] ret_from_fork_nospec_end+0x0/0x39
[71169.494525]  [<ffffffffffffffff>] 0xffffffffffffffff
[71169.495544] Kernel panic - not syncing: LBUG
Comment by Andreas Dilger [ 27/Nov/19 ]

+3 on master in the past week. This seems very likely duplicate with LU-10143, probably just a fluke on which field is zero of not in a corrupt structure.
https://testing.whamcloud.com/test_sets/3d836bfe-0caf-11ea-bbc3-52540065bddc
https://testing.whamcloud.com/test_sets/6177c52e-0cae-11ea-9487-52540065bddc
https://testing.whamcloud.com/test_sets/72415a9c-109a-11ea-9487-52540065bddc

Comment by Alex Zhuravlev [ 13/Apr/20 ]

a dup of LU-13195

Generated at Sat Feb 10 02:30:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.