[LU-10143] LBUG dt_object.h:2166:dt_declare_record_write Created: 19/Oct/17  Updated: 07/Feb/20  Resolved: 07/Feb/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.5, Lustre 2.10.6
Fix Version/s: Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1

Type: Bug Priority: Minor
Reporter: James Casper Assignee: Alex Zhuravlev
Resolution: Duplicate Votes: 0
Labels: None
Environment:

trevis, full DNE
servers: CentOS7.4, zfs, branch master, v2.10.54, b3652
clients: CentOS7.4, branch master, v2.10.54, b3652


Issue Links:
Duplicate
is duplicated by LU-9157 replay-single test_80c: rmdir failed Resolved
is duplicated by LU-10740 replay-single test_2d: FAIL: checksta... Resolved
is duplicated by LU-11538 replay-single test 80g fails with '/... Resolved
is duplicated by LU-13195 replay-single test_118: dt_declare_re... Resolved
Related
is related to LU-11366 replay-single timeout test 80f: rm: c... Resolved
is related to LU-7298 replay-single test_70b: ASSERTION( dt... Closed
is related to LU-9924 LBUG dt_object.c:513:dt_record_write Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://testing.hpdd.intel.com/test_sessions/2f6027c4-8e9a-4949-af3d-2f8d04940c9c

replay-single, test_80d: Timeout occurred after 155 mins, last suite running was replay-single, restarting cluster to continue tests LBUG

From mds console:

[ 5748.826345] LustreError: 28616:0:(dt_object.h:2166:dt_declare_record_write()) ASSERTION( dt->do_body_ops ) failed: 
[ 5748.828884] LustreError: 28616:0:(dt_object.h:2166:dt_declare_record_write()) LBUG
[ 5748.831194] Pid: 28616, comm: mdt_out00_003
[ 5748.833233] 
[ 5748.833233] Call Trace:
[ 5748.836889]  [<ffffffffc06917ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
[ 5748.839015]  [<ffffffffc069183c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[ 5748.841118]  [<ffffffffc0ef1743>] out_write_add_exec+0x133/0x1b0 [ptlrpc]
[ 5748.843217]  [<ffffffffc0ee84a3>] out_write+0x333/0x370 [ptlrpc]
[ 5748.845224]  [<ffffffffc0eeb1c4>] out_handle+0x1304/0x1920 [ptlrpc]
[ 5748.847209]  [<ffffffffc0e7d4a2>] ? lustre_msg_get_opc+0x22/0xf0 [ptlrpc]
[ 5748.849234]  [<ffffffffc0ee0d49>] ? tgt_request_preprocess.isra.26+0x299/0x7a0 [ptlrpc]
[ 5748.851305]  [<ffffffffc0ee2475>] tgt_request_handle+0x925/0x1370 [ptlrpc]
[ 5748.853263]  [<ffffffffc0e8b37e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
[ 5748.855236]  [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
[ 5748.857056]  [<ffffffffc0e8eb22>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[ 5748.858924]  [<ffffffffc0e8e090>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
[ 5748.860737]  [<ffffffff810b098f>] kthread+0xcf/0xe0
[ 5748.862471]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0
[ 5748.864195]  [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
[ 5748.865926]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0
[ 5748.867590] 
[ 5748.868971] Kernel panic - not syncing: LBUG
[ 5748.869962] CPU: 1 PID: 28616 Comm: mdt_out00_003 Tainted: P           OE  ------------   3.10.0-693.2.2.el7_lustre.x86_64 #1
[ 5748.869962] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[ 5748.869962]  ffff88007c7daf00 000000006e5a06dd ffff88005feefa88 ffffffff816a3d8d
[ 5748.869962]  ffff88005feefb08 ffffffff8169dc74 ffffffff00000008 ffff88005feefb18
[ 5748.869962]  ffff88005feefab8 000000006e5a06dd 000000006e5a06dd ffff88007fd0f8b8
[ 5748.869962] Call Trace:
[ 5748.869962]  [<ffffffff816a3d8d>] dump_stack+0x19/0x1b
[ 5748.869962]  [<ffffffff8169dc74>] panic+0xe8/0x20d
[ 5748.869962]  [<ffffffffc0691854>] lbug_with_loc+0x64/0xb0 [libcfs]
[ 5748.869962]  [<ffffffffc0ef1743>] out_write_add_exec+0x133/0x1b0 [ptlrpc]
[ 5748.869962]  [<ffffffffc0ee84a3>] out_write+0x333/0x370 [ptlrpc]
[ 5748.869962]  [<ffffffffc0eeb1c4>] out_handle+0x1304/0x1920 [ptlrpc]
[ 5748.869962]  [<ffffffffc0e7d4a2>] ? lustre_msg_get_opc+0x22/0xf0 [ptlrpc]
[ 5748.869962]  [<ffffffffc0ee0d49>] ? tgt_request_preprocess.isra.26+0x299/0x7a0 [ptlrpc]
[ 5748.893323] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1
[ 5748.893798]  [<ffffffffc0ee2475>] tgt_request_handle+0x925/0x1370 [ptlrpc]
[ 5748.893798]  [<ffffffffc0e8b37e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
[ 5748.893798]  [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
[ 5748.893798]  [<ffffffffc0e8eb22>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[ 5748.893798]  [<ffffffffc0e8e090>] ? ptlrpc_register_service+0xe80/0xe80 [ptlrpc]
[ 5748.893798]  [<ffffffff810b098f>] kthread+0xcf/0xe0
[ 5748.893798]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
[ 5748.893798]  [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
[ 5748.893798]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40


 Comments   
Comment by Joseph Gmitter (Inactive) [ 20/Oct/17 ]

Hi Alex,

Can you please look into this?

Thanks.
Joe

Comment by Sarah Liu [ 30/Jul/18 ]

another instance on master tag-2.11.53 ZFS DNE

https://testing.whamcloud.com/test_sets/454a32d6-9097-11e8-a9f7-52540065bddc

Comment by Alex Zhuravlev [ 30/Jul/18 ]

trying to reproduce locally...

 

Comment by Andreas Dilger [ 11/Aug/18 ]

+1 on b2_10:
https://testing.whamcloud.com/test_sets/509ffa68-9d11-11e8-a9f7-52540065bddc

Comment by James Nunez (Inactive) [ 12/Dec/18 ]

We have replay-single test_80g crash wth the same stack trace. Logs are at https://testing.whamcloud.com/test_sets/9c64d894-fdc2-11e8-b837-52540065bddc

Comment by Gerrit Updater [ 13/Dec/18 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33849
Subject: LU-10143 obdclass: additional debug
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2adb23e506c7fd52a1d1255b214e6c3626af1d7c

Comment by Alex Zhuravlev [ 13/Dec/18 ]

the interesting difference in the code is that osd-ldisfsk sets .do_body_ops unconditionally, right at object initialisation (even if it doesn't exist yet) while osd-zfs sets .do_body_ops only if object exists or declared to be created. but there is one case when object is being destroyed by ZFS and in this case .do_body_ops is not set. with the patch above I'm going to catch this case if the theory is correct.

though the next step is not obvious as object (likely llog) is expected to exist at this point.

 

Comment by Mikhail Pershin [ 16/Jan/19 ]

Another one in master:
https://testing.whamcloud.com/test_sets/9838396a-18e7-11e9-8388-52540065bddc

Comment by Alex Zhuravlev [ 16/Jan/19 ]

learnt how to reproduce locally..

Comment by Alex Zhuravlev [ 17/Jan/19 ]

so far, tracked this down to FID (SEQ) duplication only with ZFS..

Comment by Gerrit Updater [ 20/Jan/19 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34069
Subject: LU-10143 osd-zfs: allocate sequence in advance
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 902fbc22f0b9f3dc17ab48912cb361cd93a40db1

Comment by Gerrit Updater [ 06/Feb/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34069/
Subject: LU-10143 osd-zfs: allocate sequence in advance
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 51c449b73994f2bba98ee27ac77f90c9aa846e88

Comment by Gerrit Updater [ 15/Feb/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34267
Subject: LU-10143 osd-zfs: allocate sequence in advance
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 4dd7b58d7cab32364c75df4cafd6d6995846b215

Comment by Gerrit Updater [ 20/Feb/19 ]

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34285
Subject: LU-10143 tests: Add version check for interop
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a5fac453acc8ddb0a59c4f76341ad7825ecbfd34

Comment by Gerrit Updater [ 23/Feb/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34267/
Subject: LU-10143 osd-zfs: allocate sequence in advance
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 98ec90dd316054904f054e3cf88ae0ba8e54a2dd

Comment by Gerrit Updater [ 25/Feb/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34295
Subject: LU-10143 osd-zfs: allocate sequence in advance
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: a1126a4baebf2aa8fcf43c270398a380d3c0b503

Comment by Gerrit Updater [ 03/Mar/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34285/
Subject: LU-10143 tests: Add version check for interop
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1e63c2f85da17947d78d4b2fb79cab2bd04b2ca5

Comment by Gerrit Updater [ 19/Mar/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34295/
Subject: LU-10143 osd-zfs: allocate sequence in advance
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 6c69952f0654028d129bd76a7a24007ead26610f

Comment by Gerrit Updater [ 01/Apr/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34562
Subject: LU-10143 tests: Add version check for interop
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: fdada0bd114f4355ee79dfe250cc448d34b98ed6

Comment by Gerrit Updater [ 08/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34562/
Subject: LU-10143 tests: Add version check for interop
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: c44b6a5f36c66b3c02846c4ccaa6eca5bf061434

Comment by Bruno Faccini (Inactive) [ 04/Nov/19 ]

+1 with recent master at https://testing.whamcloud.com/test_sessions/97ef006b-0a64-4710-b57c-83f6318cb9ec .

Comment by Sebastien Buisson [ 15/Nov/19 ]

+1 on master:
https://testing.whamcloud.com/test_sets/85fcc268-0728-11ea-8e77-52540065bddc

Comment by James Nunez (Inactive) [ 15/Nov/19 ]

Reopening this ticket because it looks like we are seeing this issue again on master (2.14), IN this case https://testing.whamcloud.com/test_sets/278f669c-05aa-11ea-bbc3-52540065bddc, we are seeing replay-single test_118 crash with this LBUG.

Comment by Jian Yu [ 20/Nov/19 ]

+1 on master: https://testing.whamcloud.com/test_sets/1d5271ac-0b5d-11ea-8e77-52540065bddc

Comment by Andreas Dilger [ 27/Nov/19 ]

+5 on master in the past week. This seems very likely related to LU-9924.
https://testing.whamcloud.com/test_sets/7992c6c2-0cae-11ea-98f1-52540065bddc
https://testing.whamcloud.com/test_sets/ab5220e6-0cad-11ea-bbc3-52540065bddc
https://testing.whamcloud.com/test_sets/576bd82e-108f-11ea-8e77-52540065bddc
https://testing.whamcloud.com/test_sets/7425bbb6-10a2-11ea-9487-52540065bddc
https://testing.whamcloud.com/test_sets/78f6a1f8-10ae-11ea-98f1-52540065bddc

Comment by Andreas Dilger [ 16/Jan/20 ]

+1 on master replay-single test_118 https://testing.whamcloud.com/test_sets/ffc091c4-3892-11ea-b1e8-52540065bddc

Comment by Emoly Liu [ 21/Jan/20 ]

+1 on master: https://testing.whamcloud.com/test_sets/22e755f0-3b97-11ea-80b4-52540065bddc

Comment by Jian Yu [ 28/Jan/20 ]

+1 on master: https://testing.whamcloud.com/test_sets/8ee4dcc8-415a-11ea-9847-52540065bddc

Comment by Andreas Dilger [ 28/Jan/20 ]

+1 on master https://testing.whamcloud.com/test_sets/9588453c-41a6-11ea-af6a-52540065bddc

Comment by Andreas Dilger [ 07/Feb/20 ]

I'm going to close this ticket, since it had patches landed and backported to other branches already. LU-13195 can be used for investigating/fixing this (potentially the same) issue on master for 2.14.

Generated at Sat Feb 10 02:32:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.