[LU-10143] LBUG dt_object.h:2166:dt_declare_record_write Created: 19/Oct/17 Updated: 07/Feb/20 Resolved: 07/Feb/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.5, Lustre 2.10.6 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1 |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Casper | Assignee: | Alex Zhuravlev |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
trevis, full DNE |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||
| Description |
|
https://testing.hpdd.intel.com/test_sessions/2f6027c4-8e9a-4949-af3d-2f8d04940c9c replay-single, test_80d: Timeout occurred after 155 mins, last suite running was replay-single, restarting cluster to continue tests LBUG From mds console: [ 5748.826345] LustreError: 28616:0:(dt_object.h:2166:dt_declare_record_write()) ASSERTION( dt->do_body_ops ) failed: [ 5748.828884] LustreError: 28616:0:(dt_object.h:2166:dt_declare_record_write()) LBUG [ 5748.831194] Pid: 28616, comm: mdt_out00_003 [ 5748.833233] [ 5748.833233] Call Trace: [ 5748.836889] [<ffffffffc06917ae>] libcfs_call_trace+0x4e/0x60 [libcfs] [ 5748.839015] [<ffffffffc069183c>] lbug_with_loc+0x4c/0xb0 [libcfs] [ 5748.841118] [<ffffffffc0ef1743>] out_write_add_exec+0x133/0x1b0 [ptlrpc] [ 5748.843217] [<ffffffffc0ee84a3>] out_write+0x333/0x370 [ptlrpc] [ 5748.845224] [<ffffffffc0eeb1c4>] out_handle+0x1304/0x1920 [ptlrpc] [ 5748.847209] [<ffffffffc0e7d4a2>] ? lustre_msg_get_opc+0x22/0xf0 [ptlrpc] [ 5748.849234] [<ffffffffc0ee0d49>] ? tgt_request_preprocess.isra.26+0x299/0x7a0 [ptlrpc] [ 5748.851305] [<ffffffffc0ee2475>] tgt_request_handle+0x925/0x1370 [ptlrpc] [ 5748.853263] [<ffffffffc0e8b37e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc] [ 5748.855236] [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90 [ 5748.857056] [<ffffffffc0e8eb22>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [ 5748.858924] [<ffffffffc0e8e090>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc] [ 5748.860737] [<ffffffff810b098f>] kthread+0xcf/0xe0 [ 5748.862471] [<ffffffff810b08c0>] ? kthread+0x0/0xe0 [ 5748.864195] [<ffffffff816b4f18>] ret_from_fork+0x58/0x90 [ 5748.865926] [<ffffffff810b08c0>] ? kthread+0x0/0xe0 [ 5748.867590] [ 5748.868971] Kernel panic - not syncing: LBUG [ 5748.869962] CPU: 1 PID: 28616 Comm: mdt_out00_003 Tainted: P OE ------------ 3.10.0-693.2.2.el7_lustre.x86_64 #1 [ 5748.869962] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [ 5748.869962] ffff88007c7daf00 000000006e5a06dd ffff88005feefa88 ffffffff816a3d8d [ 5748.869962] ffff88005feefb08 ffffffff8169dc74 ffffffff00000008 ffff88005feefb18 [ 5748.869962] ffff88005feefab8 000000006e5a06dd 000000006e5a06dd ffff88007fd0f8b8 [ 5748.869962] Call Trace: [ 5748.869962] [<ffffffff816a3d8d>] dump_stack+0x19/0x1b [ 5748.869962] [<ffffffff8169dc74>] panic+0xe8/0x20d [ 5748.869962] [<ffffffffc0691854>] lbug_with_loc+0x64/0xb0 [libcfs] [ 5748.869962] [<ffffffffc0ef1743>] out_write_add_exec+0x133/0x1b0 [ptlrpc] [ 5748.869962] [<ffffffffc0ee84a3>] out_write+0x333/0x370 [ptlrpc] [ 5748.869962] [<ffffffffc0eeb1c4>] out_handle+0x1304/0x1920 [ptlrpc] [ 5748.869962] [<ffffffffc0e7d4a2>] ? lustre_msg_get_opc+0x22/0xf0 [ptlrpc] [ 5748.869962] [<ffffffffc0ee0d49>] ? tgt_request_preprocess.isra.26+0x299/0x7a0 [ptlrpc] [ 5748.893323] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1 [ 5748.893798] [<ffffffffc0ee2475>] tgt_request_handle+0x925/0x1370 [ptlrpc] [ 5748.893798] [<ffffffffc0e8b37e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc] [ 5748.893798] [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90 [ 5748.893798] [<ffffffffc0e8eb22>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [ 5748.893798] [<ffffffffc0e8e090>] ? ptlrpc_register_service+0xe80/0xe80 [ptlrpc] [ 5748.893798] [<ffffffff810b098f>] kthread+0xcf/0xe0 [ 5748.893798] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 [ 5748.893798] [<ffffffff816b4f18>] ret_from_fork+0x58/0x90 [ 5748.893798] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 |
| Comments |
| Comment by Joseph Gmitter (Inactive) [ 20/Oct/17 ] |
|
Hi Alex, Can you please look into this? Thanks. |
| Comment by Sarah Liu [ 30/Jul/18 ] |
|
another instance on master tag-2.11.53 ZFS DNE https://testing.whamcloud.com/test_sets/454a32d6-9097-11e8-a9f7-52540065bddc |
| Comment by Alex Zhuravlev [ 30/Jul/18 ] |
|
trying to reproduce locally...
|
| Comment by Andreas Dilger [ 11/Aug/18 ] |
|
+1 on b2_10: |
| Comment by James Nunez (Inactive) [ 12/Dec/18 ] |
|
We have replay-single test_80g crash wth the same stack trace. Logs are at https://testing.whamcloud.com/test_sets/9c64d894-fdc2-11e8-b837-52540065bddc |
| Comment by Gerrit Updater [ 13/Dec/18 ] |
|
Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33849 |
| Comment by Alex Zhuravlev [ 13/Dec/18 ] |
|
the interesting difference in the code is that osd-ldisfsk sets .do_body_ops unconditionally, right at object initialisation (even if it doesn't exist yet) while osd-zfs sets .do_body_ops only if object exists or declared to be created. but there is one case when object is being destroyed by ZFS and in this case .do_body_ops is not set. with the patch above I'm going to catch this case if the theory is correct. though the next step is not obvious as object (likely llog) is expected to exist at this point.
|
| Comment by Mikhail Pershin [ 16/Jan/19 ] |
|
Another one in master: |
| Comment by Alex Zhuravlev [ 16/Jan/19 ] |
|
learnt how to reproduce locally.. |
| Comment by Alex Zhuravlev [ 17/Jan/19 ] |
|
so far, tracked this down to FID (SEQ) duplication only with ZFS.. |
| Comment by Gerrit Updater [ 20/Jan/19 ] |
|
Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34069 |
| Comment by Gerrit Updater [ 06/Feb/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34069/ |
| Comment by Gerrit Updater [ 15/Feb/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34267 |
| Comment by Gerrit Updater [ 20/Feb/19 ] |
|
Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34285 |
| Comment by Gerrit Updater [ 23/Feb/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34267/ |
| Comment by Gerrit Updater [ 25/Feb/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34295 |
| Comment by Gerrit Updater [ 03/Mar/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34285/ |
| Comment by Gerrit Updater [ 19/Mar/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34295/ |
| Comment by Gerrit Updater [ 01/Apr/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34562 |
| Comment by Gerrit Updater [ 08/Apr/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34562/ |
| Comment by Bruno Faccini (Inactive) [ 04/Nov/19 ] |
|
+1 with recent master at https://testing.whamcloud.com/test_sessions/97ef006b-0a64-4710-b57c-83f6318cb9ec . |
| Comment by Sebastien Buisson [ 15/Nov/19 ] |
|
+1 on master: |
| Comment by James Nunez (Inactive) [ 15/Nov/19 ] |
|
Reopening this ticket because it looks like we are seeing this issue again on master (2.14), IN this case https://testing.whamcloud.com/test_sets/278f669c-05aa-11ea-bbc3-52540065bddc, we are seeing replay-single test_118 crash with this LBUG. |
| Comment by Jian Yu [ 20/Nov/19 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/1d5271ac-0b5d-11ea-8e77-52540065bddc |
| Comment by Andreas Dilger [ 27/Nov/19 ] |
|
+5 on master in the past week. This seems very likely related to |
| Comment by Andreas Dilger [ 16/Jan/20 ] |
|
+1 on master replay-single test_118 https://testing.whamcloud.com/test_sets/ffc091c4-3892-11ea-b1e8-52540065bddc |
| Comment by Emoly Liu [ 21/Jan/20 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/22e755f0-3b97-11ea-80b4-52540065bddc |
| Comment by Jian Yu [ 28/Jan/20 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/8ee4dcc8-415a-11ea-9847-52540065bddc |
| Comment by Andreas Dilger [ 28/Jan/20 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/9588453c-41a6-11ea-af6a-52540065bddc |
| Comment by Andreas Dilger [ 07/Feb/20 ] |
|
I'm going to close this ticket, since it had patches landed and backported to other branches already. |