[LU-17332] sanity test_820: kernel BUG at fs/jbd2/transaction.c:378 Created: 04/Dec/23 Updated: 03/Jan/24 Resolved: 03/Jan/24 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.16.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Dongyang Li |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/58a6b07c-fb1f-4a2d-ac3c-d7578d6b134f test_820 failed with the following error: trevis-28vm3 crashed during sanity test_820 [26282.338565] Lustre: server umount lustre-OST0004 complete [26282.411017] ------------[ cut here ]------------ [26282.412061] kernel BUG at fs/jbd2/transaction.c:378! [26282.413171] invalid opcode: 0000 [#1] SMP PTI [26282.414068] CPU: 1 PID: 784404 Comm: kworker/1:5 4.18.0-477.15.1.el8_lustre.x86_64 #1 [26282.416473] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [26282.435639] Call Trace: [26282.438083] jbd2__journal_start+0xee/0x1f0 [jbd2] [26282.439047] jbd2_journal_start+0x19/0x20 [jbd2] [26282.439979] flush_stashed_stats_work+0x36/0x90 [ldiskfs] [26282.441086] process_one_work+0x1a7/0x360 [26282.442753] worker_thread+0x30/0x390 [26282.444311] kthread+0x134/0x150 Test session details: This started around 2023-07-21 +/- 7 days. It looks like the workqueue is somehow running after the journal is cleaned up, since the BUG is int jbd2_journal_destroy(journal_t *journal) { /* Wait for the commit thread to wake up and die. */ journal_kill_thread(journal); : } static void journal_kill_thread(journal_t *journal) { journal->j_flags |= JBD2_UNMOUNT; : } static int start_this_handle(journal_t *journal, handle_t *handle, gfp_t gfp_mask) { : BUG_ON(journal->j_flags & JBD2_UNMOUNT); : } VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 08/Dec/23 ] |
|
I pushed patch https://review.whamcloud.com/53242 " |
| Comment by Peter Jones [ 15/Dec/23 ] |
|
Dongyang Could you please investigate? Thanks Peter |
| Comment by Dongyang Li [ 18/Dec/23 ] |
|
The test failure referred by the ticket was on 2023-07-21 https://testing.whamcloud.com/test_sets/58a6b07c-fb1f-4a2d-ac3c-d7578d6b134f However I do find out there's a similar crash happening recently on el9 servers, the trace is like this: [ 9096.356946] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost2 [ 9097.522537] LDISKFS-fs (dm-9): unmounting filesystem. [ 9098.085442] ------------[ cut here ]------------ [ 9098.086099] kernel BUG at fs/jbd2/transaction.c:384! [ 9098.086804] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 9098.087460] CPU: 0 PID: 316553 Comm: kworker/0:2 Kdump: loaded Tainted: G OE -------- --- 5.14.0-284.30.1_lustre.el9.x86_64 #1 [ 9098.088869] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 9098.089558] Workqueue: events flush_stashed_error_work [ldiskfs] [ 9098.090478] RIP: 0010:start_this_handle+0x234/0x580 [jbd2] [ 9098.091187] Code: 40 48 89 44 24 48 41 83 79 0c 02 74 1f 0f 0b 48 89 ef e8 5f 1c 67 e7 48 89 ef e8 67 1e 67 e7 49 8b 06 a8 01 0f 84 77 fe ff ff <0f> 0b 4d 8d be 90 00 00 00 ba 02 00 00 00 48 8d 74 24 28 4c 89 ff [ 9098.093212] RSP: 0018:ffffb92607a37d90 EFLAGS: 00010202 [ 9098.093837] RAX: 0000000000000031 RBX: ffff92b1f5e6ba48 RCX: 00000000000000e0 [ 9098.094663] RDX: 0000000000008d40 RSI: ffffffffc06c3bf8 RDI: ffff92b1e6e3884c [ 9098.095493] RBP: ffff92b1e6e3884c R08: 0000000000000100 R09: 0000000000000000 [ 9098.096334] R10: ffff92b220ddf900 R11: 0000000000000106 R12: ffff92b220ddf900 [ 9098.097167] R13: 0000000000000000 R14: ffff92b1e6e38800 R15: 0000000000000001 [ 9098.097995] FS: 0000000000000000(0000) GS:ffff92b27cc00000(0000) knlGS:0000000000000000 [ 9098.098933] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9098.099609] CR2: 000055c0889ae840 CR3: 000000008ed7a005 CR4: 00000000000606f0 [ 9098.100438] Call Trace: [ 9098.100782] <TASK> [ 9098.101098] ? kmem_cache_alloc+0x17d/0x300 [ 9098.101661] jbd2__journal_start+0xfb/0x200 [jbd2] [ 9098.102254] jbd2_journal_start+0x19/0x20 [jbd2] [ 9098.102866] flush_stashed_error_work+0x4a/0xf0 [ldiskfs] [ 9098.103543] process_one_work+0x1e5/0x3b0 [ 9098.104078] ? rescuer_thread+0x390/0x390 [ 9098.104578] worker_thread+0x50/0x3a0 [ 9098.105052] ? rescuer_thread+0x390/0x390 [ 9098.105549] kthread+0xd9/0x100 [ 9098.105978] ? kthread_complete_and_exit+0x20/0x20 [ 9098.106571] ret_from_fork+0x22/0x30 [ 9098.107085] </TASK> The reason is |
| Comment by Gerrit Updater [ 19/Dec/23 ] |
|
"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53499 |
| Comment by Gerrit Updater [ 03/Jan/24 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53499/ |
| Comment by Peter Jones [ 03/Jan/24 ] |
|
Landed for 2.16 |