[LU-15143] journal_dirty_metadata failed: handle type 0 started at line 1926, credits 9/0, errcode -28 Created: 21/Oct/21  Updated: 20/Nov/21  Resolved: 20/Nov/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Major
Reporter: Andrew Perepechko Assignee: Andrew Perepechko
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
[17750.028365] LDISKFS-fs: ldiskfs_getblk:896: aborting transaction: error 28 in __ldiskfs_handle_dirty_metadata
[17750.028367] LDISKFS-fs error (device md66): ldiskfs_getblk:896: inode #85: block 3107684352: comm mdt06_007: journal_dirty_metadata failed: handle type 0 started at line 1926, credits 9/0, errcode -28
[17750.640480] Aborting journal on device md66-8.
[17750.640508] Kernel panic - not syncing: LDISKFS-fs (device md66): panic forced after error 

Lustre crashes with the following stack trace:

[17750.640510] Call Trace:

[17750.640512]  [<ffffffffb4164e41>] dump_stack+0x19/0x1b
[17750.640514]  [<ffffffffb415e550>] panic+0xe8/0x21f
[17750.640524]  [<ffffffffc14b3c86>] ldiskfs_handle_error.part.191+0xa6/0xb0 [ldiskfs]
[17750.640530]  [<ffffffffc14b3ffa>] __ldiskfs_error_inode+0xaa/0x180 [ldiskfs]
[17750.640535]  [<ffffffffc1475f9a>] ? ldiskfs_journal_abort_handle.isra.5+0xaa/0xc0 [ldiskfs]
[17750.640539]  [<ffffffffc1476627>] __ldiskfs_handle_dirty_metadata+0x137/0x220 [ldiskfs]
[17750.640545]  [<ffffffffc149a891>] ldiskfs_getblk+0x131/0x200 [ldiskfs]
[17750.640550]  [<ffffffffc149a987>] ldiskfs_bread+0x27/0xc0 [ldiskfs]
[17750.640558]  [<ffffffffc153c370>] osd_write+0x4c0/0xc30 [osd_ldiskfs]
[17750.640564]  [<ffffffffc1493dcf>] ? ldiskfs_xattr_find_entry+0x3f/0x130 [ldiskfs]
[17750.640582]  [<ffffffffc0c1f3b6>] dt_record_write+0x36/0x120 [obdclass]
[17750.640614]  [<ffffffffc0f04cbe>] tgt_mk_reply_data+0x49e/0xaa0 [ptlrpc]
[17750.640630]  [<ffffffffc0bfc919>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[17750.640658]  [<ffffffffc0f05fa2>] tgt_last_rcvd_update+0x7d2/0xa00 [ptlrpc]
[17750.640659]  [<ffffffffb3c1c412>] ? kmem_cache_alloc+0x1c2/0x1f0
[17750.640662]  [<ffffffffc07037ec>] ? start_this_handle+0x3ac/0x430 [jbd2]
[17750.640689]  [<ffffffffc0f0a9c0>] tgt_txn_stop_cb+0x1a0/0x490 [ptlrpc]
[17750.640706]  [<ffffffffc0c20621>] dt_txn_hook_stop+0x81/0xd0 [obdclass]
[17750.640712]  [<ffffffffc152429d>] osd_trans_stop+0x18d/0x920 [osd_ldiskfs]
[17750.640718]  [<ffffffffc151db00>] ? osd_trans_start+0x130/0x4e0 [osd_ldiskfs]
[17750.640729]  [<ffffffffc12c3e8e>] mdt_empty_transno+0x11e/0x850 [mdt]
[17750.640738]  [<ffffffffc12c90be>] mdt_mfd_open+0x8de/0xe70 [mdt]
[17750.640747]  [<ffffffffc12c9d33>] mdt_finish_open+0x6e3/0x880 [mdt]
[17750.640756]  [<ffffffffc12ca1c3>] mdt_open_by_fid+0x2f3/0x4b0 [mdt]
[17750.640765]  [<ffffffffc12cb738>] mdt_reint_open+0x928/0x2630 [mdt]
[17750.640782]  [<ffffffffc0c334b8>] ? upcall_cache_get_entry+0x218/0x8b0 [obdclass]
[17750.640797]  [<ffffffffc0c384fe>] ? lu_ucred+0x1e/0x30 [obdclass]
[17750.640806]  [<ffffffffc12a9db5>] ? mdt_ucred+0x15/0x20 [mdt]
[17750.640814]  [<ffffffffc12aa651>] ? mdt_root_squash+0x21/0x430 [mdt]
[17750.640823]  [<ffffffffc12bef43>] mdt_reint_rec+0x83/0x210 [mdt]
[17750.640831]  [<ffffffffc12984b3>] mdt_reint_internal+0x703/0xae0 [mdt]
[17750.640840]  [<ffffffffc12a4f26>] ? mdt_intent_fixup_resent+0x36/0x220 [mdt]
[17750.640848]  [<ffffffffc12a51a3>] mdt_intent_open+0x93/0x430 [mdt]
[17750.640876]  [<ffffffffc0ea9437>] ? lustre_msg_buf+0x17/0x60 [ptlrpc]
[17750.640884]  [<ffffffffc129b6ba>] mdt_intent_opc+0x1ba/0xb50 [mdt]
[17750.640911]  [<ffffffffc0ead900>] ? lustre_swab_ldlm_policy_data+0x30/0x30 [ptlrpc]
[17750.640919]  [<ffffffffc12a5110>] ? mdt_intent_fixup_resent+0x220/0x220 [mdt]
[17750.640928]  [<ffffffffc12a37f4>] mdt_intent_policy+0x1a4/0x360 [mdt]
[17750.640949]  [<ffffffffc0e5b52a>] ldlm_lock_enqueue+0x3da/0xae0 [ptlrpc]
[17750.640955]  [<ffffffffc095d6c3>] ? cfs_hash_bd_add_locked+0x63/0x80 [libcfs]
[17750.640961]  [<ffffffffc0960e4e>] ? cfs_hash_add+0xbe/0x1a0 [libcfs]
[17750.640985]  [<ffffffffc0e83d57>] ldlm_handle_enqueue0+0xa87/0x1660 [ptlrpc]
[17750.641010]  [<ffffffffc0ead980>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
[17750.641039]  [<ffffffffc0f0de52>] tgt_enqueue+0x62/0x210 [ptlrpc]
[17750.641067]  [<ffffffffc0f12c6a>] tgt_request_handle+0x96a/0x1680 [ptlrpc]
[17750.641095]  [<ffffffffc0eede71>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[17750.641100]  [<ffffffffc0951c0e>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[17750.641126]  [<ffffffffc0eb87b6>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[17750.641150]  [<ffffffffc0eb91fe>] ? ptlrpc_wait_event+0xae/0x380 [ptlrpc]
[17750.641151]  [<ffffffffb3ad6b22>] ? default_wake_function+0x12/0x20
[17750.641152]  [<ffffffffb3acbdeb>] ? __wake_up_common+0x5b/0x90
[17750.641176]  [<ffffffffc0ebc895>] ptlrpc_main+0xc75/0x1c50 [ptlrpc]
[17750.641200]  [<ffffffffc0ebbc20>] ? ptlrpc_register_service+0x1010/0x1010 [ptlrpc]
[17750.641201]  [<ffffffffb3ac1f81>] kthread+0xd1/0xe0
[17750.641202]  [<ffffffffb3ac1eb0>] ? insert_kthread_work+0x40/0x40
[17750.641204]  [<ffffffffb4177c1d>] ret_from_fork_nospec_begin+0x7/0x21
[17750.641204]  [<ffffffffb3ac1eb0>] ? insert_kthread_work+0x40/0x40 

It seems that osd_declare_write() incorrectly calculates the necessary journal credits for /reply_data (32-byte append write) when extents are moved from the inode body to an external block:

1. user quota (properly accounted)
2. group quota (properly accounted)
3. project quota (properly accounted)
4. inode update (accounted via inode != NULL check)
5. bitmap update for the new block (properly accounted)
6. gd update for the new block (properly accounted)
7. new block update (properly accounted)
8. bitmap update for the new extent map block
9. gd update for the new extent map block
10. extent map update

The last three blocks are not properly accounted, only two blocks are in fact calculated via:

                depth = inode != NULL ? ext_depth(inode) : 0;
                depth = max(depth, 1) + 1; // we only get 2 here
                credits = depth; 

Eventually, only 9 credits are reserved but 10 credits are used.

A patch will be uploaded shortly.



 Comments   
Comment by Gerrit Updater [ 21/Oct/21 ]

"Andrew Perepechko <andrew.perepechko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45328
Subject: LU-15143 osd-ldiskfs: osd_declare_write() underestimates credits
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 384581eaba1cfd536a4dd7761a294523a6612d46

Comment by Gerrit Updater [ 20/Nov/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45328/
Subject: LU-15143 osd-ldiskfs: osd_declare_write() underestimates credits
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0844727c55d52ec24f6cfb7fa043755a6635949c

Comment by Peter Jones [ 20/Nov/21 ]

Landed for 2.15

Generated at Sat Feb 10 03:15:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.