Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15404

kernel panic and filesystem corruption in setxattr due to journal transaction restart

Details

    • 3
    • 9223372036854775807

    Description

      During recent testing, we have found a repeatable kernel panic and ldiskfs corruption.

      The kernel panics with the following stack trace:

      [41021.608188] LNet: 18792:0:(o2iblnd_cb.c:3372:kiblnd_check_conns()) Timed out tx for 10.12.0.1@o2ib4000: 0 seconds
      [41023.501466] ------------[ cut here ]------------
      [41023.506369] kernel BUG at fs/jbd2/transaction.c:1476!
      [41023.511714] invalid opcode: 0000 [#1] SMP NOPTI
      [41023.516549] CPU: 18 PID: 304955 Comm: mdt03_009 Kdump: loaded Tainted: G OE --------- - - 4.18.0-305.10.2.x6.0.24.x86_64 #1
      [41023.537576] RIP: 0010:jbd2_journal_dirty_metadata+0x1fb/0x250 [jbd2]
      [41023.544198] Code: f3 90 48 8b 03 a9 00 00 80 00 75 f4 e9 b3 fe ff ff 44 8b 45 0c 41 83 f8 01 74 8c e9 ec 94 00 00 4c 39 65 30 0f 84 68 fe ff ff <0f> 0b 4d 8b 4a 70 4c 8d 73 02 4d 39 cc 0f 84 33 ff ff ff e9 53 95
      [41023.563722] RSP: 0018:ffffb106eae8f5d8 EFLAGS: 00010207
      [41023.569412] RAX: 000000000062c029 RBX: ffff997e385c99c0 RCX: 000000000000000
      [41023.576940] RDX: ffff998c322432a0 RSI: ffff997e385c99c0 RDI: ffff998c322432a0
      [41023.584556] RBP: ffff997f643bc960 R08: ffff997e385c99c0 R09: 0000000000000000
      [41023.592073] R10: ffff998c2d496800 R11: 0000000000000100 R12: ffff998b65522300
      [41023.599709] R13: 0000000000000000 R14: ffffffffc1a64350 R15: 0000000000001514
      [41023.607327] FS: 0000000000000000(0000) GS:ffff998e7f080000(0000) knlGS:0000000000000000
      [41023.615913] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [41023.622237] CR2: 00007ff9d42640dc CR3: 0000000f00210004 CR4: 00000000003706e0
      [41023.629848] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [41023.637566] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      [41023.645130] Call Trace:
      [41023.648010] __ldiskfs_handle_dirty_metadata+0x51/0x190 [ldiskfs]
      [41023.654687] ldiskfs_do_update_inode+0x49a/0x7f0 [ldiskfs]
      [41023.660734] ldiskfs_mark_iloc_dirty+0x32/0x80 [ldiskfs]
      [41023.666635] ldiskfs_xattr_set_handle+0x381/0x580 [ldiskfs]
      [41023.672840] ldiskfs_xattr_set+0xd0/0x160 [ldiskfs]
      [41023.678168] __vfs_setxattr+0x66/0x80
      [41023.682392] osd_xattr_set+0x709/0x10a0 [osd_ldiskfs]
      [41023.688059] ? lod_gen_component_ea+0x2c2/0x9e0 [lod]
      [41023.693704] lod_sub_xattr_set+0x248/0x4d0 [lod]
      [41023.698900] lod_generate_and_set_lovea+0x262/0x310 [lod]
      [41023.704879] lod_striped_create+0x433/0x590 [lod]
      [41023.710083] lod_layout_change+0x192/0x270 [lod]
      [41023.715333] mdd_layout_change+0x13f7/0x1980 [mdd]
      [41023.720809] mdt_layout_change+0x31c/0x4b0 [mdt]
      [41023.726082] mdt_intent_layout+0x6c8/0x990 [mdt]
      [41023.731241] ? mdt_intent_getxattr+0x320/0x320 [mdt]
      [41023.736903] mdt_intent_opc+0x12c/0xbf0 [mdt]
      [41023.742067] mdt_intent_policy+0x207/0x3a0 [mdt]
      [41023.747281] ldlm_lock_enqueue+0x4e4/0xa80 [ptlrpc]
      [41023.752854] ldlm_handle_enqueue0+0x634/0x1760 [ptlrpc]
      [41023.758771] tgt_enqueue+0xa4/0x210 [ptlrpc]
      [41023.763752] tgt_request_handle+0xc93/0x1a00 [ptlrpc]
      [41023.769516] ? ptlrpc_nrs_req_get_nolock0+0xfb/0x1f0 [ptlrpc]
      [41023.775901] ptlrpc_server_handle_request+0x323/0xbd0 [ptlrpc]
      [41023.782512] ptlrpc_main+0xc06/0x1550 [ptlrpc]
      [41023.787683] ? ptlrpc_wait_event+0x500/0x500 [ptlrpc]
      [41023.793276] kthread+0x116/0x130
      [41023.797041] ? kthread_flush_work_fn+0x10/0x10
      [41023.802049] ret_from_fork+0x1f/0x40
      

      The corresponding line of code would map to:

                      J_ASSERT_JH(jh, jh->b_transaction == transaction ||
                                      jh->b_next_transaction == transaction);
      

      More precisely, jh is associated with an actively committing transaction in its disk writing phase (i.e. t_updates already dropped to zero).

      After a bit of tracing, we've found that the transaction is restarting when changing a large EA to another large EA, which in RHEL8-based ldiskfs code causes a new EA inode to be allocated and the old inode to be freed. The truncate part of the old inode release sometimes fails to extend current transaction and has to restart it:

      mdt03_024-198115 [012] 45670.650452: kernel_stack:         <stack trace>
      => trace_event_raw_event_jbd2_handle_start_class (ffffffffc0c7e60c)
      => jbd2__journal_restart (ffffffffc0c75b5c)
      => ldiskfs_datasem_ensure_credits (ffffffffc1ac3431)
      => ldiskfs_ext_rm_leaf (ffffffffc1ac44e8)
      => ldiskfs_ext_remove_space (ffffffffc1ac8240)
      => ldiskfs_ext_truncate (ffffffffc1ac953a)
      => ldiskfs_truncate (ffffffffc1adbdcb)
      => ldiskfs_evict_inode (ffffffffc1adcc71)
      => evict (ffffffff84f37202)
      => ldiskfs_xattr_set_entry (ffffffffc1abcf1e)
      => ldiskfs_xattr_ibody_set (ffffffffc1abd5be)
      => ldiskfs_xattr_set_handle (ffffffffc1abf9e4)
      => ldiskfs_xattr_set (ffffffffc1abfd70)
      => __vfs_setxattr (ffffffff84f431b6)
      => osd_xattr_set (ffffffffc1b7891d)
      => lod_sub_xattr_set (ffffffffc17da152)
      => lod_generate_and_set_lovea (ffffffffc17c7d8c)
      => lod_striped_create (ffffffffc17c81d0)
      => lod_layout_change (ffffffffc17c839b)
      => mdd_layout_change (ffffffffc1850f7d)
      => mdt_layout_change (ffffffffc18aeaf1)
      => mdt_intent_layout (ffffffffc18b5e30)
      => mdt_intent_opc (ffffffffc18ac778)
      => mdt_intent_policy (ffffffffc18b3ba6)
      => ldlm_lock_enqueue (ffffffffc138ffff)
      => ldlm_handle_enqueue0 (ffffffffc13b811f)
      => tgt_enqueue (ffffffffc1441b14)
      => tgt_request_handle (ffffffffc14465cd)
      => ptlrpc_server_handle_request (ffffffffc13ecaea)
      => ptlrpc_main (ffffffffc13f132a)
      => kthread (ffffffff84d043a6)
      => ret_from_fork (ffffffff8560023f)
      

      One problematic part here is that transaction restart enforces current transaction commit so the incomplete transaction will likely commit before the kernel panics. It will cause ldiskfs corruption after remount. The reason why the kernel panic is that we restart this transaction somewhere in between of ldiskfs_get_write_access() and ldiskfs_mark_dirty_metadata() so the inode bh sticks in the old transaction:

      ldiskfs_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
                            const char *name, const void *value, size_t value_len,
                            int flags)
      ...
              error = ldiskfs_reserve_inode_write(handle, inode, &is.iloc);
      ...
                      error = ldiskfs_xattr_ibody_set(handle, inode, &i, &is);
      ...
                      error = ldiskfs_mark_iloc_dirty(handle, inode, &is.iloc);
      ...
      }
      

      We don't have a fix yet and haven't yet decided how to fix this. E.g. moving final iput for the old EA inode out of transaction may be problematic with osd-ldiskfs/ldiskfs layering.

      The bug seems to be almost completely coming from upstream. However, credits calculation may be different in ext4 and osd-ldiskfs and the bug may not necessarily reproduce with ext4 alone.

      Attachments

        Issue Links

          Activity

            [LU-15404] kernel panic and filesystem corruption in setxattr due to journal transaction restart

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57769
            Subject: LU-15404 ldiskfs: fix truncate during setxattr for el7.9
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: af369e016d720877cbcd7ea5a086ed7df1f6a5ea

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57769 Subject: LU-15404 ldiskfs: fix truncate during setxattr for el7.9 Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: af369e016d720877cbcd7ea5a086ed7df1f6a5ea
            adilger Andreas Dilger made changes -
            Link New: This issue is related to EX-8679 [ EX-8679 ]
            pjones Peter Jones made changes -
            Link New: This issue is related to EX-8505 [ EX-8505 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-16973 [ LU-16973 ]
            pjones Peter Jones made changes -
            Link New: This issue is duplicated by EX-7038 [ EX-7038 ]

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51335/
            Subject: LU-15404 ldiskfs: fix truncate during setxattr for el7.9
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 471ce3d95651ca06209a76973cae3bbdb5b6aa2f

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51335/ Subject: LU-15404 ldiskfs: fix truncate during setxattr for el7.9 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 471ce3d95651ca06209a76973cae3bbdb5b6aa2f

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51335
            Subject: LU-15404 ldiskfs: fix truncate during setxattr for el7.9
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 9595ef28e16eeb110844c952e6a58079ded40500

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51335 Subject: LU-15404 ldiskfs: fix truncate during setxattr for el7.9 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9595ef28e16eeb110844c952e6a58079ded40500
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-16032 [ LU-16032 ]
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.16.0 [ 15190 ]
            Fix Version/s New: Lustre 2.15.3 [ 15998 ]

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50586/
            Subject: LU-15404 ldiskfs: use per-filesystem workqueues to avoid deadlocks
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 9ab613631b5833ba7e7a578fdb9819ebc593ab3c

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50586/ Subject: LU-15404 ldiskfs: use per-filesystem workqueues to avoid deadlocks Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 9ab613631b5833ba7e7a578fdb9819ebc593ab3c

            People

              panda Andrew Perepechko
              panda Andrew Perepechko
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: