Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8685

Fix JBD2 issue in EL7 Kernels

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      A bug in JBD2 version of EL7 has been unveiled at some sites. It can lead to Kernel Oopses like:

      [ 3440.794264] ------------[ cut here ]------------
      [ 3440.794294] kernel BUG at fs/jbd2/transaction.c:2239!
      [ 3440.794319] invalid opcode: 0000 [#1] SMP 
      [ 3440.794971] CPU: 10 PID: 7903 Comm: mdt03_010 Tainted: G           OE  ------------   3.10.0-327.36.1.el7_lustre.2.7.18.2.x86_64 #1
      [ 3440.795057] task: ffff880fd3453980 ti: ffff880f7dea0000 task.ti: ffff880f7dea0000
      [ 3440.795091] RIP: 0010:[<ffffffffa100a8b6>]  [<ffffffffa100a8b6>] __jbd2_journal_file_buffer+0x206/0x220 [jbd2]
      [ 3440.795134] RSP: 0018:ffff880f7dea3810  EFLAGS: 00010246
      [ 3440.795154] RAX: 000000009a0e9a0e RBX: ffff880fd2a15e00 RCX: 0000000000009a0e
      [ 3440.795185] RDX: 0000000000009a0e RSI: ffff880fdcc9a100 RDI: ffff880fd2a15e00
      [ 3440.795218] RBP: ffff880f7dea3860 R08: 4010000000000000 R09: 0fdccf2820080000
      [ 3440.795249] R10: f00534cf13c20802 R11: 0000000000000002 R12: ffff880fdccf2820
      [ 3440.795282] R13: ffff880fdcc9a100 R14: 0000000000000004 R15: ffff880fdccf2820
      [ 3440.795311] FS:  0000000000000000(0000) GS:ffff88103fa80000(0000) knlGS:0000000000000000
      [ 3440.795347] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 3440.795374] CR2: 0000000000d88e10 CR3: 000000000194a000 CR4: 00000000001407e0
      [ 3440.795408] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 3440.795441] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [ 3440.795465] Stack:
      [ 3440.795473]  0000000000009a0c 0000000000009a0e 0000000000009a0e ffff882025561ba0
      [ 3440.795513]  00000000b540e5dd ffff880fd2a15e00 ffff880fdcc9a100 ffff880fce58e030
      [ 3440.795555]  ffff882025561ba0 ffff880fdccf2820 ffff880f7dea38e8 ffffffffa100ac0c
      [ 3440.795597] Call Trace:
      [ 3440.795616]  [<ffffffffa100ac0c>] do_get_write_access+0x33c/0x4e0 [jbd2]
      [ 3440.795650]  [<ffffffffa100add7>] jbd2_journal_get_write_access+0x27/0x40 [jbd2]
      [ 3440.795682]  [<ffffffffa1068c0b>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs]
      [ 3440.795725]  [<ffffffffa107c069>] ldiskfs_delete_entry+0xa9/0x1a0 [ldiskfs]
      [ 3440.795770]  [<ffffffffa1197ab8>] ? osd_fld_lookup+0x48/0xd0 [osd_ldiskfs]
      [ 3440.795809]  [<ffffffffa1197c03>] ? osd_remote_fid+0xc3/0x440 [osd_ldiskfs]
      [ 3440.795848]  [<ffffffffa1198539>] osd_index_ea_delete+0x5b9/0xbf0 [osd_ldiskfs]
      [ 3440.797985]  [<ffffffff811c186e>] ? kmem_cache_alloc_trace+0x1ce/0x1f0
      [ 3440.800123]  [<ffffffffa142d017>] lod_index_delete+0x67/0x140 [lod]
      [ 3440.802246]  [<ffffffffa0a48a3f>] ? lu_context_init+0xff/0x260 [obdclass]
      [ 3440.804408]  [<ffffffffa147ff2c>] __mdd_index_delete_only+0x19c/0x260 [mdd]
      [ 3440.806558]  [<ffffffffa1480d09>] __mdd_index_delete+0x49/0x2a0 [mdd]
      [ 3440.808568]  [<ffffffffa0a4d64e>] ? lu_capainfo_get+0x1e/0x30 [obdclass]
      [ 3440.810532]  [<ffffffffa1491500>] mdd_unlink+0x600/0xa90 [mdd]
      [ 3440.812537]  [<ffffffffa1353ac6>] mdt_reint_unlink+0xa96/0x11f0 [mdt]
      [ 3440.814493]  [<ffffffffa0a66afe>] ? lu_ucred+0x1e/0x30 [obdclass]
      [ 3440.816392]  [<ffffffffa13575b0>] mdt_reint_rec+0x80/0x210 [mdt]
      [ 3440.818283]  [<ffffffffa13382a9>] mdt_reint_internal+0x5d9/0xb30 [mdt]
      [ 3440.820095]  [<ffffffffa1343237>] mdt_reint+0x67/0x140 [mdt]
      [ 3440.821948]  [<ffffffffa0d149db>] tgt_request_handle+0x8fb/0x11f0 [ptlrpc]
      [ 3440.823730]  [<ffffffffa0cb8aab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
      [ 3440.825391]  [<ffffffffa0905e08>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
      [ 3440.827156]  [<ffffffffa0cb5b78>] ? ptlrpc_wait_event+0x98/0x330 [ptlrpc]
      [ 3440.828815]  [<ffffffffa0cbc3d0>] ptlrpc_main+0xc00/0x1f60 [ptlrpc]
      [ 3440.830489]  [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0
      [ 3440.832008]  [<ffffffffa0cbb7d0>] ? ptlrpc_register_service+0x1070/0x1070 [ptlrpc]
      [ 3440.833556]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
      [ 3440.835122]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
      [ 3440.836661]  [<ffffffff81646b98>] ret_from_fork+0x58/0x90
      [ 3440.838211]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
      [ 3440.839636] Code: 00 e9 3c ff ff ff 0f 1f 80 00 00 00 00 49 83 c5 48 e9 0b ff ff ff 0f 1f 80 00 00 00 00 41 83 45 18 01 49 83 c5 28 e9 f6 fe ff ff <0f> 0b 0f 0b e8 91 07 07 e0 48 85 c0 0f 84 66 fe ff ff 0f 0b 0f 
      [ 3440.842717] RIP  [<ffffffffa100a8b6>] __jbd2_journal_file_buffer+0x206/0x220 [jbd2]
      

      or "Soft lockups" with threads being stuck waiting for j_list_lock spin-lock.

      The kernel BUG is at:

         2228 /*
         2229  * File a buffer on the given transaction list.
         2230  */
         2231 void __jbd2_journal_file_buffer(struct journal_head *jh,
         2232                         transaction_t *transaction, int jlist)
         2233 {
         2234         struct journal_head **list = NULL;
         2235         int was_dirty = 0;
         2236         struct buffer_head *bh = jh2bh(jh);
         2237 
         2238         J_ASSERT_JH(jh, jbd_is_locked_bh_state(bh));
         2239         assert_spin_locked(&transaction->t_journal->j_list_lock); <<<<<
      

      This problem has been introduced by an earlier upstream kernel commit v3.14-rc2-30-g6e4862a "jbd2: minimize region locked by j_list_lock in journal_get_create_access()", where j_list_lock could be unlocked even if not held.

      Attachments

        Issue Links

          Activity

            People

              bfaccini Bruno Faccini (Inactive)
              bfaccini Bruno Faccini (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: