Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7419

llog corruption after hitting ASSERTION( handle->lgh_hdr == ((void *)0) ) in llog_init_handle

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.8.0
    • 2.6.32-504.16.2.1chaos.ch5.3.x86_64
    • 3
    • 9223372036854775807

    Description

      A kernel panic occurred while initializing a new llog catalog record (details below) and an assertion was hit that was attempting to ensure that the llog_handle had a NULL pointer to a llog header struct. The panic occurred while I was running an mdtest job which was writing to a striped directory from 32 client nodes running 4 threads each. The panic caused an llog file to become corrupt. I manually repaired the llog file and restarted my MDSs and recovery now completes.

      To summarize my setup, I am running a test cluster with 4 MDSs, 2 OSTs, and 32 client nodes with the filesystem mounted. No failover. The lustre version running is 2.7.62. The error messages and call stack from MDS1 are below:

      2015-11-04 13:36:26 LustreError: 38007:0:(llog.c:342:llog_init_handle()) ASSERTION( handle->lgh_hdr == ((void *)0) ) failed:
      2015-11-04 13:36:26 LustreError: 38007:0:(llog.c:342:llog_init_handle()) LBUG
      2015-11-04 13:36:26 Pid: 38007, comm: mdt01_005
      2015-11-04 13:36:26 Nov  4 13:36:26
      ...
      2015-11-04 13:36:26 Kernel panic - not syncing: LBUG
      2015-11-04 13:36:26 Pid: 38007, comm: mdt01_005 Tainted: P           ---------------    2.6.32-504.16.2.1chaos.ch5.3.x86_64 #1
      2015-11-04 13:36:26 Call Trace:
      2015-11-04 13:36:26  [<ffffffff8152d471>] ? panic+0xa7/0x16f
      2015-11-04 13:36:26  [<ffffffffa0847f2b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      2015-11-04 13:36:26  [<ffffffffa09a62cf>] ? llog_init_handle+0x86f/0xb10 [obdclass]
      2015-11-04 13:36:26  [<ffffffffa09ac809>] ? llog_cat_new_log+0x3d9/0xdc0 [obdclass]
      2015-11-04 13:36:26  [<ffffffffa09a4663>] ? llog_declare_write_rec+0x93/0x210 [obdclass]
      2015-11-04 13:36:26  [<ffffffffa09ad616>] ? llog_cat_declare_add_rec+0x426/0x430 [obdclass]
      2015-11-04 13:36:26  [<ffffffffa09a406f>] ? llog_declare_add+0x7f/0x1b0 [obdclass]
      2015-11-04 13:36:26  [<ffffffffa0c9c19c>] ? top_trans_start+0x17c/0x960 [ptlrpc]
      2015-11-04 13:36:26  [<ffffffffa127cc11>] ? lod_trans_start+0x61/0x70 [lod]
      2015-11-04 13:36:26  [<ffffffffa13248b4>] ? mdd_trans_start+0x14/0x20 [mdd]
      2015-11-04 13:36:26  [<ffffffffa1313333>] ? mdd_create+0xe53/0x1aa0 [mdd]
      2015-11-04 13:36:26  [<ffffffffa11c6784>] ? mdt_version_save+0x84/0x1a0 [mdt]
      2015-11-04 13:36:26  [<ffffffffa11c8f46>] ? mdt_reint_create+0xbb6/0xcc0 [mdt]
      2015-11-04 13:36:26  [<ffffffffa0a13230>] ? lu_ucred+0x20/0x30 [obdclass]
      2015-11-04 13:36:26  [<ffffffffa11a8675>] ? mdt_ucred+0x15/0x20 [mdt]
      2015-11-04 13:36:26  [<ffffffffa11c183c>] ? mdt_root_squash+0x2c/0x3f0 [mdt]
      2015-11-04 13:36:26  [<ffffffffa0c43d32>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
      2015-11-04 13:36:26  [<ffffffffa11c597d>] ? mdt_reint_rec+0x5d/0x200 [mdt]
      2015-11-04 13:36:26  [<ffffffffa11b177b>] ? mdt_reint_internal+0x62b/0xb80 [mdt]
      2015-11-04 13:36:26  [<ffffffffa11b216b>] ? mdt_reint+0x6b/0x120 [mdt]
      2015-11-04 13:36:26  [<ffffffffa0c8621c>] ? tgt_request_handle+0x8bc/0x12e0 [ptlrpc]
      2015-11-04 13:36:26  [<ffffffffa0c2da21>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc]
      2015-11-04 13:36:26  [<ffffffff8106d740>] ? pick_next_task_fair+0xd0/0x130
      2015-11-04 13:36:26  [<ffffffff8152d8f6>] ? schedule+0x176/0x3a0
      2015-11-04 13:36:26  [<ffffffffa0c2cbe0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
      2015-11-04 13:36:26  [<ffffffff8109fffe>] ? kthread+0x9e/0xc0
      2015-11-04 13:36:27  [<ffffffff8100c24a>] ? child_rip+0xa/0x20
      2015-11-04 13:36:27  [<ffffffff8109ff60>] ? kthread+0x0/0xc0
      2015-11-04 13:36:27  [<ffffffff8100c240>] ? child_rip+0x0/0x20
      

      After rebooting MDS1, I started to see llog corruption messages for an llog file that was on MDS4 (remember the panic was on MDS1) shown below:

      2015-11-04 14:15:59 LustreError: 11466:0:(llog_osd.c:833:llog_osd_next_block()) ldne-MDT0003-osp-MDT0000: can't read llog block from log [0x300000401:0x1:0x0] offset 32768: rc = -5
      2015-11-04 14:15:59 LustreError: 11466:0:(llog.c:578:llog_process_thread()) Local llog found corrupted
      

      Eventually, the recovery timer goes negative and recovery never ends (see LU-6994). I manually fixed the llog file on MDS4 and recovery now completes. Attached are the original corrupted llog (0x300000401.0x1.0x0) and the version after the fix (0x300000401.0x1.0x0.replace).

      Attachments

        1. 0x300000401.0x1.0x0
          34 kB
          Giuseppe Di Natale
        2. 0x300000401.0x1.0x0.replace
          34 kB
          Giuseppe Di Natale

        Issue Links

          Activity

            People

              di.wang Di Wang
              dinatale2 Giuseppe Di Natale (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: