Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8320

:(llog_osd.c:338:llog_osd_write_rec()) ASSERTION( llh ) failed:

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.9.0
    • None
    • None
    • 1
    • 9223372036854775807

    Description

      MDS crash with LBUG.

      0>LustreError: 39313:0:(llog_osd.c:338:llog_osd_write_rec()) ASSERTION( llh ) failed: ^M
      <0>LustreError: 39313:0:(llog_osd.c:338:llog_osd_write_rec()) LBUG^M
      <4>Pid: 39313, comm: mdt02_049^M
      <4>^M
      <4>Call Trace:^M
      <4> [<ffffffffa048b895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
      <4> [<ffffffffa048be97>] lbug_with_loc+0x47/0xb0 [libcfs]^M
      <4> [<ffffffffa05bed55>] llog_osd_write_rec+0xfb5/0x1370 [obdclass]^M
      <4> [<ffffffffa0d46ecb>] ? dynlock_unlock+0x16b/0x1d0 [osd_ldiskfs]^M
      <4> [<ffffffffa0d2e5d2>] ? iam_path_release+0x42/0x70 [osd_ldiskfs]^M
      <4> [<ffffffffa0590438>] llog_write_rec+0xc8/0x290 [obdclass]^M
      <4> [<ffffffffa059910d>] llog_cat_add_rec+0xad/0x480 [obdclass]^M
      <4> [<ffffffffa0590231>] llog_add+0x91/0x1d0 [obdclass]^M
      <4> [<ffffffffa0fd04f7>] osp_sync_add_rec+0x247/0xad0 [osp]^M
      <4> [<ffffffffa0fd0e2b>] osp_sync_add+0x7b/0x80 [osp]^M
      <4> [<ffffffffa0fc27d6>] osp_object_destroy+0x106/0x150 [osp]^M
      <4> [<ffffffffa0f068e7>] lod_object_destroy+0x1a7/0x350 [lod]^M
      <4> [<ffffffffa0f74880>] mdd_finish_unlink+0x210/0x3d0 [mdd]^M
      <4> [<ffffffffa0f65d35>] ? mdd_attr_check_set_internal+0x275/0x2c0 [mdd]^M
      <4> [<ffffffffa0f75306>] mdd_unlink+0x8c6/0xca0 [mdd]^M
      <4> [<ffffffffa0e37788>] mdo_unlink+0x18/0x50 [mdt]^M
      <4> [<ffffffffa0e3b005>] mdt_reint_unlink+0x835/0x1030 [mdt]^M
      <4> [<ffffffffa0e37571>] mdt_reint_rec+0x41/0xe0 [mdt]^M
      <4> [<ffffffffa0e1ced3>] mdt_reint_internal+0x4c3/0x780 [mdt]^M
      <4> [<ffffffffa0e1d1d4>] mdt_reint+0x44/0xe0 [mdt]^M
      <4> [<ffffffffa0e1fada>] mdt_handle_common+0x52a/0x1470 [mdt]^M
      <4> [<ffffffffa0e5c5f5>] mds_regular_handle+0x15/0x20 [mdt]^M
      <4> [<ffffffffa07750c5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]^M
      <4> [<ffffffffa048c5ae>] ? cfs_timer_arm+0xe/0x10 [libcfs]^M
      <4> [<ffffffffa049d8d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]^M
      <4> [<ffffffffa076da69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]^M
      <4> [<ffffffff81057779>] ? __wake_up_common+0x59/0x90^M
      <4> [<ffffffffa077789d>] ptlrpc_main+0xafd/0x1780 [ptlrpc]^M
      <4> [<ffffffff8100c28a>] child_rip+0xa/0x20^M
      <4> [<ffffffffa0776da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]^M
      <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20^M
      <4>^M
      <0>Kernel panic - not syncing: LBUG^M
      <4>Pid: 39313, comm: mdt02_049 Tainted: G           ---------------  T 2.6.32-504.30.3.el6.20151008.x86_64.lustre253 #1^M
      <4>Call Trace:^M
      <4> [<ffffffff81564fb9>] ? panic+0xa7/0x190^M
      <4> [<ffffffffa048beeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]^M
      <4> [<ffffffffa05bed55>] ? llog_osd_write_rec+0xfb5/0x1370 [obdclass]^M
      <4> [<ffffffffa0d46ecb>] ? dynlock_unlock+0x16b/0x1d0 [osd_ldiskfs]^M
      <4> [<ffffffffa0d2e5d2>] ? iam_path_release+0x42/0x70 [osd_ldiskfs]^M
      <4> [<ffffffffa0590438>] ? llog_write_rec+0xc8/0x290 [obdclass]^M
      <4> [<ffffffffa059910d>] ? llog_cat_add_rec+0xad/0x480 [obdclass]^M
      <4> [<ffffffffa0590231>] ? llog_add+0x91/0x1d0 [obdclass]^M
      <4> [<ffffffffa0fd04f7>] ? osp_sync_add_rec+0x247/0xad0 [osp]^M
      <4> [<ffffffffa0fd0e2b>] ? osp_sync_add+0x7b/0x80 [osp]^M
      <4> [<ffffffffa0fc27d6>] ? osp_object_destroy+0x106/0x150 [osp]^M
      <4> [<ffffffffa0f068e7>] ? lod_object_destroy+0x1a7/0x350 [lod]^M
      <4> [<ffffffffa0f74880>] ? mdd_finish_unlink+0x210/0x3d0 [mdd]^M
      <4> [<ffffffffa0f65d35>] ? mdd_attr_check_set_internal+0x275/0x2c0 [mdd]^M
      <4> [<ffffffffa0f75306>] ? mdd_unlink+0x8c6/0xca0 [mdd]^M
      <4> [<ffffffffa0e37788>] ? mdo_unlink+0x18/0x50 [mdt]^M
      <4> [<ffffffffa0e3b005>] ? mdt_reint_unlink+0x835/0x1030 [mdt]^M
      <4> [<ffffffffa0e37571>] ? mdt_reint_rec+0x41/0xe0 [mdt]^M
      <4> [<ffffffffa0e1ced3>] ? mdt_reint_internal+0x4c3/0x780 [mdt]^M
      <4> [<ffffffffa0e1d1d4>] ? mdt_reint+0x44/0xe0 [mdt]^M
      <4> [<ffffffffa0e1fada>] ? mdt_handle_common+0x52a/0x1470 [mdt]^M
      <4> [<ffffffffa0e5c5f5>] ? mds_regular_handle+0x15/0x20 [mdt]^M
      <4> [<ffffffffa07750c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]^M
      <4> [<ffffffffa048c5ae>] ? cfs_timer_arm+0xe/0x10 [libcfs]^M
      <4> [<ffffffffa049d8d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]^M
      <4> [<ffffffffa076da69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]^M
      <4> [<ffffffff81057779>] ? __wake_up_common+0x59/0x90^M
      <4> [<ffffffffa077789d>] ? ptlrpc_main+0xafd/0x1780 [ptlrpc]^M
      <4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20^M
      <4> [<ffffffffa0776da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]^M
      <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20^M
      

      Attachments

        Activity

          [LU-8320] :(llog_osd.c:338:llog_osd_write_rec()) ASSERTION( llh ) failed:

          Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/21128
          Subject: LU-8320 llog: prevent llog ID re-use.
          Project: fs/lustre-release
          Branch: b2_5
          Current Patch Set: 1
          Commit: a051f8c40337d9a577bee88dab1b50477d84061d

          gerrit Gerrit Updater added a comment - Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/21128 Subject: LU-8320 llog: prevent llog ID re-use. Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: a051f8c40337d9a577bee88dab1b50477d84061d

          OK, got it. Another question then - there should be CATALOG file with just list of catalog llog ids in it, could you look in it and provide those catalog llogs for inspection? No need to remove them, I want just check how big they are and what is inside.

          tappro Mikhail Pershin added a comment - OK, got it. Another question then - there should be CATALOG file with just list of catalog llog ids in it, could you look in it and provide those catalog llogs for inspection? No need to remove them, I want just check how big they are and what is inside.
          mhanafi Mahmoud Hanafi added a comment - - edited

          The 9252 was for a previous crash where I didn't print out the full data structure. The second time (9553) I printing the correct structures.
          We did have a crash the next day where I used the systemtap script to find the file and delete it. But i didn't save a copy.

          mhanafi Mahmoud Hanafi added a comment - - edited The 9252 was for a previous crash where I didn't print out the full data structure. The second time (9553) I printing the correct structures. We did have a crash the next day where I used the systemtap script to find the file and delete it. But i didn't save a copy.

          Mahmoud, in your very first comment about systemtap info, the llog_process_th is 9252 but in the next one it is 9553. Can that be this is another thread and related llog file is different from that one causing assertion? Do you still have any data related to that or similar lbug?

          tappro Mikhail Pershin added a comment - Mahmoud, in your very first comment about systemtap info, the llog_process_th is 9252 but in the next one it is 9553. Can that be this is another thread and related llog file is different from that one causing assertion? Do you still have any data related to that or similar lbug?

          FYI, we have a mix of 2.5.3, 2.7.1, and 2.7.2. We are moving all of them to 2.7.2 but that will take some time so we would need patch for 2.5.3 and 2.7.1. Jay will give you access to the repo.

          mhanafi Mahmoud Hanafi added a comment - FYI, we have a mix of 2.5.3, 2.7.1, and 2.7.2. We are moving all of them to 2.7.2 but that will take some time so we would need patch for 2.5.3 and 2.7.1. Jay will give you access to the repo.
          pjones Peter Jones added a comment -

          Mike is setting up a github account. I will provide the details to Jay so he can grant the appropriate access to see NASA's tree

          pjones Peter Jones added a comment - Mike is setting up a github account. I will provide the details to Jay so he can grant the appropriate access to see NASA's tree

          Hello, am I right that you are using 2.5.3 Lustre with additional patches? Are there patches related to llog subsystem? Could you make a list of them?

          I have found one suspicious thing in corrupted llog, checking the code now to understand how that may happen.

          tappro Mikhail Pershin added a comment - Hello, am I right that you are using 2.5.3 Lustre with additional patches? Are there patches related to llog subsystem? Could you make a list of them? I have found one suspicious thing in corrupted llog, checking the code now to understand how that may happen.

          Any updates?

          mhanafi Mahmoud Hanafi added a comment - Any updates?
          tappro Mikhail Pershin added a comment - - edited

          Mahmoud, the message about 'Llog is not catalog' is not an error or corruption, it is just saying that llog file is not catalog as expected by llog_catfix tool. I am checking 12744.save now. It looks like plain llog with unlink records in it, I am still looking through it.

          yes, I am working on patch at least to handle llog_read_header() errors gracefully.

          tappro Mikhail Pershin added a comment - - edited Mahmoud, the message about 'Llog is not catalog' is not an error or corruption, it is just saying that llog file is not catalog as expected by llog_catfix tool. I am checking 12744.save now. It looks like plain llog with unlink records in it, I am still looking through it. yes, I am working on patch at least to handle llog_read_header() errors gracefully.

          Can we expect a patch? Any idea what is causing the corruption?

          mhanafi Mahmoud Hanafi added a comment - Can we expect a patch? Any idea what is causing the corruption?

          First of all the llog_init_handle() should handle llog_read_header() errors correctly, not allowing the further use of such llog_handle. This will prevent similar lbugs(). It is the origin of that issue.

          tappro Mikhail Pershin added a comment - First of all the llog_init_handle() should handle llog_read_header() errors correctly, not allowing the further use of such llog_handle. This will prevent similar lbugs(). It is the origin of that issue.

          People

            tappro Mikhail Pershin
            mhanafi Mahmoud Hanafi
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: