Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8320

:(llog_osd.c:338:llog_osd_write_rec()) ASSERTION( llh ) failed:

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.9.0
    • None
    • None
    • 1
    • 9223372036854775807

    Description

      MDS crash with LBUG.

      0>LustreError: 39313:0:(llog_osd.c:338:llog_osd_write_rec()) ASSERTION( llh ) failed: ^M
      <0>LustreError: 39313:0:(llog_osd.c:338:llog_osd_write_rec()) LBUG^M
      <4>Pid: 39313, comm: mdt02_049^M
      <4>^M
      <4>Call Trace:^M
      <4> [<ffffffffa048b895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
      <4> [<ffffffffa048be97>] lbug_with_loc+0x47/0xb0 [libcfs]^M
      <4> [<ffffffffa05bed55>] llog_osd_write_rec+0xfb5/0x1370 [obdclass]^M
      <4> [<ffffffffa0d46ecb>] ? dynlock_unlock+0x16b/0x1d0 [osd_ldiskfs]^M
      <4> [<ffffffffa0d2e5d2>] ? iam_path_release+0x42/0x70 [osd_ldiskfs]^M
      <4> [<ffffffffa0590438>] llog_write_rec+0xc8/0x290 [obdclass]^M
      <4> [<ffffffffa059910d>] llog_cat_add_rec+0xad/0x480 [obdclass]^M
      <4> [<ffffffffa0590231>] llog_add+0x91/0x1d0 [obdclass]^M
      <4> [<ffffffffa0fd04f7>] osp_sync_add_rec+0x247/0xad0 [osp]^M
      <4> [<ffffffffa0fd0e2b>] osp_sync_add+0x7b/0x80 [osp]^M
      <4> [<ffffffffa0fc27d6>] osp_object_destroy+0x106/0x150 [osp]^M
      <4> [<ffffffffa0f068e7>] lod_object_destroy+0x1a7/0x350 [lod]^M
      <4> [<ffffffffa0f74880>] mdd_finish_unlink+0x210/0x3d0 [mdd]^M
      <4> [<ffffffffa0f65d35>] ? mdd_attr_check_set_internal+0x275/0x2c0 [mdd]^M
      <4> [<ffffffffa0f75306>] mdd_unlink+0x8c6/0xca0 [mdd]^M
      <4> [<ffffffffa0e37788>] mdo_unlink+0x18/0x50 [mdt]^M
      <4> [<ffffffffa0e3b005>] mdt_reint_unlink+0x835/0x1030 [mdt]^M
      <4> [<ffffffffa0e37571>] mdt_reint_rec+0x41/0xe0 [mdt]^M
      <4> [<ffffffffa0e1ced3>] mdt_reint_internal+0x4c3/0x780 [mdt]^M
      <4> [<ffffffffa0e1d1d4>] mdt_reint+0x44/0xe0 [mdt]^M
      <4> [<ffffffffa0e1fada>] mdt_handle_common+0x52a/0x1470 [mdt]^M
      <4> [<ffffffffa0e5c5f5>] mds_regular_handle+0x15/0x20 [mdt]^M
      <4> [<ffffffffa07750c5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]^M
      <4> [<ffffffffa048c5ae>] ? cfs_timer_arm+0xe/0x10 [libcfs]^M
      <4> [<ffffffffa049d8d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]^M
      <4> [<ffffffffa076da69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]^M
      <4> [<ffffffff81057779>] ? __wake_up_common+0x59/0x90^M
      <4> [<ffffffffa077789d>] ptlrpc_main+0xafd/0x1780 [ptlrpc]^M
      <4> [<ffffffff8100c28a>] child_rip+0xa/0x20^M
      <4> [<ffffffffa0776da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]^M
      <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20^M
      <4>^M
      <0>Kernel panic - not syncing: LBUG^M
      <4>Pid: 39313, comm: mdt02_049 Tainted: G           ---------------  T 2.6.32-504.30.3.el6.20151008.x86_64.lustre253 #1^M
      <4>Call Trace:^M
      <4> [<ffffffff81564fb9>] ? panic+0xa7/0x190^M
      <4> [<ffffffffa048beeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]^M
      <4> [<ffffffffa05bed55>] ? llog_osd_write_rec+0xfb5/0x1370 [obdclass]^M
      <4> [<ffffffffa0d46ecb>] ? dynlock_unlock+0x16b/0x1d0 [osd_ldiskfs]^M
      <4> [<ffffffffa0d2e5d2>] ? iam_path_release+0x42/0x70 [osd_ldiskfs]^M
      <4> [<ffffffffa0590438>] ? llog_write_rec+0xc8/0x290 [obdclass]^M
      <4> [<ffffffffa059910d>] ? llog_cat_add_rec+0xad/0x480 [obdclass]^M
      <4> [<ffffffffa0590231>] ? llog_add+0x91/0x1d0 [obdclass]^M
      <4> [<ffffffffa0fd04f7>] ? osp_sync_add_rec+0x247/0xad0 [osp]^M
      <4> [<ffffffffa0fd0e2b>] ? osp_sync_add+0x7b/0x80 [osp]^M
      <4> [<ffffffffa0fc27d6>] ? osp_object_destroy+0x106/0x150 [osp]^M
      <4> [<ffffffffa0f068e7>] ? lod_object_destroy+0x1a7/0x350 [lod]^M
      <4> [<ffffffffa0f74880>] ? mdd_finish_unlink+0x210/0x3d0 [mdd]^M
      <4> [<ffffffffa0f65d35>] ? mdd_attr_check_set_internal+0x275/0x2c0 [mdd]^M
      <4> [<ffffffffa0f75306>] ? mdd_unlink+0x8c6/0xca0 [mdd]^M
      <4> [<ffffffffa0e37788>] ? mdo_unlink+0x18/0x50 [mdt]^M
      <4> [<ffffffffa0e3b005>] ? mdt_reint_unlink+0x835/0x1030 [mdt]^M
      <4> [<ffffffffa0e37571>] ? mdt_reint_rec+0x41/0xe0 [mdt]^M
      <4> [<ffffffffa0e1ced3>] ? mdt_reint_internal+0x4c3/0x780 [mdt]^M
      <4> [<ffffffffa0e1d1d4>] ? mdt_reint+0x44/0xe0 [mdt]^M
      <4> [<ffffffffa0e1fada>] ? mdt_handle_common+0x52a/0x1470 [mdt]^M
      <4> [<ffffffffa0e5c5f5>] ? mds_regular_handle+0x15/0x20 [mdt]^M
      <4> [<ffffffffa07750c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]^M
      <4> [<ffffffffa048c5ae>] ? cfs_timer_arm+0xe/0x10 [libcfs]^M
      <4> [<ffffffffa049d8d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]^M
      <4> [<ffffffffa076da69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]^M
      <4> [<ffffffff81057779>] ? __wake_up_common+0x59/0x90^M
      <4> [<ffffffffa077789d>] ? ptlrpc_main+0xafd/0x1780 [ptlrpc]^M
      <4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20^M
      <4> [<ffffffffa0776da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]^M
      <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20^M
      

      Attachments

        Activity

          [LU-8320] :(llog_osd.c:338:llog_osd_write_rec()) ASSERTION( llh ) failed:
          mhanafi Mahmoud Hanafi added a comment - - edited

          Great!. do you still need to the CATALOGS file? Is the an issue with 2.7? if so we will need a patch for that as well.

          Why would we have old llog files still around?

          mhanafi Mahmoud Hanafi added a comment - - edited Great!. do you still need to the CATALOGS file? Is the an issue with 2.7? if so we will need a patch for that as well. Why would we have old llog files still around?
          tappro Mikhail Pershin added a comment - - edited

          I think I've found the reason of this issue, it looks like the newly generated llog ID may cycle through zero and match some very old llog files. In that case it is considered as new llog (not initialized yet, but since it exists on disk several check are passed while they shouldn't and we have that assertion. That scenario fits well with what I see in llog file provided and assertion with llh == NULL, as well as with your comment that problem files are very old.

          tappro Mikhail Pershin added a comment - - edited I think I've found the reason of this issue, it looks like the newly generated llog ID may cycle through zero and match some very old llog files. In that case it is considered as new llog (not initialized yet, but since it exists on disk several check are passed while they shouldn't and we have that assertion. That scenario fits well with what I see in llog file provided and assertion with llh == NULL, as well as with your comment that problem files are very old.

          Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/21128
          Subject: LU-8320 llog: prevent llog ID re-use.
          Project: fs/lustre-release
          Branch: b2_5
          Current Patch Set: 1
          Commit: a051f8c40337d9a577bee88dab1b50477d84061d

          gerrit Gerrit Updater added a comment - Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/21128 Subject: LU-8320 llog: prevent llog ID re-use. Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: a051f8c40337d9a577bee88dab1b50477d84061d

          OK, got it. Another question then - there should be CATALOG file with just list of catalog llog ids in it, could you look in it and provide those catalog llogs for inspection? No need to remove them, I want just check how big they are and what is inside.

          tappro Mikhail Pershin added a comment - OK, got it. Another question then - there should be CATALOG file with just list of catalog llog ids in it, could you look in it and provide those catalog llogs for inspection? No need to remove them, I want just check how big they are and what is inside.
          mhanafi Mahmoud Hanafi added a comment - - edited

          The 9252 was for a previous crash where I didn't print out the full data structure. The second time (9553) I printing the correct structures.
          We did have a crash the next day where I used the systemtap script to find the file and delete it. But i didn't save a copy.

          mhanafi Mahmoud Hanafi added a comment - - edited The 9252 was for a previous crash where I didn't print out the full data structure. The second time (9553) I printing the correct structures. We did have a crash the next day where I used the systemtap script to find the file and delete it. But i didn't save a copy.

          Mahmoud, in your very first comment about systemtap info, the llog_process_th is 9252 but in the next one it is 9553. Can that be this is another thread and related llog file is different from that one causing assertion? Do you still have any data related to that or similar lbug?

          tappro Mikhail Pershin added a comment - Mahmoud, in your very first comment about systemtap info, the llog_process_th is 9252 but in the next one it is 9553. Can that be this is another thread and related llog file is different from that one causing assertion? Do you still have any data related to that or similar lbug?

          FYI, we have a mix of 2.5.3, 2.7.1, and 2.7.2. We are moving all of them to 2.7.2 but that will take some time so we would need patch for 2.5.3 and 2.7.1. Jay will give you access to the repo.

          mhanafi Mahmoud Hanafi added a comment - FYI, we have a mix of 2.5.3, 2.7.1, and 2.7.2. We are moving all of them to 2.7.2 but that will take some time so we would need patch for 2.5.3 and 2.7.1. Jay will give you access to the repo.
          pjones Peter Jones added a comment -

          Mike is setting up a github account. I will provide the details to Jay so he can grant the appropriate access to see NASA's tree

          pjones Peter Jones added a comment - Mike is setting up a github account. I will provide the details to Jay so he can grant the appropriate access to see NASA's tree

          Hello, am I right that you are using 2.5.3 Lustre with additional patches? Are there patches related to llog subsystem? Could you make a list of them?

          I have found one suspicious thing in corrupted llog, checking the code now to understand how that may happen.

          tappro Mikhail Pershin added a comment - Hello, am I right that you are using 2.5.3 Lustre with additional patches? Are there patches related to llog subsystem? Could you make a list of them? I have found one suspicious thing in corrupted llog, checking the code now to understand how that may happen.

          Any updates?

          mhanafi Mahmoud Hanafi added a comment - Any updates?
          tappro Mikhail Pershin added a comment - - edited

          Mahmoud, the message about 'Llog is not catalog' is not an error or corruption, it is just saying that llog file is not catalog as expected by llog_catfix tool. I am checking 12744.save now. It looks like plain llog with unlink records in it, I am still looking through it.

          yes, I am working on patch at least to handle llog_read_header() errors gracefully.

          tappro Mikhail Pershin added a comment - - edited Mahmoud, the message about 'Llog is not catalog' is not an error or corruption, it is just saying that llog file is not catalog as expected by llog_catfix tool. I am checking 12744.save now. It looks like plain llog with unlink records in it, I am still looking through it. yes, I am working on patch at least to handle llog_read_header() errors gracefully.

          People

            tappro Mikhail Pershin
            mhanafi Mahmoud Hanafi
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: