Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8320

:(llog_osd.c:338:llog_osd_write_rec()) ASSERTION( llh ) failed:

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.9.0
    • None
    • None
    • 1
    • 9223372036854775807

    Description

      MDS crash with LBUG.

      0>LustreError: 39313:0:(llog_osd.c:338:llog_osd_write_rec()) ASSERTION( llh ) failed: ^M
      <0>LustreError: 39313:0:(llog_osd.c:338:llog_osd_write_rec()) LBUG^M
      <4>Pid: 39313, comm: mdt02_049^M
      <4>^M
      <4>Call Trace:^M
      <4> [<ffffffffa048b895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
      <4> [<ffffffffa048be97>] lbug_with_loc+0x47/0xb0 [libcfs]^M
      <4> [<ffffffffa05bed55>] llog_osd_write_rec+0xfb5/0x1370 [obdclass]^M
      <4> [<ffffffffa0d46ecb>] ? dynlock_unlock+0x16b/0x1d0 [osd_ldiskfs]^M
      <4> [<ffffffffa0d2e5d2>] ? iam_path_release+0x42/0x70 [osd_ldiskfs]^M
      <4> [<ffffffffa0590438>] llog_write_rec+0xc8/0x290 [obdclass]^M
      <4> [<ffffffffa059910d>] llog_cat_add_rec+0xad/0x480 [obdclass]^M
      <4> [<ffffffffa0590231>] llog_add+0x91/0x1d0 [obdclass]^M
      <4> [<ffffffffa0fd04f7>] osp_sync_add_rec+0x247/0xad0 [osp]^M
      <4> [<ffffffffa0fd0e2b>] osp_sync_add+0x7b/0x80 [osp]^M
      <4> [<ffffffffa0fc27d6>] osp_object_destroy+0x106/0x150 [osp]^M
      <4> [<ffffffffa0f068e7>] lod_object_destroy+0x1a7/0x350 [lod]^M
      <4> [<ffffffffa0f74880>] mdd_finish_unlink+0x210/0x3d0 [mdd]^M
      <4> [<ffffffffa0f65d35>] ? mdd_attr_check_set_internal+0x275/0x2c0 [mdd]^M
      <4> [<ffffffffa0f75306>] mdd_unlink+0x8c6/0xca0 [mdd]^M
      <4> [<ffffffffa0e37788>] mdo_unlink+0x18/0x50 [mdt]^M
      <4> [<ffffffffa0e3b005>] mdt_reint_unlink+0x835/0x1030 [mdt]^M
      <4> [<ffffffffa0e37571>] mdt_reint_rec+0x41/0xe0 [mdt]^M
      <4> [<ffffffffa0e1ced3>] mdt_reint_internal+0x4c3/0x780 [mdt]^M
      <4> [<ffffffffa0e1d1d4>] mdt_reint+0x44/0xe0 [mdt]^M
      <4> [<ffffffffa0e1fada>] mdt_handle_common+0x52a/0x1470 [mdt]^M
      <4> [<ffffffffa0e5c5f5>] mds_regular_handle+0x15/0x20 [mdt]^M
      <4> [<ffffffffa07750c5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]^M
      <4> [<ffffffffa048c5ae>] ? cfs_timer_arm+0xe/0x10 [libcfs]^M
      <4> [<ffffffffa049d8d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]^M
      <4> [<ffffffffa076da69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]^M
      <4> [<ffffffff81057779>] ? __wake_up_common+0x59/0x90^M
      <4> [<ffffffffa077789d>] ptlrpc_main+0xafd/0x1780 [ptlrpc]^M
      <4> [<ffffffff8100c28a>] child_rip+0xa/0x20^M
      <4> [<ffffffffa0776da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]^M
      <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20^M
      <4>^M
      <0>Kernel panic - not syncing: LBUG^M
      <4>Pid: 39313, comm: mdt02_049 Tainted: G           ---------------  T 2.6.32-504.30.3.el6.20151008.x86_64.lustre253 #1^M
      <4>Call Trace:^M
      <4> [<ffffffff81564fb9>] ? panic+0xa7/0x190^M
      <4> [<ffffffffa048beeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]^M
      <4> [<ffffffffa05bed55>] ? llog_osd_write_rec+0xfb5/0x1370 [obdclass]^M
      <4> [<ffffffffa0d46ecb>] ? dynlock_unlock+0x16b/0x1d0 [osd_ldiskfs]^M
      <4> [<ffffffffa0d2e5d2>] ? iam_path_release+0x42/0x70 [osd_ldiskfs]^M
      <4> [<ffffffffa0590438>] ? llog_write_rec+0xc8/0x290 [obdclass]^M
      <4> [<ffffffffa059910d>] ? llog_cat_add_rec+0xad/0x480 [obdclass]^M
      <4> [<ffffffffa0590231>] ? llog_add+0x91/0x1d0 [obdclass]^M
      <4> [<ffffffffa0fd04f7>] ? osp_sync_add_rec+0x247/0xad0 [osp]^M
      <4> [<ffffffffa0fd0e2b>] ? osp_sync_add+0x7b/0x80 [osp]^M
      <4> [<ffffffffa0fc27d6>] ? osp_object_destroy+0x106/0x150 [osp]^M
      <4> [<ffffffffa0f068e7>] ? lod_object_destroy+0x1a7/0x350 [lod]^M
      <4> [<ffffffffa0f74880>] ? mdd_finish_unlink+0x210/0x3d0 [mdd]^M
      <4> [<ffffffffa0f65d35>] ? mdd_attr_check_set_internal+0x275/0x2c0 [mdd]^M
      <4> [<ffffffffa0f75306>] ? mdd_unlink+0x8c6/0xca0 [mdd]^M
      <4> [<ffffffffa0e37788>] ? mdo_unlink+0x18/0x50 [mdt]^M
      <4> [<ffffffffa0e3b005>] ? mdt_reint_unlink+0x835/0x1030 [mdt]^M
      <4> [<ffffffffa0e37571>] ? mdt_reint_rec+0x41/0xe0 [mdt]^M
      <4> [<ffffffffa0e1ced3>] ? mdt_reint_internal+0x4c3/0x780 [mdt]^M
      <4> [<ffffffffa0e1d1d4>] ? mdt_reint+0x44/0xe0 [mdt]^M
      <4> [<ffffffffa0e1fada>] ? mdt_handle_common+0x52a/0x1470 [mdt]^M
      <4> [<ffffffffa0e5c5f5>] ? mds_regular_handle+0x15/0x20 [mdt]^M
      <4> [<ffffffffa07750c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]^M
      <4> [<ffffffffa048c5ae>] ? cfs_timer_arm+0xe/0x10 [libcfs]^M
      <4> [<ffffffffa049d8d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]^M
      <4> [<ffffffffa076da69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]^M
      <4> [<ffffffff81057779>] ? __wake_up_common+0x59/0x90^M
      <4> [<ffffffffa077789d>] ? ptlrpc_main+0xafd/0x1780 [ptlrpc]^M
      <4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20^M
      <4> [<ffffffffa0776da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]^M
      <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20^M
      

      Attachments

        Activity

          [LU-8320] :(llog_osd.c:338:llog_osd_write_rec()) ASSERTION( llh ) failed:

          Please re-open until the backport patch lands to 2.7 FE.

          ndauchy Nathan Dauchy (Inactive) added a comment - Please re-open until the backport patch lands to 2.7 FE.
          pjones Peter Jones added a comment -

          Landed for 2.9

          pjones Peter Jones added a comment - Landed for 2.9

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21144/
          Subject: LU-8320 llog: prevent llog ID re-use.
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: a93ede18ababa3fe1ae8f4a5f92e868589a58cb6

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21144/ Subject: LU-8320 llog: prevent llog ID re-use. Project: fs/lustre-release Branch: master Current Patch Set: Commit: a93ede18ababa3fe1ae8f4a5f92e868589a58cb6

          Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/21144
          Subject: LU-8320 llog: prevent llog ID re-use.
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 722f308635f118d00a5c4a44fa72d18986ccdac9

          gerrit Gerrit Updater added a comment - Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/21144 Subject: LU-8320 llog: prevent llog ID re-use. Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 722f308635f118d00a5c4a44fa72d18986ccdac9

          LU-5297 patch has not landed to b2_7_fe yet. I cherry-picked it to our nas-2.7.1 and nas-2.7.2 anyway (locally, not pushed to github yet.) OTOH, LU-5297 patch caused conflicts in nas-2.5.3. Since there is a workaround (ie, find the offending file and remove it), I think we do not need LU-5297 on nas-2.5.3.

          So, we have LU-4528, LU-7079, LU-6696, and LU-5297 on nas-2.7.1 and nas-2.7.2. I tried LU-8320 and it was applied cleanly on nas-2.7.x, but I will wait until it gets reviews.

          jaylan Jay Lan (Inactive) added a comment - LU-5297 patch has not landed to b2_7_fe yet. I cherry-picked it to our nas-2.7.1 and nas-2.7.2 anyway (locally, not pushed to github yet.) OTOH, LU-5297 patch caused conflicts in nas-2.5.3. Since there is a workaround (ie, find the offending file and remove it), I think we do not need LU-5297 on nas-2.5.3. So, we have LU-4528 , LU-7079 , LU-6696 , and LU-5297 on nas-2.7.1 and nas-2.7.2. I tried LU-8320 and it was applied cleanly on nas-2.7.x, but I will wait until it gets reviews.

          The 2.5.3 lustre server still running 2.5.3-6nasS, and LU-7079 patch was included in 2.5.3-6.1nasS.

          jaylan Jay Lan (Inactive) added a comment - The 2.5.3 lustre server still running 2.5.3-6nasS, and LU-7079 patch was included in 2.5.3-6.1nasS.

          Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/21130
          Subject: LU-8320 llog: prevent llog ID re-use.
          Project: fs/lustre-release
          Branch: b2_7
          Current Patch Set: 1
          Commit: d5e1cfbd9bd5bfdcd6d5c6b029e4cf578c9e75b8

          gerrit Gerrit Updater added a comment - Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/21130 Subject: LU-8320 llog: prevent llog ID re-use. Project: fs/lustre-release Branch: b2_7 Current Patch Set: 1 Commit: d5e1cfbd9bd5bfdcd6d5c6b029e4cf578c9e75b8
          tappro Mikhail Pershin added a comment - - edited

          Mahmoud, there was a bug in OSP which caused some llogs to be not processed after some moment, so they stays forever. This was fixed by LU-7079 which was merged in your tree at May 25, may it be so that failed node had no such patch applied? Also patch from LU-5297 solved similar problem which may cause stuck llog files.

          I don't need those CATALOGS file right now, but if you will have time, I'd like to look at llog files in it to check how many plain llogs are in use.

          tappro Mikhail Pershin added a comment - - edited Mahmoud, there was a bug in OSP which caused some llogs to be not processed after some moment, so they stays forever. This was fixed by LU-7079 which was merged in your tree at May 25, may it be so that failed node had no such patch applied? Also patch from LU-5297 solved similar problem which may cause stuck llog files. I don't need those CATALOGS file right now, but if you will have time, I'd like to look at llog files in it to check how many plain llogs are in use.

          Also I'd recommend to apply patch from LU-5297 in addition to LU-7079 (already in your tree) which might help to avoid having very old llogs.
          Consider also the patch from LU-4528 (http://review.whamcloud.com/#/c/11751/) which wasn't merged to 2.5 - it helps to avoid several types of corruptions we had in past. It is already in 2.7 so maybe this is not so critical if you are moving to 2.7.

          tappro Mikhail Pershin added a comment - Also I'd recommend to apply patch from LU-5297 in addition to LU-7079 (already in your tree) which might help to avoid having very old llogs. Consider also the patch from LU-4528 ( http://review.whamcloud.com/#/c/11751/ ) which wasn't merged to 2.5 - it helps to avoid several types of corruptions we had in past. It is already in 2.7 so maybe this is not so critical if you are moving to 2.7.
          mhanafi Mahmoud Hanafi added a comment - - edited

          Great!. do you still need to the CATALOGS file? Is the an issue with 2.7? if so we will need a patch for that as well.

          Why would we have old llog files still around?

          mhanafi Mahmoud Hanafi added a comment - - edited Great!. do you still need to the CATALOGS file? Is the an issue with 2.7? if so we will need a patch for that as well. Why would we have old llog files still around?

          People

            tappro Mikhail Pershin
            mhanafi Mahmoud Hanafi
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: