Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6471

Unexpected Lustre Client LBUG in llog_write()

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.5.3
    • None
    • 3
    • 9223372036854775807

    Description

      Some IEEL2.2.x and Lustre 2.5 installations have triggered this unexpected LBUG on Lustre Clients consuming ChangeLogs :

      <3>LustreError: 31536:0:(llog_cat.c:192:llog_cat_id2handle()) store2-MDT0000-mdc-ffff88206515bc00: error opening log id 0x2def:1:0: rc = -2
      <3>LustreError: 31536:0:(llog_cat.c:565:llog_cat_process_cb()) store2-MDT0000-mdc-ffff88206515bc00: cannot find handle for llog 0x2def:1: -2
      <0>LustreError: 31536:0:(llog.c:861:llog_write()) ASSERTION( loghandle->lgh_obj != ((void *)0) ) failed:
      <0>LustreError: 31536:0:(llog.c:861:llog_write()) LBUG
      <4>Pid: 31536, comm: mdc_clg_send_th
      <4>
      <4>Call Trace:
      <4> [<ffffffffa0437895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa0437e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa0553c52>] llog_write+0x352/0x420 [obdclass]
      <4> [<ffffffffa0553e0c>] llog_cancel_rec+0xbc/0x7c0 [obdclass]
      <4> [<ffffffffa0558bcc>] llog_cat_cleanup+0x9c/0x260 [obdclass]
      <4> [<ffffffffa0559fd8>] llog_cat_process_cb+0x588/0x630 [obdclass]
      <4> [<ffffffffa0554e6b>] llog_process_thread+0x95b/0xd70 [obdclass]
      <4> [<ffffffffa0559a50>] ? llog_cat_process_cb+0x0/0x630 [obdclass]
      <4> [<ffffffffa0556a67>] llog_process_or_fork+0x127/0x550 [obdclass]
      <4> [<ffffffffa0557f09>] llog_cat_process_or_fork+0x89/0x350 [obdclass]
      <4> [<ffffffff8152aede>] ? mutex_lock+0x1e/0x50
      <4> [<ffffffffa094fe10>] ? changelog_kkuc_cb+0x0/0x250 [mdc]
      <4> [<ffffffffa05581e9>] llog_cat_process+0x19/0x20 [obdclass]
      <4> [<ffffffffa09541eb>] mdc_changelog_send_thread+0x4cb/0x810 [mdc]
      <4> [<ffffffffa0953d20>] ? mdc_changelog_send_thread+0x0/0x810 [mdc]
      <4> [<ffffffff8109e66e>] kthread+0x9e/0xc0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      <4>
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 31536, comm: mdc_clg_send_th Not tainted 2.6.32-504.12.2.el6.x86_64 #1
      <4>Call Trace:
      <4> [<ffffffff8152933c>] ? panic+0xa7/0x16f
      <4> [<ffffffffa0437eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      <4> [<ffffffffa0553c52>] ? llog_write+0x352/0x420 [obdclass]
      <4> [<ffffffffa0553e0c>] ? llog_cancel_rec+0xbc/0x7c0 [obdclass]
      <4> [<ffffffffa0558bcc>] ? llog_cat_cleanup+0x9c/0x260 [obdclass]
      <4> [<ffffffffa0559fd8>] ? llog_cat_process_cb+0x588/0x630 [obdclass]
      <4> [<ffffffffa0554e6b>] ? llog_process_thread+0x95b/0xd70 [obdclass]
      <4> [<ffffffffa0559a50>] ? llog_cat_process_cb+0x0/0x630 [obdclass]
      <4> [<ffffffffa0556a67>] ? llog_process_or_fork+0x127/0x550 [obdclass]
      <4> [<ffffffffa0557f09>] ? llog_cat_process_or_fork+0x89/0x350 [obdclass]
      <4> [<ffffffff8152aede>] ? mutex_lock+0x1e/0x50
      <4> [<ffffffffa094fe10>] ? changelog_kkuc_cb+0x0/0x250 [mdc]
      <4> [<ffffffffa05581e9>] ? llog_cat_process+0x19/0x20 [obdclass]
      <4> [<ffffffffa09541eb>] ? mdc_changelog_send_thread+0x4cb/0x810 [mdc]
      <4> [<ffffffffa0953d20>] ? mdc_changelog_send_thread+0x0/0x810 [mdc]
      <4> [<ffffffff8109e66e>] ? kthread+0x9e/0xc0
      <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      <4> [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      The problem seems to have been introduced by patch for LU-5195, which does not handle the case where code is running on a Lustre Client (ie, cathandle->lgh_obj == NULL) and is only a ChangeLog/LLOG consumer/replicator.

      This problem had not been seen before since concerned change is only being executed when triggering some catalog inconsistency. Looks like master is also impacted, will push a patch soon.

      Attachments

        Issue Links

          Activity

            [LU-6471] Unexpected Lustre Client LBUG in llog_write()
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14489/
            Subject: LU-6471 obdclass: fix llog_cat_cleanup() usage on Client
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 485f3ba8743365c7fb5823893a7eb45030cdffb5

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14489/ Subject: LU-6471 obdclass: fix llog_cat_cleanup() usage on Client Project: fs/lustre-release Branch: master Current Patch Set: Commit: 485f3ba8743365c7fb5823893a7eb45030cdffb5

            Having a look to old (ie, running with Lustre version not including patch from LU-5195) ChangeLogs consumer Client's syslogs, I have found they already contained the 2 msgs preceding the LBUG now :

            <3>LustreError: 31536:0:(llog_cat.c:192:llog_cat_id2handle()) store2-MDT0000-mdc-ffff88206515bc00: error opening log id 0x2def:1:0: rc = -2
            <3>LustreError: 31536:0:(llog_cat.c:565:llog_cat_process_cb()) store2-MDT0000-mdc-ffff88206515bc00: cannot find handle for llog 0x2def:1: -2
            

            This means that since Servers/MDS also did not run with LU-5195 patch in at this time, this fix is not the responsible of the LBUG by itself, but that some kind of a race possibly exists to allow a Client to work with an out-of-sync Catalog content ... This is confirmed by the lustre debug trace logs, extracted fro crash-dump upon LBUG, where it appears that the reported missing plan log has been available and used a few times before.
            So finally, this may be a new consequence of the fact that the plain log destroy and catlog index record deletion are not atomic, already addressed for Server side (only!) in LU-5195 patch.

            bfaccini Bruno Faccini (Inactive) added a comment - Having a look to old (ie, running with Lustre version not including patch from LU-5195 ) ChangeLogs consumer Client's syslogs, I have found they already contained the 2 msgs preceding the LBUG now : <3>LustreError: 31536:0:(llog_cat.c:192:llog_cat_id2handle()) store2-MDT0000-mdc-ffff88206515bc00: error opening log id 0x2def:1:0: rc = -2 <3>LustreError: 31536:0:(llog_cat.c:565:llog_cat_process_cb()) store2-MDT0000-mdc-ffff88206515bc00: cannot find handle for llog 0x2def:1: -2 This means that since Servers/MDS also did not run with LU-5195 patch in at this time, this fix is not the responsible of the LBUG by itself, but that some kind of a race possibly exists to allow a Client to work with an out-of-sync Catalog content ... This is confirmed by the lustre debug trace logs, extracted fro crash-dump upon LBUG, where it appears that the reported missing plan log has been available and used a few times before. So finally, this may be a new consequence of the fact that the plain log destroy and catlog index record deletion are not atomic, already addressed for Server side (only!) in LU-5195 patch.

            We meet the same issue on T100 system with lustre 2.5.3.90 on the robinhood node

            apercher Antoine Percher added a comment - We meet the same issue on T100 system with lustre 2.5.3.90 on the robinhood node

            Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/14489
            Subject: LU-6471 obdclass: fix llog_cat_cleanup() usage on Client
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 9219f1ba884874f5f25df7499a13956c4eb65bea

            gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/14489 Subject: LU-6471 obdclass: fix llog_cat_cleanup() usage on Client Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9219f1ba884874f5f25df7499a13956c4eb65bea

            People

              bfaccini Bruno Faccini (Inactive)
              bfaccini Bruno Faccini (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: