Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6699

LustreError: 7605:0:(osd_handler.c:2530:osd_object_destroy()) ASSERTION

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.7.0, Lustre 2.10.0
    • None
    • RHEL6
    • 3
    • 9223372036854775807

    Description

      We have just upgraded our servers to 2.7. This has caused one of the MDS to assert.

      Message from syslogd@cs04r-sc-mds03-02 at Jun 9 16:56:29 ...
      kernel:LustreError: 7605:0:(osd_handler.c:2530:osd_object_destroy()) ASSERTION( !lu_object_is_dying(dt->do_lu.lo_header) ) failed:
      Jun 9 16:56:29 cs04r-sc-mds03-02 kernel: LustreError: 7605:0:(osd_handler.c:2530:osd_object_destroy()) LBUG
      Jun 9 16:56:29 cs04r-sc-mds03-02 kernel: Pid: 7605, comm: mdt02_006
      Jun 9 16:56:29 cs04r-sc-mds03-02 kernel:
      Jun 9 16:56:29 cs04r-sc-mds03-02 kernel: Call Trace:
      Jun 9 16:56:29 cs04r-sc-mds03-02 kernel: [<ffffffffa0410895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      Jun 9 16:56:29 cs04r-sc-mds03-02 kernel: [<ffffffffa0410e97>] lbug_with_loc+0x47/0xb0 [libcfs]

      Could you advise a suitable course of action

      Attachments

        Issue Links

          Activity

            [LU-6699] LustreError: 7605:0:(osd_handler.c:2530:osd_object_destroy()) ASSERTION

            Outdated issue, there were several patches landed to fix llog races and issues and this issue may be fixed already. Reopen if will appear again

            tappro Mikhail Pershin added a comment - Outdated issue, there were several patches landed to fix llog races and issues and this issue may be fixed already. Reopen if will appear again

            Created the bug https://jira.hpdd.intel.com/browse/LU-8496 (Race is changelog clear path). The assertion is different but seems that the path is same, so please review.

            520557 Rahul Deshmukh (Inactive) added a comment - Created the bug https://jira.hpdd.intel.com/browse/LU-8496 (Race is changelog clear path). The assertion is different but seems that the path is same, so please review.

            Mike, we had the following fix for Lustre-2.1 :

            MRP-1443 llog: avoid llog cancel race
                
                Concurrently running two or more lfs changelog_clear need to be
                protected against races. llog_process_thread() used to read llogs
                without taking into account that the llog being read may be destroyed
                by another process. This patch serializes changelog cancellings using
                llog_ctxt's mutex.
            
            diff --git a/lustre/mdd/mdd_device.c b/lustre/mdd/mdd_device.c
            index 1642f0a..8140208 100644
            --- a/lustre/mdd/mdd_device.c
            +++ b/lustre/mdd/mdd_device.c
            @@ -386,6 +386,7 @@ int mdd_changelog_llog_cancel(const struct lu_env *env,
                     if (ctxt == NULL)
                             return -ENXIO;
             
            +        cfs_mutex_lock(&ctxt->loc_mutex);
                     cfs_spin_lock(&mdd->mdd_cl.mc_lock);
                     cur = (long long)mdd->mdd_cl.mc_index;
                     cfs_spin_unlock(&mdd->mdd_cl.mc_lock);
            @@ -413,6 +414,7 @@ int mdd_changelog_llog_cancel(const struct lu_env *env,
             
                     rc = llog_cancel(ctxt, NULL, 1, (struct llog_cookie *)&endrec, 0);
             out:
            +        cfs_mutex_unlock(&ctxt->loc_mutex);
                     llog_ctxt_put(ctxt);
                     return rc;
             }
            

            do you think it might be useful for 2.7+ ?

            zam Alexander Zarochentsev added a comment - Mike, we had the following fix for Lustre-2.1 : MRP-1443 llog: avoid llog cancel race Concurrently running two or more lfs changelog_clear need to be protected against races. llog_process_thread() used to read llogs without taking into account that the llog being read may be destroyed by another process. This patch serializes changelog cancellings using llog_ctxt's mutex. diff --git a/lustre/mdd/mdd_device.c b/lustre/mdd/mdd_device.c index 1642f0a..8140208 100644 --- a/lustre/mdd/mdd_device.c +++ b/lustre/mdd/mdd_device.c @@ -386,6 +386,7 @@ int mdd_changelog_llog_cancel( const struct lu_env *env, if (ctxt == NULL) return -ENXIO; + cfs_mutex_lock(&ctxt->loc_mutex); cfs_spin_lock(&mdd->mdd_cl.mc_lock); cur = ( long long )mdd->mdd_cl.mc_index; cfs_spin_unlock(&mdd->mdd_cl.mc_lock); @@ -413,6 +414,7 @@ int mdd_changelog_llog_cancel( const struct lu_env *env, rc = llog_cancel(ctxt, NULL, 1, (struct llog_cookie *)&endrec, 0); out: + cfs_mutex_unlock(&ctxt->loc_mutex); llog_ctxt_put(ctxt); return rc; } do you think it might be useful for 2.7+ ?
            bzzz Alex Zhuravlev added a comment - - edited

            [delete unrelated test failure]

            bzzz Alex Zhuravlev added a comment - - edited [delete unrelated test failure]
            yong.fan nasf (Inactive) added a comment - - edited

            [delete unrelated test failure]

            yong.fan nasf (Inactive) added a comment - - edited [delete unrelated test failure]

            This system was upgraded from 2.5, I have only seen this once. Though we did have an issue with an MDS not responding after a minor network outage. But I do not have a compelling set of logs to suggest they are related.

            I will put the console output below

            Jun  8 13:27:57 cs04r-sc-mds03-01 kernel: LNet: There was an unexpected network error while writing to 172.23.148.22: -110.
            Jun  8 13:30:32 cs04r-sc-mds03-01 kernel: Lustre: MGS: haven't heard from client de0451fe-3e87-4bcb-2ca6-d2af988671be (at 172.23.148.35@tcp) in 227 seconds. I think it's dead, and I am evicting it. exp ffff881fcf0bcc00, cur 1433766632 expire 1433766482 last 1433766405
            Jun  8 13:30:32 cs04r-sc-mds03-01 kernel: Lustre: Skipped 1 previous similar message
            Jun  8 13:30:48 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000: Client bb255a22-f3c1-835b-8049-eab34c95ba65 (at 172.23.148.64@tcp) reconnecting
            Jun  8 13:30:59 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000: Client db5a1353-f37b-fe0a-ccf8-9bc50f7a62ad (at 172.23.148.65@tcp) reconnecting
            Jun  8 13:31:03 cs04r-sc-mds03-01 kernel: Lustre: MGS: Client b85575c0-8d63-0c39-a18e-c25179bf68dd (at 172.23.148.26@tcp) reconnecting
            Jun  8 13:31:08 cs04r-sc-mds03-01 kernel: Lustre: MGS: Client 0e2a3416-2996-0da3-aab5-16ab1d68433f (at 172.23.148.24@tcp) reconnecting
            Jun  8 13:31:08 cs04r-sc-mds03-01 kernel: Lustre: Skipped 2 previous similar messages
            Jun  8 13:31:38 cs04r-sc-mds03-01 kernel: Lustre: MGS: Client 8945eb8e-242f-a306-9ce7-98c47b58cd6c (at 172.23.148.38@tcp) reconnecting
            Jun  8 13:31:38 cs04r-sc-mds03-01 kernel: Lustre: Skipped 1 previous similar message
            Jun  8 13:38:59 cs04r-sc-mds03-01 kernel: LustreError: 20218:0:(llog_cat.c:508:llog_cat_cancel_records()) lustre03-MDD0000: fail to cancel 0 of 1 llog-records: rc = -2
            Jun  8 13:38:59 cs04r-sc-mds03-01 kernel: LustreError: 20218:0:(llog_cat.c:508:llog_cat_cancel_records()) Skipped 18 previous similar messages
            Jun  8 13:38:59 cs04r-sc-mds03-01 kernel: LustreError: 20218:0:(mdd_device.c:260:llog_changelog_cancel()) lustre03-MDD0000: cancel idx 52222 of catalog 0x8:10 rc=-2
            Jun  8 13:38:59 cs04r-sc-mds03-01 kernel: LustreError: 20218:0:(mdd_device.c:260:llog_changelog_cancel()) Skipped 18 previous similar messages
            Jun  8 13:49:04 cs04r-sc-mds03-01 kernel: LustreError: 18959:0:(llog_cat.c:508:llog_cat_cancel_records()) lustre03-MDD0000: fail to cancel 0 of 1 llog-records: rc = -2
            Jun  8 13:49:04 cs04r-sc-mds03-01 kernel: LustreError: 18959:0:(llog_cat.c:508:llog_cat_cancel_records()) Skipped 17 previous similar messages
            Jun  8 13:49:04 cs04r-sc-mds03-01 kernel: LustreError: 18959:0:(mdd_device.c:260:llog_changelog_cancel()) lustre03-MDD0000: cancel idx 52247 of catalog 0x8:10 rc=-2
            Jun  8 13:49:04 cs04r-sc-mds03-01 kernel: LustreError: 18959:0:(mdd_device.c:260:llog_changelog_cancel()) Skipped 17 previous similar messages
            Jun  8 14:02:30 cs04r-sc-mds03-01 kernel: LustreError: 18965:0:(llog_cat.c:508:llog_cat_cancel_records()) lustre03-MDD0000: fail to cancel 0 of 1 llog-records: rc = -2
            Jun  8 14:02:30 cs04r-sc-mds03-01 kernel: LustreError: 18965:0:(llog_cat.c:508:llog_cat_cancel_records()) Skipped 25 previous similar messages
            Jun  8 14:02:30 cs04r-sc-mds03-01 kernel: LustreError: 18965:0:(mdd_device.c:260:llog_changelog_cancel()) lustre03-MDD0000: cancel idx 52272 of catalog 0x8:10 rc=-2
            Jun  8 14:02:30 cs04r-sc-mds03-01 kernel: LustreError: 18965:0:(mdd_device.c:260:llog_changelog_cancel()) Skipped 25 previous similar messages
            Jun  8 14:04:33 cs04r-sc-mds03-01 kernel: LustreError: 19000:0:(llog_cat.c:163:llog_cat_id2handle()) lustre03-MDD0000: error opening log id 0x10f8e:1:0: rc = -2
            Jun  8 14:04:33 cs04r-sc-mds03-01 kernel: LustreError: 19000:0:(llog_cat.c:537:llog_cat_process_cb()) lustre03-MDD0000: cannot find handle for llog 0x10f8e:1: -2
            
            davebond-diamond Dave Bond (Inactive) added a comment - This system was upgraded from 2.5, I have only seen this once. Though we did have an issue with an MDS not responding after a minor network outage. But I do not have a compelling set of logs to suggest they are related. I will put the console output below Jun 8 13:27:57 cs04r-sc-mds03-01 kernel: LNet: There was an unexpected network error while writing to 172.23.148.22: -110. Jun 8 13:30:32 cs04r-sc-mds03-01 kernel: Lustre: MGS: haven't heard from client de0451fe-3e87-4bcb-2ca6-d2af988671be (at 172.23.148.35@tcp) in 227 seconds. I think it's dead, and I am evicting it. exp ffff881fcf0bcc00, cur 1433766632 expire 1433766482 last 1433766405 Jun 8 13:30:32 cs04r-sc-mds03-01 kernel: Lustre: Skipped 1 previous similar message Jun 8 13:30:48 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000: Client bb255a22-f3c1-835b-8049-eab34c95ba65 (at 172.23.148.64@tcp) reconnecting Jun 8 13:30:59 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000: Client db5a1353-f37b-fe0a-ccf8-9bc50f7a62ad (at 172.23.148.65@tcp) reconnecting Jun 8 13:31:03 cs04r-sc-mds03-01 kernel: Lustre: MGS: Client b85575c0-8d63-0c39-a18e-c25179bf68dd (at 172.23.148.26@tcp) reconnecting Jun 8 13:31:08 cs04r-sc-mds03-01 kernel: Lustre: MGS: Client 0e2a3416-2996-0da3-aab5-16ab1d68433f (at 172.23.148.24@tcp) reconnecting Jun 8 13:31:08 cs04r-sc-mds03-01 kernel: Lustre: Skipped 2 previous similar messages Jun 8 13:31:38 cs04r-sc-mds03-01 kernel: Lustre: MGS: Client 8945eb8e-242f-a306-9ce7-98c47b58cd6c (at 172.23.148.38@tcp) reconnecting Jun 8 13:31:38 cs04r-sc-mds03-01 kernel: Lustre: Skipped 1 previous similar message Jun 8 13:38:59 cs04r-sc-mds03-01 kernel: LustreError: 20218:0:(llog_cat.c:508:llog_cat_cancel_records()) lustre03-MDD0000: fail to cancel 0 of 1 llog-records: rc = -2 Jun 8 13:38:59 cs04r-sc-mds03-01 kernel: LustreError: 20218:0:(llog_cat.c:508:llog_cat_cancel_records()) Skipped 18 previous similar messages Jun 8 13:38:59 cs04r-sc-mds03-01 kernel: LustreError: 20218:0:(mdd_device.c:260:llog_changelog_cancel()) lustre03-MDD0000: cancel idx 52222 of catalog 0x8:10 rc=-2 Jun 8 13:38:59 cs04r-sc-mds03-01 kernel: LustreError: 20218:0:(mdd_device.c:260:llog_changelog_cancel()) Skipped 18 previous similar messages Jun 8 13:49:04 cs04r-sc-mds03-01 kernel: LustreError: 18959:0:(llog_cat.c:508:llog_cat_cancel_records()) lustre03-MDD0000: fail to cancel 0 of 1 llog-records: rc = -2 Jun 8 13:49:04 cs04r-sc-mds03-01 kernel: LustreError: 18959:0:(llog_cat.c:508:llog_cat_cancel_records()) Skipped 17 previous similar messages Jun 8 13:49:04 cs04r-sc-mds03-01 kernel: LustreError: 18959:0:(mdd_device.c:260:llog_changelog_cancel()) lustre03-MDD0000: cancel idx 52247 of catalog 0x8:10 rc=-2 Jun 8 13:49:04 cs04r-sc-mds03-01 kernel: LustreError: 18959:0:(mdd_device.c:260:llog_changelog_cancel()) Skipped 17 previous similar messages Jun 8 14:02:30 cs04r-sc-mds03-01 kernel: LustreError: 18965:0:(llog_cat.c:508:llog_cat_cancel_records()) lustre03-MDD0000: fail to cancel 0 of 1 llog-records: rc = -2 Jun 8 14:02:30 cs04r-sc-mds03-01 kernel: LustreError: 18965:0:(llog_cat.c:508:llog_cat_cancel_records()) Skipped 25 previous similar messages Jun 8 14:02:30 cs04r-sc-mds03-01 kernel: LustreError: 18965:0:(mdd_device.c:260:llog_changelog_cancel()) lustre03-MDD0000: cancel idx 52272 of catalog 0x8:10 rc=-2 Jun 8 14:02:30 cs04r-sc-mds03-01 kernel: LustreError: 18965:0:(mdd_device.c:260:llog_changelog_cancel()) Skipped 25 previous similar messages Jun 8 14:04:33 cs04r-sc-mds03-01 kernel: LustreError: 19000:0:(llog_cat.c:163:llog_cat_id2handle()) lustre03-MDD0000: error opening log id 0x10f8e:1:0: rc = -2 Jun 8 14:04:33 cs04r-sc-mds03-01 kernel: LustreError: 19000:0:(llog_cat.c:537:llog_cat_process_cb()) lustre03-MDD0000: cannot find handle for llog 0x10f8e:1: -2

            People

              tappro Mikhail Pershin
              davebond-diamond Dave Bond (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: