Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11205

Failure to clear the changelog for user 1 on MDT

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.12.0, Lustre 2.10.4, Lustre 2.10.6
    • CentOS 7.4 (3.10.0-693.2.2.el7_lustre.pl1.x86_64)
    • 3
    • 9223372036854775807

    Description

      Hello,

      We're seeing the following messages on Oak's MDT in 2.10.4:

      Aug 03 09:21:39 oak-md1-s2 kernel: Lustre: 11137:0:(mdd_device.c:1577:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      Aug 03 09:31:38 oak-md1-s2 kernel: Lustre: 11271:0:(mdd_device.c:1577:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      

      Robinhood (also running 2.10.4) shows this:

      2018/08/03 10:00:47 [13766/22] ChangeLog | ERROR: llapi_changelog_clear("oak-MDT0000", "cl1", 13975842301) returned -22
      2018/08/03 10:00:47 [13766/22] EntryProc | Error -22 performing callback at stage STAGE_CHGLOG_CLR.
      2018/08/03 10:00:47 [13766/16] llapi | cannot purge records for 'cl1'
      2018/08/03 10:00:47 [13766/16] ChangeLog | ERROR: llapi_changelog_clear("oak-MDT0000", "cl1", 13975842303) returned -22
      2018/08/03 10:00:47 [13766/16] EntryProc | Error -22 performing callback at stage STAGE_CHGLOG_CLR.
      2018/08/03 10:00:47 [13766/4] llapi | cannot purge records for 'cl1'
      2018/08/03 10:00:47 [13766/4] ChangeLog | ERROR: llapi_changelog_clear("oak-MDT0000", "cl1", 13975842304) returned -22
      2018/08/03 10:00:47 [13766/4] EntryProc | Error -22 performing callback at stage STAGE_CHGLOG_CLR.
      

      Oak's MDT usage is as follow:

      [root@oak-md1-s2 ~]# df -h -t lustre
      Filesystem                  Size  Used Avail Use% Mounted on
      /dev/mapper/md1-rbod1-mdt0  1.3T  131G 1022G  12% /mnt/oak/mdt/0
      [root@oak-md1-s2 ~]# df -i -t lustre
      Filesystem                    Inodes     IUsed     IFree IUse% Mounted on
      /dev/mapper/md1-rbod1-mdt0 873332736 266515673 606817063   31% /mnt/oak/mdt/0
      

      I'm concerned that the MDT might fill up with changelogs. Could you please assist in troubleshooting this issue?
      Thanks!
      Stephane

      Attachments

        1. changelog-reader.tgz
          0.8 kB
        2. dk_ornl_20190328_1216.gz
          4.61 MB
        3. dk_ornl_20190328.gz
          4.59 MB
        4. dk.1547747365.gz
          8.56 MB
        5. dk.1547747668.gz
          8.58 MB
        6. dk.1547828521.gz
          7.96 MB
        7. f2_llog_reader_20190328.gz
          1.03 MB
        8. lu-11205-ssec.log.gz
          647 kB
        9. ornl_0x1_0xd8_0x0.gz
          1.19 MB

        Issue Links

          Activity

            [LU-11205] Failure to clear the changelog for user 1 on MDT

            Was it better? In our testing it did make the problem go away. I wonder if the problem is less common.

            simmonsja James A Simmons added a comment - Was it better? In our testing it did make the problem go away. I wonder if the problem is less common.

            When I tested the code posted in the review earlier today, I still saw out-of-order and lost changelog records. From this and other tests it does appear that lost changelog records are out-of-order records that are not reliably (and often reliably not) returned to userspace.

            olaf Olaf Weber (Inactive) added a comment - When I tested the code posted in the review earlier today, I still saw out-of-order and lost changelog records. From this and other tests it does appear that lost changelog records are out-of-order records that are not reliably (and often reliably not) returned to userspace.

            running lfs changelog clear 0 restores the changelog_size back to 4711216. There is some strange issue there but that is independent of this patch. The testing has gone very well!! Thanks Boyko.

            simmonsja James A Simmons added a comment - running lfs changelog clear 0 restores the changelog_size back to 4711216. There is some strange issue there but that is independent of this patch. The testing has gone very well!! Thanks Boyko.

            We did some testing and saw these results. Before we started robinhood to purge things we had:

            lctl get_param ..changelog_size mdd.f2-tds-MDT0000.changelog_size=4711216

            Afterwards we saw

            lctl get_param ..changelog_size mdd.f2-tds-MDT0000.changelog_size=928368

            Does this patch help resolve some of the orphan records as well that we couldn't cleanup before?

            simmonsja James A Simmons added a comment - We did some testing and saw these results. Before we started robinhood to purge things we had: lctl get_param . .changelog_size mdd.f2-tds-MDT0000.changelog_size=4711216 Afterwards we saw lctl get_param . .changelog_size mdd.f2-tds-MDT0000.changelog_size=928368 Does this patch help resolve some of the orphan records as well that we couldn't cleanup before?
            simmonsja James A Simmons added a comment - - edited

            Yep, setting it up right now. Thank you.

            simmonsja James A Simmons added a comment - - edited Yep, setting it up right now. Thank you.
            aboyko Alexander Boyko added a comment - - edited

            @James A Simmons, could you try the patch above? It also includes the reproducer for this issue.

            == sanity test 160j: Verify that reordered records could be cancelled in changelog =================== 08:27:16 (1563352036)
            mdd.lustre-MDT0000.changelog_mask=+hsm
            mdd.lustre-MDT0001.changelog_mask=+hsm
            Registered 2 changelog users: 'cl1 cl1'
            fail_loc=0x8000015d
            lustre-MDT0000.2 01CREAT 08:27:18.551297912 2019.07.17 0x0 t=[0x200000402:0x13:0x0] j=touch.0 ef=0xf u=0:0 nid=192.168.101.35@tcp p=[0x200000402:0x3:0x0] f2
            lustre-MDT0000.3 06UNLNK 08:27:18.553672555 2019.07.17 0x1 t=[0x200000402:0xa:0x0] j=rm.0 ef=0xf u=0:0 nid=192.168.101.35@tcp p=[0x200000402:0x5:0x0] f3
            lustre-MDT0000.4 14SATTR 08:27:18.554017844 2019.07.17 0x14 t=[0x200000402:0x8:0x0] j=chmod.0 ef=0xf u=0:0 nid=192.168.101.35@tcp
            lustre-MDT0000.5 11CLOSE 08:27:18.555499959 2019.07.17 0x42 t=[0x200000402:0x13:0x0] j=touch.0 ef=0xf u=0:0 nid=192.168.101.35@tcp
            lustre-MDT0000.1 01CREAT 08:27:18.550045424 2019.07.17 0x0 t=[0x200000402:0x12:0x0] j=touch.0 ef=0xf u=0:0 nid=192.168.101.35@tcp p=[0x200000402:0x7:0x0] 1
            lustre-MDT0000.6 11CLOSE 08:27:19.562156592 2019.07.17 0x42 t=[0x200000402:0x12:0x0] j=touch.0 ef=0xf u=0:0 nid=192.168.101.35@tcp
            2
            lustre-MDT0000: clear the changelog for cl1 to record #2
            3
            lustre-MDT0000: clear the changelog for cl1 to record #3
            4
            lustre-MDT0000: clear the changelog for cl1 to record #4
            5
            lustre-MDT0000: clear the changelog for cl1 to record #5
            1
            lustre-MDT0000: clear the changelog for cl1 to record #1
            lfs changelog_clear: cannot purge records for 'cl1': Invalid argument (22)
            changelog_clear: record out of range: 1
             sanity test_160j: @@@@@@ FAIL: Fail to clear changelog record 1 
              Trace dump:
              = /usr/lib64/lustre/tests/test-framework.sh:5919:error()
              = /usr/lib64/lustre/tests/sanity.sh:13140:test_160j()
              = /usr/lib64/lustre/tests/test-framework.sh:6203:run_one()
              = /usr/lib64/lustre/tests/test-framework.sh:6242:run_one_logged()
              = /usr/lib64/lustre/tests/test-framework.sh:6038:run_test()
              = /usr/lib64/lustre/tests/sanity.sh:13145:main()
            Dumping lctl log to /tmp/test_logs/1563352024/sanity.test_160j.*.1563352042.log
            
            aboyko Alexander Boyko added a comment - - edited @James A Simmons, could you try the patch above? It also includes the reproducer for this issue. == sanity test 160j: Verify that reordered records could be cancelled in changelog =================== 08:27:16 (1563352036) mdd.lustre-MDT0000.changelog_mask=+hsm mdd.lustre-MDT0001.changelog_mask=+hsm Registered 2 changelog users: 'cl1 cl1' fail_loc=0x8000015d lustre-MDT0000.2 01CREAT 08:27:18.551297912 2019.07.17 0x0 t=[0x200000402:0x13:0x0] j=touch.0 ef=0xf u=0:0 nid=192.168.101.35@tcp p=[0x200000402:0x3:0x0] f2 lustre-MDT0000.3 06UNLNK 08:27:18.553672555 2019.07.17 0x1 t=[0x200000402:0xa:0x0] j=rm.0 ef=0xf u=0:0 nid=192.168.101.35@tcp p=[0x200000402:0x5:0x0] f3 lustre-MDT0000.4 14SATTR 08:27:18.554017844 2019.07.17 0x14 t=[0x200000402:0x8:0x0] j=chmod.0 ef=0xf u=0:0 nid=192.168.101.35@tcp lustre-MDT0000.5 11CLOSE 08:27:18.555499959 2019.07.17 0x42 t=[0x200000402:0x13:0x0] j=touch.0 ef=0xf u=0:0 nid=192.168.101.35@tcp lustre-MDT0000.1 01CREAT 08:27:18.550045424 2019.07.17 0x0 t=[0x200000402:0x12:0x0] j=touch.0 ef=0xf u=0:0 nid=192.168.101.35@tcp p=[0x200000402:0x7:0x0] 1 lustre-MDT0000.6 11CLOSE 08:27:19.562156592 2019.07.17 0x42 t=[0x200000402:0x12:0x0] j=touch.0 ef=0xf u=0:0 nid=192.168.101.35@tcp 2 lustre-MDT0000: clear the changelog for cl1 to record #2 3 lustre-MDT0000: clear the changelog for cl1 to record #3 4 lustre-MDT0000: clear the changelog for cl1 to record #4 5 lustre-MDT0000: clear the changelog for cl1 to record #5 1 lustre-MDT0000: clear the changelog for cl1 to record #1 lfs changelog_clear: cannot purge records for 'cl1': Invalid argument (22) changelog_clear: record out of range: 1 sanity test_160j: @@@@@@ FAIL: Fail to clear changelog record 1 Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5919:error() = /usr/lib64/lustre/tests/sanity.sh:13140:test_160j() = /usr/lib64/lustre/tests/test-framework.sh:6203:run_one() = /usr/lib64/lustre/tests/test-framework.sh:6242:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:6038:run_test() = /usr/lib64/lustre/tests/sanity.sh:13145:main() Dumping lctl log to /tmp/test_logs/1563352024/sanity.test_160j.*.1563352042.log

            Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/35650
            Subject: LU-11205 mdd: process mixing changelog records
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: f9f82470637f7769884a863c8ef90d79fcdfc12f

            gerrit Gerrit Updater added a comment - Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/35650 Subject: LU-11205 mdd: process mixing changelog records Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f9f82470637f7769884a863c8ef90d79fcdfc12f

            I've created a fix, submitting soon.

            aboyko Alexander Boyko added a comment - I've created a fix, submitting soon.

            I tried the fix from Boyko and it didn't resolve the problem

            simmonsja James A Simmons added a comment - I tried the fix from Boyko and it didn't resolve the problem

            Hi @aboyko

            This seems very promising!! I guess all Robinhood users are hitting this issue with recent versions of Lustre. It's unclear to me if it's wasting space on the MDTs, but it's definitively very confusing for Lustre sysadmins.

             Are you going to submit your patch to gerrit (http://review.whamcloud.com/) so that it can be properly reviewed by others?

            Thanks again!

            sthiell Stephane Thiell added a comment - Hi @aboyko This seems very promising!! I guess all Robinhood users are hitting this issue with recent versions of Lustre. It's unclear to me if it's wasting space on the MDTs, but it's definitively very confusing for Lustre sysadmins.  Are you going to submit your patch to gerrit ( http://review.whamcloud.com/ ) so that it can be properly reviewed by others? Thanks again!
            aboyko Alexander Boyko added a comment - - edited

            I want to add my thoughts

            mdd_changelog_store() fill the record index in sync way for mdd,

             rec->cr.cr_index = ++mdd->mdd_cl.mc_index;

            after that

            llog_add()>llog_cat_add_rec>takes write  semaphore and do llog_write_rec( ..LOG_NEXT_IDX..). Next index simply adds to the end of llog.

            So assume  we have two threads in same time, cr_index = 1 and cr_index = 2 before llog_add.

            There is no any guarantee that first cr_index would be the first for llog_write_rec.

            To guaranty the same transaction for reordered cr_index, we need to set cr_index after 

            llog_th = thandle_get_sub(env, th, ctxt->loc_handle->lgh_obj);

            diff --git a/lustre/mdd/mdd_dir.c b/lustre/mdd/mdd_dir.c
            index 3a24e21..21ad94b 100644
            --- a/lustre/mdd/mdd_dir.c
            +++ b/lustre/mdd/mdd_dir.c
            @@ -806,13 +806,6 @@ int mdd_changelog_store(const struct lu_env *env, struct mdd_device *mdd,
                    rec->cr_hdr.lrh_type = CHANGELOG_REC;
                    rec->cr.cr_time = cl_time();
            
            -       spin_lock(&mdd->mdd_cl.mc_lock);
            -       /* NB: I suppose it's possible llog_add adds out of order wrt cr_index,
            -        * but as long as the MDD transactions are ordered correctly for e.g.
            -        * rename conflicts, I don't think this should matter. */
            -       rec->cr.cr_index = ++mdd->mdd_cl.mc_index;
            -       spin_unlock(&mdd->mdd_cl.mc_lock);
            -
                    ctxt = llog_get_context(obd, LLOG_CHANGELOG_ORIG_CTXT);
                    if (ctxt == NULL)
                            return -ENXIO;
            @@ -821,6 +814,14 @@ int mdd_changelog_store(const struct lu_env *env, struct mdd_device *mdd,
                    if (IS_ERR(llog_th))
                            GOTO(out_put, rc = PTR_ERR(llog_th));
            
            +       spin_lock(&mdd->mdd_cl.mc_lock);
            +       /* NB: I suppose it's possible llog_add adds out of order wrt cr_index,
            +        * but as long as the MDD transactions are ordered correctly for e.g.
            +        * rename conflicts, I don't think this should matter. */
            +       rec->cr.cr_index = ++mdd->mdd_cl.mc_index;
            +       spin_unlock(&mdd->mdd_cl.mc_lock);
            +
            +
                    /* nested journal transaction */
                    rc = llog_add(env, ctxt->loc_handle, &rec->cr_hdr, NULL, llog_th);
            
             

             

            aboyko Alexander Boyko added a comment - - edited I want to add my thoughts mdd_changelog_store() fill the record index in sync way for mdd,  rec->cr.cr_index = ++mdd->mdd_cl.mc_index; after that llog_add() >llog_cat_add_rec >takes write  semaphore and do llog_write_rec( ..LOG_NEXT_IDX..). Next index simply adds to the end of llog. So assume  we have two threads in same time, cr_index = 1 and cr_index = 2 before llog_add. There is no any guarantee that first cr_index would be the first for llog_write_rec. To guaranty the same transaction for reordered cr_index, we need to set cr_index after  llog_th = thandle_get_sub(env, th, ctxt->loc_handle->lgh_obj); diff --git a/lustre/mdd/mdd_dir.c b/lustre/mdd/mdd_dir.c index 3a24e21..21ad94b 100644 --- a/lustre/mdd/mdd_dir.c +++ b/lustre/mdd/mdd_dir.c @@ -806,13 +806,6 @@ int mdd_changelog_store(const struct lu_env *env, struct mdd_device *mdd, rec->cr_hdr.lrh_type = CHANGELOG_REC; rec->cr.cr_time = cl_time(); - spin_lock(&mdd->mdd_cl.mc_lock); - /* NB: I suppose it's possible llog_add adds out of order wrt cr_index, - * but as long as the MDD transactions are ordered correctly for e.g. - * rename conflicts, I don't think this should matter. */ - rec->cr.cr_index = ++mdd->mdd_cl.mc_index; - spin_unlock(&mdd->mdd_cl.mc_lock); - ctxt = llog_get_context(obd, LLOG_CHANGELOG_ORIG_CTXT); if (ctxt == NULL) return -ENXIO; @@ -821,6 +814,14 @@ int mdd_changelog_store(const struct lu_env *env, struct mdd_device *mdd, if (IS_ERR(llog_th)) GOTO(out_put, rc = PTR_ERR(llog_th)); + spin_lock(&mdd->mdd_cl.mc_lock); + /* NB: I suppose it's possible llog_add adds out of order wrt cr_index, + * but as long as the MDD transactions are ordered correctly for e.g. + * rename conflicts, I don't think this should matter. */ + rec->cr.cr_index = ++mdd->mdd_cl.mc_index; + spin_unlock(&mdd->mdd_cl.mc_lock); + + /* nested journal transaction */ rc = llog_add(env, ctxt->loc_handle, &rec->cr_hdr, NULL, llog_th);  

            People

              tappro Mikhail Pershin
              sthiell Stephane Thiell
              Votes:
              3 Vote for this issue
              Watchers:
              26 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: