Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2548

After upgrade from 1.8.8 to 2.4 hit qmt_entry.c:281:qmt_glb_write()) $$$ failed to update global index, rc:-5

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • before upgrade: client and server are running 1.8.8
      after upgrade: client and server are running lustre-master build#1141
    • 3
    • 5972

    Description

      After clean upgrade server and client from 1.8.8 to 2.4, I enabled quota with following steps:
      1. before setup Lustre: tunefs.lustre --quota mdsdev/ostdev
      2. after setup Lustre: lctl conf_param lustre.quota.mdt=ug
      lctl conf_param lustre.quota.ost=ug

      then do iozone got this error:

      upgrade-downgrade : @@@@@@ FAIL: iozone did not fail with EDQUOT
      {noforamt}
      
      found errors in mds dmesg:
      

      Lustre: DEBUG MARKER: ===== Pass ==================================================================
      Lustre: DEBUG MARKER: ===== Check Lustre quotas usage/limits ======================================
      Lustre: DEBUG MARKER: ===== Verify the data =======================================================
      Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:mdt
      LDISKFS-fs warning (device sdb1): ldiskfs_block_to_path: block 1852143205 > max in inode 24537
      LustreError: 7867:0:(qmt_entry.c:281:qmt_glb_write()) $$$ failed to update global index, rc:-5 qmt:lustre-QMT0000 pool:0-md id:60001 enforced:1 hard:5120 soft:0 granted:1024 time:0 qunit:1024 edquot:0 may_rel:0 revoke:4297684387
      LustreError: 10848:0:(qsd_handler.c:344:qsd_req_completion()) $$$ DQACQ failed with -5, flags:0x1 qsd:lustre-MDT0000 qtype:usr id:60001 enforced:1 granted:3 pending:0 waiting:2 req:1 usage:3 qunit:0 qtune:0 edquot:0
      Lustre: DEBUG MARKER: upgrade-downgrade : @@@@@@ FAIL: iozone did not fail with EDQUOT
      LDISKFS-fs warning (device sdb1): ldiskfs_block_to_path:
      LDISKFS-fs warning (device sdb1): ldiskfs_block_to_path: block 1852143205 > max in inode 24537
      LustreError: 10877:0:(qmt_entry.c:281:qmt_glb_write()) $$$ failed to update global index, rc:-5 qmt:lustre-QMT0000 pool:0-md id:60001 enforced:1 hard:5120 soft:0 granted:1026 time:0 qunit:1024 edquot:0 may_rel:0 revoke:4297684387
      LustreError: 7577:0:(qsd_handler.c:344:qsd_req_completion()) $$$ DQACQ failed with -5, flags:0x2 qsd:lustre-MDT0000 qtype:usr id:60001 enforced:1 granted:3 pending:0 waiting:0 req:1 usage:2 qunit:1024 qtune:512 edquot:0
      LDISKFS-fs warning (device sdb1): ldiskfs_block_to_path: block 1852143205 > max in inode 24537
      LDISKFS-fs warning (device sdb1): ldiskfs_block_to_path: block 1852143205 > max in inode 24537
      block 1768711539 > max in inode 24538

      
      

      Attachments

        Issue Links

          Activity

            [LU-2548] After upgrade from 1.8.8 to 2.4 hit qmt_entry.c:281:qmt_glb_write()) $$$ failed to update global index, rc:-5

            Well, I realize that the orignal iam index truncation is not quite right, the iam container wasn't reinitialized after truncation. I've update the patch 5292, the new patch works for me, Sarah could you verify if it fix your problem? Thanks.

            niu Niu Yawei (Inactive) added a comment - Well, I realize that the orignal iam index truncation is not quite right, the iam container wasn't reinitialized after truncation. I've update the patch 5292, the new patch works for me, Sarah could you verify if it fix your problem? Thanks.
            sarah Sarah Liu added a comment -

            Sure, will get back to you when I have the result

            sarah Sarah Liu added a comment - Sure, will get back to you when I have the result

            My test shows the global index truncation before the migration will lead to the IAM error, to not block other 1.8 upgrading tests, I've posted a temporary fix (skip the index truncation during migration) for it. http://review.whamcloud.com/5292

            Sarah, could you try if above patch works for you too? Thanks.

            niu Niu Yawei (Inactive) added a comment - My test shows the global index truncation before the migration will lead to the IAM error, to not block other 1.8 upgrading tests, I've posted a temporary fix (skip the index truncation during migration) for it. http://review.whamcloud.com/5292 Sarah, could you try if above patch works for you too? Thanks.

            I can reproduce the original problem in my local environment now, seems like something wrong in IAM when upgrading from 1.8 to 2.4 (2.1 -> 2.4 is fine), will look into it closer.

            niu Niu Yawei (Inactive) added a comment - I can reproduce the original problem in my local environment now, seems like something wrong in IAM when upgrading from 1.8 to 2.4 (2.1 -> 2.4 is fine), will look into it closer.

            don't apply migration on global index copy: http://review.whamcloud.com/5259

            Actually, I still don't quite sure the reason of qmt_glb_write() failed, but at least, we shound't do migration on the global index copy.

            niu Niu Yawei (Inactive) added a comment - don't apply migration on global index copy: http://review.whamcloud.com/5259 Actually, I still don't quite sure the reason of qmt_glb_write() failed, but at least, we shound't do migration on the global index copy.

            I see, those message should come from the global index copy of the quota slave on MDT, migration should not apply to those global index copy. The failure of "qmt_glb_write()) $$$ failed to update global index, rc:-5" could probably caused by the race of migration with usual global index copy update. I'll post a pach to fix this.

            niu Niu Yawei (Inactive) added a comment - I see, those message should come from the global index copy of the quota slave on MDT, migration should not apply to those global index copy. The failure of "qmt_glb_write()) $$$ failed to update global index, rc:-5" could probably caused by the race of migration with usual global index copy update. I'll post a pach to fix this.

            I found something really weird in the dmesg (1.8 upgrade to 2.4):

            Lustre: lustre-MDT0000: Migrate inode quota from old admin quota file(admin_quotafile_v2.usr) to new IAM quota index([0x200000006:0x10000:0x0]).
            Lustre: lustre-MDT0000: Migrate inode quota from old admin quota file(admin_quotafile_v2.grp) to new IAM quota index([0x200000006:0x1010000:0x0]).
            Lustre: 31664:0:(mdt_handler.c:5261:mdt_process_config()) For interoperability, skip this mdt.group_upcall. It is obsolete.
            Lustre: 31664:0:(mdt_handler.c:5261:mdt_process_config()) For interoperability, skip this mdt.quota_type. It is obsolete.
            Lustre: lustre-MDT0000: Temporarily refusing client connection from 0@lo
            LustreError: 11-0: an error occurred while communicating with 0@lo. The mds_connect operation failed with -11
            Lustre: lustre-MDT0000: Migrate inode quota from old admin quota file(admin_quotafile_v2.usr) to new IAM quota index([0x200000003:0x8:0x0]).
            Lustre: Skipped 2 previous similar messages
            

            It says MDT is trying to migrate inode user quota into fid [0x200000003:0x8:0x0], which isn't a quota global index fid. I can't see why this could happen from the code, and I can't reproduce it locally neither.

            Sarah, could you show me how did you reproduce it? If it's reproduceable, could you capature the log with DQUOTA & D_TRACE enabled for the MDT startup procesure only? (start mdt on the old 1.8 device) The startup log was truncated in your attached logs. Thanks in advance.

            niu Niu Yawei (Inactive) added a comment - I found something really weird in the dmesg (1.8 upgrade to 2.4): Lustre: lustre-MDT0000: Migrate inode quota from old admin quota file(admin_quotafile_v2.usr) to new IAM quota index([0x200000006:0x10000:0x0]). Lustre: lustre-MDT0000: Migrate inode quota from old admin quota file(admin_quotafile_v2.grp) to new IAM quota index([0x200000006:0x1010000:0x0]). Lustre: 31664:0:(mdt_handler.c:5261:mdt_process_config()) For interoperability, skip this mdt.group_upcall. It is obsolete. Lustre: 31664:0:(mdt_handler.c:5261:mdt_process_config()) For interoperability, skip this mdt.quota_type. It is obsolete. Lustre: lustre-MDT0000: Temporarily refusing client connection from 0@lo LustreError: 11-0: an error occurred while communicating with 0@lo. The mds_connect operation failed with -11 Lustre: lustre-MDT0000: Migrate inode quota from old admin quota file(admin_quotafile_v2.usr) to new IAM quota index([0x200000003:0x8:0x0]). Lustre: Skipped 2 previous similar messages It says MDT is trying to migrate inode user quota into fid [0x200000003:0x8:0x0] , which isn't a quota global index fid. I can't see why this could happen from the code, and I can't reproduce it locally neither. Sarah, could you show me how did you reproduce it? If it's reproduceable, could you capature the log with DQUOTA & D_TRACE enabled for the MDT startup procesure only? (start mdt on the old 1.8 device) The startup log was truncated in your attached logs. Thanks in advance.
            sarah Sarah Liu added a comment -

            upgrade from 2.1.4 to 2.4 hit LU-2587

            sarah Sarah Liu added a comment - upgrade from 2.1.4 to 2.4 hit LU-2587
            sarah Sarah Liu added a comment -

            MDS dmesg and debug logs of 1.8->2.4

            sarah Sarah Liu added a comment - MDS dmesg and debug logs of 1.8->2.4
            sarah Sarah Liu added a comment -

            Niu, I tried upgrading 1.8->2.4 again and it can be reproduced.

            sarah Sarah Liu added a comment - Niu, I tried upgrading 1.8->2.4 again and it can be reproduced.
            sarah Sarah Liu added a comment -

            Niu, this time I upgraded to the latest tag-2.3.58, that's a different build from the first time.

            I will keep you updated when I finish upgrading from 2.1 to 2.4 and try again 1.8 to 2.4 to see if it happens every time.

            sarah Sarah Liu added a comment - Niu, this time I upgraded to the latest tag-2.3.58, that's a different build from the first time. I will keep you updated when I finish upgrading from 2.1 to 2.4 and try again 1.8 to 2.4 to see if it happens every time.

            People

              niu Niu Yawei (Inactive)
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: