Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2548

After upgrade from 1.8.8 to 2.4 hit qmt_entry.c:281:qmt_glb_write()) $$$ failed to update global index, rc:-5

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • before upgrade: client and server are running 1.8.8
      after upgrade: client and server are running lustre-master build#1141
    • 3
    • 5972

    Description

      After clean upgrade server and client from 1.8.8 to 2.4, I enabled quota with following steps:
      1. before setup Lustre: tunefs.lustre --quota mdsdev/ostdev
      2. after setup Lustre: lctl conf_param lustre.quota.mdt=ug
      lctl conf_param lustre.quota.ost=ug

      then do iozone got this error:

      upgrade-downgrade : @@@@@@ FAIL: iozone did not fail with EDQUOT
      {noforamt}
      
      found errors in mds dmesg:
      

      Lustre: DEBUG MARKER: ===== Pass ==================================================================
      Lustre: DEBUG MARKER: ===== Check Lustre quotas usage/limits ======================================
      Lustre: DEBUG MARKER: ===== Verify the data =======================================================
      Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:mdt
      LDISKFS-fs warning (device sdb1): ldiskfs_block_to_path: block 1852143205 > max in inode 24537
      LustreError: 7867:0:(qmt_entry.c:281:qmt_glb_write()) $$$ failed to update global index, rc:-5 qmt:lustre-QMT0000 pool:0-md id:60001 enforced:1 hard:5120 soft:0 granted:1024 time:0 qunit:1024 edquot:0 may_rel:0 revoke:4297684387
      LustreError: 10848:0:(qsd_handler.c:344:qsd_req_completion()) $$$ DQACQ failed with -5, flags:0x1 qsd:lustre-MDT0000 qtype:usr id:60001 enforced:1 granted:3 pending:0 waiting:2 req:1 usage:3 qunit:0 qtune:0 edquot:0
      Lustre: DEBUG MARKER: upgrade-downgrade : @@@@@@ FAIL: iozone did not fail with EDQUOT
      LDISKFS-fs warning (device sdb1): ldiskfs_block_to_path:
      LDISKFS-fs warning (device sdb1): ldiskfs_block_to_path: block 1852143205 > max in inode 24537
      LustreError: 10877:0:(qmt_entry.c:281:qmt_glb_write()) $$$ failed to update global index, rc:-5 qmt:lustre-QMT0000 pool:0-md id:60001 enforced:1 hard:5120 soft:0 granted:1026 time:0 qunit:1024 edquot:0 may_rel:0 revoke:4297684387
      LustreError: 7577:0:(qsd_handler.c:344:qsd_req_completion()) $$$ DQACQ failed with -5, flags:0x2 qsd:lustre-MDT0000 qtype:usr id:60001 enforced:1 granted:3 pending:0 waiting:0 req:1 usage:2 qunit:1024 qtune:512 edquot:0
      LDISKFS-fs warning (device sdb1): ldiskfs_block_to_path: block 1852143205 > max in inode 24537
      LDISKFS-fs warning (device sdb1): ldiskfs_block_to_path: block 1852143205 > max in inode 24537
      block 1768711539 > max in inode 24538

      
      

      Attachments

        Issue Links

          Activity

            [LU-2548] After upgrade from 1.8.8 to 2.4 hit qmt_entry.c:281:qmt_glb_write()) $$$ failed to update global index, rc:-5
            pjones Peter Jones added a comment -

            Landed for 2.4

            pjones Peter Jones added a comment - Landed for 2.4
            sarah Sarah Liu added a comment -

            I think the error I hit commented on 09/Feb/13 may be LU-1488 which should be fixed in lustre-1.8.9. I will run the test again upgrade from 1.8.9 to master to see if this still happens.

            sarah Sarah Liu added a comment - I think the error I hit commented on 09/Feb/13 may be LU-1488 which should be fixed in lustre-1.8.9. I will run the test again upgrade from 1.8.9 to master to see if this still happens.

            Sarah, this looks like another issue (not related to quota), could you try if your interop tests can pass even without quota enforced? If not, I think we'd open another ticket with mor detailed issue description.

            niu Niu Yawei (Inactive) added a comment - Sarah, this looks like another issue (not related to quota), could you try if your interop tests can pass even without quota enforced? If not, I think we'd open another ticket with mor detailed issue description.

            Andreas, yes I think 5259 and 5292 should be landed regardless of interop testing, otherwise, 5293 (fix of LU-2688) will not pass, then land the fix of LU-2688.

            niu Niu Yawei (Inactive) added a comment - Andreas, yes I think 5259 and 5292 should be landed regardless of interop testing, otherwise, 5293 (fix of LU-2688 ) will not pass, then land the fix of LU-2688 .
            sarah Sarah Liu added a comment -

            Niu,

            When I trying to upgrade from 1.8.8 to lustre-reviews/build #13046 which contains your fix, It hit following errors when extracting a kernel tar ball. That means the test failed even before it ran IOZONE which cause the original failure.

            client console:

            ===== Verify the data =======================================================
            Lustre: DEBUG MARKER: ===== Verify the data =======================================================
            Verifying the extracted kernel tarball...
            LustreError: 20678:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115205255725057/0 (ffff810312a3dce0) ns: lustre-MDT0000-mdc-ffff810324d9ac00 lock: ffff810325f31a00/0x77e8acf10368301e lrc: 2/0,0 mode: PR/PR res: 8589935616/1 bits 0x13 rrc: 2 type: IBT flags: 0x2090 remote: 0x1a32fabd9561d955 expref: -99 pid: 20678 timeout: 0
            + runas -u quota_usr tar xjf /mnt/lustre/d0.upgrade-downgrade/quota_usr/linux-2.6.18-238.12.1.el5.tar.bz2 -C /mnt/lustre/d0.upgrade-downgrade/quota_usr.new
            LustreError: 20682:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115205255725057/0 (ffff81030c6b05a0) ns: lustre-MDT0000-mdc-ffff810324d9ac00 lock: ffff810306ac8600/0x77e8acf10368305d lrc: 2/0,0 mode: PR/PR res: 8589935616/1 bits 0x13 rrc: 2 type: IBT flags: 0x2090 remote: 0x1a32fabd9561d9f6 expref: -99 pid: 20682 timeout: 0
            LustreError: 20682:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115205255731243/0 (ffff8103034ffd60) ns: lustre-MDT0000-mdc-ffff810324d9ac00 lock: ffff81030e8ee800/0x77e8acf10368e2eb lrc: 2/0,0 mode: PR/PR res: 8589935616/6187 bits 0x13 rrc: 2 type: IBT flags: 0x2090 remote: 0x1a32fabd9564217a expref: -99 pid: 20682 timeout: 0
            
             upgrade-downgrade : @@@@@@ FAIL: runas -u quota_usr tar xjf /mnt/lustre/d0.upgrade-downgrade/quota_usr/linux-2.6.18-238.12.1.el5.tar.bz2 -C /mnt/lustre/d0.upgrade-downgrade/quota_usr.new failed 
            Lustre: DEBUG MARKER: upgrade-downgrade : @@@@@@ FAIL: runas -u quota_usr tar xjf /mnt/lustre/d0.upgrade-downgrade/quota_usr/linux-2.6.18-238.12.1.el5.tar.bz2 -C /mnt/lustre/d0.upgrade-downgrade/quota_usr.new failed
            Dumping lctl log to /tmp/test_logs/1360393563/upgrade-downgrade..*.1360398341.log
            

            OST console:

            [root@fat-amd-3 ~]# Lustre: DEBUG MARKER: upgrade-downgrade : @@@@@@ FAIL: runas -u quota_usr tar xjf /mnt/lustre/d0.upgrade-downgrade/quota_usr/linux-2.6.18-238.12.1.el5.tar.bz2 -C /mnt/lustre/d0.upgrade-downgrade/quota_usr.new failed
            LNet: Service thread pid 7926 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            Pid: 7926, comm: ll_ost01_001
            
            Call Trace:
             [<ffffffff814ead12>] schedule_timeout+0x192/0x2e0
             [<ffffffff8107cb50>] ? process_timeout+0x0/0x10
             [<ffffffffa037c6d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
             [<ffffffffa064924d>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc]
             [<ffffffffa0644970>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
             [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
             [<ffffffffa0648988>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
             [<ffffffffa0648d60>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
             [<ffffffffa0647700>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
             [<ffffffffa0d38740>] ofd_destroy_by_fid+0x160/0x380 [ofd]
             [<ffffffffa0647700>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
             [<ffffffffa0648d60>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
             [<ffffffffa0670c15>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
             [<ffffffffa0d39c67>] ofd_destroy+0x187/0x670 [ofd]
             [<ffffffffa038c2e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
             [<ffffffffa0d11731>] ost_handle+0x38f1/0x46f0 [ost]
             [<ffffffffa0388154>] ? libcfs_id2str+0x74/0xb0 [libcfs]
             [<ffffffffa0681c7c>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc]
             [<ffffffffa037c5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
             [<ffffffffa06793a9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
             [<ffffffffa038c2e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
             [<ffffffff81052223>] ? __wake_up+0x53/0x70
             [<ffffffffa06831c6>] ptlrpc_main+0xb76/0x1870 [ptlrpc]
             [<ffffffffa0682650>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
             [<ffffffff8100c0ca>] child_rip+0xa/0x20
             [<ffffffffa0682650>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
             [<ffffffffa0682650>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
             [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
            
            sarah Sarah Liu added a comment - Niu, When I trying to upgrade from 1.8.8 to lustre-reviews/build #13046 which contains your fix, It hit following errors when extracting a kernel tar ball. That means the test failed even before it ran IOZONE which cause the original failure. client console: ===== Verify the data ======================================================= Lustre: DEBUG MARKER: ===== Verify the data ======================================================= Verifying the extracted kernel tarball... LustreError: 20678:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115205255725057/0 (ffff810312a3dce0) ns: lustre-MDT0000-mdc-ffff810324d9ac00 lock: ffff810325f31a00/0x77e8acf10368301e lrc: 2/0,0 mode: PR/PR res: 8589935616/1 bits 0x13 rrc: 2 type: IBT flags: 0x2090 remote: 0x1a32fabd9561d955 expref: -99 pid: 20678 timeout: 0 + runas -u quota_usr tar xjf /mnt/lustre/d0.upgrade-downgrade/quota_usr/linux-2.6.18-238.12.1.el5.tar.bz2 -C /mnt/lustre/d0.upgrade-downgrade/quota_usr.new LustreError: 20682:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115205255725057/0 (ffff81030c6b05a0) ns: lustre-MDT0000-mdc-ffff810324d9ac00 lock: ffff810306ac8600/0x77e8acf10368305d lrc: 2/0,0 mode: PR/PR res: 8589935616/1 bits 0x13 rrc: 2 type: IBT flags: 0x2090 remote: 0x1a32fabd9561d9f6 expref: -99 pid: 20682 timeout: 0 LustreError: 20682:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115205255731243/0 (ffff8103034ffd60) ns: lustre-MDT0000-mdc-ffff810324d9ac00 lock: ffff81030e8ee800/0x77e8acf10368e2eb lrc: 2/0,0 mode: PR/PR res: 8589935616/6187 bits 0x13 rrc: 2 type: IBT flags: 0x2090 remote: 0x1a32fabd9564217a expref: -99 pid: 20682 timeout: 0 upgrade-downgrade : @@@@@@ FAIL: runas -u quota_usr tar xjf /mnt/lustre/d0.upgrade-downgrade/quota_usr/linux-2.6.18-238.12.1.el5.tar.bz2 -C /mnt/lustre/d0.upgrade-downgrade/quota_usr.new failed Lustre: DEBUG MARKER: upgrade-downgrade : @@@@@@ FAIL: runas -u quota_usr tar xjf /mnt/lustre/d0.upgrade-downgrade/quota_usr/linux-2.6.18-238.12.1.el5.tar.bz2 -C /mnt/lustre/d0.upgrade-downgrade/quota_usr.new failed Dumping lctl log to /tmp/test_logs/1360393563/upgrade-downgrade..*.1360398341.log OST console: [root@fat-amd-3 ~]# Lustre: DEBUG MARKER: upgrade-downgrade : @@@@@@ FAIL: runas -u quota_usr tar xjf /mnt/lustre/d0.upgrade-downgrade/quota_usr/linux-2.6.18-238.12.1.el5.tar.bz2 -C /mnt/lustre/d0.upgrade-downgrade/quota_usr.new failed LNet: Service thread pid 7926 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 7926, comm: ll_ost01_001 Call Trace: [<ffffffff814ead12>] schedule_timeout+0x192/0x2e0 [<ffffffff8107cb50>] ? process_timeout+0x0/0x10 [<ffffffffa037c6d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs] [<ffffffffa064924d>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc] [<ffffffffa0644970>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20 [<ffffffffa0648988>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] [<ffffffffa0648d60>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] [<ffffffffa0647700>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc] [<ffffffffa0d38740>] ofd_destroy_by_fid+0x160/0x380 [ofd] [<ffffffffa0647700>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc] [<ffffffffa0648d60>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] [<ffffffffa0670c15>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] [<ffffffffa0d39c67>] ofd_destroy+0x187/0x670 [ofd] [<ffffffffa038c2e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa0d11731>] ost_handle+0x38f1/0x46f0 [ost] [<ffffffffa0388154>] ? libcfs_id2str+0x74/0xb0 [libcfs] [<ffffffffa0681c7c>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] [<ffffffffa037c5de>] ? cfs_timer_arm+0xe/0x10 [libcfs] [<ffffffffa06793a9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] [<ffffffffa038c2e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffff81052223>] ? __wake_up+0x53/0x70 [<ffffffffa06831c6>] ptlrpc_main+0xb76/0x1870 [ptlrpc] [<ffffffffa0682650>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffa0682650>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] [<ffffffffa0682650>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

            Niu, can the http://review.whamcloud.com/5259 and http://review.whamcloud.com/5292 patches be landed regardless of interop testing, or should they wait for Sarah to do manual testing?

            Getting http://bugs.whamcloud.com/browse/LU-2688 (quota in conf-sanity.sh test_32) landed would help avoid this problem in the future.

            adilger Andreas Dilger added a comment - Niu, can the http://review.whamcloud.com/5259 and http://review.whamcloud.com/5292 patches be landed regardless of interop testing, or should they wait for Sarah to do manual testing? Getting http://bugs.whamcloud.com/browse/LU-2688 (quota in conf-sanity.sh test_32) landed would help avoid this problem in the future.

            Well, I realize that the orignal iam index truncation is not quite right, the iam container wasn't reinitialized after truncation. I've update the patch 5292, the new patch works for me, Sarah could you verify if it fix your problem? Thanks.

            niu Niu Yawei (Inactive) added a comment - Well, I realize that the orignal iam index truncation is not quite right, the iam container wasn't reinitialized after truncation. I've update the patch 5292, the new patch works for me, Sarah could you verify if it fix your problem? Thanks.
            sarah Sarah Liu added a comment -

            Sure, will get back to you when I have the result

            sarah Sarah Liu added a comment - Sure, will get back to you when I have the result

            My test shows the global index truncation before the migration will lead to the IAM error, to not block other 1.8 upgrading tests, I've posted a temporary fix (skip the index truncation during migration) for it. http://review.whamcloud.com/5292

            Sarah, could you try if above patch works for you too? Thanks.

            niu Niu Yawei (Inactive) added a comment - My test shows the global index truncation before the migration will lead to the IAM error, to not block other 1.8 upgrading tests, I've posted a temporary fix (skip the index truncation during migration) for it. http://review.whamcloud.com/5292 Sarah, could you try if above patch works for you too? Thanks.

            I can reproduce the original problem in my local environment now, seems like something wrong in IAM when upgrading from 1.8 to 2.4 (2.1 -> 2.4 is fine), will look into it closer.

            niu Niu Yawei (Inactive) added a comment - I can reproduce the original problem in my local environment now, seems like something wrong in IAM when upgrading from 1.8 to 2.4 (2.1 -> 2.4 is fine), will look into it closer.

            People

              niu Niu Yawei (Inactive)
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: