[LU-2548] After upgrade from 1.8.8 to 2.4 hit qmt_entry.c:281:qmt_glb_write()) $$$ failed to update global index, rc:-5 Created: 28/Dec/12  Updated: 27/Feb/13  Resolved: 27/Feb/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Sarah Liu Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: HB
Environment:

before upgrade: client and server are running 1.8.8
after upgrade: client and server are running lustre-master build#1141


Attachments: File debug_amd-1.tar.gz     File dmesg     Text File upgrade-downgrade..debug_log.fat-amd-1.1357602330.log     Text File upgrade-downgrade..dmesg.fat-amd-1.1357602330.log    
Issue Links:
Related
is related to LU-2688 add quota upgrade checks to conf_sani... Resolved
Severity: 3
Rank (Obsolete): 5972

 Description   

After clean upgrade server and client from 1.8.8 to 2.4, I enabled quota with following steps:
1. before setup Lustre: tunefs.lustre --quota mdsdev/ostdev
2. after setup Lustre: lctl conf_param lustre.quota.mdt=ug
lctl conf_param lustre.quota.ost=ug

then do iozone got this error:

upgrade-downgrade : @@@@@@ FAIL: iozone did not fail with EDQUOT
{noforamt}

found errors in mds dmesg:

Lustre: DEBUG MARKER: ===== Pass ==================================================================
Lustre: DEBUG MARKER: ===== Check Lustre quotas usage/limits ======================================
Lustre: DEBUG MARKER: ===== Verify the data =======================================================
Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:mdt
LDISKFS-fs warning (device sdb1): ldiskfs_block_to_path: block 1852143205 > max in inode 24537
LustreError: 7867:0:(qmt_entry.c:281:qmt_glb_write()) $$$ failed to update global index, rc:-5 qmt:lustre-QMT0000 pool:0-md id:60001 enforced:1 hard:5120 soft:0 granted:1024 time:0 qunit:1024 edquot:0 may_rel:0 revoke:4297684387
LustreError: 10848:0:(qsd_handler.c:344:qsd_req_completion()) $$$ DQACQ failed with -5, flags:0x1 qsd:lustre-MDT0000 qtype:usr id:60001 enforced:1 granted:3 pending:0 waiting:2 req:1 usage:3 qunit:0 qtune:0 edquot:0
Lustre: DEBUG MARKER: upgrade-downgrade : @@@@@@ FAIL: iozone did not fail with EDQUOT
LDISKFS-fs warning (device sdb1): ldiskfs_block_to_path:
LDISKFS-fs warning (device sdb1): ldiskfs_block_to_path: block 1852143205 > max in inode 24537
LustreError: 10877:0:(qmt_entry.c:281:qmt_glb_write()) $$$ failed to update global index, rc:-5 qmt:lustre-QMT0000 pool:0-md id:60001 enforced:1 hard:5120 soft:0 granted:1026 time:0 qunit:1024 edquot:0 may_rel:0 revoke:4297684387
LustreError: 7577:0:(qsd_handler.c:344:qsd_req_completion()) $$$ DQACQ failed with -5, flags:0x2 qsd:lustre-MDT0000 qtype:usr id:60001 enforced:1 granted:3 pending:0 waiting:0 req:1 usage:2 qunit:1024 qtune:512 edquot:0
LDISKFS-fs warning (device sdb1): ldiskfs_block_to_path: block 1852143205 > max in inode 24537
LDISKFS-fs warning (device sdb1): ldiskfs_block_to_path: block 1852143205 > max in inode 24537
block 1768711539 > max in inode 24538




 Comments   
Comment by Niu Yawei (Inactive) [ 31/Dec/12 ]

Hi, Sarah. Could you enable TRACE to collect a debug log for this failure? Thanks.

Comment by Sarah Liu [ 02/Jan/13 ]

will keep you updated.

Comment by Sarah Liu [ 03/Jan/13 ]

debug and dmesg logs from MDS

Comment by Niu Yawei (Inactive) [ 04/Jan/13 ]

This time the error messages are different from the first time. Sarah, is it reproduceable on 2.1 to 2.4 upgrading? How often did it happen on 1.8 -> 2.4 upgrading? Thanks.

Comment by Sarah Liu [ 04/Jan/13 ]

Niu, this time I upgraded to the latest tag-2.3.58, that's a different build from the first time.

I will keep you updated when I finish upgrading from 2.1 to 2.4 and try again 1.8 to 2.4 to see if it happens every time.

Comment by Sarah Liu [ 07/Jan/13 ]

Niu, I tried upgrading 1.8->2.4 again and it can be reproduced.

Comment by Sarah Liu [ 07/Jan/13 ]

MDS dmesg and debug logs of 1.8->2.4

Comment by Sarah Liu [ 08/Jan/13 ]

upgrade from 2.1.4 to 2.4 hit LU-2587

Comment by Niu Yawei (Inactive) [ 01/Feb/13 ]

I found something really weird in the dmesg (1.8 upgrade to 2.4):

Lustre: lustre-MDT0000: Migrate inode quota from old admin quota file(admin_quotafile_v2.usr) to new IAM quota index([0x200000006:0x10000:0x0]).
Lustre: lustre-MDT0000: Migrate inode quota from old admin quota file(admin_quotafile_v2.grp) to new IAM quota index([0x200000006:0x1010000:0x0]).
Lustre: 31664:0:(mdt_handler.c:5261:mdt_process_config()) For interoperability, skip this mdt.group_upcall. It is obsolete.
Lustre: 31664:0:(mdt_handler.c:5261:mdt_process_config()) For interoperability, skip this mdt.quota_type. It is obsolete.
Lustre: lustre-MDT0000: Temporarily refusing client connection from 0@lo
LustreError: 11-0: an error occurred while communicating with 0@lo. The mds_connect operation failed with -11
Lustre: lustre-MDT0000: Migrate inode quota from old admin quota file(admin_quotafile_v2.usr) to new IAM quota index([0x200000003:0x8:0x0]).
Lustre: Skipped 2 previous similar messages

It says MDT is trying to migrate inode user quota into fid [0x200000003:0x8:0x0], which isn't a quota global index fid. I can't see why this could happen from the code, and I can't reproduce it locally neither.

Sarah, could you show me how did you reproduce it? If it's reproduceable, could you capature the log with DQUOTA & D_TRACE enabled for the MDT startup procesure only? (start mdt on the old 1.8 device) The startup log was truncated in your attached logs. Thanks in advance.

Comment by Niu Yawei (Inactive) [ 04/Feb/13 ]

I see, those message should come from the global index copy of the quota slave on MDT, migration should not apply to those global index copy. The failure of "qmt_glb_write()) $$$ failed to update global index, rc:-5" could probably caused by the race of migration with usual global index copy update. I'll post a pach to fix this.

Comment by Niu Yawei (Inactive) [ 04/Feb/13 ]

don't apply migration on global index copy: http://review.whamcloud.com/5259

Actually, I still don't quite sure the reason of qmt_glb_write() failed, but at least, we shound't do migration on the global index copy.

Comment by Niu Yawei (Inactive) [ 06/Feb/13 ]

I can reproduce the original problem in my local environment now, seems like something wrong in IAM when upgrading from 1.8 to 2.4 (2.1 -> 2.4 is fine), will look into it closer.

Comment by Niu Yawei (Inactive) [ 07/Feb/13 ]

My test shows the global index truncation before the migration will lead to the IAM error, to not block other 1.8 upgrading tests, I've posted a temporary fix (skip the index truncation during migration) for it. http://review.whamcloud.com/5292

Sarah, could you try if above patch works for you too? Thanks.

Comment by Sarah Liu [ 07/Feb/13 ]

Sure, will get back to you when I have the result

Comment by Niu Yawei (Inactive) [ 08/Feb/13 ]

Well, I realize that the orignal iam index truncation is not quite right, the iam container wasn't reinitialized after truncation. I've update the patch 5292, the new patch works for me, Sarah could you verify if it fix your problem? Thanks.

Comment by Andreas Dilger [ 08/Feb/13 ]

Niu, can the http://review.whamcloud.com/5259 and http://review.whamcloud.com/5292 patches be landed regardless of interop testing, or should they wait for Sarah to do manual testing?

Getting http://bugs.whamcloud.com/browse/LU-2688 (quota in conf-sanity.sh test_32) landed would help avoid this problem in the future.

Comment by Sarah Liu [ 09/Feb/13 ]

Niu,

When I trying to upgrade from 1.8.8 to lustre-reviews/build #13046 which contains your fix, It hit following errors when extracting a kernel tar ball. That means the test failed even before it ran IOZONE which cause the original failure.

client console:

===== Verify the data =======================================================
Lustre: DEBUG MARKER: ===== Verify the data =======================================================
Verifying the extracted kernel tarball...
LustreError: 20678:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115205255725057/0 (ffff810312a3dce0) ns: lustre-MDT0000-mdc-ffff810324d9ac00 lock: ffff810325f31a00/0x77e8acf10368301e lrc: 2/0,0 mode: PR/PR res: 8589935616/1 bits 0x13 rrc: 2 type: IBT flags: 0x2090 remote: 0x1a32fabd9561d955 expref: -99 pid: 20678 timeout: 0
+ runas -u quota_usr tar xjf /mnt/lustre/d0.upgrade-downgrade/quota_usr/linux-2.6.18-238.12.1.el5.tar.bz2 -C /mnt/lustre/d0.upgrade-downgrade/quota_usr.new
LustreError: 20682:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115205255725057/0 (ffff81030c6b05a0) ns: lustre-MDT0000-mdc-ffff810324d9ac00 lock: ffff810306ac8600/0x77e8acf10368305d lrc: 2/0,0 mode: PR/PR res: 8589935616/1 bits 0x13 rrc: 2 type: IBT flags: 0x2090 remote: 0x1a32fabd9561d9f6 expref: -99 pid: 20682 timeout: 0
LustreError: 20682:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115205255731243/0 (ffff8103034ffd60) ns: lustre-MDT0000-mdc-ffff810324d9ac00 lock: ffff81030e8ee800/0x77e8acf10368e2eb lrc: 2/0,0 mode: PR/PR res: 8589935616/6187 bits 0x13 rrc: 2 type: IBT flags: 0x2090 remote: 0x1a32fabd9564217a expref: -99 pid: 20682 timeout: 0

 upgrade-downgrade : @@@@@@ FAIL: runas -u quota_usr tar xjf /mnt/lustre/d0.upgrade-downgrade/quota_usr/linux-2.6.18-238.12.1.el5.tar.bz2 -C /mnt/lustre/d0.upgrade-downgrade/quota_usr.new failed 
Lustre: DEBUG MARKER: upgrade-downgrade : @@@@@@ FAIL: runas -u quota_usr tar xjf /mnt/lustre/d0.upgrade-downgrade/quota_usr/linux-2.6.18-238.12.1.el5.tar.bz2 -C /mnt/lustre/d0.upgrade-downgrade/quota_usr.new failed
Dumping lctl log to /tmp/test_logs/1360393563/upgrade-downgrade..*.1360398341.log

OST console:

[root@fat-amd-3 ~]# Lustre: DEBUG MARKER: upgrade-downgrade : @@@@@@ FAIL: runas -u quota_usr tar xjf /mnt/lustre/d0.upgrade-downgrade/quota_usr/linux-2.6.18-238.12.1.el5.tar.bz2 -C /mnt/lustre/d0.upgrade-downgrade/quota_usr.new failed
LNet: Service thread pid 7926 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 7926, comm: ll_ost01_001

Call Trace:
 [<ffffffff814ead12>] schedule_timeout+0x192/0x2e0
 [<ffffffff8107cb50>] ? process_timeout+0x0/0x10
 [<ffffffffa037c6d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
 [<ffffffffa064924d>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc]
 [<ffffffffa0644970>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
 [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
 [<ffffffffa0648988>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
 [<ffffffffa0648d60>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
 [<ffffffffa0647700>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
 [<ffffffffa0d38740>] ofd_destroy_by_fid+0x160/0x380 [ofd]
 [<ffffffffa0647700>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
 [<ffffffffa0648d60>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
 [<ffffffffa0670c15>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
 [<ffffffffa0d39c67>] ofd_destroy+0x187/0x670 [ofd]
 [<ffffffffa038c2e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa0d11731>] ost_handle+0x38f1/0x46f0 [ost]
 [<ffffffffa0388154>] ? libcfs_id2str+0x74/0xb0 [libcfs]
 [<ffffffffa0681c7c>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc]
 [<ffffffffa037c5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
 [<ffffffffa06793a9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
 [<ffffffffa038c2e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffff81052223>] ? __wake_up+0x53/0x70
 [<ffffffffa06831c6>] ptlrpc_main+0xb76/0x1870 [ptlrpc]
 [<ffffffffa0682650>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffffa0682650>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
 [<ffffffffa0682650>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Comment by Niu Yawei (Inactive) [ 15/Feb/13 ]

Andreas, yes I think 5259 and 5292 should be landed regardless of interop testing, otherwise, 5293 (fix of LU-2688) will not pass, then land the fix of LU-2688.

Comment by Niu Yawei (Inactive) [ 15/Feb/13 ]

Sarah, this looks like another issue (not related to quota), could you try if your interop tests can pass even without quota enforced? If not, I think we'd open another ticket with mor detailed issue description.

Comment by Sarah Liu [ 27/Feb/13 ]

I think the error I hit commented on 09/Feb/13 may be LU-1488 which should be fixed in lustre-1.8.9. I will run the test again upgrade from 1.8.9 to master to see if this still happens.

Comment by Peter Jones [ 27/Feb/13 ]

Landed for 2.4

Generated at Sat Feb 10 01:26:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.