[LU-2548] After upgrade from 1.8.8 to 2.4 hit qmt_entry.c:281:qmt_glb_write()) $$$ failed to update global index, rc:-5 Created: 28/Dec/12 Updated: 27/Feb/13 Resolved: 27/Feb/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Sarah Liu | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB | ||
| Environment: |
before upgrade: client and server are running 1.8.8 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 5972 | ||||||||
| Description |
|
After clean upgrade server and client from 1.8.8 to 2.4, I enabled quota with following steps: then do iozone got this error: upgrade-downgrade : @@@@@@ FAIL: iozone did not fail with EDQUOT
{noforamt}
found errors in mds dmesg:
Lustre: DEBUG MARKER: ===== Pass ================================================================== |
| Comments |
| Comment by Niu Yawei (Inactive) [ 31/Dec/12 ] |
|
Hi, Sarah. Could you enable TRACE to collect a debug log for this failure? Thanks. |
| Comment by Sarah Liu [ 02/Jan/13 ] |
|
will keep you updated. |
| Comment by Sarah Liu [ 03/Jan/13 ] |
|
debug and dmesg logs from MDS |
| Comment by Niu Yawei (Inactive) [ 04/Jan/13 ] |
|
This time the error messages are different from the first time. Sarah, is it reproduceable on 2.1 to 2.4 upgrading? How often did it happen on 1.8 -> 2.4 upgrading? Thanks. |
| Comment by Sarah Liu [ 04/Jan/13 ] |
|
Niu, this time I upgraded to the latest tag-2.3.58, that's a different build from the first time. I will keep you updated when I finish upgrading from 2.1 to 2.4 and try again 1.8 to 2.4 to see if it happens every time. |
| Comment by Sarah Liu [ 07/Jan/13 ] |
|
Niu, I tried upgrading 1.8->2.4 again and it can be reproduced. |
| Comment by Sarah Liu [ 07/Jan/13 ] |
|
MDS dmesg and debug logs of 1.8->2.4 |
| Comment by Sarah Liu [ 08/Jan/13 ] |
|
upgrade from 2.1.4 to 2.4 hit |
| Comment by Niu Yawei (Inactive) [ 01/Feb/13 ] |
|
I found something really weird in the dmesg (1.8 upgrade to 2.4): Lustre: lustre-MDT0000: Migrate inode quota from old admin quota file(admin_quotafile_v2.usr) to new IAM quota index([0x200000006:0x10000:0x0]). Lustre: lustre-MDT0000: Migrate inode quota from old admin quota file(admin_quotafile_v2.grp) to new IAM quota index([0x200000006:0x1010000:0x0]). Lustre: 31664:0:(mdt_handler.c:5261:mdt_process_config()) For interoperability, skip this mdt.group_upcall. It is obsolete. Lustre: 31664:0:(mdt_handler.c:5261:mdt_process_config()) For interoperability, skip this mdt.quota_type. It is obsolete. Lustre: lustre-MDT0000: Temporarily refusing client connection from 0@lo LustreError: 11-0: an error occurred while communicating with 0@lo. The mds_connect operation failed with -11 Lustre: lustre-MDT0000: Migrate inode quota from old admin quota file(admin_quotafile_v2.usr) to new IAM quota index([0x200000003:0x8:0x0]). Lustre: Skipped 2 previous similar messages It says MDT is trying to migrate inode user quota into fid [0x200000003:0x8:0x0], which isn't a quota global index fid. I can't see why this could happen from the code, and I can't reproduce it locally neither. Sarah, could you show me how did you reproduce it? If it's reproduceable, could you capature the log with DQUOTA & D_TRACE enabled for the MDT startup procesure only? (start mdt on the old 1.8 device) The startup log was truncated in your attached logs. Thanks in advance. |
| Comment by Niu Yawei (Inactive) [ 04/Feb/13 ] |
|
I see, those message should come from the global index copy of the quota slave on MDT, migration should not apply to those global index copy. The failure of "qmt_glb_write()) $$$ failed to update global index, rc:-5" could probably caused by the race of migration with usual global index copy update. I'll post a pach to fix this. |
| Comment by Niu Yawei (Inactive) [ 04/Feb/13 ] |
|
don't apply migration on global index copy: http://review.whamcloud.com/5259 Actually, I still don't quite sure the reason of qmt_glb_write() failed, but at least, we shound't do migration on the global index copy. |
| Comment by Niu Yawei (Inactive) [ 06/Feb/13 ] |
|
I can reproduce the original problem in my local environment now, seems like something wrong in IAM when upgrading from 1.8 to 2.4 (2.1 -> 2.4 is fine), will look into it closer. |
| Comment by Niu Yawei (Inactive) [ 07/Feb/13 ] |
|
My test shows the global index truncation before the migration will lead to the IAM error, to not block other 1.8 upgrading tests, I've posted a temporary fix (skip the index truncation during migration) for it. http://review.whamcloud.com/5292 Sarah, could you try if above patch works for you too? Thanks. |
| Comment by Sarah Liu [ 07/Feb/13 ] |
|
Sure, will get back to you when I have the result |
| Comment by Niu Yawei (Inactive) [ 08/Feb/13 ] |
|
Well, I realize that the orignal iam index truncation is not quite right, the iam container wasn't reinitialized after truncation. I've update the patch 5292, the new patch works for me, Sarah could you verify if it fix your problem? Thanks. |
| Comment by Andreas Dilger [ 08/Feb/13 ] |
|
Niu, can the http://review.whamcloud.com/5259 and http://review.whamcloud.com/5292 patches be landed regardless of interop testing, or should they wait for Sarah to do manual testing? Getting http://bugs.whamcloud.com/browse/LU-2688 (quota in conf-sanity.sh test_32) landed would help avoid this problem in the future. |
| Comment by Sarah Liu [ 09/Feb/13 ] |
|
Niu, When I trying to upgrade from 1.8.8 to lustre-reviews/build #13046 which contains your fix, It hit following errors when extracting a kernel tar ball. That means the test failed even before it ran IOZONE which cause the original failure. client console: ===== Verify the data ======================================================= Lustre: DEBUG MARKER: ===== Verify the data ======================================================= Verifying the extracted kernel tarball... LustreError: 20678:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115205255725057/0 (ffff810312a3dce0) ns: lustre-MDT0000-mdc-ffff810324d9ac00 lock: ffff810325f31a00/0x77e8acf10368301e lrc: 2/0,0 mode: PR/PR res: 8589935616/1 bits 0x13 rrc: 2 type: IBT flags: 0x2090 remote: 0x1a32fabd9561d955 expref: -99 pid: 20678 timeout: 0 + runas -u quota_usr tar xjf /mnt/lustre/d0.upgrade-downgrade/quota_usr/linux-2.6.18-238.12.1.el5.tar.bz2 -C /mnt/lustre/d0.upgrade-downgrade/quota_usr.new LustreError: 20682:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115205255725057/0 (ffff81030c6b05a0) ns: lustre-MDT0000-mdc-ffff810324d9ac00 lock: ffff810306ac8600/0x77e8acf10368305d lrc: 2/0,0 mode: PR/PR res: 8589935616/1 bits 0x13 rrc: 2 type: IBT flags: 0x2090 remote: 0x1a32fabd9561d9f6 expref: -99 pid: 20682 timeout: 0 LustreError: 20682:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115205255731243/0 (ffff8103034ffd60) ns: lustre-MDT0000-mdc-ffff810324d9ac00 lock: ffff81030e8ee800/0x77e8acf10368e2eb lrc: 2/0,0 mode: PR/PR res: 8589935616/6187 bits 0x13 rrc: 2 type: IBT flags: 0x2090 remote: 0x1a32fabd9564217a expref: -99 pid: 20682 timeout: 0 upgrade-downgrade : @@@@@@ FAIL: runas -u quota_usr tar xjf /mnt/lustre/d0.upgrade-downgrade/quota_usr/linux-2.6.18-238.12.1.el5.tar.bz2 -C /mnt/lustre/d0.upgrade-downgrade/quota_usr.new failed Lustre: DEBUG MARKER: upgrade-downgrade : @@@@@@ FAIL: runas -u quota_usr tar xjf /mnt/lustre/d0.upgrade-downgrade/quota_usr/linux-2.6.18-238.12.1.el5.tar.bz2 -C /mnt/lustre/d0.upgrade-downgrade/quota_usr.new failed Dumping lctl log to /tmp/test_logs/1360393563/upgrade-downgrade..*.1360398341.log OST console: [root@fat-amd-3 ~]# Lustre: DEBUG MARKER: upgrade-downgrade : @@@@@@ FAIL: runas -u quota_usr tar xjf /mnt/lustre/d0.upgrade-downgrade/quota_usr/linux-2.6.18-238.12.1.el5.tar.bz2 -C /mnt/lustre/d0.upgrade-downgrade/quota_usr.new failed LNet: Service thread pid 7926 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 7926, comm: ll_ost01_001 Call Trace: [<ffffffff814ead12>] schedule_timeout+0x192/0x2e0 [<ffffffff8107cb50>] ? process_timeout+0x0/0x10 [<ffffffffa037c6d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs] [<ffffffffa064924d>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc] [<ffffffffa0644970>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20 [<ffffffffa0648988>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] [<ffffffffa0648d60>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] [<ffffffffa0647700>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc] [<ffffffffa0d38740>] ofd_destroy_by_fid+0x160/0x380 [ofd] [<ffffffffa0647700>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc] [<ffffffffa0648d60>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] [<ffffffffa0670c15>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] [<ffffffffa0d39c67>] ofd_destroy+0x187/0x670 [ofd] [<ffffffffa038c2e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa0d11731>] ost_handle+0x38f1/0x46f0 [ost] [<ffffffffa0388154>] ? libcfs_id2str+0x74/0xb0 [libcfs] [<ffffffffa0681c7c>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] [<ffffffffa037c5de>] ? cfs_timer_arm+0xe/0x10 [libcfs] [<ffffffffa06793a9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] [<ffffffffa038c2e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffff81052223>] ? __wake_up+0x53/0x70 [<ffffffffa06831c6>] ptlrpc_main+0xb76/0x1870 [ptlrpc] [<ffffffffa0682650>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffa0682650>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] [<ffffffffa0682650>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 |
| Comment by Niu Yawei (Inactive) [ 15/Feb/13 ] |
|
Andreas, yes I think 5259 and 5292 should be landed regardless of interop testing, otherwise, 5293 (fix of |
| Comment by Niu Yawei (Inactive) [ 15/Feb/13 ] |
|
Sarah, this looks like another issue (not related to quota), could you try if your interop tests can pass even without quota enforced? If not, I think we'd open another ticket with mor detailed issue description. |
| Comment by Sarah Liu [ 27/Feb/13 ] |
|
I think the error I hit commented on 09/Feb/13 may be |
| Comment by Peter Jones [ 27/Feb/13 ] |
|
Landed for 2.4 |