[LU-2701] recovery-small test 27 umount hang Created: 29/Jan/13 Updated: 15/Mar/13 Resolved: 15/Mar/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Oleg Drokin | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | MB | ||
| Severity: | 3 |
| Rank (Obsolete): | 6294 |
| Description |
|
Some recent landing introduced a problem in osp cleanup. PID: 28642 TASK: ffff880097b0a140 CPU: 6 COMMAND: "umount" #0 [ffff88007b6c1898] schedule at ffffffff814f7c98 #1 [ffff88007b6c1960] osp_sync_fini at ffffffffa069d09d [osp] #2 [ffff88007b6c19c0] osp_process_config at ffffffffa06972c0 [osp] #3 [ffff88007b6c1a20] lod_cleanup_desc_tgts at ffffffffa05ed564 [lod] #4 [ffff88007b6c1a70] lod_process_config at ffffffffa05f0266 [lod] #5 [ffff88007b6c1af0] mdd_process_config at ffffffffa0427c4b [mdd] #6 [ffff88007b6c1b50] mdt_stack_fini at ffffffffa0726b21 [mdt] #7 [ffff88007b6c1bb0] mdt_device_fini at ffffffffa072799a [mdt] #8 [ffff88007b6c1bf0] class_cleanup at ffffffffa0fb5247 [obdclass] #9 [ffff88007b6c1c70] class_process_config at ffffffffa0fb6b2c [obdclass] #10 [ffff88007b6c1d00] class_manual_cleanup at ffffffffa0fb7869 [obdclass] #11 [ffff88007b6c1dc0] server_put_super at ffffffffa0fc83bc [obdclass] #12 [ffff88007b6c1e30] generic_shutdown_super at ffffffff8117d6ab #13 [ffff88007b6c1e50] kill_anon_super at ffffffff8117d796 #14 [ffff88007b6c1e70] lustre_kill_super at ffffffffa0fb9666 [obdclass] #15 [ffff88007b6c1e90] deactivate_super at ffffffff8117e825 #16 [ffff88007b6c1eb0] mntput_no_expire at ffffffff8119a89f #17 [ffff88007b6c1ee0] sys_umount at ffffffff8119b34b After this nothing cound progress: [145371.090429] LustreError: 22398:0:(fail.c:133:__cfs_fail_timeout_set()) cfs_f ail_timeout id 407 sleeping for 10000ms [145380.552626] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, o r until 1 client reconnects [145380.573743] Lustre: lustre-MDT0000: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted. [145380.588321] Lustre: lustre-OST0001: deleting orphan objects from 0x0:176 to 192 [145380.588747] Lustre: Skipped 1 previous similar message [145381.093065] LustreError: 22398:0:(fail.c:137:__cfs_fail_timeout_set()) cfs_f ail_timeout id 407 awake [145459.804339] Lustre: Failing over lustre-MDT0000 [145459.809747] LustreError: 11-0: lustre-MDT0000-mdc-ffff88008c61dbf0: Communic ating with 0@lo, operation mds_reint failed with -19. [145459.810324] LustreError: Skipped 5 previous similar messages [145460.115626] LustreError: 20940:0:(client.c:1039:ptlrpc_import_delay_req()) @ @@ IMP_CLOSED req@ffff88008fd2ebf0 x1425344870597376/t0(0) o6->lustre-OST0000- osc-MDT0000@0@lo:28/4 lens 664/432 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 [145460.157395] LustreError: 20938:0:(client.c:1039:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff880073692bf0 x1425344870597409/t0(0) o6->lustre-OST0001-osc-MDT0000@0@lo:28/4 lens 664/432 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 [145460.157395] LustreError: 20938:0:(client.c:1039:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff880073692bf0 x1425344870597409/t0(0) o6->lustre-OST0001-osc-MDT0000@0@lo:28/4 lens 664/432 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 [145460.158377] LustreError: 20938:0:(client.c:1039:ptlrpc_import_delay_req()) Skipped 7 previous similar messages [145460.608530] LustreError: 137-5: lustre-MDT0000: Not available for connect from 0@lo (stopping) [145465.605136] LustreError: 137-5: lustre-MDT0000: Not available for connect from 0@lo (stopping) [145465.606017] LustreError: Skipped 3 previous similar messages ... I have a crashdump. |
| Comments |
| Comment by Oleg Drokin [ 29/Jan/13 ] |
|
This problem seems to be introduced by commit 74ec68346e14851ad8a1912185e1dccd3e6d12cd falso from Wangdi: |
| Comment by Jodi Levi (Inactive) [ 29/Jan/13 ] |
|
Di is looking into this one as well. |
| Comment by Oleg Drokin [ 29/Jan/13 ] |
|
Hm, actually looking through earlier logs, I have seem a very similar stack before for umount, but not triggering in recovery-small test 27 so reliably. |
| Comment by Oleg Drokin [ 29/Jan/13 ] |
|
First appearance I see was on Jan 19, so still a recent thing. |
| Comment by Di Wang [ 29/Jan/13 ] |
|
Hmm, if the first appearance is on Jan 19th, most changes on this area has not been landed yet. I think most of the patches here were landed on Jan 22th. And also it seems cleanup process is waiting for unlink log threads to be stopped, so it is unlikely fid on ost problem, IMHO. I will look at it deeper, sigh, can not reproduce it locally. |
| Comment by Andreas Dilger [ 07/Feb/13 ] |
|
Problem appears to be in LOD/OSP code, not DNE. |
| Comment by Alex Zhuravlev [ 19/Feb/13 ] |
| Comment by Alex Zhuravlev [ 15/Mar/13 ] |
|
the patch is landed. the issue is hopefully solved. |