[LU-8502] replay-vbr: umount hangs waiting for mgs_ir_fini_fs Created: 16/Aug/16 Updated: 04/Jan/18 Resolved: 18/Dec/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0, Lustre 2.11.0 |
| Fix Version/s: | Lustre 2.11.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This issue was created by maloo for Kit Westneat <kit.westneat@gmail.com> Test timed out with a lot of stack traces related to mgs_ir_fini_fs: [17640.500244] [<ffffffff8163ba29>] schedule+0x29/0x70 [17640.500244] [<ffffffff81639719>] schedule_timeout+0x209/0x2d0 [17640.500244] [<ffffffff811bfad9>] ? discard_slab+0x39/0x50 [17640.500244] [<ffffffff81632d4d>] ? __slab_free+0x253/0x277 [17640.500244] [<ffffffff8163bdf6>] wait_for_completion+0x116/0x170 [17640.500244] [<ffffffff810b88c0>] ? wake_up_state+0x20/0x20 [17640.500244] [<ffffffffa0cb3a0e>] mgs_ir_fini_fs+0x27e/0x2ec [mgs] [17640.500244] [<ffffffffa0ca0361>] mgs_free_fsdb+0x41/0x8e0 [mgs] [17640.500244] [<ffffffffa0ca97d2>] mgs_cleanup_fsdb_list+0x52/0x70 [mgs] [17640.500244] [<ffffffffa0c8fa87>] mgs_device_fini+0x97/0x5b0 [mgs] [17640.500244] [<ffffffffa07d088c>] class_cleanup+0x94c/0xd80 [obdclass] [17640.500244] [<ffffffffa07d3606>] class_process_config+0x2226/0x2f60 [obdclass] [17640.500244] [<ffffffff811c2483>] ? __kmalloc+0x1f3/0x230 [17640.500244] [<ffffffffa07cd6cb>] ? lustre_cfg_new+0x8b/0x400 [obdclass] [17640.500244] [<ffffffffa07d442f>] class_manual_cleanup+0xef/0x810 [obdclass] [17640.500244] [<ffffffffa0802560>] server_put_super+0xb20/0xcd0 [obdclass] [17640.500244] [<ffffffff811e1096>] generic_shutdown_super+0x56/0xe0 [17640.500244] [<ffffffff811e1472>] kill_anon_super+0x12/0x20 [17640.500244] [<ffffffffa07d7c92>] lustre_kill_super+0x32/0x50 [obdclass] [17640.500244] [<ffffffff811e1829>] deactivate_locked_super+0x49/0x60 [17640.500244] [<ffffffff811e1e26>] deactivate_super+0x46/0x60 [17640.500244] [<ffffffff811fed95>] mntput_no_expire+0xc5/0x120 [17640.500244] [<ffffffff811ffecf>] SyS_umount+0x9f/0x3c0 [17640.500244] mgs_lustre_noti S ffff88004c88dc00 0 21703 2 0x00000080 [17640.500244] ffff88004c83fbb0 0000000000000046 ffff88004c88dc00 ffff88004c83ffd8 [17640.500244] ffff88004c83ffd8 ffff88004c83ffd8 ffff88004c88dc00 ffff88004f27a800 [17640.500244] ffff88004c88dc00 0000000000000000 ffffffffa09d4b90 ffff88004c88dc00 [17640.500244] Call Trace: [17640.500244] [<ffffffffa09d4b90>] ? ldlm_completion_ast_async+0x300/0x300 [ptlrpc] [17640.500244] [<ffffffff8163ba29>] schedule+0x29/0x70 [17640.500244] [<ffffffffa09d540d>] ldlm_completion_ast+0x62d/0x910 [ptlrpc] [17640.500244] [<ffffffff810b88c0>] ? wake_up_state+0x20/0x20 [17640.500244] [<ffffffffa0c8e8f1>] mgs_completion_ast_generic+0xb1/0x1d0 [mgs] [17640.500244] [<ffffffffa0c8ea23>] mgs_completion_ast_ir+0x13/0x20 [mgs] [17640.500244] [<ffffffffa09d7ab0>] ldlm_cli_enqueue_local+0x230/0x860 [ptlrpc] [17640.500244] [<ffffffffa0c8ea10>] ? mgs_completion_ast_generic+0x1d0/0x1d0 [mgs] [17640.500244] [<ffffffffa09da820>] ? ldlm_blocking_ast_nocheck+0x310/0x310 [ptlrpc] [17640.500244] [<ffffffffa0c93ddc>] mgs_revoke_lock+0x1ec/0x370 [mgs] [17640.500244] [<ffffffffa09da820>] ? ldlm_blocking_ast_nocheck+0x310/0x310 [ptlrpc] [17640.500244] [<ffffffffa0c8ea10>] ? mgs_completion_ast_generic+0x1d0/0x1d0 [mgs] [17640.500244] [<ffffffffa0cb0462>] mgs_ir_notify+0x142/0x2a0 [mgs] [17640.500244] [<ffffffff810b88c0>] ? wake_up_state+0x20/0x20 [17640.500244] [<ffffffffa0cb0320>] ? lprocfs_ir_set_state+0x170/0x170 [mgs] [17640.500244] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [17640.500244] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [17640.500244] [<ffffffff816469d8>] ret_from_fork+0x58/0x90 [17640.500244] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/89fac3d8-634d-11e6-906c-5254006e85c2. |
| Comments |
| Comment by James Casper [ 07/Apr/17 ] |
|
Seeing a lot of test set PASSes followed by test set test_0a TIMEOUTs: replay-dual PASS, replay-dual TIMEOUT replay-ost-single PASS, replay-dual TIMEOUT replay-dual PASS, replay-single TIMEOUT |
| Comment by James Casper [ 07/Apr/17 ] |
|
In this case, replay-dual was the last test set run: https://testing.hpdd.intel.com/test_sessions/2862ae67-6628-41c8-9561-e586502f1a13 It passed all subtests, and then hung on the umount. The test set was then marked TIMEOUT. trevis-41vm7.log |
| Comment by Peter Jones [ 20/Apr/17 ] |
|
Jinshan Could you please advise on this one? Thanks Peter |
| Comment by Jinshan Xiong (Inactive) [ 10/May/17 ] |
|
I suspect this is the same issue as |
| Comment by Gerrit Updater [ 23/May/17 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/27255 |
| Comment by Gerrit Updater [ 23/May/17 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/27256 |
| Comment by James Nunez (Inactive) [ 18/Dec/17 ] |
|
I've reviewed the last month of replay-vbr test 1b hangs and they are all during interop testing with master (future 2.11) clients and previous versions of Lustre servers including the b2_8 and b2_9 branches. Thus, this issues looks like it is fixed. |