[LU-8502] replay-vbr: umount hangs waiting for mgs_ir_fini_fs Created: 16/Aug/16  Updated: 04/Jan/18  Resolved: 18/Dec/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Lustre 2.11.0
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: Jinshan Xiong (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: Text File trevis-41vm7.log    
Issue Links:
Related
is related to LU-7372 replay-dual test_26: test failed to r... Resolved
is related to LU-9113 insanity test_0 umount fails for /mnt... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Kit Westneat <kit.westneat@gmail.com>

Test timed out with a lot of stack traces related to mgs_ir_fini_fs:

[17640.500244]  [<ffffffff8163ba29>] schedule+0x29/0x70
[17640.500244]  [<ffffffff81639719>] schedule_timeout+0x209/0x2d0
[17640.500244]  [<ffffffff811bfad9>] ? discard_slab+0x39/0x50
[17640.500244]  [<ffffffff81632d4d>] ? __slab_free+0x253/0x277
[17640.500244]  [<ffffffff8163bdf6>] wait_for_completion+0x116/0x170
[17640.500244]  [<ffffffff810b88c0>] ? wake_up_state+0x20/0x20
[17640.500244]  [<ffffffffa0cb3a0e>] mgs_ir_fini_fs+0x27e/0x2ec [mgs]
[17640.500244]  [<ffffffffa0ca0361>] mgs_free_fsdb+0x41/0x8e0 [mgs]
[17640.500244]  [<ffffffffa0ca97d2>] mgs_cleanup_fsdb_list+0x52/0x70 [mgs]
[17640.500244]  [<ffffffffa0c8fa87>] mgs_device_fini+0x97/0x5b0 [mgs]
[17640.500244]  [<ffffffffa07d088c>] class_cleanup+0x94c/0xd80 [obdclass]
[17640.500244]  [<ffffffffa07d3606>] class_process_config+0x2226/0x2f60 [obdclass]
[17640.500244]  [<ffffffff811c2483>] ? __kmalloc+0x1f3/0x230
[17640.500244]  [<ffffffffa07cd6cb>] ? lustre_cfg_new+0x8b/0x400 [obdclass]
[17640.500244]  [<ffffffffa07d442f>] class_manual_cleanup+0xef/0x810 [obdclass]
[17640.500244]  [<ffffffffa0802560>] server_put_super+0xb20/0xcd0 [obdclass]
[17640.500244]  [<ffffffff811e1096>] generic_shutdown_super+0x56/0xe0
[17640.500244]  [<ffffffff811e1472>] kill_anon_super+0x12/0x20
[17640.500244]  [<ffffffffa07d7c92>] lustre_kill_super+0x32/0x50 [obdclass]
[17640.500244]  [<ffffffff811e1829>] deactivate_locked_super+0x49/0x60
[17640.500244]  [<ffffffff811e1e26>] deactivate_super+0x46/0x60
[17640.500244]  [<ffffffff811fed95>] mntput_no_expire+0xc5/0x120
[17640.500244]  [<ffffffff811ffecf>] SyS_umount+0x9f/0x3c0
[17640.500244] mgs_lustre_noti S ffff88004c88dc00     0 21703      2 0x00000080
[17640.500244]  ffff88004c83fbb0 0000000000000046 ffff88004c88dc00 ffff88004c83ffd8
[17640.500244]  ffff88004c83ffd8 ffff88004c83ffd8 ffff88004c88dc00 ffff88004f27a800
[17640.500244]  ffff88004c88dc00 0000000000000000 ffffffffa09d4b90 ffff88004c88dc00
[17640.500244] Call Trace:
[17640.500244]  [<ffffffffa09d4b90>] ? ldlm_completion_ast_async+0x300/0x300 [ptlrpc]
[17640.500244]  [<ffffffff8163ba29>] schedule+0x29/0x70
[17640.500244]  [<ffffffffa09d540d>] ldlm_completion_ast+0x62d/0x910 [ptlrpc]
[17640.500244]  [<ffffffff810b88c0>] ? wake_up_state+0x20/0x20
[17640.500244]  [<ffffffffa0c8e8f1>] mgs_completion_ast_generic+0xb1/0x1d0 [mgs]
[17640.500244]  [<ffffffffa0c8ea23>] mgs_completion_ast_ir+0x13/0x20 [mgs]
[17640.500244]  [<ffffffffa09d7ab0>] ldlm_cli_enqueue_local+0x230/0x860 [ptlrpc]
[17640.500244]  [<ffffffffa0c8ea10>] ? mgs_completion_ast_generic+0x1d0/0x1d0 [mgs]
[17640.500244]  [<ffffffffa09da820>] ? ldlm_blocking_ast_nocheck+0x310/0x310 [ptlrpc]
[17640.500244]  [<ffffffffa0c93ddc>] mgs_revoke_lock+0x1ec/0x370 [mgs]
[17640.500244]  [<ffffffffa09da820>] ? ldlm_blocking_ast_nocheck+0x310/0x310 [ptlrpc]
[17640.500244]  [<ffffffffa0c8ea10>] ? mgs_completion_ast_generic+0x1d0/0x1d0 [mgs]
[17640.500244]  [<ffffffffa0cb0462>] mgs_ir_notify+0x142/0x2a0 [mgs]
[17640.500244]  [<ffffffff810b88c0>] ? wake_up_state+0x20/0x20
[17640.500244]  [<ffffffffa0cb0320>] ? lprocfs_ir_set_state+0x170/0x170 [mgs]
[17640.500244]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[17640.500244]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[17640.500244]  [<ffffffff816469d8>] ret_from_fork+0x58/0x90
[17640.500244]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/89fac3d8-634d-11e6-906c-5254006e85c2.



 Comments   
Comment by James Casper [ 07/Apr/17 ]

Seeing a lot of test set PASSes followed by test set test_0a TIMEOUTs:

replay-dual PASS, replay-dual TIMEOUT
https://testing.hpdd.intel.com/test_sessions/279f69ac-eda3-4fd2-a1e9-f9135f7c0d66

replay-ost-single PASS, replay-dual TIMEOUT
https://testing.hpdd.intel.com/test_sessions/4a7fd4bc-e055-44c6-afac-97569c944b02

replay-dual PASS, replay-single TIMEOUT
https://testing.hpdd.intel.com/test_sessions/bce01895-216d-4460-b513-24c7b02ef25e

Comment by James Casper [ 07/Apr/17 ]

In this case, replay-dual was the last test set run:

https://testing.hpdd.intel.com/test_sessions/2862ae67-6628-41c8-9561-e586502f1a13

It passed all subtests, and then hung on the umount. The test set was then marked TIMEOUT. trevis-41vm7.log

Comment by Peter Jones [ 20/Apr/17 ]

Jinshan

Could you please advise on this one?

Thanks

Peter

Comment by Jinshan Xiong (Inactive) [ 10/May/17 ]

I suspect this is the same issue as LU-7372, the patch is located at: https://review.whamcloud.com/17853

Comment by Gerrit Updater [ 23/May/17 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/27255
Subject: LU-8502 test: Baseline test failure rates
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bc450568faa8fb98d9650d9b834a83f8f5e2efb8

Comment by Gerrit Updater [ 23/May/17 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/27256
Subject: LU-8502 test: Run LU-7372 patch against failing tests
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0f498ea376a184aadfb502fde9fee0b79319f6ea

Comment by James Nunez (Inactive) [ 18/Dec/17 ]

I've reviewed the last month of replay-vbr test 1b hangs and they are all during interop testing with master (future 2.11) clients and previous versions of Lustre servers including the b2_8 and b2_9 branches. Thus, this issues looks like it is fixed.

Generated at Sat Feb 10 02:18:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.