[LU-4772] MGS is waiting for obd_unlinked_exports Created: 14/Mar/14  Updated: 19/Mar/19  Resolved: 13/Jul/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.1, Lustre 2.7.0, Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: James Nunez (Inactive) Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None
Environment:

OpenSFS cluster with one MGS/MDS (MDT is a partition of /dev/sda), one OSS with two OSTs, a node running robinhood, one client and agent node and one client


Issue Links:
Duplicate
duplicates LU-5539 MGS is waiting for obd_unlinked_expor... Resolved
duplicates LU-4062 sanity test_132: MGS is waiting for o... Closed
is duplicated by LU-5161 replay-single test_42: OBD refcount is 5 Resolved
is duplicated by LU-6103 replay-single test_53f: test failed t... Resolved
Related
is related to LU-6616 recovery-small test_27: (mgs_handler.... Closed
Severity: 3
Rank (Obsolete): 13125

 Description   

Running acceptance-small for HSM testing on 2.5.1-RC4, replay-single hangs in test 53g. The test results are at https://maloo.whamcloud.com/test_sets/23724c76-abad-11e3-a696-52540035b04c , but there are no logs for test 53g, except at the end of the suite_log:

== replay-single test 53g: |X| drop open reply and close request while close and open are both in flight == 18:37:54 (1394761074)
fail_loc=0x119
fail_loc=0x80000115
fail_loc=0
Replay barrier on lscratch-MDT0000
Failing mds1 on c16
Stopping /lustre/lscratch/mdt0 (opts:) on c16

Looking at dmesg on the MDT:

Lustre: DEBUG MARKER: == replay-single test 53g: |X| drop open reply and close request while close and open are both in flight == 18:37:54 (1394761074)
Lustre: *** cfs_fail_loc=119, val=2147483648***
Lustre: Skipped 1 previous similar message
LustreError: 18010:0:(ldlm_lib.c:2415:target_send_reply_msg()) @@@ dropping reply  req@ffff88081e682000 x1462504360037684/t274877906958(0) o36->97aaf730-d78d-08d9-43ce-9e768c9c685f@192.168.2.120@o2ib:0/0 lens 488/448 e 0 to 0 dl 1394761114 ref 1 fl Interpret:/0/0 rc 0/0
LustreError: 18010:0:(ldlm_lib.c:2415:target_send_reply_msg()) Skipped 1 previous similar message
Lustre: *** cfs_fail_loc=115, val=2147483648***
Turning device sda (0x800003) read-only
Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lscratch-MDT0000
Lustre: MGS is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck?
Lustre: MGS is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck?
LustreError: 166-1: MGC192.168.2.116@o2ib: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
Lustre: MGS is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck?
Lustre: MGS is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck?
Lustre: MGS is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck?
Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394761310/real 1394761310]  req@ffff8808227c9400 x1462503945016504/t0(0) o250->MGC192.168.2.116@o2ib@0@lo:26/25 lens 400/544 e 0 to 1 dl 1394761346 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 14 previous similar messages
LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.119@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 66 previous similar messages
INFO: task umount:18492 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount        D 0000000000000000     0 18492  18491 0x00000080
 ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000
 ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985
 ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8
Call Trace:
 [<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0
 [<ffffffff81084220>] ? process_timeout+0x0/0x10
 [<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass]
 [<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs]
 [<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass]
 [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass]
 [<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
 [<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass]
 [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass]
 [<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0
 [<ffffffff8118b326>] kill_anon_super+0x16/0x60
 [<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass]
 [<ffffffff8118bac7>] deactivate_super+0x57/0x80
 [<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110
 [<ffffffff811ab64b>] sys_umount+0x7b/0x3a0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Lustre: MGS is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 5. Is it stuck?
INFO: task umount:18492 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount        D 0000000000000000     0 18492  18491 0x00000080
 ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000
 ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985
 ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8
Call Trace:
 [<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0
 [<ffffffff81084220>] ? process_timeout+0x0/0x10
 [<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass]
 [<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs]
 [<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass]
 [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass]
 [<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
 [<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass]
 [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass]
 [<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0
 [<ffffffff8118b326>] kill_anon_super+0x16/0x60
 [<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass]
 [<ffffffff8118bac7>] deactivate_super+0x57/0x80
 [<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110
 [<ffffffff811ab64b>] sys_umount+0x7b/0x3a0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task umount:18492 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount        D 0000000000000000     0 18492  18491 0x00000080
 ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000
 ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985
 ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8
Call Trace:
 [<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0
 [<ffffffff81084220>] ? process_timeout+0x0/0x10
 [<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass]
 [<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs]
 [<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass]
 [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass]
 [<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
 [<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass]
 [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass]
 [<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0
 [<ffffffff8118b326>] kill_anon_super+0x16/0x60
 [<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass]
 [<ffffffff8118bac7>] deactivate_super+0x57/0x80
 [<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110
 [<ffffffff811ab64b>] sys_umount+0x7b/0x3a0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394761910/real 1394761910]  req@ffff8808049d6000 x1462503945016540/t0(0) o250->MGC192.168.2.116@o2ib@0@lo:26/25 lens 400/544 e 0 to 1 dl 1394761965 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.118@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 120 previous similar messages
INFO: task umount:18492 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount        D 0000000000000000     0 18492  18491 0x00000080
 ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000
 ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985
 ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8
Call Trace:
 [<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0
 [<ffffffff81084220>] ? process_timeout+0x0/0x10
 [<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass]
 [<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs]
 [<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass]
 [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass]
 [<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
 [<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass]
 [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass]
 [<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0
 [<ffffffff8118b326>] kill_anon_super+0x16/0x60
 [<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass]
 [<ffffffff8118bac7>] deactivate_super+0x57/0x80
 [<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110
 [<ffffffff811ab64b>] sys_umount+0x7b/0x3a0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Lustre: MGS is waiting for obd_unlinked_exports more than 512 seconds. The obd refcount = 5. Is it stuck?
INFO: task umount:18492 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount        D 0000000000000000     0 18492  18491 0x00000080
 ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000
 ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985
 ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8
Call Trace:
 [<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0
 [<ffffffff81084220>] ? process_timeout+0x0/0x10
 [<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass]
 [<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs]
 [<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass]
 [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass]
 [<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
 [<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass]
 [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass]
 [<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0
 [<ffffffff8118b326>] kill_anon_super+0x16/0x60
 [<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass]
 [<ffffffff8118bac7>] deactivate_super+0x57/0x80
 [<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110
 [<ffffffff811ab64b>] sys_umount+0x7b/0x3a0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task umount:18492 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount        D 0000000000000000     0 18492  18491 0x00000080
 ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000
 ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985
 ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8
Call Trace:
 [<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0
 [<ffffffff81084220>] ? process_timeout+0x0/0x10
 [<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass]
 [<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs]
 [<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass]
 [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass]
 [<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
 [<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass]
 [<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass]
 [<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0
 [<ffffffff8118b326>] kill_anon_super+0x16/0x60
 [<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass]
 [<ffffffff8118bac7>] deactivate_super+0x57/0x80
 [<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110
 [<ffffffff811ab64b>] sys_umount+0x7b/0x3a0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.120@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 113 previous similar messages
Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394762585/real 1394762585]  req@ffff880819252800 x1462503945016576/t0(0) o250->MGC192.168.2.116@o2ib@0@lo:26/25 lens 400/544 e 0 to 1 dl 1394762640 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
Lustre: MGS is waiting for obd_unlinked_exports more than 1024 seconds. The obd refcount = 5. Is it stuck?
LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.117@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 120 previous similar messages
Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394763260/real 1394763260]  req@ffff8808202e3800 x1462503945016612/t0(0) o250->MGC192.168.2.116@o2ib@0@lo:26/25 lens 400/544 e 0 to 1 dl 1394763315 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.119@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 116 previous similar messages
Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394763935/real 1394763935]  req@ffff8808040e9800 x1462503945016648/t0(0) o250->MGC192.168.2.116@o2ib@0@lo:26/25 lens 400/544 e 0 to 1 dl 1394763990 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.119@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 113 previous similar messages

On the client, we see

Lustre: DEBUG MARKER: == replay-single test 53g: |X| drop open reply and close request while close and open are both in flight == 18:37:54 (1394761074)
Lustre: DEBUG MARKER: cancel_lru_locks mdc start
Lustre: DEBUG MARKER: cancel_lru_locks mdc stop
Lustre: DEBUG MARKER: cancel_lru_locks mdc start
Lustre: DEBUG MARKER: cancel_lru_locks mdc stop
Lustre: DEBUG MARKER: local REPLAY BARRIER on lscratch-MDT0000
Lustre: 7328:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394761618/real 1394761618]  req@ffff88080f538800 x1462504360038020/t0(0) o250->MGC192.168.2.116@o2ib@192.168.2.116@o2ib:26/25 lens 400/544 e 0 to 1 dl 1394761673 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 7328:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 15 previous similar messages
Lustre: 7328:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394762293/real 1394762293]  req@ffff880820549400 x1462504360038380/t0(0) o250->MGC192.168.2.116@o2ib@192.168.2.116@o2ib:26/25 lens 400/544 e 0 to 1 dl 1394762348 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1


 Comments   
Comment by Andreas Dilger [ 30/May/14 ]

Also hit this on master at unmount time in conf-sanity.sh test_38:
https://maloo.whamcloud.com/test_sets/aadb2e7a-e60a-11e3-87f3-52540035b04c

Comment by Mikhail Pershin [ 01/Mar/15 ]

Another hit in replay-single.sh test_28 https://testing.hpdd.intel.com/test_sessions/08c1bd60-bfb3-11e4-88dc-5254006e85c2

I check other logs for this issue and found more details:
00000020:00080000:0.0:1408673672.851541:0:15843:0:(genops.c:1541:print_export_data()) MGS: UNLINKED ffff88007b1f4800 ac3b9fd6-6587-c04b-9265-166e701c0e80 10.1.4.113@tcp 1 (0 0 0) 1 0 0 0: (null) 0

that means that export has 1 reference but it is not reference from lock, request or transaction callback - they all are zeroes.

Comment by Gerrit Updater [ 01/Mar/15 ]

Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/13920
Subject: LU-4772 mgs: free MGS fsdb before export barrier
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 05b9891447670042bf552a59422156efd628af97

Comment by Mikhail Pershin [ 01/Mar/15 ]

I think the reason for this issue was the msg_fsc structure which keeps export reference while alive. But msg_device_fini() waits for exports before cleaning up those structures. They are cleaned up through corresponding FSDB cleanup: msg_fsdb_free() -> mgs_ir_fini_fs() -> mgs_fsc_cleanup_by_fsdb()

Comment by Gerrit Updater [ 08/Apr/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13920/
Subject: LU-4772 mgs: check MGS refcounting before export barrier
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 61cc0fd9636ccf4d302a9f776fe98910e4b0333d

Comment by Li Xi (Inactive) [ 09/Apr/15 ]

Hi Mikhail Pershin,

We are curretly hiting this issue pretty frequently. I saw your comments. I guess you are really close to finish a fix patch? Do you have any initial patch which we can confirm whether it works or not? Please let me know if there is anything I can do.

Comment by Mikhail Pershin [ 11/Apr/15 ]

Li Xi, initial patch was landed on master check it here too http://review.whamcloud.com/13920/
this is debug patch, but if you will use it, note that there is assertion. If that is reason for this issue then it may happen on your side. maybe it is better to replace it with CERROR to just output error message in your case.

Comment by Mikhail Pershin [ 11/Apr/15 ]

It happened again: https://testing.hpdd.intel.com/test_sets/f4f6624e-df02-11e4-9454-5254006e85c2
now with debug:

18:50:33:LustreError: 30857:0:(mgs_handler.c:1379:mgs_fsc_debug()) ASSERTION( list_empty(&fsdb->fsdb_clients) ) failed: Find FSC after cleanup, FSDB lustre
18:50:33:LustreError: 30857:0:(mgs_handler.c:1379:mgs_fsc_debug()) LBUG
18:50:33:Pid: 30857, comm: umount
18:50:33:
18:50:33:Call Trace:
18:50:33: [<ffffffffa0885875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
18:50:33: [<ffffffffa0885e77>] lbug_with_loc+0x47/0xb0 [libcfs]
18:50:33: [<ffffffffa10d0523>] mgs_fsc_debug+0x93/0xa0 [mgs]
18:50:33: [<ffffffffa10d36b5>] mgs_device_fini+0x115/0x5b0 [mgs]
18:50:33: [<ffffffffa098fc42>] class_cleanup+0x552/0xd10 [obdclass]
18:50:33: [<ffffffffa0970286>] ? class_name2dev+0x56/0xe0 [obdclass]
18:50:33: [<ffffffffa09923ea>] class_process_config+0x1fea/0x27c0 [obdclass]
18:50:33: [<ffffffffa0892161>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
18:50:33: [<ffffffffa098b3a5>] ? lustre_cfg_new+0x435/0x630 [obdclass]
18:50:33: [<ffffffffa0992ce1>] class_manual_cleanup+0x121/0x870 [obdclass]
18:50:33: [<ffffffffa0970286>] ? class_name2dev+0x56/0xe0 [obdclass]
18:50:33: [<ffffffffa09cd0bf>] server_put_super+0x81f/0xe50 [obdclass]
18:50:33: [<ffffffff8119083b>] generic_shutdown_super+0x5b/0xe0
18:50:33: [<ffffffff81190926>] kill_anon_super+0x16/0x60
18:50:33: [<ffffffffa0994f36>] lustre_kill_super+0x36/0x60 [obdclass]
18:50:33: [<ffffffff811910c7>] deactivate_super+0x57/0x80
18:50:33: [<ffffffff811b0cff>] mntput_no_expire+0xbf/0x110
18:50:33: [<ffffffff811b184b>] sys_umount+0x7b/0x3a0
18:50:33: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
18:50:33:
18:50:33:Kernel panic - not syncing: LBUG

Jinshan, it seems that FSC are not fully cleaned up during class_disconnect_exports()

Comment by Gerrit Updater [ 11/Apr/15 ]

Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/14443
Subject: LU-4772 mgs: free MGS fsdb before export barrier
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ac5a43d780bb4d28edd4a1492ec202e482c92705

Comment by Andreas Dilger [ 11/May/15 ]

Saw this today in replay-single test_35 https://testing.hpdd.intel.com/test_sets/a504e786-f557-11e4-8a1d-5254006e85c2

Comment by Jinshan Xiong (Inactive) [ 18/May/15 ]

The assertion in debug patch is hit in LU-6616.

08:12:16:Lustre: DEBUG MARKER: == recovery-small test 27: fail LOV while using OSC's == 21:44:09 (1431578649)
08:12:16:Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts
08:12:16:Lustre: DEBUG MARKER: umount -d /mnt/mds1
08:12:16:Lustre: Failing over lustre-MDT0000
08:12:16:LustreError: 12100:0:(mgs_handler.c:1379:mgs_fsc_debug()) ASSERTION( list_empty(&fsdb->fsdb_clients) ) failed: Find FSC after cleanup, FSDB lustre
08:12:16:LustreError: 12100:0:(mgs_handler.c:1379:mgs_fsc_debug()) LBUG
08:12:17:Pid: 12100, comm: umount
08:12:18:
08:12:18:Call Trace:
08:12:18: [<ffffffffa0704875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
08:12:18: [<ffffffffa0704e77>] lbug_with_loc+0x47/0xb0 [libcfs]
08:12:19: [<ffffffffa0e7c523>] mgs_fsc_debug+0x93/0xa0 [mgs]
08:12:19: [<ffffffffa0e7f6b5>] mgs_device_fini+0x115/0x5b0 [mgs]
08:12:19: [<ffffffffa083d582>] class_cleanup+0x552/0xd10 [obdclass]
08:12:19: [<ffffffffa081dbc6>] ? class_name2dev+0x56/0xe0 [obdclass]
08:12:19: [<ffffffffa083fd2a>] class_process_config+0x1fea/0x27c0 [obdclass]
08:12:19: [<ffffffffa0710c31>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
08:12:19: [<ffffffffa0838ce5>] ? lustre_cfg_new+0x435/0x630 [obdclass]
08:12:20: [<ffffffffa0840621>] class_manual_cleanup+0x121/0x870 [obdclass]
08:12:20: [<ffffffffa081dbc6>] ? class_name2dev+0x56/0xe0 [obdclass]
08:12:20: [<ffffffffa087b774>] server_put_super+0xc44/0xea0 [obdclass]
08:12:20: [<ffffffff81190adb>] generic_shutdown_super+0x5b/0xe0
08:12:20: [<ffffffff81190bc6>] kill_anon_super+0x16/0x60
08:12:20: [<ffffffffa0842876>] lustre_kill_super+0x36/0x60 [obdclass]
08:12:21: [<ffffffff81191367>] deactivate_super+0x57/0x80
08:12:22: [<ffffffff811b0fbf>] mntput_no_expire+0xbf/0x110
08:12:23: [<ffffffff811b1b0b>] sys_umount+0x7b/0x3a0
08:12:23: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
08:12:23:
08:12:23:Kernel panic - not syncing: LBUG
08:12:23:Pid: 12100, comm: umount Tainted: P           ---------------    2.6.32-504.16.2.el6_lustre.gd805a88.x86_64 #1
08:12:24:Call Trace:
08:12:24: [<ffffffff81529fbc>] ? panic+0xa7/0x16f
08:12:25: [<ffffffffa0704ecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
08:12:25: [<ffffffffa0e7c523>] ? mgs_fsc_debug+0x93/0xa0 [mgs]
08:12:25: [<ffffffffa0e7f6b5>] ? mgs_device_fini+0x115/0x5b0 [mgs]
08:12:25: [<ffffffffa083d582>] ? class_cleanup+0x552/0xd10 [obdclass]
08:12:25: [<ffffffffa081dbc6>] ? class_name2dev+0x56/0xe0 [obdclass]
08:12:25: [<ffffffffa083fd2a>] ? class_process_config+0x1fea/0x27c0 [obdclass]
08:12:25: [<ffffffffa0710c31>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
08:12:26: [<ffffffffa0838ce5>] ? lustre_cfg_new+0x435/0x630 [obdclass]
08:12:26: [<ffffffffa0840621>] ? class_manual_cleanup+0x121/0x870 [obdclass]
08:12:27: [<ffffffffa081dbc6>] ? class_name2dev+0x56/0xe0 [obdclass]
08:12:27: [<ffffffffa087b774>] ? server_put_super+0xc44/0xea0 [obdclass]
08:12:27: [<ffffffff81190adb>] ? generic_shutdown_super+0x5b/0xe0
08:12:28: [<ffffffff81190bc6>] ? kill_anon_super+0x16/0x60
08:12:28: [<ffffffffa0842876>] ? lustre_kill_super+0x36/0x60 [obdclass]
08:12:28: [<ffffffff81191367>] ? deactivate_super+0x57/0x80
08:12:28: [<ffffffff811b0fbf>] ? mntput_no_expire+0xbf/0x110
08:12:29: [<ffffffff811b1b0b>] ? sys_umount+0x7b/0x3a0
08:12:29: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
08:12:29:Initializing cgroup subsys cpuset
08:12:30:Initializing cgroup subsys cpu

Maloo test result is at: https://testing.hpdd.intel.com/test_sets/32d97c9c-fa45-11e4-8c8b-5254006e85c2

Comment by Jinshan Xiong (Inactive) [ 19/May/15 ]

I was trying to find out the root cause but I couldn't fetch the debug log. However, patch 14443 seems harmless anyway so I'd like to give it a try.

Comment by Gerrit Updater [ 20/May/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14443/
Subject: LU-4772 mgs: free MGS fsdb before export barrier
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 319acc84feeb66d08c7db408ad732889b46765e0

Comment by Peter Jones [ 13/Jul/15 ]

Landed for 2.8

Generated at Sat Feb 10 01:45:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.