[LU-4772] MGS is waiting for obd_unlinked_exports Created: 14/Mar/14 Updated: 19/Mar/19 Resolved: 13/Jul/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.1, Lustre 2.7.0, Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | James Nunez (Inactive) | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
OpenSFS cluster with one MGS/MDS (MDT is a partition of /dev/sda), one OSS with two OSTs, a node running robinhood, one client and agent node and one client |
||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||
| Rank (Obsolete): | 13125 | ||||||||||||||||||||||||||||
| Description |
|
Running acceptance-small for HSM testing on 2.5.1-RC4, replay-single hangs in test 53g. The test results are at https://maloo.whamcloud.com/test_sets/23724c76-abad-11e3-a696-52540035b04c , but there are no logs for test 53g, except at the end of the suite_log: == replay-single test 53g: |X| drop open reply and close request while close and open are both in flight == 18:37:54 (1394761074) fail_loc=0x119 fail_loc=0x80000115 fail_loc=0 Replay barrier on lscratch-MDT0000 Failing mds1 on c16 Stopping /lustre/lscratch/mdt0 (opts:) on c16 Looking at dmesg on the MDT: Lustre: DEBUG MARKER: == replay-single test 53g: |X| drop open reply and close request while close and open are both in flight == 18:37:54 (1394761074)
Lustre: *** cfs_fail_loc=119, val=2147483648***
Lustre: Skipped 1 previous similar message
LustreError: 18010:0:(ldlm_lib.c:2415:target_send_reply_msg()) @@@ dropping reply req@ffff88081e682000 x1462504360037684/t274877906958(0) o36->97aaf730-d78d-08d9-43ce-9e768c9c685f@192.168.2.120@o2ib:0/0 lens 488/448 e 0 to 0 dl 1394761114 ref 1 fl Interpret:/0/0 rc 0/0
LustreError: 18010:0:(ldlm_lib.c:2415:target_send_reply_msg()) Skipped 1 previous similar message
Lustre: *** cfs_fail_loc=115, val=2147483648***
Turning device sda (0x800003) read-only
Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lscratch-MDT0000
Lustre: MGS is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck?
Lustre: MGS is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck?
LustreError: 166-1: MGC192.168.2.116@o2ib: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
Lustre: MGS is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck?
Lustre: MGS is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck?
Lustre: MGS is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck?
Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394761310/real 1394761310] req@ffff8808227c9400 x1462503945016504/t0(0) o250->MGC192.168.2.116@o2ib@0@lo:26/25 lens 400/544 e 0 to 1 dl 1394761346 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 14 previous similar messages
LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.119@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 66 previous similar messages
INFO: task umount:18492 blocked for more than 120 seconds.
Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount D 0000000000000000 0 18492 18491 0x00000080
ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000
ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985
ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8
Call Trace:
[<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0
[<ffffffff81084220>] ? process_timeout+0x0/0x10
[<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass]
[<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs]
[<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass]
[<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass]
[<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
[<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass]
[<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass]
[<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0
[<ffffffff8118b326>] kill_anon_super+0x16/0x60
[<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass]
[<ffffffff8118bac7>] deactivate_super+0x57/0x80
[<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110
[<ffffffff811ab64b>] sys_umount+0x7b/0x3a0
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Lustre: MGS is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 5. Is it stuck?
INFO: task umount:18492 blocked for more than 120 seconds.
Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount D 0000000000000000 0 18492 18491 0x00000080
ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000
ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985
ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8
Call Trace:
[<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0
[<ffffffff81084220>] ? process_timeout+0x0/0x10
[<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass]
[<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs]
[<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass]
[<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass]
[<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
[<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass]
[<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass]
[<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0
[<ffffffff8118b326>] kill_anon_super+0x16/0x60
[<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass]
[<ffffffff8118bac7>] deactivate_super+0x57/0x80
[<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110
[<ffffffff811ab64b>] sys_umount+0x7b/0x3a0
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task umount:18492 blocked for more than 120 seconds.
Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount D 0000000000000000 0 18492 18491 0x00000080
ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000
ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985
ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8
Call Trace:
[<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0
[<ffffffff81084220>] ? process_timeout+0x0/0x10
[<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass]
[<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs]
[<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass]
[<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass]
[<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
[<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass]
[<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass]
[<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0
[<ffffffff8118b326>] kill_anon_super+0x16/0x60
[<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass]
[<ffffffff8118bac7>] deactivate_super+0x57/0x80
[<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110
[<ffffffff811ab64b>] sys_umount+0x7b/0x3a0
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394761910/real 1394761910] req@ffff8808049d6000 x1462503945016540/t0(0) o250->MGC192.168.2.116@o2ib@0@lo:26/25 lens 400/544 e 0 to 1 dl 1394761965 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.118@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 120 previous similar messages
INFO: task umount:18492 blocked for more than 120 seconds.
Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount D 0000000000000000 0 18492 18491 0x00000080
ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000
ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985
ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8
Call Trace:
[<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0
[<ffffffff81084220>] ? process_timeout+0x0/0x10
[<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass]
[<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs]
[<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass]
[<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass]
[<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
[<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass]
[<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass]
[<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0
[<ffffffff8118b326>] kill_anon_super+0x16/0x60
[<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass]
[<ffffffff8118bac7>] deactivate_super+0x57/0x80
[<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110
[<ffffffff811ab64b>] sys_umount+0x7b/0x3a0
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Lustre: MGS is waiting for obd_unlinked_exports more than 512 seconds. The obd refcount = 5. Is it stuck?
INFO: task umount:18492 blocked for more than 120 seconds.
Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount D 0000000000000000 0 18492 18491 0x00000080
ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000
ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985
ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8
Call Trace:
[<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0
[<ffffffff81084220>] ? process_timeout+0x0/0x10
[<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass]
[<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs]
[<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass]
[<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass]
[<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
[<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass]
[<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass]
[<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0
[<ffffffff8118b326>] kill_anon_super+0x16/0x60
[<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass]
[<ffffffff8118bac7>] deactivate_super+0x57/0x80
[<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110
[<ffffffff811ab64b>] sys_umount+0x7b/0x3a0
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task umount:18492 blocked for more than 120 seconds.
Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount D 0000000000000000 0 18492 18491 0x00000080
ffff880803de3aa8 0000000000000082 ffff880803de3a08 ffff88082104b000
ffffffffa05a6985 0000000000000000 ffff8808102d6084 ffffffffa05a6985
ffff88082a385af8 ffff880803de3fd8 000000000000fbc8 ffff88082a385af8
Call Trace:
[<ffffffff81528eb2>] schedule_timeout+0x192/0x2e0
[<ffffffff81084220>] ? process_timeout+0x0/0x10
[<ffffffffa0528eeb>] obd_exports_barrier+0xab/0x180 [obdclass]
[<ffffffffa0d4052e>] mgs_device_fini+0xfe/0x580 [mgs]
[<ffffffffa0554523>] class_cleanup+0x573/0xd30 [obdclass]
[<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa055624a>] class_process_config+0x156a/0x1ad0 [obdclass]
[<ffffffffa054f3a3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
[<ffffffffa0556929>] class_manual_cleanup+0x179/0x6f0 [obdclass]
[<ffffffffa052b086>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa0591dfd>] server_put_super+0x45d/0xf60 [obdclass]
[<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0
[<ffffffff8118b326>] kill_anon_super+0x16/0x60
[<ffffffffa05587d6>] lustre_kill_super+0x36/0x60 [obdclass]
[<ffffffff8118bac7>] deactivate_super+0x57/0x80
[<ffffffff811aaaff>] mntput_no_expire+0xbf/0x110
[<ffffffff811ab64b>] sys_umount+0x7b/0x3a0
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.120@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 113 previous similar messages
Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394762585/real 1394762585] req@ffff880819252800 x1462503945016576/t0(0) o250->MGC192.168.2.116@o2ib@0@lo:26/25 lens 400/544 e 0 to 1 dl 1394762640 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
Lustre: MGS is waiting for obd_unlinked_exports more than 1024 seconds. The obd refcount = 5. Is it stuck?
LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.117@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 120 previous similar messages
Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394763260/real 1394763260] req@ffff8808202e3800 x1462503945016612/t0(0) o250->MGC192.168.2.116@o2ib@0@lo:26/25 lens 400/544 e 0 to 1 dl 1394763315 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.119@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 116 previous similar messages
Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394763935/real 1394763935] req@ffff8808040e9800 x1462503945016648/t0(0) o250->MGC192.168.2.116@o2ib@0@lo:26/25 lens 400/544 e 0 to 1 dl 1394763990 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 1776:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.119@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 113 previous similar messages
On the client, we see Lustre: DEBUG MARKER: == replay-single test 53g: |X| drop open reply and close request while close and open are both in flight == 18:37:54 (1394761074) Lustre: DEBUG MARKER: cancel_lru_locks mdc start Lustre: DEBUG MARKER: cancel_lru_locks mdc stop Lustre: DEBUG MARKER: cancel_lru_locks mdc start Lustre: DEBUG MARKER: cancel_lru_locks mdc stop Lustre: DEBUG MARKER: local REPLAY BARRIER on lscratch-MDT0000 Lustre: 7328:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394761618/real 1394761618] req@ffff88080f538800 x1462504360038020/t0(0) o250->MGC192.168.2.116@o2ib@192.168.2.116@o2ib:26/25 lens 400/544 e 0 to 1 dl 1394761673 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 7328:0:(client.c:1901:ptlrpc_expire_one_request()) Skipped 15 previous similar messages Lustre: 7328:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394762293/real 1394762293] req@ffff880820549400 x1462504360038380/t0(0) o250->MGC192.168.2.116@o2ib@192.168.2.116@o2ib:26/25 lens 400/544 e 0 to 1 dl 1394762348 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 |
| Comments |
| Comment by Andreas Dilger [ 30/May/14 ] |
|
Also hit this on master at unmount time in conf-sanity.sh test_38: |
| Comment by Mikhail Pershin [ 01/Mar/15 ] |
|
Another hit in replay-single.sh test_28 https://testing.hpdd.intel.com/test_sessions/08c1bd60-bfb3-11e4-88dc-5254006e85c2 I check other logs for this issue and found more details: that means that export has 1 reference but it is not reference from lock, request or transaction callback - they all are zeroes. |
| Comment by Gerrit Updater [ 01/Mar/15 ] |
|
Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/13920 |
| Comment by Mikhail Pershin [ 01/Mar/15 ] |
|
I think the reason for this issue was the msg_fsc structure which keeps export reference while alive. But msg_device_fini() waits for exports before cleaning up those structures. They are cleaned up through corresponding FSDB cleanup: msg_fsdb_free() -> mgs_ir_fini_fs() -> mgs_fsc_cleanup_by_fsdb() |
| Comment by Gerrit Updater [ 08/Apr/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13920/ |
| Comment by Li Xi (Inactive) [ 09/Apr/15 ] |
|
Hi Mikhail Pershin, We are curretly hiting this issue pretty frequently. I saw your comments. I guess you are really close to finish a fix patch? Do you have any initial patch which we can confirm whether it works or not? Please let me know if there is anything I can do. |
| Comment by Mikhail Pershin [ 11/Apr/15 ] |
|
Li Xi, initial patch was landed on master check it here too http://review.whamcloud.com/13920/ |
| Comment by Mikhail Pershin [ 11/Apr/15 ] |
|
It happened again: https://testing.hpdd.intel.com/test_sets/f4f6624e-df02-11e4-9454-5254006e85c2 18:50:33:LustreError: 30857:0:(mgs_handler.c:1379:mgs_fsc_debug()) ASSERTION( list_empty(&fsdb->fsdb_clients) ) failed: Find FSC after cleanup, FSDB lustre 18:50:33:LustreError: 30857:0:(mgs_handler.c:1379:mgs_fsc_debug()) LBUG 18:50:33:Pid: 30857, comm: umount 18:50:33: 18:50:33:Call Trace: 18:50:33: [<ffffffffa0885875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 18:50:33: [<ffffffffa0885e77>] lbug_with_loc+0x47/0xb0 [libcfs] 18:50:33: [<ffffffffa10d0523>] mgs_fsc_debug+0x93/0xa0 [mgs] 18:50:33: [<ffffffffa10d36b5>] mgs_device_fini+0x115/0x5b0 [mgs] 18:50:33: [<ffffffffa098fc42>] class_cleanup+0x552/0xd10 [obdclass] 18:50:33: [<ffffffffa0970286>] ? class_name2dev+0x56/0xe0 [obdclass] 18:50:33: [<ffffffffa09923ea>] class_process_config+0x1fea/0x27c0 [obdclass] 18:50:33: [<ffffffffa0892161>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 18:50:33: [<ffffffffa098b3a5>] ? lustre_cfg_new+0x435/0x630 [obdclass] 18:50:33: [<ffffffffa0992ce1>] class_manual_cleanup+0x121/0x870 [obdclass] 18:50:33: [<ffffffffa0970286>] ? class_name2dev+0x56/0xe0 [obdclass] 18:50:33: [<ffffffffa09cd0bf>] server_put_super+0x81f/0xe50 [obdclass] 18:50:33: [<ffffffff8119083b>] generic_shutdown_super+0x5b/0xe0 18:50:33: [<ffffffff81190926>] kill_anon_super+0x16/0x60 18:50:33: [<ffffffffa0994f36>] lustre_kill_super+0x36/0x60 [obdclass] 18:50:33: [<ffffffff811910c7>] deactivate_super+0x57/0x80 18:50:33: [<ffffffff811b0cff>] mntput_no_expire+0xbf/0x110 18:50:33: [<ffffffff811b184b>] sys_umount+0x7b/0x3a0 18:50:33: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 18:50:33: 18:50:33:Kernel panic - not syncing: LBUG Jinshan, it seems that FSC are not fully cleaned up during class_disconnect_exports() |
| Comment by Gerrit Updater [ 11/Apr/15 ] |
|
Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/14443 |
| Comment by Andreas Dilger [ 11/May/15 ] |
|
Saw this today in replay-single test_35 https://testing.hpdd.intel.com/test_sets/a504e786-f557-11e4-8a1d-5254006e85c2 |
| Comment by Jinshan Xiong (Inactive) [ 18/May/15 ] |
|
The assertion in debug patch is hit in 08:12:16:Lustre: DEBUG MARKER: == recovery-small test 27: fail LOV while using OSC's == 21:44:09 (1431578649) 08:12:16:Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts 08:12:16:Lustre: DEBUG MARKER: umount -d /mnt/mds1 08:12:16:Lustre: Failing over lustre-MDT0000 08:12:16:LustreError: 12100:0:(mgs_handler.c:1379:mgs_fsc_debug()) ASSERTION( list_empty(&fsdb->fsdb_clients) ) failed: Find FSC after cleanup, FSDB lustre 08:12:16:LustreError: 12100:0:(mgs_handler.c:1379:mgs_fsc_debug()) LBUG 08:12:17:Pid: 12100, comm: umount 08:12:18: 08:12:18:Call Trace: 08:12:18: [<ffffffffa0704875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 08:12:18: [<ffffffffa0704e77>] lbug_with_loc+0x47/0xb0 [libcfs] 08:12:19: [<ffffffffa0e7c523>] mgs_fsc_debug+0x93/0xa0 [mgs] 08:12:19: [<ffffffffa0e7f6b5>] mgs_device_fini+0x115/0x5b0 [mgs] 08:12:19: [<ffffffffa083d582>] class_cleanup+0x552/0xd10 [obdclass] 08:12:19: [<ffffffffa081dbc6>] ? class_name2dev+0x56/0xe0 [obdclass] 08:12:19: [<ffffffffa083fd2a>] class_process_config+0x1fea/0x27c0 [obdclass] 08:12:19: [<ffffffffa0710c31>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 08:12:19: [<ffffffffa0838ce5>] ? lustre_cfg_new+0x435/0x630 [obdclass] 08:12:20: [<ffffffffa0840621>] class_manual_cleanup+0x121/0x870 [obdclass] 08:12:20: [<ffffffffa081dbc6>] ? class_name2dev+0x56/0xe0 [obdclass] 08:12:20: [<ffffffffa087b774>] server_put_super+0xc44/0xea0 [obdclass] 08:12:20: [<ffffffff81190adb>] generic_shutdown_super+0x5b/0xe0 08:12:20: [<ffffffff81190bc6>] kill_anon_super+0x16/0x60 08:12:20: [<ffffffffa0842876>] lustre_kill_super+0x36/0x60 [obdclass] 08:12:21: [<ffffffff81191367>] deactivate_super+0x57/0x80 08:12:22: [<ffffffff811b0fbf>] mntput_no_expire+0xbf/0x110 08:12:23: [<ffffffff811b1b0b>] sys_umount+0x7b/0x3a0 08:12:23: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 08:12:23: 08:12:23:Kernel panic - not syncing: LBUG 08:12:23:Pid: 12100, comm: umount Tainted: P --------------- 2.6.32-504.16.2.el6_lustre.gd805a88.x86_64 #1 08:12:24:Call Trace: 08:12:24: [<ffffffff81529fbc>] ? panic+0xa7/0x16f 08:12:25: [<ffffffffa0704ecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 08:12:25: [<ffffffffa0e7c523>] ? mgs_fsc_debug+0x93/0xa0 [mgs] 08:12:25: [<ffffffffa0e7f6b5>] ? mgs_device_fini+0x115/0x5b0 [mgs] 08:12:25: [<ffffffffa083d582>] ? class_cleanup+0x552/0xd10 [obdclass] 08:12:25: [<ffffffffa081dbc6>] ? class_name2dev+0x56/0xe0 [obdclass] 08:12:25: [<ffffffffa083fd2a>] ? class_process_config+0x1fea/0x27c0 [obdclass] 08:12:25: [<ffffffffa0710c31>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 08:12:26: [<ffffffffa0838ce5>] ? lustre_cfg_new+0x435/0x630 [obdclass] 08:12:26: [<ffffffffa0840621>] ? class_manual_cleanup+0x121/0x870 [obdclass] 08:12:27: [<ffffffffa081dbc6>] ? class_name2dev+0x56/0xe0 [obdclass] 08:12:27: [<ffffffffa087b774>] ? server_put_super+0xc44/0xea0 [obdclass] 08:12:27: [<ffffffff81190adb>] ? generic_shutdown_super+0x5b/0xe0 08:12:28: [<ffffffff81190bc6>] ? kill_anon_super+0x16/0x60 08:12:28: [<ffffffffa0842876>] ? lustre_kill_super+0x36/0x60 [obdclass] 08:12:28: [<ffffffff81191367>] ? deactivate_super+0x57/0x80 08:12:28: [<ffffffff811b0fbf>] ? mntput_no_expire+0xbf/0x110 08:12:29: [<ffffffff811b1b0b>] ? sys_umount+0x7b/0x3a0 08:12:29: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b 08:12:29:Initializing cgroup subsys cpuset 08:12:30:Initializing cgroup subsys cpu Maloo test result is at: https://testing.hpdd.intel.com/test_sets/32d97c9c-fa45-11e4-8c8b-5254006e85c2 |
| Comment by Jinshan Xiong (Inactive) [ 19/May/15 ] |
|
I was trying to find out the root cause but I couldn't fetch the debug log. However, patch 14443 seems harmless anyway so I'd like to give it a try. |
| Comment by Gerrit Updater [ 20/May/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14443/ |
| Comment by Peter Jones [ 13/Jul/15 ] |
|
Landed for 2.8 |