[LU-5539] MGS is waiting for obd_unlinked_exports more than 1024 seconds Created: 22/Aug/14 Updated: 23/Nov/17 Resolved: 09/Jan/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3, Lustre 2.5.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Jian Yu | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | 22pl, mq115 | ||
| Environment: |
Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/84/ |
||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 15416 | ||||||||||||||||||||
| Description |
|
While testing patch http://review.whamcloud.com/11539 based on Lustre b2_5 build #84, unmounting mgs in sanity-lfsck test 0 hung: 20:00:58:Lustre: DEBUG MARKER: umount -d -f /mnt/mds1 20:00:58:LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.1.4.57@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. 20:00:58:LustreError: Skipped 7 previous similar messages 20:00:58:LustreError: 166-1: MGC10.1.4.66@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail 20:00:58:Lustre: MGS is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck? 20:00:58:Lustre: MGS is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck? 20:00:58:Lustre: 20326:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1408672403/real 1408672403] req@ffff88007b36dc00 x1477087144042844/t0(0) o250->MGC10.1.4.66@tcp@0@lo:26/25 lens 400/544 e 0 to 1 dl 1408672419 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 20:00:58:Lustre: 20326:0:(client.c:1918:ptlrpc_expire_one_request()) Skipped 11 previous similar messages 20:00:58:Lustre: MGS is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck? 20:00:58:Lustre: MGS is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck? 20:00:58:LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.1.4.57@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. 20:00:58:LustreError: Skipped 213 previous similar messages 20:00:58:INFO: task umount:16206 blocked for more than 120 seconds. 20:00:58: Not tainted 2.6.32-431.23.3.el6_lustre.g6035153.x86_64 #1 20:00:58:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 20:00:58:umount D 0000000000000001 0 16206 16205 0x00000080 20:00:58: ffff880059079aa8 0000000000000082 0000000000000000 ffff88007b874400 20:00:58: ffffffffa0c34294 0000000000000000 ffff88006c2120c4 ffffffffa0c34294 20:00:58: ffff880060233af8 ffff880059079fd8 000000000000fbc8 ffff880060233af8 20:00:58:Call Trace: 20:00:58: [<ffffffff81529e92>] schedule_timeout+0x192/0x2e0 20:00:58: [<ffffffff81083f30>] ? process_timeout+0x0/0x10 20:00:58: [<ffffffffa0bb5e9b>] obd_exports_barrier+0xab/0x180 [obdclass] 20:00:58: [<ffffffffa16e152e>] mgs_device_fini+0xfe/0x580 [mgs] 20:00:58: [<ffffffffa0be19f3>] class_cleanup+0x573/0xd30 [obdclass] 20:00:58: [<ffffffffa0bb8036>] ? class_name2dev+0x56/0xe0 [obdclass] 20:00:58: [<ffffffffa0be371a>] class_process_config+0x156a/0x1ad0 [obdclass] 20:00:58: [<ffffffffa0bdc873>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass] 20:00:58: [<ffffffffa0be3df9>] class_manual_cleanup+0x179/0x6f0 [obdclass] 20:00:58: [<ffffffffa0bb8036>] ? class_name2dev+0x56/0xe0 [obdclass] 20:00:58: [<ffffffffa0c1f2dd>] server_put_super+0x45d/0xf60 [obdclass] 20:00:58: [<ffffffff8118b23b>] generic_shutdown_super+0x5b/0xe0 20:00:58: [<ffffffff8118b326>] kill_anon_super+0x16/0x60 20:00:58: [<ffffffffa0be5ca6>] lustre_kill_super+0x36/0x60 [obdclass] 20:00:58: [<ffffffff8118bac7>] deactivate_super+0x57/0x80 20:00:58: [<ffffffff811ab4cf>] mntput_no_expire+0xbf/0x110 20:00:58: [<ffffffff811ac01b>] sys_umount+0x7b/0x3a0 20:00:58: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 20:00:58:Lustre: MGS is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck? Maloo report: https://testing.hpdd.intel.com/test_sets/37948628-29b7-11e4-8657-5254006e85c2 |
| Comments |
| Comment by Jian Yu [ 24/Aug/14 ] |
|
Lustre client build: https://build.hpdd.intel.com/job/lustre-b2_5/84/ sanity-scrub test 1c also hit the same failure: |
| Comment by Jian Yu [ 14/Dec/14 ] |
|
While verifying patch http://review.whamcloud.com/13046 on Lustre b2_5 branch, recovery-small test 107 hit the same failure: Lustre: DEBUG MARKER: umount -d /mnt/mds1
LustreError: 166-1: MGC10.1.4.101@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
Lustre: MGS is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck?
Lustre: MGS is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck?
Lustre: MGS is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck?
Lustre: MGS is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck?
Lustre: MGS is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck?
Lustre: 24854:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1418409787/real 1418409787] req@ffff880057dc1c00 x1487306634660884/t0(0) o250->MGC10.1.4.101@tcp@0@lo:26/25 lens 400/544 e 0 to 1 dl 1418409812 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 24854:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 17 previous similar messages
INFO: task umount:12953 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6_lustre.g6b22a20.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount D 0000000000000001 0 12953 12952 0x00000080
ffff880079a09aa8 0000000000000082 0000000000000000 ffff88006c40b800
ffffffffa0b39471 0000000000000000 ffff88006c349184 ffffffffa0b39471
ffff88006cd4b098 ffff880079a09fd8 000000000000fbc8 ffff88006cd4b098
Call Trace:
[<ffffffff8152a532>] schedule_timeout+0x192/0x2e0
[<ffffffff81083f30>] ? process_timeout+0x0/0x10
[<ffffffffa0ab9efb>] obd_exports_barrier+0xab/0x180 [obdclass]
[<ffffffffa170755e>] mgs_device_fini+0xfe/0x580 [mgs]
[<ffffffffa0ae6833>] class_cleanup+0x573/0xd30 [obdclass]
[<ffffffffa0abc096>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa0ae855a>] class_process_config+0x156a/0x1ad0 [obdclass]
[<ffffffffa09942f8>] ? libcfs_log_return+0x28/0x40 [libcfs]
[<ffffffffa0ae16f2>] ? lustre_cfg_new+0x312/0x6e0 [obdclass]
[<ffffffffa0ae8c39>] class_manual_cleanup+0x179/0x6f0 [obdclass]
[<ffffffffa0abc096>] ? class_name2dev+0x56/0xe0 [obdclass]
[<ffffffffa0b241ed>] server_put_super+0x45d/0xf60 [obdclass]
[<ffffffff8118b63b>] generic_shutdown_super+0x5b/0xe0
[<ffffffff8118b726>] kill_anon_super+0x16/0x60
[<ffffffffa0aeaae6>] lustre_kill_super+0x36/0x60 [obdclass]
[<ffffffff8118bec7>] deactivate_super+0x57/0x80
[<ffffffff811ab8cf>] mntput_no_expire+0xbf/0x110
[<ffffffff811ac41b>] sys_umount+0x7b/0x3a0
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Maloo report: https://testing.hpdd.intel.com/test_sets/a1d5447c-82e1-11e4-9195-5254006e85c2 |
| Comment by Jian Yu [ 14/Dec/14 ] |
|
Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/106/ replay-single test 35 hit the same failure: |
| Comment by Andreas Dilger [ 09/Jan/15 ] |
|
Closing as a duplicate of |