[LU-8500] lustre-2.7.2 hits MGS is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 5. Is it stuck? Created: 15/Aug/16 Updated: 27/May/19 Resolved: 08/Sep/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Nathan Dauchy (Inactive) | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Hit what looks like Aug 15 09:19:12 nbp1-mds kernel: INFO: task umount:37486 blocked for more than 120 seconds. Aug 15 09:19:12 nbp1-mds kernel: Tainted: G -- ------------ T 2.6.32-573.26.1.el6.20160517.x86_64.lustre272 #1 Aug 15 09:19:13 nbp1-mds kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 15 09:19:13 nbp1-mds kernel: umount D 0000000000000003 0 37486 1 0x00000080 Aug 15 09:19:13 nbp1-mds kernel: ffff881f07cabab8 0000000000000086 0000000000000000 ffff881f07caba58 Aug 15 09:19:13 nbp1-mds kernel: ffff881f07caba18 ffff883e20b05400 001305aa0b19e806 0000000000000000 Aug 15 09:19:13 nbp1-mds kernel: ffff881f0b3ece4c 000000023ecf1493 ffff881e28d7d068 ffff881f07cabfd8 Aug 15 09:19:13 nbp1-mds kernel: Call Trace: Aug 15 09:19:13 nbp1-mds kernel: [<ffffffff81574ce2>] schedule_timeout+0x192/0x2e0 Aug 15 09:19:13 nbp1-mds kernel: [<ffffffff810892c0>] ? process_timeout+0x0/0x10 Aug 15 09:19:13 nbp1-mds kernel: [<ffffffffa05b2296>] obd_exports_barrier+0xb6/0x190 [obdclass] Aug 15 09:19:13 nbp1-mds kernel: [<ffffffffa0e7ea14>] mgs_device_fini+0x134/0x5b0 [mgs] Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05d7122>] class_cleanup+0x562/0xd20 [obdclass] Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05b4216>] ? class_name2dev+0x56/0xe0 [obdclass] Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05d8e4a>] class_process_config+0x156a/0x1ad0 [obdclass] Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05d1205>] ? lustre_cfg_new+0x435/0x630 [obdclass] Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05d9525>] class_manual_cleanup+0x175/0x4c0 [obdclass] Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05b4216>] ? class_name2dev+0x56/0xe0 [obdclass] Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa061827f>] server_put_super+0x9df/0x1060 [obdclass] Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff811ad166>] ? invalidate_inodes+0xf6/0x190 Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff8119127b>] generic_shutdown_super+0x5b/0xe0 Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff81191366>] kill_anon_super+0x16/0x60 Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05db0b6>] lustre_kill_super+0x36/0x60 [obdclass] Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff81191b07>] deactivate_super+0x57/0x80 Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff811b1acf>] mntput_no_expire+0xbf/0x110 Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff811b261b>] sys_umount+0x7b/0x3a0 Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b Aug 15 09:20:03 nbp1-mds kernel: Lustre: MGS is waiting for obd_unlinked_exports more than 512 seconds. The obd refcount = 5. Is it stuck? |
| Comments |
| Comment by Nathan Dauchy (Inactive) [ 15/Aug/16 ] |
|
Note that I saw a similar messages (one with stack trace, one without) other servers. The difference being that these are for the MDT, not the MGS: Aug 15 09:22:09 nbp6-mds kernel: Lustre: nbp6-MDT0000 is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 9. Is it stuck? Aug 15 10:20:15 nbp8-mds1 kernel: INFO: task umount:55640 blocked for more than 120 seconds. Aug 15 10:20:15 nbp8-mds1 kernel: Not tainted 2.6.32-504.30.3.el6.20151008.x86_64.lustre271 #1 Aug 15 10:20:16 nbp8-mds1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 15 10:20:16 nbp8-mds1 kernel: umount D 0000000000000002 0 55640 55625 0x00000080 Aug 15 10:20:16 nbp8-mds1 kernel: ffff880935d2da78 0000000000000082 0000000000000000 ffff880935d2da18 Aug 15 10:20:16 nbp8-mds1 kernel: ffff880935d2d9d8 ffff883ddf90bc00 001a700794410389 0000000000000000 Aug 15 10:20:16 nbp8-mds1 kernel: ffff883f224ace8c 00000002bb51a25e ffff880e76565ad0 ffff880935d2dfd8 Aug 15 10:20:16 nbp8-mds1 kernel: Call Trace: Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffff81562c32>] schedule_timeout+0x192/0x2e0 Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffff81086900>] ? process_timeout+0x0/0x10 Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05a8296>] obd_exports_barrier+0xb6/0x190 [obdclass] Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa0f9f2c2>] mdt_device_fini+0x642/0x1010 [mdt] Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05ad406>] ? class_disconnect_exports+0x116/0x2f0 [obdclass] Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05cd122>] class_cleanup+0x562/0xd20 [obdclass] Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05aa216>] ? class_name2dev+0x56/0xe0 [obdclass] Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05cee4a>] class_process_config+0x156a/0x1ad0 [obdclass] Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05c7205>] ? lustre_cfg_new+0x435/0x630 [obdclass] Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05cf525>] class_manual_cleanup+0x175/0x4c0 [obdclass] Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05aa216>] ? class_name2dev+0x56/0xe0 [obdclass] Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa060e597>] server_put_super+0xcf7/0x1060 [obdclass] Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffff811a67a6>] ? invalidate_inodes+0xf6/0x190 Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffff8118ab8b>] generic_shutdown_super+0x5b/0xe0 Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffff8118ac76>] kill_anon_super+0x16/0x60 Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffffa05d10b6>] lustre_kill_super+0x36/0x60 [obdclass] Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffff8118b417>] deactivate_super+0x57/0x80 Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffff811ab11f>] mntput_no_expire+0xbf/0x110 Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffff811abc6b>] sys_umount+0x7b/0x3a0 Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b Aug 15 10:21:46 nbp8-mds1 kernel: Lustre: nbp8-MDT0000: Not available for connect from 10.151.55.23@o2ib (stopping) Aug 15 10:21:46 nbp8-mds1 kernel: Lustre: Skipped 47513 previous similar messages Aug 15 10:22:06 nbp8-mds1 kernel: Lustre: nbp8-MDT0000 is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 7. Is it stuck? Please confirm whether the fix in |
| Comment by Peter Jones [ 15/Aug/16 ] |
|
Nathan The Hongchao What do you suggest here? Peter |
| Comment by Hongchao Zhang [ 16/Aug/16 ] |
|
Hi Nathan, Is the debug log available during umounting MDT? it will dump the exports in this case, and will help to find the cause of the issue. |
| Comment by Nathan Dauchy (Inactive) [ 16/Aug/16 ] |
|
I'm afraid I don't have additional debugging info... was in a rush to get things shut down so we could complete server and Lustre version upgrades during a maintenance window, so I just power cycled the servers that hit this issue. |
| Comment by Nathan Dauchy (Inactive) [ 16/Aug/16 ] |
|
Peter, According to Jay, the lustre-2.7.2-1.1nasS_mofed32v1 build (what was running on "nbp1-mds" above) does indeed have the However, the lustre-2.7.1-5.1nasS_mofed31v5 build (what was on "nbp8-mds1" above) did NOT have the fix. I presume neither did the "lustre-2.5.3-6nasS_mofed31v5" version running on "nbp6-mds" above. Therefore, the issues for the latter two (regarding MDT) can be ignored for now as we assume the Only the first stack trace on nbp1-mds is of interest then. Thanks! |
| Comment by Peter Jones [ 17/Aug/16 ] |
|
Hongchao What is your assessment of the first stack trace? Peter |
| Comment by Hongchao Zhang [ 18/Aug/16 ] |
|
I'm still investigating the problem of the first stack trace, and don't find the cause of the problem yet. |
| Comment by Hongchao Zhang [ 19/Aug/16 ] |
|
by investigating the code lines, I find some cases leaking the export reference, which could be related to the issue. |
| Comment by Peter Jones [ 19/Aug/16 ] |
|
Hongchao I recommend fixing this on master first, then backporting to 2.7 FE for NASA to use Regards Peter |
| Comment by Gerrit Updater [ 19/Aug/16 ] |
|
Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/22031 |
| Comment by James A Simmons [ 23/Aug/16 ] |
|
We just hit this on 2.8. Can't unmount our OSTs. |
| Comment by Gerrit Updater [ 02/Sep/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22031/ |
| Comment by Nathan Dauchy (Inactive) [ 02/Sep/16 ] |
|
Oleg, thanks for landing. Hongchao, Peter, now we just need the finalized and checked and landed patch for the 2.7 FE branch. |
| Comment by James A Simmons [ 02/Sep/16 ] |
|
We need one for b2_8_fe as well. |
| Comment by Jian Yu [ 03/Sep/16 ] |
|
Here are the back-ported patches for: |
| Comment by Peter Jones [ 08/Sep/16 ] |
|
Master fix landed for 2.9; Landings to maintenance branches tracked separately |
| Comment by Gerrit Updater [ 02/Feb/18 ] |
|
Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: https://review.whamcloud.com/31139 |