[LU-8500] lustre-2.7.2 hits MGS is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 5. Is it stuck? Created: 15/Aug/16  Updated: 27/May/19  Resolved: 08/Sep/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: Nathan Dauchy (Inactive) Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hit what looks like LU-4772 (MGS is waiting for obd_unlinked_exports) while running 2.7.2. So this is a duplicate case, but opening it for tracking to get the patch backported to the FE branch for NASA.

Aug 15 09:19:12 nbp1-mds kernel: INFO: task umount:37486 blocked for more than 120 seconds.
Aug 15 09:19:12 nbp1-mds kernel: Tainted: G           -- ------------  T 2.6.32-573.26.1.el6.20160517.x86_64.lustre272 #1
Aug 15 09:19:13 nbp1-mds kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 15 09:19:13 nbp1-mds kernel: umount        D 0000000000000003     0 37486      1 0x00000080
Aug 15 09:19:13 nbp1-mds kernel: ffff881f07cabab8 0000000000000086 0000000000000000 ffff881f07caba58
Aug 15 09:19:13 nbp1-mds kernel: ffff881f07caba18 ffff883e20b05400 001305aa0b19e806 0000000000000000
Aug 15 09:19:13 nbp1-mds kernel: ffff881f0b3ece4c 000000023ecf1493 ffff881e28d7d068 ffff881f07cabfd8
Aug 15 09:19:13 nbp1-mds kernel: Call Trace:
Aug 15 09:19:13 nbp1-mds kernel: [<ffffffff81574ce2>] schedule_timeout+0x192/0x2e0
Aug 15 09:19:13 nbp1-mds kernel: [<ffffffff810892c0>] ? process_timeout+0x0/0x10
Aug 15 09:19:13 nbp1-mds kernel: [<ffffffffa05b2296>] obd_exports_barrier+0xb6/0x190 [obdclass]
Aug 15 09:19:13 nbp1-mds kernel: [<ffffffffa0e7ea14>] mgs_device_fini+0x134/0x5b0 [mgs]
Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05d7122>] class_cleanup+0x562/0xd20 [obdclass]
Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05b4216>] ? class_name2dev+0x56/0xe0 [obdclass]
Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05d8e4a>] class_process_config+0x156a/0x1ad0 [obdclass]
Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05d1205>] ? lustre_cfg_new+0x435/0x630 [obdclass]
Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05d9525>] class_manual_cleanup+0x175/0x4c0 [obdclass]
Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05b4216>] ? class_name2dev+0x56/0xe0 [obdclass]
Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa061827f>] server_put_super+0x9df/0x1060 [obdclass]
Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff811ad166>] ? invalidate_inodes+0xf6/0x190
Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff8119127b>] generic_shutdown_super+0x5b/0xe0
Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff81191366>] kill_anon_super+0x16/0x60
Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05db0b6>] lustre_kill_super+0x36/0x60 [obdclass]
Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff81191b07>] deactivate_super+0x57/0x80
Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff811b1acf>] mntput_no_expire+0xbf/0x110
Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff811b261b>] sys_umount+0x7b/0x3a0
Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
Aug 15 09:20:03 nbp1-mds kernel: Lustre: MGS is waiting for obd_unlinked_exports more than 512 seconds. The obd refcount = 5. Is it stuck?


 Comments   
Comment by Nathan Dauchy (Inactive) [ 15/Aug/16 ]

Note that I saw a similar messages (one with stack trace, one without) other servers. The difference being that these are for the MDT, not the MGS:

Aug 15 09:22:09 nbp6-mds kernel: Lustre: nbp6-MDT0000 is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 9. Is it stuck?
Aug 15 10:20:15 nbp8-mds1 kernel: INFO: task umount:55640 blocked for more than 120 seconds.
Aug 15 10:20:15 nbp8-mds1 kernel: Not tainted 2.6.32-504.30.3.el6.20151008.x86_64.lustre271 #1
Aug 15 10:20:16 nbp8-mds1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 15 10:20:16 nbp8-mds1 kernel: umount        D 0000000000000002     0 55640  55625 0x00000080
Aug 15 10:20:16 nbp8-mds1 kernel: ffff880935d2da78 0000000000000082 0000000000000000 ffff880935d2da18
Aug 15 10:20:16 nbp8-mds1 kernel: ffff880935d2d9d8 ffff883ddf90bc00 001a700794410389 0000000000000000
Aug 15 10:20:16 nbp8-mds1 kernel: ffff883f224ace8c 00000002bb51a25e ffff880e76565ad0 ffff880935d2dfd8
Aug 15 10:20:16 nbp8-mds1 kernel: Call Trace:
Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffff81562c32>] schedule_timeout+0x192/0x2e0
Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffff81086900>] ? process_timeout+0x0/0x10
Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05a8296>] obd_exports_barrier+0xb6/0x190 [obdclass]
Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa0f9f2c2>] mdt_device_fini+0x642/0x1010 [mdt]
Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05ad406>] ? class_disconnect_exports+0x116/0x2f0 [obdclass]
Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05cd122>] class_cleanup+0x562/0xd20 [obdclass]
Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05aa216>] ? class_name2dev+0x56/0xe0 [obdclass]
Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05cee4a>] class_process_config+0x156a/0x1ad0 [obdclass]
Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05c7205>] ? lustre_cfg_new+0x435/0x630 [obdclass]
Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05cf525>] class_manual_cleanup+0x175/0x4c0 [obdclass]
Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05aa216>] ? class_name2dev+0x56/0xe0 [obdclass]
Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa060e597>] server_put_super+0xcf7/0x1060 [obdclass]
Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffff811a67a6>] ? invalidate_inodes+0xf6/0x190
Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffff8118ab8b>] generic_shutdown_super+0x5b/0xe0
Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffff8118ac76>] kill_anon_super+0x16/0x60
Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffffa05d10b6>] lustre_kill_super+0x36/0x60 [obdclass]
Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffff8118b417>] deactivate_super+0x57/0x80
Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffff811ab11f>] mntput_no_expire+0xbf/0x110
Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffff811abc6b>] sys_umount+0x7b/0x3a0
Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
Aug 15 10:21:46 nbp8-mds1 kernel: Lustre: nbp8-MDT0000: Not available for connect from 10.151.55.23@o2ib (stopping)
Aug 15 10:21:46 nbp8-mds1 kernel: Lustre: Skipped 47513 previous similar messages
Aug 15 10:22:06 nbp8-mds1 kernel: Lustre: nbp8-MDT0000 is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 7. Is it stuck?

Please confirm whether the fix in LU-4772 addresses both the MGS and MDT.

Comment by Peter Jones [ 15/Aug/16 ]

Nathan

The LU-4772 fix is already in the branch you are running with so it seems like this must be something new

Hongchao

What do you suggest here?

Peter

Comment by Hongchao Zhang [ 16/Aug/16 ]

Hi Nathan,

Is the debug log available during umounting MDT? it will dump the exports in this case, and will help to find the cause of the issue.
Thanks!

Comment by Nathan Dauchy (Inactive) [ 16/Aug/16 ]

I'm afraid I don't have additional debugging info... was in a rush to get things shut down so we could complete server and Lustre version upgrades during a maintenance window, so I just power cycled the servers that hit this issue.

Comment by Nathan Dauchy (Inactive) [ 16/Aug/16 ]

Peter,

According to Jay, the lustre-2.7.2-1.1nasS_mofed32v1 build (what was running on "nbp1-mds" above) does indeed have the LU-4772 fix.

However, the lustre-2.7.1-5.1nasS_mofed31v5 build (what was on "nbp8-mds1" above) did NOT have the fix. I presume neither did the "lustre-2.5.3-6nasS_mofed31v5" version running on "nbp6-mds" above. Therefore, the issues for the latter two (regarding MDT) can be ignored for now as we assume the LU-4772 fix will prevent those in the future.

Only the first stack trace on nbp1-mds is of interest then.

Thanks!

Comment by Peter Jones [ 17/Aug/16 ]

Hongchao

What is your assessment of the first stack trace?

Peter

Comment by Hongchao Zhang [ 18/Aug/16 ]

I'm still investigating the problem of the first stack trace, and don't find the cause of the problem yet.
and I'm trying to create a debug patch to try to locate the problem, it's a little complicated and still need more time on it.

Comment by Hongchao Zhang [ 19/Aug/16 ]

by investigating the code lines, I find some cases leaking the export reference, which could be related to the issue.
http://review.whamcloud.com/#/c/22021/

Comment by Peter Jones [ 19/Aug/16 ]

Hongchao

I recommend fixing this on master first, then backporting to 2.7 FE for NASA to use

Regards

Peter

Comment by Gerrit Updater [ 19/Aug/16 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/22031
Subject: LU-8500 ldlm: fix export reference problem
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a8fc617b0edbd195d22da21f30370fa4b33e74c1

Comment by James A Simmons [ 23/Aug/16 ]

We just hit this on 2.8. Can't unmount our OSTs.

Comment by Gerrit Updater [ 02/Sep/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22031/
Subject: LU-8500 ldlm: fix export reference problem
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 108339f1543fb006f4ddd16830e7266df0b46723

Comment by Nathan Dauchy (Inactive) [ 02/Sep/16 ]

Oleg, thanks for landing.

Hongchao, Peter, now we just need the finalized and checked and landed patch for the 2.7 FE branch.

Comment by James A Simmons [ 02/Sep/16 ]

We need one for b2_8_fe as well.

Comment by Jian Yu [ 03/Sep/16 ]

Here are the back-ported patches for:
Lustre b2_7_fe branch: http://review.whamcloud.com/22289
Lustre b2_8_fe branch: http://review.whamcloud.com/22302

Comment by Peter Jones [ 08/Sep/16 ]

Master fix landed for 2.9; Landings to maintenance branches tracked separately

Comment by Gerrit Updater [ 02/Feb/18 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: https://review.whamcloud.com/31139
Subject: LU-8500 ldlm: fix export reference
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 38a66a3cf18cb15fe98823019d925f9717ec38fa

Generated at Sat Feb 10 02:18:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.