[LU-15453] MDT shutdown hangs on mutex_lock, possibly cld_lock Created: 14/Jan/22  Updated: 22/Jul/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Olaf Faaland Assignee: Mikhail Pershin
Resolution: Unresolved Votes: 0
Labels: llnl
Environment:

lustre-2.12.7_2.llnl-2.ch6.x86_64
zfs-0.7.11-9.8llnl.ch6.x86_64
3.10.0-1160.45.1.1chaos.ch6.x86_64


Attachments: Text File bt.a.txt     Text File foreach.bt.txt    
Issue Links:
Related
is related to LU-14026 symptoms of message loss or corruptio... Open
is related to LU-13356 lctl conf_param hung on the MGS node Resolved
is related to LU-15234 LNet high peer reference counts incon... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

LNet issues (See LU-15234 and LU-14026) result in clients and lustre servers reporting via console logs that they lost connection to the MGS.

We are working on solving the LNet issues, but this may also be revealing error-path issues that should be fixed.

MDT0, which is usually running on the same server as the MGS, is one of the targets which reports a lost connection (they are separate devices, stored in distinct datasets, started/stopped separately):

MGC172.19.3.98@o2ib600: Connection to MGS (at 0@lo) was lost 

Attempting to shutdown the MDT hangs, with this stack reported by the watchdog:

 schedule_preempt_disabled+0x39/0x90
 __mutex_lock_slowpath+0x10f/0x250
 mutex_lock+0x32/0x42
 mgc_process_config+0x21a/0x1420 [mgc]
 obd_process_config.constprop.14+0x75/0x210 [obdclass]
 ? lprocfs_counter_add+0xf9/0x160 [obdclass]
 lustre_end_log+0x1ff/0x550 [obdclass]
 server_put_super+0x82e/0xd00 [obdclass]
 generic_shutdown_super+0x6d/0x110
 kill_anon_super+0x12/0x20
 lustre_kill_super+0x32/0x50 [obdclass]
 deactivate_locked_super+0x4e/0x70
 deactivate_super+0x46/0x60
 cleanup_mnt+0x3f/0x80
 __cleanup_mnt+0x12/0x20
 task_work_run+0xbb/0xf0
 do_notify_resume+0xa5/0xc0
 int_signal+0x12/0x17

The server was crashed and a dump collected.  The stacks for the umount process and the ll_cfg_requeue process both have pointers to the "ls1-mdtir" config_llog_data structure; I believe cld->cld_lock is held by ll_cfg_requeue and umount is waiting on it.

PID: 4504   TASK: ffff8e8c9edc8000  CPU: 24  COMMAND: "ll_cfg_requeue"
 #0 [ffff8e8ac474f970] __schedule at ffffffff9d3b6788
 #1 [ffff8e8ac474f9d8] schedule at ffffffff9d3b6ce9
 #2 [ffff8e8ac474f9e8] schedule_timeout at ffffffff9d3b4528
 #3 [ffff8e8ac474fa98] ldlm_completion_ast at ffffffffc14ac650 [ptlrpc]
 #4 [ffff8e8ac474fb40] ldlm_cli_enqueue_fini at ffffffffc14ae83f [ptlrpc]
 #5 [ffff8e8ac474fbf0] ldlm_cli_enqueue at ffffffffc14b10d1 [ptlrpc]
 #6 [ffff8e8ac474fca8] mgc_enqueue at ffffffffc0fb94cf [mgc]
 #7 [ffff8e8ac474fd70] mgc_process_log at ffffffffc0fbf393 [mgc]
 #8 [ffff8e8ac474fe30] mgc_requeue_thread at ffffffffc0fc1b10 [mgc]
 #9 [ffff8e8ac474fec8] kthread at ffffffff9cccb221

I can provide console logs and the crash dump.  I do not have lustre debug logs.



 Comments   
Comment by Olaf Faaland [ 14/Jan/22 ]

For my records, my internal ticket is TOSS5512

Comment by Peter Jones [ 17/Jan/22 ]

Serguei

Could you please advise

Thanks

Peter

Comment by Peter Jones [ 17/Jan/22 ]

Actually, perhaps Mike is a more appropriate candidate...

Comment by Andreas Dilger [ 17/Jan/22 ]

Hi Olaf, could you please attach the stack traces of running processes at the time of the hang ("bt" from the crashdump).

Comment by Olaf Faaland [ 19/Jan/22 ]

Hi, sorry for the delay.  I've attached:
"bt -a" output in bt.a.txt (stack traces of the active task on each CPU)
"foreach bt" output in foreach.bt.txt (stack traces of all processes)

Comment by Mikhail Pershin [ 20/Jan/22 ]

Symptoms remind me ticket LU-13356, the related patch is not yet landed in b2_12: https://review.whamcloud.com/41309 

Another thought is LU-15020 which is about the waiting for OST_DISCONNECT, but the first one looks closer to what we have here

Comment by Olaf Faaland [ 24/Jan/22 ]

Hi Mikhail,

Yes, it does look a lot like LU-13356.  I see Etienne's comment about change #41309 removing interop support with v2.2 clients and servers, and that the patch therefore cannot be landed to b2_12. 

  1. At our site, we we have only Lustre 2.10.8 routers and Lustre {2.12.8,2.14) clients/servers/routers.  We do not have v2.2 running anywhere.  Can we safely add that patch to our stack?  It would be useful to hear back about this today, if possible.
  2. If change #41309 cannot be landed to b2_12, what are some other options?  This question is not as urgent.
  3. If we see this symptom again before we have any patches landed to address it, is there other information I can gather that would help confirm this theory?

thanks

Comment by Mikhail Pershin [ 25/Jan/22 ]

Olaf, the patch can be added to your stack if there is no need for 2.2 interop. As for question #2 - do you mean will there be an alternative solution in b2_12?

As for other information to collect, it seems we can only rely on symptoms here, since related code has no any debug messages directly connected with the situation

Comment by Olaf Faaland [ 25/Jan/22 ]

Mikhail,

> As for question #2 - do you mean will there be an alternative solution in b2_12?

Yes, that was my question.

thanks!

Comment by Stephane Thiell [ 25/Jan/22 ]

Honestly it is a bit ridiculous to not land change 41309 to b2_12 at this time because of compat issue with old Lustre 2.2. Without this patch, the MGS on 2.12.x is not stable, even in a full 2.12 environment. We have patched all our clients and servers with it (we're running 2.12.x everywhere now, mostly 2.12.7 and now deploying 2.12.8 that also requires patching). Just saying.  

Comment by Andreas Dilger [ 25/Jan/22 ]

sthiell, I don't think anyone is against landing 41309 on b2_12 because of 2.2 interop, just that it hasn't landed yet.

Comment by Stephane Thiell [ 26/Jan/22 ]

OK! Thanks Andreas!

Comment by Olaf Faaland [ 03/Feb/22 ]

Stephane,

Do you have any other patches in your stack related to recovery?

thanks

Comment by Stephane Thiell [ 04/Feb/22 ]

Hi Olaf,

I don't think so. Our servers are running 2.12.7 with:

  • LU-13356 client: don't use OBD_CONNECT_MNE_SWAB (41309)
  • LU-14688 mdt: changelog purge deletes plain llog (43990)

Our clients are now slowly moving to 2.12.8 + LU-13356

Generated at Sat Feb 10 03:18:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.