[LU-19046] server mount stuck uninterruptibly in mgc_fs_setup() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.17.0
Affects Version/s: None
Labels:
- mgc
- mount

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

If multiple targets are mounting on the same server, and one of the targets gets stuck accessing the MGS, or for other reasons during setup, a stack may be dumped like:

INFO: task mount.lustre:93138 blocked for more than 90 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:mount.lustre    state:D stack:0     pid:93138 ppid:93135  flags:0x00004082
Call Trace:
  __schedule+0x2d1/0x870
  schedule+0x55/0xf0
  schedule_preempt_disabled+0xa/0x10
  __mutex_lock.isra.11+0x349/0x420
  mgc_fs_setup.isra.12+0x65/0x7a0 [mgc]
  mgc_set_info_async+0x99f/0xb30 [mgc]
  server_start_targets+0x452/0x2c30 [obdclass]
  server_fill_super+0x94e/0x10a0 [obdclass]
  lustre_fill_super+0x388/0x3d0 [lustre]
  mount_nodev+0x49/0xa0
  legacy_get_tree+0x27/0x50
  vfs_get_tree+0x25/0xc0
  do_mount+0x2e9/0x950
  ksys_mount+0xbe/0xe0

This is a fairly common occurrence in different situations and should be improved in a few ways:

the mutex_lock(&cli->cl_mgc_mutex) in mgs_fs_setup() should be interruptible, so that a stuck mount can be killed without rebooting the server
the cl_mgc_mutex is held for the full duration of the target's llog transfer from the MGS, blocking all other target mounts. This is ostensibly because the MGC is "attached" to a local target's CONFIG directory, and cannot process writes to different targets at the same time. It would be much better if this code was restructured so that targets could fetch and store their llogs concurrently from the MGS. Fault Tolerant MGS (LU-17819) would not help here, since the serialization is on the local node and there would still only be a single MGC service for each "client".
we could consider to fetch the llog with bulk transfers instead of llog operations, but it might make the code messy

Attachments

Issue Links

is related to

LU-15038 mgc_fs_setup() can leak cl_mgc_mutex reference

Resolved

LU-17819 LMR1a: Replicate MGS Services to Multiple MDS

Open

Activity

People

Assignee:: Andreas Dilger

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 23/May/25 2:58 AM

Updated:: 5 days ago 1:49 AM

Resolved:: 5 days ago 1:49 AM