[LU-1279] failure trying to mount two targets at the same time after boot Created: 02/Apr/12  Updated: 27/Apr/15  Resolved: 21/Oct/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.1.2
Fix Version/s: Lustre 2.7.0, Lustre 2.5.4

Type: Bug Priority: Major
Reporter: Brian Murrell (Inactive) Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

lustre-modules-2.1.1-2.6.32_220.4.2.el6_lustre.gcbb4fad.x86_64_gae03fc8.x86_64


Attachments: Text File modprobe_log.txt     File module-init-tools-3.9.tar.gz    
Issue Links:
Blocker
Duplicate
duplicates LU-4311 Mount sometimes fails with EIO on OSS... Closed
is duplicated by LU-3975 Race loading ldiskfs with parallel mo... Resolved
is duplicated by LU-5961 Concurrent mount of ZFS targets fails... Resolved
Related
is related to LU-3975 Race loading ldiskfs with parallel mo... Resolved
is related to LU-5159 Lustre MGS/MDT fails to start using i... Resolved
is related to LU-2456 Dynamic LNet Config Main Development ... Resolved
Severity: 3
Rank (Obsolete): 4025

 Description   

After booting an OSS, two OSTs are mounted simultaneously. The mounts fail due to module loading failure:

Lustre: OBD class driver, http://wiki.whamcloud.com/
Lustre:         Lustre Version: 2.1.1
Lustre:         Build Version: jenkins-gae03fc8-PRISTINE-2.6.32-220.4.2.el6_lustre.gcbb4fad.x86_64
Lustre: Lustre LU module (ffffffffa0578c60).
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol RQF_FLD_QUERY
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol req_capsule_server_pack
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol req_capsule_client_get
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol ptlrpc_queue_wait
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol req_capsule_fini
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol req_capsule_init
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol req_capsule_set
INFO: task hydra-agent:1590 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
hydra-agent   D 0000000000000000     0  1590      1 0x00000080
 ffff88003db09d68 0000000000000082 ffff88003d740a88 ffff88003bad0250
 ffff88003db09d68 ffffffff8113fb78 800000002c760065 0000000000000086
 ffff880037c1c678 ffff88003db09fd8 000000000000f4e8 ffff880037c1c678
Call Trace:
 [<ffffffff8113fb78>] ? vma_adjust+0x128/0x590
 [<ffffffff814ee35e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff814ee1fb>] mutex_lock+0x2b/0x50
 [<ffffffff810aaafd>] m_start+0x1d/0x40
 [<ffffffff81198cc0>] seq_read+0x90/0x3f0
 [<ffffffff811dae0e>] proc_reg_read+0x7e/0xc0
 [<ffffffff81176cb5>] vfs_read+0xb5/0x1a0
 [<ffffffff810d4582>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff81176df1>] sys_read+0x51/0x90
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
INFO: task modprobe:1679 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
modprobe      D 0000000000000000     0  1679   1651 0x00000080
 ffff88002ed35aa8 0000000000000082 ffff88002ed35a58 ffffffff810097cc
 ffff88003ef260f8 0000000000000000 0000000000d35a68 ffff880002213b00
 ffff880037415a78 ffff88002ed35fd8 000000000000f4e8 ffff880037415a78
Call Trace:
 [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
 [<ffffffff814ecd0e>] ? thread_return+0x4e/0x760
 [<ffffffff814edb75>] schedule_timeout+0x215/0x2e0
 [<ffffffff8104c9e9>] ? __wake_up_common+0x59/0x90
 [<ffffffff814ed7f3>] wait_for_common+0x123/0x180
 [<ffffffff8105e7f0>] ? default_wake_function+0x0/0x20
 [<ffffffff8108b741>] ? __queue_work+0x41/0x50
 [<ffffffff814ed90d>] wait_for_completion+0x1d/0x20
 [<ffffffff81089c90>] call_usermodehelper_exec+0xe0/0xf0
 [<ffffffffa04966d2>] ? lnet_startup_lndnis+0x262/0x6f0 [lnet]
 [<ffffffff81089feb>] __request_module+0x18b/0x210
 [<ffffffffa0498e00>] ? lnet_parse_networks+0x90/0x7e0 [lnet]
 [<ffffffffa041aa13>] ? cfs_alloc+0x63/0x90 [libcfs]
 [<ffffffffa04966d2>] lnet_startup_lndnis+0x262/0x6f0 [lnet]
 [<ffffffffa041aa13>] ? cfs_alloc+0x63/0x90 [libcfs]
 [<ffffffffa0496c85>] LNetNIInit+0x125/0x1f0 [lnet]
 [<ffffffffa06aa13a>] ? init_module+0x0/0x597 [ptlrpc]
 [<ffffffffa05f1c89>] ptlrpc_ni_init+0x29/0x170 [ptlrpc]
 [<ffffffff8105e7f0>] ? default_wake_function+0x0/0x20
 [<ffffffffa05f2053>] ptlrpc_init_portals+0x13/0xd0 [ptlrpc]
 [<ffffffffa06aa13a>] ? init_module+0x0/0x597 [ptlrpc]
 [<ffffffffa06aa21a>] init_module+0xe0/0x597 [ptlrpc]
 [<ffffffff81096d15>] ? __blocking_notifier_call_chain+0x65/0x80
 [<ffffffff8100204c>] do_one_initcall+0x3c/0x1d0
 [<ffffffff810af4e1>] sys_init_module+0xe1/0x250
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
INFO: task modprobe:1688 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
modprobe      D 0000000000000000     0  1688   1687 0x00000080
 ffff88003d6d3eb8 0000000000000086 ffff88003d6d3e18 0000000000000082
 ffff88003d6d1ab8 ffff88003d6d3fd8 000000000000f4e8 ffff88003d6d1ac0
 ffff88003d6d1ab8 ffff88003d6d3fd8 000000000000f4e8 ffff88003d6d1ab8
Call Trace:
 [<ffffffff814f39dd>] ? kprobes_module_callback+0xdd/0x170
 [<ffffffff814ee35e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff81096d15>] ? __blocking_notifier_call_chain+0x65/0x80
 [<ffffffff814ee1fb>] mutex_lock+0x2b/0x50
 [<ffffffff810af533>] sys_init_module+0x133/0x250
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol req_capsule_server_get
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol ptlrpc_at_set_req_timeout
type=1305 audit(1333395888.750:31878): auid=4294967295 ses=4294967295 op="remove rule" key=(null) list=4 res=1
type=1305 audit(1333395888.750:31879): audit_enabled=0 old=1 auid=4294967295 ses=4294967295 res=1
readahead-collector: starting delayed service auditd
readahead-collector: sorting
readahead-collector: finished
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol ptlrpc_request_alloc_pack
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol RMF_FLD_OPC
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol ptlrpc_request_set_replen
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol RMF_FLD_MDFLD
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol ptlrpc_req_finished
LustreError: 1679:0:(socklnd.c:2420:ksocknal_base_startup()) Can't spawn socknal scheduler[0]: -513
LustreError: 105-4: Error -100 starting up LNI tcp
LustreError: 1679:0:(events.c:728:ptlrpc_init_portals()) network initialisation failed


 Comments   
Comment by Ashley Pittman (Inactive) [ 14/May/12 ]

We are also seeing this issue in our 2.1.2 release testing, currently using the 6a9def181397a20c48b8f55ef4f6c29f2fc18ed6 commit on the b2_1 branch.

Comment by Peter Jones [ 22/May/12 ]

Bob

Oleg thinks that this is related to the port of LU-1166. Could you please look at it?

Thanks

Peter

Comment by Bob Glossman (Inactive) [ 22/May/12 ]

Seems unlikely this is related to LU-1166 since it started happening before that change landed in b2_1. Think the underlying cause is still not known.

Comment by Ashley Pittman (Inactive) [ 22/May/12 ]

This can be provoked purely by loading the module. If I just run modprobe lustre I see the following in the message file and the command returns very quickly. Lustre version is a DDN build of b2_1.

May 22 12:41:37 12k-3 kernel: Lustre: Lustre: Build Version: EXAScaler-build-ddn1.1--PRISTINE-2.6.32-220.13.1.el6_lustre20120521052707.x86_64
May 22 12:41:38 12k-3 kernel: Lustre: Added LNI 10.10.11.183@o2ib [8/64/0/180]
May 22 12:41:38 12k-3 kernel: Lustre: Lustre OSC module (ffffffffa0b56040).
May 22 12:41:38 12k-3 kernel: Lustre: Lustre LOV module (ffffffffa0bb6000).
May 22 12:41:38 12k-3 kernel: Lustre: Lustre client module (ffffffffa0c44ca0).

If I try the following, using pdsh to spawn several local modprobe processes simultaneously however I see the following:
pdsh -R exec -w [0-10] modprobe lustre

May 22 12:43:05 12k-3 kernel: Lustre: Lustre: Build Version: EXAScaler-build-ddn1.1--PRISTINE-2.6.32-220.13.1.el6_lustre20120521052707.x86_64
May 22 12:43:35 12k-3 kernel: fld: gave up waiting for init of module ptlrpc.
May 22 12:43:35 12k-3 kernel: fld: Unknown symbol RQF_FLD_QUERY
May 22 12:44:05 12k-3 kernel: fld: gave up waiting for init of module ptlrpc.
May 22 12:44:05 12k-3 kernel: fld: Unknown symbol req_capsule_server_pack
May 22 12:44:35 12k-3 kernel: fld: gave up waiting for init of module ptlrpc.
May 22 12:44:35 12k-3 kernel: fld: Unknown symbol req_capsule_client_get
May 22 12:45:05 12k-3 kernel: fld: gave up waiting for init of module ptlrpc.
May 22 12:45:05 12k-3 kernel: fld: Unknown symbol ptlrpc_queue_wait
May 22 12:45:35 12k-3 kernel: fld: gave up waiting for init of module ptlrpc.
May 22 12:45:35 12k-3 kernel: fld: Unknown symbol req_capsule_fini
May 22 12:46:05 12k-3 kernel: fld: gave up waiting for init of module ptlrpc.
May 22 12:46:05 12k-3 kernel: fld: Unknown symbol req_capsule_init
May 22 12:46:35 12k-3 kernel: fld: gave up waiting for init of module ptlrpc.
May 22 12:46:35 12k-3 kernel: fld: Unknown symbol req_capsule_set
May 22 12:47:05 12k-3 kernel: fld: gave up waiting for init of module ptlrpc.
May 22 12:47:05 12k-3 kernel: fld: Unknown symbol req_capsule_server_get
May 22 12:47:35 12k-3 kernel: fld: gave up waiting for init of module ptlrpc.
May 22 12:47:35 12k-3 kernel: fld: Unknown symbol ptlrpc_at_set_req_timeout
May 22 12:48:05 12k-3 kernel: fld: gave up waiting for init of module ptlrpc.
May 22 12:48:05 12k-3 kernel: fld: Unknown symbol ptlrpc_request_alloc_pack
May 22 12:48:35 12k-3 kernel: fld: gave up waiting for init of module ptlrpc.
May 22 12:48:35 12k-3 kernel: fld: Unknown symbol RMF_FLD_OPC
May 22 12:49:05 12k-3 kernel: fld: gave up waiting for init of module ptlrpc.
May 22 12:49:05 12k-3 kernel: fld: Unknown symbol ptlrpc_request_set_replen

At this point I normally reboot the node however I left a system in this state overnight and it had resolved itself by this morning, I've just triggered this bug again and will leave it in the bad state to see what happens and when.

Comment by Bob Glossman (Inactive) [ 14/Jun/12 ]

When I attempt to reproduce this I see many modprobe procs, all in the call sequence

ptlrpc_ni_init -> LNetNIInit -> ... -> cfs_request_module -> call_usermodehelper_exec

It seems like before the_lnet.ln_refcount is set non-zero later on in LNetNIInit() there's nothing to prevent lnet init from calling into the linux module load infrastructure multiple times for the same module load. Seems to collide down in linux.

Here's an example stack trace captured in /var/log/messages:

Jun 14 09:32:47 centos24 kernel: Call Trace:
Jun 14 09:32:47 centos24 kernel: [<ffffffff8104da7c>] ? check_preempt_curr+0x7c/0x90
Jun 14 09:32:47 centos24 kernel: [<ffffffff814ee255>] schedule_timeout+0x215/0x2e0
Jun 14 09:32:47 centos24 kernel: [<ffffffff81090d46>] ? autoremove_wake_function+0x16/0x40
Jun 14 09:32:47 centos24 kernel: [<ffffffff8104cab9>] ? __wake_up_common+0x59/0x90
Jun 14 09:32:47 centos24 kernel: [<ffffffff814eded3>] wait_for_common+0x123/0x180
Jun 14 09:32:47 centos24 kernel: [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20
Jun 14 09:32:47 centos24 kernel: [<ffffffff8108b9e1>] ? __queue_work+0x41/0x50
Jun 14 09:32:47 centos24 kernel: [<ffffffff814edfed>] wait_for_completion+0x1d/0x20
Jun 14 09:32:47 centos24 kernel: [<ffffffff81089f30>] call_usermodehelper_exec+0xe0/0xf0
Jun 14 09:32:47 centos24 kernel: [<ffffffffa04d16d2>] ? lnet_startup_lndnis+0x262/0x6f0 [lnet]
Jun 14 09:32:47 centos24 kernel: [<ffffffff8108a28b>] __request_module+0x18b/0x210
Jun 14 09:32:47 centos24 kernel: [<ffffffffa04d3e00>] ? lnet_parse_networks+0x90/0x7e0 [lnet]
Jun 14 09:32:47 centos24 kernel: [<ffffffffa0992a13>] ? cfs_alloc+0x63/0x90 [libcfs]
Jun 14 09:32:47 centos24 kernel: [<ffffffffa04d16d2>] lnet_startup_lndnis+0x262/0x6f0 [lnet]
Jun 14 09:32:47 centos24 kernel: [<ffffffffa0992a13>] ? cfs_alloc+0x63/0x90 [libcfs]
Jun 14 09:32:47 centos24 kernel: [<ffffffffa04d1c85>] LNetNIInit+0x125/0x1f0 [lnet]
Jun 14 09:32:47 centos24 kernel: [<ffffffffa070713a>] ? init_module+0x0/0x597 [ptlrpc]
Jun 14 09:32:47 centos24 kernel: [<ffffffffa06501c9>] ptlrpc_ni_init+0x29/0x170 [ptlrpc]
Jun 14 09:32:47 centos24 kernel: [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20
Jun 14 09:32:47 centos24 kernel: [<ffffffffa0650593>] ptlrpc_init_portals+0x13/0xd0 [ptlrpc]
Jun 14 09:32:47 centos24 kernel: [<ffffffffa070713a>] ? init_module+0x0/0x597 [ptlrpc]
Jun 14 09:32:47 centos24 kernel: [<ffffffffa070721a>] init_module+0xe0/0x597 [ptlrpc]
Jun 14 09:32:47 centos24 kernel: [<ffffffff81096fb5>] ? __blocking_notifier_call_chain+0x65/0x80
Jun 14 09:32:47 centos24 kernel: [<ffffffff8100204c>] do_one_initcall+0x3c/0x1d0
Jun 14 09:32:47 centos24 kernel: [<ffffffff810af891>] sys_init_module+0xe1/0x250
Jun 14 09:32:47 centos24 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b

Comment by Bob Glossman (Inactive) [ 13/Jul/12 ]

I'm not sure there's much we can do about this problem. The root cause is the linux kernel mutex used to serialize module loads, module_mutex. The deadlock situation is this:

a) modprobe lustre -> .... -> init_module(ptlrpc) -> LNetNIInit() -> modprobe ksocklnd

The 'modprobe ksocklnd' is blocked trying to get module_mutex in sys_init_module() -> mutex_lock_interruptible(module_mutex)

ptlrpc init routine is blocked from finishing because ksocklnd load is blocked from starting by b) sequence trying to run concurrently.

b) modprobe lustre -> load_module(?) (this thread owns module_mutex, obtained before entering load_module()) -> ... -> use_module(?, ptlrpc) -> find_symbol_in_setup

The b) sequence of modprobe loads holds the mutex, tries to lookup a ptlrpc symbol, can't due to ptlrpc never finishing init. This will timeout after 30s, but the pattern will repeat multiple times apparently due to multiple symbol lookups. If you wait long enough through all the multiple timeouts, eventually the a) sequence will finally finish.

Without tampering inside linux and maybe causing more problems than we solve, I don't see any easy way around this.

Perhaps there can be a workaround solution. If lustre modules get preloaded in a guaranteed single stream fashion before trying to do mounts the problem could be avoided. I'm thinking of something like a .rc script executing 'modprobe lustre' that would run before any mounts could happen, deployed only on lustre servers.

Comment by Ashley Pittman (Inactive) [ 13/Jul/12 ]

Right now we have "modprobe lustre" in /etc/sysconfig/modules/exascaler.modules however the problems with this are two-fold, firstly we still hit the issue on install because we have to unload the module to set the lnet parameters, when we do this and then try and start the filesystem we hit the bug. Secondly and more importantly IB networks tend not to be up that early in the boot process so the module loading fails anyway. We could potentially add this to rc.local which stands a greater chance of working however I believe it would still be racy and could potentially fail, in which case we'd hit this bug on filesystem startup.

Comment by Bob Glossman (Inactive) [ 13/Jul/12 ]

/etc/sysconfig/modules files execute out of /etc/rc.sysinit and are too early in the boot sequence, as you've pointed out. I'm not sure I understand your objection to rc.local. That does seem late enough in the boot sequence so that all the necessary devices & networks are started up. Everything needed by lustre would be present by then. Is that before you mount OSTs or start up a lustre fs? If so that sounds to be a good place.

On your other objection you say you need to manually stop & teardown lustre modules after an install in order to mod lnet setup. If a manual command is used to stop it (lustre_rmmod?) can't you just do a manual cmdline 'modprobe lustre' after changing the lnet params to bring it back up? Then your node should be in a good state to start a filesystem.

Comment by Bob Glossman (Inactive) [ 13/Jul/12 ]

I almost hesitate to ask such a dumb question, but if this problem is triggered by mounting multiple OSTs at once why do that? Just mount them one at a time and avoid the whole issue.

Comment by Ashley Pittman (Inactive) [ 13/Jul/12 ]

rc.local might be OK but I have seen IB networks taking several minutes to startup properly after boot, especially if the switch is booting at the same time as the node. I'll do a test.

The filesystem is started by corosync normally which does it in parallel, we have little control over that.

Comment by Bob Glossman (Inactive) [ 13/Jul/12 ]

If your filesystems are started by corosync, then rc.local may be too late in the boot sequence. S99local comes very late. When does corosync service start?

Comment by Ashley Pittman (Inactive) [ 13/Jul/12 ]

Corosync is configured to not attempt to start the filesystem until all network interfaces are up.

Comment by Robert Read (Inactive) [ 13/Jul/12 ]

This is also an issue for Chroma (HYD-1263). Is this a module dependency issue? Shouldn't the LNDs be loaded before ptlrpc and higher layers?

Comment by Bob Glossman (Inactive) [ 13/Jul/12 ]

To the best of my understanding this isn't strictly a module dependency issue. For one thing as long as there is only a single set of modloads going they all complete and don't block each other. Only when they are multiple and concurrent does this problem appear. For another lnet doesn't load lnd modules based on dependencies. It explicitly loads each needed lnd module with cfs_request_module() inside lnet_startup_lndnis(). That maps to kernel API request_module(), which forks off another userspace modprobe to load the lnd.

That's exactly what's happening in the deadlock. The module load of ptlrpc has completed the load_module() portion of its operation and has given up module_mutex. In the course of running init_module() it calls LNetNIInit -> lnet_startup_lndnis. lnet_startup_lndnis has forked a child to modprobe the lnd module (ksocklnd in my example, could be o2iblnd in the customer's config). The lnd is blocked from loading due to concurrent modprobe holding module_mutex and waiting for ptlrpc init_module to complete. Without a concurrent modprobe racing in & grabbing the mutex the lnd module load would complete, the ptlrpc init_module would complete, and later module loads that need globals in any of ptlrpc, lnet, and any lnd find them all present and available.

Comment by Ashley Pittman (Inactive) [ 14/Jul/12 ]

One thing I could do is to make the filesystem agent attempt to modprobe ksocklnd and o2iblnd directly before loading the lustre module, it sounds like that might avoid the deadlock?

Comment by Bob Glossman (Inactive) [ 14/Jul/12 ]

yes, preloading ksocklnd and ko2iblnd might help. As long as you only do one modprobe at a time you could even try having the filesystem agent do modprobe lustre and preload everything. Only doing more than one or starting lustre mounts before modprobe completes is likely to trigger this problem as far as I can see.

Comment by Peter Jones [ 27/Jul/12 ]

Bobijam

Could you please look into this one?

Thanks

Peter

Comment by Zhenyu Xu [ 30/Jul/12 ]

would you please try http://review.whamcloud.com/3471 to see if that could be any helpful?

Comment by Bob Glossman (Inactive) [ 31/Jul/12 ]

Tried the suggested patch from http://review.whamcloud.com/#change,3471 in current b2_1. Doesn't help at all. Still fails exactly the same.

Comment by Zhenyu Xu [ 02/Aug/12 ]

Bob,

I have a question here, in the scenario you described

a) modprobe lustre -> .... -> init_module(ptlrpc) -> LNetNIInit() -> modprobe ksocklnd

init_module(ptlrpc) which will take module_mutex lock and calls load_module(ptlrpc) -> ... -> LNetNIInit() -> modprobe ksocklnd

will itself hinder ksocklnd module's loading since init_module(ptlrpc) already hold the module_mutex?

<edit> I got the answer here, when ptlrpc module get to LNetNIInit(), the module_mutex has already been unlocked, and it will not hinder ksocklnd module's loading.

Comment by Peter Jones [ 03/Aug/12 ]

Andreas

Any advice to offer here?

Thanks

Peter

Comment by Bob Glossman (Inactive) [ 03/Aug/12 ]

Bobijam,
Yes, that's what I saw in the linux code. sys_init_module() holds module_mutex while doing load_module() but drops it before going to the init code in the loaded module.

Comment by Andreas Dilger [ 03/Aug/12 ]

We used to do some modprobes from mount.lustre in the past, and this seems like a better place to do it than the system init scripts (which may not even have any Listre configs in them if the RPMs were just installed for the first time).

I believe the reason that the LNET code is calling up to userspace to load the LND modules is because kernel-side module loading was removed at some point. I don't like that a a general solution, and would prefer if the LND module loading was also done. Y mount.lustre, but that would require also parsing /etc/modprobe.d/

{something}

to find out the LNET routing config, but this file does not necessarily have a single name - we'd have to try multiple names and may still get it incorrect.

For now, until the LNET config project is done, let's ignore the LNET modules, and just do a single "modprobe lustre" from within mount.lustre. It would be useful to have some sort of exclusion in userspace (e.g. checking if the module is loaded yet) but it is still racy, and I don't want to make this more complex than it needs to be, since exclusion in userspace is much harder to do.

Failing that, we could add serialization inside the lustre module init to ensure that there is no conflicts down at the lower level. That might even be less complex than trying to do it in userspace?

Comment by Zhenyu Xu [ 06/Aug/12 ]

Bob,

The original ptlrpc module initializer failed at loading ksocklnd module init, which should eventually get the module mutex, and the orignal log shows that it failed with another reason, the kernel thread start failed with -513 (-ERESTARTNOINTR), what could be the signal, is it a timeout signal?

LustreError: 1679:0:(socklnd.c:2420:ksocknal_base_startup()) Can't spawn socknal scheduler[0]: -513
LustreError: 105-4: Error -100 starting up LNI tcp
LustreError: 1679:0:(events.c:728:ptlrpc_init_portals()) network initialisation failed
Comment by Bob Glossman (Inactive) [ 06/Aug/12 ]

Bobijam,

I don't know the exact mechanism of the failure, but I do think it is probably a timeout generated signal. My guess is that it's a timeout in user level code around the sys_init_module("ksocklnd") syscall in the user level modprobe command. If the syscall takes too long the timeout fires and interrupts the syscall. That's just a guess as I don't have a view of the modprobe command code. There may be some kernel level mechanism limiting the max time spent in a given syscall that I'm just not aware of.

In the other leg of the deadlock the thread holding module_mutex is failing lookups of global symbols from ptlrpc with a 30s timeout on each one. This means it won't give up module_mutex for a very long time.

Comment by Bob Glossman (Inactive) [ 06/Aug/12 ]

Sorry, I see I misread your previous comment. In all the deadlocks I took a close look at ksocklnd never got loaded and so never got as far as ksocknal_startup() -> ksocknal_base_startup(). Still think it's probably a timeout based signal, but have less idea of where it's coming from.

Comment by Nate Pearlstein (Inactive) [ 17/Oct/12 ]

This also has unpleasant results for HA clusters. I have found that putting modprobe ptlrpc in rc.local
effectively works around the problem.

Comment by Andreas Dilger [ 17/Oct/12 ]

Bobijam, can you please add a call to mount_lustre.c like:

        system("modprobe ptlrpc 2>&1 /dev/null");

or something that will just run silently in the background and not print any error message.

Comment by Brian Murrell (Inactive) [ 18/Oct/12 ]

Will adding this to mount_lustre.c actually resolve the problem though given that the original description of this bug says that it's multiple calls to mount that are causing the problem? IOW, isn't this just moving the race/deadlock to somewhere else in the stack?

Comment by Ashley Pittman (Inactive) [ 18/Oct/12 ]

"modprobe lustre" wouldn't solve the problem but "modprobe ptlrpc" should as it avoids the case where the module loading trys to pull in another module further down the chain.

We added modprobe ksocklnd ; modprobe ko2iblnd to our HA agent and do not see this issue any more.

Comment by Zhenyu Xu [ 07/Nov/12 ]

master patch tracking at http://review.whamcloud.com/4292 + http://review.whamcloud.com/4449

Comment by Zhenyu Xu [ 07/Nov/12 ]

b2_1 port at http://review.whamcloud.com/4488

Comment by Oleg Drokin [ 07/Nov/12 ]

Please note that this patch introduces a problem now with lustre clients.
If you have no lustre modules loaded at all and do mount.lustre before the patch, lustre modules get loaded as needed from mount syscalll.

Now with the patch in the mount loads ptlrpc.ko before the mount syscall that loads parts of lustre stack, but not all the way to the client, and then comes mount sycall time mounting actually fails. (real complaint from ORNL)

So we need to address this too either by improving the way we detect client mounts and loading lustre from obd_mount.c somewhere or other means.

Comment by Andreas Dilger [ 07/Nov/12 ]

One simple option might be to add another line:

        rc = system("/sbin/modprobe lustre >/dev/null 2>&1");

after the current ptlrpc line? That should avoid the missing Lustre filesystem code on the client. The root of the problem is that the "shared" mount handling is done in obdclass/mount.c or similar, which provides the "lustre" filesystem type, but this is not enough to mount the client. If no modules are loaded, the missing "lustre" filesystem type would otherwise force the "lustre" module to be loaded and properly set up the stack.

It may be with 2.4 that we no longer need this "shared" mount handling in obdclass, and can move it back to llite, but I'm not sure.

Comment by James A Simmons [ 08/Nov/12 ]

Oleg is describing the problems I was experiencing testing master on our cray systems. So our compute nodes were depending on the mount syscall loading all the proper modules for us. We attempts to manually do a modprobe lustre but only some of the modules were loaded. For the busted setup here are the modules loaded:

mgc 76893 0
lustre 913092 0
lov 648104 1 lustre
osc 466313 1 lov
mdc 202643 1 lustre
fid 70712 1 mdc
fld 89516 1 fid
xpmem 28383 0
kgnilnd 522013 1
ptlrpc 1307091 7 mgc,lustre,lov,osc,mdc,fid,fld
obdclass 1304282 83 mgc,lustre,lov,osc,mdc,fid,fld,ptlrpc
lnet 384132 4 lustre,kgnilnd,ptlrpc,obdclass
lvfs 37066 9 mgc,lustre,lov,osc,mdc,fid,fld,ptlrpc,obdclass
sha1_generic 2367 0
md5 2181 0
libcfs 391846 12 mgc,lustre,lov,osc,mdc,fid,fld,kgnilnd,ptlrpc,obdclass,lnet,lvfs
ib_core 759 0 [permanent]
kdreg 17949 0
gpcd_gem 10275 0
ipogif_gem 13077 0
kgni_gem 274252 2 kgnilnd,gpcd_gemhwerr 60686 0
rca 362621 5
hss_os 4592 2 hwerr,rca
heartbeat 4230 1 rca
simplex 2636 2 hwerr,rca
ghal_gem 84908 6 gpcd_gem,ipogif_gem,kgni_gem,hwerr,rca,hss_os
cgm 8376 2 kgni_gem,ghal_gem
craytrace 6476 0

Comment by James A Simmons [ 08/Nov/12 ]

With patch reverted I get this list of loaded modules:

xpmem 28383 0
lmv 314649 1
mgc 76893 1
lustre 913092 29
lov 648104 16 lustre
osc 466313 43 lov
mdc 202643 2 lustre
fid 70712 1 mdc
fld 89516 2 lmv,fid
kgnilnd 522013 1
ptlrpc 1307091 8 lmv,mgc,lustre,lov,osc,mdc,fid,fld
obdclass 1304353 106 lmv,mgc,lustre,lov,osc,mdc,fid,fld,ptlrpc
lnet 384132 4 lustre,kgnilnd,ptlrpc,obdclass
lvfs 37066 10 lmv,mgc,lustre,lov,osc,mdc,fid,fld,ptlrpc,obdclass
sha1_generic 2367 0
md5 2181 0
libcfs 391846 13 lmv,mgc,lustre,lov,osc,mdc,fid,fld,kgnilnd,ptlrpc,obdclass,lnet,lvfs
ib_core 759 0 [permanent]
kdreg 17949 0
gpcd_gem 10275 0
ipogif_gem 13077 0
kgni_gem 274252 2 kgnilnd,gpcd_gem
hwerr 60686 0
rca 362621 5
hss_os 4592 2 hwerr,rca
heartbeat 4230 1 rca
simplex 2636 2 hwerr,rca
ghal_gem 84908 6 gpcd_gem,ipogif_gem,kgni_gem,hwerr,rca,hss_os
cgm 8376 2 kgni_gem,ghal_gem
craytrace 6476 0

Comment by Oleg Drokin [ 08/Nov/12 ]

Andreas, actually modprobe lustre might be a wrong way to do it. After all this might be a server mount and there's no need for lustre clients modules to be present then.

Comment by James A Simmons [ 08/Nov/12 ]

Agree. On cray computes the modules are loaded at boot time only. After the boot is complete the image then removes all the modules and turns off modprobe. You can't modprobe lustre during run time. In that case a compute node could only mount lustre at boot and not during run time.

Comment by Andreas Dilger [ 05/Dec/12 ]

I'm planning to revert the "modprobe ptlrpc" line to avoid the "mount" problem.

It might be possible to fix this by adding explicit modprobe lines for the LNDs based on the MGS NID type as it is parsed by mount_lustre.c, like:

convert_hostnames()
:
:
        (void)system("modprobe %s > /dev/null 2>&1",
                     libcfs_lnd2modname(libcfs_str2lnd(s1));

This isn't a perfect solution if there is routing involved since only the MGS LND module would be loaded and that may not match the LND type used by the client, but that is somewhat a corner case. It is likely to fix the problem for many environments, and will at least not break things like the current patch does.

The issue blocking a proper fix is that the lnet modules are loaded based on modprobe options, but the module options are not in a specific file, so mount would have to hunt and peck to fix this. It definitely makes sense to have the simplified LNET config fix this properly by loading the actual modules needed on the client.

Comment by Andreas Dilger [ 05/Dec/12 ]

I've pushed http://review.whamcloud.com/4750 to revert the patch http://review.whamcloud.com/4292 and http://review.whamcloud.com/4449 fixup. Hopefully that will allow http://review.whamcloud.com/4727 to pass testing.

Comment by Andreas Dilger [ 10/Dec/12 ]

The patch to revert ptlrpc loading is underway, but someone else needs to make the patch to load the individual LND modules at mount time.

Comment by Hongchao Zhang [ 15/Jul/13 ]

the patch which checks and load the specified LNDs in mount.lustre is underway.

Comment by Hongchao Zhang [ 17/Jul/13 ]

the patch to load individual LND modules at mount time is tracked at http://review.whamcloud.com/#/c/7024/

Comment by Hongchao Zhang [ 17/Jul/13 ]

if modules can't be loaded after boot time(see above), it can be fixed by adding "modprobe lustre; lctl net load" to rc.local with the patch

Comment by Brian Murrell (Inactive) [ 17/Jul/13 ]

If loading modules at boot time is the solution, it should be done it's own initscript (or integrating into the distro's native boot-time module loading infrastructure) but not rc.local. rc.local should be left alone for the sysadmin to use, remove, etc. as he wishes.

Comment by Hongchao Zhang [ 18/Jul/13 ]

yes, it can be put into other initscripts, rc.local is just one of the possible ones.

Comment by Bryon Neitzel (Inactive) [ 19/Jul/13 ]

This is needed for https://jira.hpdd.intel.com/browse/HYD-2311. Could we try to get this landed for 2.5 please?

Comment by Hongchao Zhang [ 23/Aug/13 ]

the patch is updated

Comment by Patrick Farrell (Inactive) [ 18/Oct/13 ]

I think you may find that this patch is not sufficient to allow parallel mounting of multiple targets. We've recently observed other issues with that same operation.

Cray uses a script to mount all OSTs at the same time - The standard mount -t lustre commands are issued in parallel.

Recently, we've started occasionally noticing that the fsfilt_ldiskfs module sometimes fails to load.
The error comes from fsfilt_get_ops, which loads the module like this:

                if (!(rc = cfs_request_module("%s", name))) {
                        fs_ops = fsfilt_search_type(type);
                        CDEBUG(D_INFO, "Loaded module '%s'\n", name);
                        if (!fs_ops)
                                rc = -ENOENT;
                }


                if (rc) {
                        CERROR("Can't find %s interface\n", name);
                        RETURN(ERR_PTR(rc < 0 ? rc : -rc));
                        /* u
                }

So when the module load returns a non-zero RC, it checks to see if the module was really loaded with fsfilt_search_type (it checks for a pointer to something from the module).
If that happens, that ENOENT is set. It's converted to something else and ends up back at userspace as something like "ENOTSUPPORTED", but this is the source of it.

We get the "Can't find fsfilt_ldiskfs interface" message, and the mount fails.

In investigating this, I switched to using an instrumented version of modprobe to see what step was failing.

I've logged a number of mount attempts with a four OST OSS, some of which failed with the error above but most of which succeeded.

I'll attach the log from it, but here's an example of what I found, when logging a (successful, in this case) Lustre mount:
pid 28287 Modprobe started.
pid 28287: modulearg is: osd-ldiskfs
pid 28287: File open of file /lib/modules/2.6.32-358.6.1.el6.x86_64/updates/kernel/net/lustre/libcfs.ko result: 1
pid 28287: File open of file /lib/modules/2.6.32-358.6.1.el6.x86_64/updates/kernel/fs/lustre/lvfs.ko result: 1
pid 28287: File open of file /lib/modules/2.6.32-358.6.1.el6.x86_64/updates/kernel/net/lustre/lnet.ko result: 1
pid 28287: File open of file /lib/modules/2.6.32-358.6.1.el6.x86_64/updates/kernel/fs/lustre/obdclass.ko result: 1
pid 28287: File open of file /lib/modules/2.6.32-358.6.1.el6.x86_64/updates/kernel/fs/lustre/ptlrpc.ko result: 1
pid 28287: File open of file /lib/modules/2.6.32-358.6.1.el6.x86_64/updates/kernel/fs/lustre/fld.ko result: 1
pid 28287: File open of file /lib/modules/2.6.32-358.6.1.el6.x86_64/updates/kernel/fs/lustre/fid.ko result: 1
pid 28287: File open of file /lib/modules/2.6.32-358.6.1.el6.x86_64/updates/kernel/fs/lustre/mdd.ko result: 1
pid 28287: init_module failed, error: -1
pid 28287: File open of file /lib/modules/2.6.32-358.6.1.el6.x86_64/kernel/fs/jbd2/jbd2.ko result: 1
pid 28287: File open of file /lib/modules/2.6.32-358.6.1.el6.x86_64/updates/kernel/fs/lustre/lquota.ko result: 1
pid 28287: init_module failed, error: -1
pid 28287: File open of file /lib/modules/2.6.32-358.6.1.el6.x86_64/kernel/fs/mbcache.ko result: 1
pid 28287: File open of file /lib/modules/2.6.32-358.6.1.el6.x86_64/updates/kernel/fs/lustre-ldiskfs/ldiskfs.ko result: 1
pid 28287: init_module failed, error: -1
pid 28287: File open of file /lib/modules/2.6.32-358.6.1.el6.x86_64/updates/kernel/fs/lustre/osd_ldiskfs.ko result: 1
pid 28287: init_module failed, error: -1

Modulearg is the module name modprobe was called with. It then attempts to load any dependencies, and finally the module given in the argument.
The pid is the pid of the particular instance of modprobe, then the file being opened is the module it is trying to load.
When you see "init_module" failed, that is a failure to init the module listed on the "File open of file" line above.
In the attached log file, mounts of Lustre are separated by "Mount starting" and "Mount complete".

There is a separate modprobe of osd-ldiskfs which succeeded as part of that mount.

If you look through the attached file, you will see many modules failing to init on and off on various mounts, and various module inits being attempted more than once.
Often, fsfilt_ldiskfs will fail to init one or more times, but the mount will still succeed.
It's also worth noting that even when the mount fails, fsfilt_ldiskfs is init'ed successfully at least once.
This suggests that possibly the check that the module is loaded is racing with the module load itself.

modprobe fsfilt_ldiskfs before doing the mount avoids the mount failure.

Comment by Hongchao Zhang [ 18/Oct/13 ]

is the error -1 (-EPERM) the actual error returned by "init_module"?

the modules could be loaded more than once in "class_get_type"(obdclass/genops.c" if the previous loads failed.

btw, which mount is the failed one in the attached log?

Thanks!

Comment by Patrick Farrell (Inactive) [ 18/Oct/13 ]

I'm not sure it necessarily represents -EPERM, but -1 is the return value from init_module:
This is the code from modprobe (inside the function 'insmod', around line 690 in modprobe.c) , with the debug line I added:

        ret = init_module(module->data, module->len, optstring);
        if (ret != 0) {
                fprintf(debug_fp,"pid %d: init_module failed, error: %d\n",(int) pid, ret);

The fourth mount in the file is the one which failed (line 596 is one of the fsfilt_ldiskfs failures from that mount, note there are three failures and one success for fsfilt_ldiskfs in that mount).

Comment by Patrick Farrell (Inactive) [ 18/Oct/13 ]

Full source of module-init-tools-3.9.tar.gz, with lightly instrumented modprobe.c.

This version of modprobe (built just by doing configure, then make, and then found in the /build/ directory) creates and appends debug output to a file called /tmp/modprobe_output

It's made from a version of module-init-tools very close to that used with CentOS 6.4 and can be dropped in in place of the existing modprobe in /sbin/modprobe

Comment by Patrick Farrell (Inactive) [ 18/Oct/13 ]

I've also attached the module-init-tools source with my instrumented modprobe. See the note on the attachment for more details.

One further note - This ticket gives the affected version as 2.1. We're seeing the related issue I described above in 2.4.1.

Comment by Andreas Dilger [ 18/Oct/13 ]

It is important to note that the fsfilt code that is loading the fsfilt_ldiskfs module will be deleted in Lustre 2.6. The code is almost obsolete in Lustre 2.5, but the patch to remove it (http://review.whamcloud.com/5512) didn't get landed before the feature freeze. Any fixes made in this area (e.g. having mount.lustre load fsfilt_ldiskfs for ldiskfs servers, or putting a mutex around the probe/load/check code) does not need to be landed for Lustre 2.5 or later.

I'd prefer to just land 5512 to b2_5 to delete the affected code to fix this problem instead. The patch may be large, but is mostly just moving code around so that lvfs and fsfilt can be deleted.

Comment by James A Simmons [ 18/Oct/13 ]

Patrick can you try http://review.whamcloud.com/5512 to see if the fsfilt issues go away for you?

Comment by Patrick Farrell (Inactive) [ 18/Oct/13 ]

James:

Possibly. I have yet to replicate those on a system where I can install arbitrary Lustre code [it's only happening on some of our systems, seems very timing dependent], but I'll give it a try. (Replication first of all...)

Comment by Cory Spitz [ 18/Oct/13 ]

I'm sure that #5512 will workaround this problem, but isn't there a larger issue to resolve? Patrick noted that modules like mdd and lquota also fail in the same manner. Should we investigate fixing racy module load for all modules?

Comment by Patrick Farrell (Inactive) [ 18/Oct/13 ]

James - I just tried briefly and I'm not able to replicate the issue on a test system with >2 OSTs/OSS (we're seeing it on our 'production' development systems, rather than our dedicated Lustre testing systems).

But as Cory said, it seems obvious that #5512 would fix the problem. If the module does not exist, loading it can hardly be a problem.

Still, I'm worried we might just be dodging this issue and leaving others waiting to be discovered, as this is not the only module load failure in parallel mounts. It's just the only one causing a noticeable problem now. Unless we can point to a clear difference in how the other modules are loaded that shows why they're not failing the mount when one of their inits fails, I think we may see problems there in the future.

Comment by Hongchao Zhang [ 21/Oct/13 ]

the "-1" should be the error returned by the syscall "sys_init_module", and is -EPERM
there are two conditions to return -EPERM,

SYSCALL_DEFINE3(init_module, void __user *, umod,
                unsigned long, len, const char __user *, uargs)
{
        struct module *mod;
        int ret = 0;

        /* Must have permission */
        if (!capable(CAP_SYS_MODULE) || modules_disabled)
                return -EPERM;

the "modules_disabled" is controlled by "/proc/sys/modules_disabled", I remembered that you mentioned Cray will disable module load after boot completes,
is it implemented by this way? and is there any chance ""/proc/sys/modules_disabled" is set during issuing parallel mounting commands?

Thanks

Comment by Patrick Farrell (Inactive) [ 21/Oct/13 ]

Hongchao,

A very good guess. In this case, no, module loading is definitely not disabled. We disable module load on the compute nodes of our mainframes, but this is not those nodes (which are heavily stripped down because their OS runs exclusively in memory). This problem is seen on an external CentOS 6.4 system running Lustre. In essence, it's just vanilla CentOS 6.4 with Lustre installed on it - No significant changes to the OS or settings.

I'd encourage you to test target mounts in parallel with the modified modprobe with multiple OSTs on a test system of your own - You may not see the outright mount failure we observe, but I suspect you will see a number of -1s returned.

Comment by Patrick Farrell (Inactive) [ 28/Oct/13 ]

Further thoughts... There are various other possible sources for a -1. I don't think it's EPERM here.

From the init_module syscall in the kernel:

/* Do all the hard work */
mod = load_module(umod, len, uargs);
if (IS_ERR(mod))

{ mutex_unlock(&module_mutex); return PTR_ERR(mod); }

We also have:
ret = do_one_initcall(mod->init);

And I suspect it's coming from one of those.

Comment by Hongchao Zhang [ 04/Nov/13 ]

there is an issue in kernel related to loading modules in parallel, and the following loads after the first one didn't wait the module to be initialized,
which causes problem such as the filesystem type can't be found for the corresponding module has not been initialized. (please see LU-3975 for details)

this issue could be a duplicate of it, could you please try with it.

Thanks!

Comment by Andreas Dilger [ 27/Nov/13 ]

If there is already a fix in the upstream kernel (http://thread.gmane.org/gmane.linux.kernel/1358707/focus=1358709), we should just apply this patch to the RHEL/SLES kernels that we support, until such a time that they backport the fix themselves. It doesn't make sense to do anything at the Lustre level if this is not a Lustre bug.

Comment by Hongchao Zhang [ 09/Jul/14 ]

the patch http://review.whamcloud.com/#/c/7024/ has been updated, and it conflicts with the patch http://review.whamcloud.com/#/c/9832/
in LU-2456, and will update it once the patch http://review.whamcloud.com/#/c/9832/ landed.

Comment by Li Wei (Inactive) [ 28/Jul/14 ]

Nowadays, libcfs suffers from the problem as well, since it has become yet another module that results in additional request_module() calls in its init callback. (Ideally, we should avoid this kind of init callbacks in Lustre.) RHEL 7 should (to be tested) have enough fixes so that these deadlocks will no longer happen. IMHO, it is better to just carry a kernel fix in the RHEL 6 patch series than to go down the hacky road of changing mount.lustre. Here is a patch I'm waiting to test on a larger scale: http://review.whamcloud.com/11229.

Comment by Li Wei (Inactive) [ 08/Aug/14 ]

As discussed with Bob when reviewing the RHEL 6 patch, we might also want to check if a separate patch is needed for SLES kernels as well.

Comment by Peter Jones [ 21/Oct/14 ]

This has never (to my knowledge) been reported on SLES, but reported from multiple sources on RHEL 6.x, so I think it is reasonable to mark this as resolved for 2.5.4 and 2.7 based on Li Wei's fix having landed. If this is ever seen on SLES then we can track that issue under a new ticket.

Generated at Sat Feb 10 01:15:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.