[LU-986] Possible Race Condition Created: 12/Jan/12  Updated: 06/Mar/12  Resolved: 06/Mar/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: Lustre 2.1.1

Type: Bug Priority: Minor
Reporter: Roger Spellman (Inactive) Assignee: Peter Jones
Resolution: Fixed Votes: 0
Labels: None
Environment:

Centos 6.0


Severity: 3
Rank (Obsolete): 6489

 Description   

I believe that I have found a possible race condition.

I have an OSS with four OSTs. If I mount them one at a time, then they always mount just fine. That is, the following always works:

mount -t lustre /dev/mapper/map00 /mnt/ost00
mount -t lustre /dev/mapper/map01 /mnt/ost01
mount -t lustre /dev/mapper/map02 /mnt/ost02
mount -t lustre /dev/mapper/map03 /mnt/ost03

If I mount them all at the same time, like the following, then it sometimes fails.

mount -t lustre /dev/mapper/map00 /mnt/ost00 &
mount -t lustre /dev/mapper/map01 /mnt/ost01 &
mount -t lustre /dev/mapper/map02 /mnt/ost02 &
mount -t lustre /dev/mapper/map03 /mnt/ost03 &

The failures are because some modules do not load successfully. I get errors such as:

kernel: lov: gave up waiting for init of module osc.
kernel: lov: Unknown symbol osc_update_enqueue

To track this down, I added printk's to osc_init() in osc_request.c, and to init_lustre_quota() in quota_interface.c (these are the module init routines for those two modules).

If I mount the targets without the ampersand (and sometimes when I mount the targets with the ampersand), then lquota is initialized first before osc_init. In these cases, everything mounts just fine.

In the cases when there is a problem, osc_init is called before lquota.

osc_init() calls:

cfs_request_module("lquota");

Using printk's, I have shown that when osc_init() runs before init_lustre_quota(), then that call to cfs_request_module does not return quickly, meaning that the system is NOT loading the lquota.ko module right away. I believe that this is because multiple lustre modules are trying to load lquota at once.

Question: Why do several lustre modules call cfs_request_module("lquota") ?

Are they using a service or a variable exported by lquota? I don't think so. If they were, then modprobe would force lquota to be loaded first, which is not the case. In particular, the lustre and the osc modules DO NOT have a dependency on lquota. So, why are these modules calling request_module("lquota")?

Roger Spellman
Staff Engineer
Terascala, Inc.
508-588-1501
www.terascala.com <http://www.terascala.com/>



 Comments   
Comment by Peter Jones [ 02/Mar/12 ]

Roger

Do you still see this issue with the latest master code?

Peter

Comment by Roger Spellman (Inactive) [ 05/Mar/12 ]

Peter, I'm waiting to get onto our system so that I can retest it. The system is in the middle of a multi-day test that should complete tomorrow.

Comment by Roger Spellman (Inactive) [ 06/Mar/12 ]

Peter,

I was not able to reproduce this bug on 2.1.1.RC4.
So, it looks fixed.

Roger

Comment by Peter Jones [ 06/Mar/12 ]

Thanks for confirming Roger. I knew that there had been some quota-related fixes.

Generated at Sat Feb 10 01:12:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.