[LU-6553] Recurrence of LU-5299: obd_mount_server.c:1690:osd_start()) ASSERTION( obd ) failed Created: 01/May/15  Updated: 13/Oct/21  Resolved: 13/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.5.4
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Patrick Farrell (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Combined MGT/MDT, racing multiple mount commands.


Attachments: File perses_20150430t095047_mds2.log.sort.gz    
Issue Links:
Related
is related to LU-5573 Test timeout conf-sanity test_41c Resolved
is related to LU-5299 osd_start() LBUG when doing parallel ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The patch for LU-5573 (http://review.whamcloud.com/#/c/12353/), which closed LU-5299, does not cover some cases.

Specifically, the code which enables the combined MGT/MDT to start correctly also disables the race protection for a combined MGT/MDT.

So racing multiple mount commands on a combined MGT/MDT can still cause this problem.

I've taken a look, and I don't see any easy way to fix this in the current context. I can provide dumps if needed, and I'll attach a log now.

Note the attempts to start MDT0000. There are five, four of which start after the first one but before it has completed.



 Comments   
Comment by Bruno Faccini (Inactive) [ 02/May/15 ]

Hello Patrick,
As I am the unfortunate author of both+complementary patches for LU-5299 and LU-5573, I think I had to assign this ticket to me ...
Thanks for the report and also the attached Lustre debug trace for the LBUG.
Having a look to the trace I think you are right with the fact that the problem/race is still being present, but this is only in the case of concurrent mount/start commands where either of the nosvc/nomgs flags has been specified for a combined MDT/MGS device.
Will try to fix this case too, as a new complementary patch ...

Comment by Patrick Farrell (Inactive) [ 04/May/15 ]

Thanks, Bruno - Good luck. I couldn't find an easy way to do it, but I expect you know this code much better than me.

Comment by Wally Wang (Inactive) [ 10/Sep/15 ]

Hi Bruno, any progress on this one? Thanks.

Comment by Bruno Faccini (Inactive) [ 11/Feb/16 ]

Patrick, Wally,
Sorry I am late on this, but back working on a new/addon patch to fully fix.
Can you help in providing your reproducer or at least give some details on how these concurrent mount cmds are generated ?

Comment by Patrick Farrell (Inactive) [ 11/Feb/16 ]

Bruno -

We don't have a specific reproducer. It actually turned out we were doing concurrent mounts because our failover stuff was misconfigured on an internal system.
But I think it would be sufficient to simply use a bash script to spawn off multiple mount commands for a particular target.

Comment by Wally Wang (Inactive) [ 15/Mar/16 ]

Bruno,
Here is a simple reproducer:

1. create and start a Lustre file system with mgt/mdt combo
2. umount the mgt and mdt
3. run the following 'test_mount' script 5 times in parallel:

cat test_mount
#!/bin/bash
mount -t lustre -o nosvc,abort_recov --verbose /dev/sdd /tmp/lustre/scratch/mgt
mount -t lustre -o nomgs,abort_recov --verbose /dev/sdd /tmp/lustre/scratch/mdt

for ((i=0;i<5;i++));do ./test_mount & done;

Comment by Patrick Farrell (Inactive) [ 15/Mar/16 ]

Thanks, Wally!

Generated at Sat Feb 10 02:01:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.