[LU-631] IO errors when using automounter and Lustre Created: 24/Aug/11  Updated: 09/Jul/13  Resolved: 25/Apr/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Minor
Reporter: Jeremy Filizetti Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: ptr
Environment:

various


Severity: 3
Rank (Obsolete): 7892

 Description   

Ever since we moved from Lustre 1.6.6 to 1.8 I've seen issues with using
the automounter and Lustre. I've finally got around to looking at what
the issue is, but I'm not quite sure what the correct way to resolve it
is. I think the issue will remain in 2.0+ but I didn't look closely at
the code. The issue is that lov_connect which calls lov_connect_obd is
an asynchronous connect that does not wait for all OSCs to be connected
before returning. In the end lustre_fill_super can return before all
OSCs have been set active so any file operations that caused the
automount may return an error. Many lov functions check to make sure
the lov_tgt_desc ltd_active flag is 1 or return -EIO.

Original email thread from lustre-devel:
http://groups.google.com/group/lustre-devel-list/browse_thread/thread/4796d88cadf9d0e9/248ebf6e3f9877f3?lnk=gst&q=automount#248ebf6e3f9877f3



 Comments   
Comment by Peter Jones [ 03/Nov/11 ]

Hongchao

Can you please look into this one?

Thanks

Peter

Comment by Hongchao Zhang [ 04/Nov/11 ]

the problem(-EIO) caused by "ls -l /lustre/xen1/tmp/testfile" is in "lov_enqueue", where "lov_prep_enqueue_set" find
there is no available OSC to send glimpse request and the request set contains no lov_request, then it return -EIO,

in "lov_prep_enqueue_set",
...
if (!set->set_count)
GOTO(out_set, rc = -EIO);
...

here, we can wait these OSCs to be connected & activated, but it will need long time if the OST is recovering,
furthermore, there is still problem in the current code:
if there are more than one stripes in a file, and one OSC is activated, the other isn't, then only one glimpse request
is sent, and its A(CM)Time&Size is taken into account, but the second one's is not! it's the same effect if we don't
return "-EIO" in the above code snippet.

Comment by Jeremy Filizetti [ 06/Nov/11 ]

I think the easiest way to make a satisfactory fix (to me) is to make sure that nothing is queued to the OSC before it has been set active so that we don't return -EIO from lov_prep_enqueue_set on operations that might have triggered the mount from the automounter.

As for the bug you mention about not accounting for

{a,c,m}

time and size from all of the OSC if some of them are done should also be fixed. Maybe that should be tracked under a separate bug.

Comment by Peter Jones [ 23/Nov/11 ]

Bobi

Hongchao is out for a while. Could you please investigate this issue in his absence?

Thanks

Peter

Comment by Hongchao Zhang [ 02/Dec/11 ]

the patch is tracked at http://review.whamcloud.com/#change,2469

Comment by Peter Jones [ 25/Apr/13 ]

Landed for 2.4

Comment by Alexey Lyashkov [ 09/Jul/13 ]

good patch to make MDT hang if someone will add OST which unreachable in config change time.
new creation will call statfs to obtain information about new OST - and any new creation will blocked until ost connection finished.

Generated at Sat Feb 10 01:08:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.