[LU-959] kuc channels not reestablished after MDS crash Created: 04/Jan/12  Updated: 11/Apr/12  Resolved: 11/Apr/12

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.0.0, Lustre 2.1.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Thomas LEIBOVICI - CEA (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 6499

 Description   

It seams the kuc channels are not reestablished after a MDS crash.
In particular, process that are listening for changelogs remain stuck listening to the kernel->userspace pipe
whereas no message is sent from the MDS, as clients do not re-register kuc listeners after reconnecting to the MDS.

It would probably need an action in mdc_import_event() to reregister kuc listeners,
something like:

in mdc_import_event():

case IMP_EVENT_ACTIVE: {
rc = obd_notify_observer(obd, obd, OBD_NOTIFY_ACTIVE, NULL);
+ /* restore re-establish kuc registration after reconnecting */
+ if (rc == 0)
+ rc = mdc_kuc_reregister(imp);



 Comments   
Comment by Peter Jones [ 04/Jan/12 ]

Niu

Could you please look into this one

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 05/Jan/12 ]

Hi, Thomas

I don't quite follow your description of this ticket, and I didn't find mdc_kuc_reregister() neither. Could you elaborate this ticket a little bit more or post the patch on gerrit for review? Thank you.

Comment by Thomas LEIBOVICI - CEA (Inactive) [ 05/Jan/12 ]

OK, I'll try to explain the issue with more details.

To receive MDT changelogs from a client, llapi_changelog_start() is called by the user space program (like lfs):
1) this creates a "kuc" channel (communuication channel between mdc and user space, which is implemented in libcfs/libcfs/kernel_user_comm.c)
2) it calls an ioctl which results in calling mdc_ioc_changelog_send() in mdc that sends an RPC to MDS, to notify it it must send CL records to this client.
Then, the user space program calls llapi_changelog_recv() to get changelog records. This listens to incoming data from the kuc channel.

The problem is there is no recovery mechanism of KUC channels when the MDS restarts:
the client remains blocked in llapi_changelog_recv() an no more data is sent from the MDS (it forgot a client was listening for changelogs).
I think there should be an internal mechanism in MDC to call mdc_ioc_changelog_send() again after a MDC/MDS reconnection.

This is what I suggested: a mdc_kuc_reregister() should be implemented to be called in mdc_import_event(),
so mdc_ioc_changelog_send() it called for each registered process in the kuc layer.

Do you have a better understanding of this issue?
Thanks

Comment by Niu Yawei (Inactive) [ 05/Jan/12 ]

Thanks a lot for the details, Thomas. I think I have much better understanding now, but I still don't see why client was blocked in llapi_changelog_recv() after MDS restarted: mdc_ioc_changelog_send() just use the llog APIs to read changelog on MDS then put it in the pipe, so when MDS restarts, no matter if the client llog process procedure break earlier for an RPC error, the CL_EOF will always be written, and llapi_changelog_recv() should receive this EOF record and break reading.

Do you have the debug log and stack trace when the process stuck in llapi_changelog_recv()?

Comment by Thomas LEIBOVICI - CEA (Inactive) [ 05/Jan/12 ]

Right, I see what you mean. Maybe my initial understanding of the problem is wrong.
Unfortunately, I have no stack for you right now. Just something we noticed...
So we'll have to wait for the next MDS crash to get a detailed stack, which will be hopefully not too soon

Comment by Niu Yawei (Inactive) [ 10/Apr/12 ]

Thomas, is it still relevant? can we close it?

Comment by Thomas LEIBOVICI - CEA (Inactive) [ 11/Apr/12 ]

OK, let's close it. I'll reopen it in case of new occurrence.

Comment by Niu Yawei (Inactive) [ 11/Apr/12 ]

not reproduced, close it for now.

Generated at Sat Feb 10 01:12:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.