[LU-3456] Remove or refactor "ost_connect failed" message - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.7.0
Affects Version/s: Lustre 2.4.0
Labels:
- shh

Severity:
3
Rank (Obsolete):
8640

Description

I see this message at startup time on the MDS. If it's safe to ignore, it should be removed. If it's important, it should be refactored to be understandable by an admin (I don't even know what it means, and it's a console message).

2013-06-11 12:53:52 LustreError: 11-0: lc2-OST0007-osc-MDT0000: Communicating with 10.1.1.48@o2ib9, operation ost_connect failed with -19.

Attachments

Activity

[LU-3456] Remove or refactor "ost_connect failed" message

Andreas Dilger added a comment - 22/Apr/14 10:25 PM

Prakash, you are correct that this can happen if the MDS is started before the OSS. The message is printed to the console to alert the sysadmin in case the target OST is not starting up properly, but I agree it is a distraction if it is printed due to some transient condition.

That said, when Brian submitted the patch to update this console message he left in the printing of errors during the initial connection attempt. I think it would make sense to avoid printing this error if there are just a small number of failed initial connection attempts, but still print something if the connection is failing for a long time. It seems reasonable to only print out such messages when there are persistent problems on the connection.

I've pushed an RFC patch http://review.whamcloud.com/10057 but I haven't tested it at all. In particular, I'm not sure if the same request is used repeatedly for the initial connection (which means rq_nr_resends is properly incremented) or if a new request is used each time (which means my attempt at squashing the initial connect messages will fail). Bobijam, could you please take a look at this?

Andreas Dilger added a comment - 22/Apr/14 10:25 PM Prakash, you are correct that this can happen if the MDS is started before the OSS. The message is printed to the console to alert the sysadmin in case the target OST is not starting up properly, but I agree it is a distraction if it is printed due to some transient condition. That said, when Brian submitted the patch to update this console message he left in the printing of errors during the initial connection attempt. I think it would make sense to avoid printing this error if there are just a small number of failed initial connection attempts, but still print something if the connection is failing for a long time. It seems reasonable to only print out such messages when there are persistent problems on the connection. I've pushed an RFC patch http://review.whamcloud.com/10057 but I haven't tested it at all. In particular, I'm not sure if the same request is used repeatedly for the initial connection (which means rq_nr_resends is properly incremented) or if a new request is used each time (which means my attempt at squashing the initial connect messages will fail). Bobijam, could you please take a look at this?

Prakash Surya (Inactive) added a comment - 21/Apr/14 5:10 PM

So then, that sounds like "normal" operation to me. I don't think it warrants a console message. It's probably an artifact of how I sometimes power cycle all server nodes in a test filesystem. If the MDS comes up before the OSS nodes, then this message will appear?

Prakash Surya (Inactive) added a comment - 21/Apr/14 5:10 PM So then, that sounds like "normal" operation to me. I don't think it warrants a console message. It's probably an artifact of how I sometimes power cycle all server nodes in a test filesystem. If the MDS comes up before the OSS nodes, then this message will appear?

Zhenyu Xu added a comment - 18/Apr/14 1:52 PM

This message indicates that MDS tries to connect OST0007 while the OSS hasn't set up OST0007 yet or the OST0007 is failed for the time being (-19 == -ENODEV)

Zhenyu Xu added a comment - 18/Apr/14 1:52 PM This message indicates that MDS tries to connect OST0007 while the OSS hasn't set up OST0007 yet or the OST0007 is failed for the time being (-19 == -ENODEV)

Peter Jones added a comment - 18/Apr/14 12:58 PM

Bobi

So could you propose an alternative wording for the message that would be more intuitive?

Thanks

Peter

Peter Jones added a comment - 18/Apr/14 12:58 PM Bobi So could you propose an alternative wording for the message that would be more intuitive? Thanks Peter

Zhenyu Xu added a comment - 13/Jun/13 3:11 AM

it's from ptlrpc_check_status(), indicating when MDS is start up, it tries to connect OST while at the time OST device is not available.

The comment in ptlrpc_console_allow() reveals that the error happens in the initial connection is not suppressed, while reconnect request error messages will be suppressed.

Zhenyu Xu added a comment - 13/Jun/13 3:11 AM it's from ptlrpc_check_status(), indicating when MDS is start up, it tries to connect OST while at the time OST device is not available. The comment in ptlrpc_console_allow() reveals that the error happens in the initial connection is not suppressed, while reconnect request error messages will be suppressed.

Peter Jones added a comment - 12/Jun/13 12:18 AM

Bobijam

Could you please help with this one?

Thanks

Peter

Peter Jones added a comment - 12/Jun/13 12:18 AM Bobijam Could you please help with this one? Thanks Peter

People

Assignee:: Zhenyu Xu

Reporter:: Prakash Surya (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 11/Jun/13 8:35 PM

Updated:: 26/Feb/15 9:56 PM

Resolved:: 26/Feb/15 9:56 PM