[LU-9737] lnetctl net show command hung after add net Created: 05/Jul/17  Updated: 14/Jul/17

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question/Request Priority: Minor
Reporter: sebg-crd-pm (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Unresolved Votes: 0
Labels: None
Environment:

lustre: 2.10.0-RC1
lnet: 0.7.0


Rank (Obsolete): 9223372036854775807

 Description   

Hi,

I am test multi-rail in 2.10.0-RC1 and add one netwrok with ib0,ib1 interfaces.
But lnetctl/lctl always hung after add one net.
Should I enable anything ? Does anyone have user guide for multi-rail configuration? Thanks.

[test steps]
1.mdoprobe lnet
2.lnetctl lnet configure
3.lnetctl net show
net:

  • net: lo
    nid: 0@lo
    status: up

4.lnetctl net add --net o2ib0 --if ib0,ib1
5.lnetctl net show =>hung

[kernel message]
[ 434.309534] LNet: Added LNI 172.20.110.220@o2ib [8/256/0/180]
[ 434.323740] LNet: Added LNI 172.20.110.221@o2ib [8/256/0/180]
[ 726.028183] INFO: task lctl:12235 blocked for more than 120 seconds.
[ 726.028248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 726.028306] lctl D ffffffffa0cdc7b0 0 12235 10719 0x00000084
[ 726.028313] ffff88086c75fd40 0000000000000086 ffff88086cb69f60 ffff88086c75ffd8
[ 726.028317] ffff88086c75ffd8 ffff88086c75ffd8 ffff88086cb69f60 ffffffffa0cdc7a8
[ 726.028320] ffffffffa0cdc7ac ffff88086cb69f60 00000000ffffffff ffffffffa0cdc7b0
[ 726.028324] Call Trace:
[ 726.028343] [<ffffffff8168c769>] schedule_preempt_disabled+0x29/0x70
[ 726.028348] [<ffffffff8168a3c5>] __mutex_lock_slowpath+0xc5/0x1c0
[ 726.028354] [<ffffffff8168982f>] mutex_lock+0x1f/0x2f
[ 726.028372] [<ffffffffa0c96fd6>] LNetNIInit+0x46/0xa40 [lnet]
[ 726.028388] [<ffffffffa0cb478f>] lnet_ioctl+0x4f/0x250 [lnet]
[ 726.028404] [<ffffffffa0c384ac>] libcfs_ioctl+0x2ac/0x4c0 [libcfs]
[ 726.028415] [<ffffffffa0c34517>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
[ 726.028422] [<ffffffff81211ed5>] do_vfs_ioctl+0x2d5/0x4b0
[ 726.028427] [<ffffffff8121cb77>] ? __fd_install+0x47/0x60
[ 726.028431] [<ffffffff81212151>] SyS_ioctl+0xa1/0xc0
[ 726.028437] [<ffffffff816965c9>] system_call_fastpath+0x16/0x1b
[ 846.028188] INFO: task lctl:12235 blocked for more than 120 seconds.
[ 846.028256] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.



 Comments   
Comment by Peter Jones [ 05/Jul/17 ]

The Multi-Rail instructions are in the manual - http://doc.lustre.org/lustre_manual.xhtml#lnetmr

Comment by Peter Jones [ 05/Jul/17 ]

Amir

Could you please assist with any follow on questions relating to the instructions in the manual?

Thanks

Peter

Comment by Amir Shehata (Inactive) [ 05/Jul/17 ]

I'm unable to reproduce your problem locally.

Is this reproducible a 100% of the time? From the stack trace it appears that the ln_api_mutex is not being unlocked causing a deadlock. But I don't see a problem in the code.

How did you get 2.10-RC1? did you build it yourself? or did you download the RPMs from somewhere?

Is there other users trying to run "lctl" commands at the same time when you encounter this problem?

Do you have lustre up? or are you loading lnet by itself?

Can you also paste the output of the following command:

lnetctl -h
Comment by sebg-crd-pm (Inactive) [ 06/Jul/17 ]

Hi,

#Is this reproducible a 100% of the time?
=>Yes.
#How did you get 2.10-RC1? did you build it yourself? or did you download the RPMs from somewhere?
=>I got it from https://git.hpdd.intel.com/?p=fs/lustre-release.git, and build it by myself.
#Is there other users trying to run "lctl" commands at the same time when you encounter this problem?
=>No, it only used by me.
#Do you have lustre up? or are you loading lnet by itself?
=>I have also try to use "lctl net up" or "lnet start". They all got the same issue.
I got the message when execute "lnet status", Is it allright?
error: get_param: param_path 'health_check': No such file or directory
running
#Can you also paste the output of the following command:
=>[root@mdsb1 ~]# lnetctl -h
Try interactive use without arguments or use one of:
"lnet"
"route"
"net"
"routing"
"set"
"import"
"export"
"stats"
"peer_credits"
"help"
"exit"
"quit"
as argument.

Becasue I have installed lustre 2.9 in these nodes before install lustre 2.10-RC1.
In order to clarify the problem, I will try to install Lustre 2.10-RC1 in one pure OS node.
Or you can provide me diagnostic script to get anything as you want to know.

Another question, if "lnetctl net add --net o2ib0 --if ib0,ib1" works fine in mgs node,
How can I format mds/oss with mgsnode NID or mount lustre with mgsnode NID ? Thanks.
(Is it need included both mgs ib0,ib1 NIDs, like
"mkfs.lustre --reformat --mdt --index=0 --mgsnode=172.20.110.220@o2ib:172.20.110.221@o2ib" ?
mount.lustre 172.20.110.220@o2ib:172.20.110.221@o2ib:/hpcfs /mnt/client ? )

Comment by sebg-crd-pm (Inactive) [ 06/Jul/17 ]

Hi Amir Shehata,
I found there is no lnetctl in my built lustre. The lnetctl file maybe installed by IEEL lustre before.
I can setup lnetctl ok now after install yaml-devel package and rebuid lustre. Thanks.

Another question, if "lnetctl net add --net o2ib0 --if ib0,ib1" works fine in mgs node,
How can I format mds/oss with mgsnode NID or mount lustre with mgsnode NID ? Thanks.
(Is it need included both mgs ib0,ib1 NIDs, like
"mkfs.lustre --reformat --mdt --index=0 --mgsnode=172.20.110.220@o2ib:172.20.110.221@o2ib ..." ?
mount.lustre 172.20.110.220@o2ib:172.20.110.221@o2ib:/hpcfs /mnt/client ? )

Comment by Olaf Weber [ 10/Jul/17 ]

Hi Amir, this looks like a duplicate of LU-9729

Comment by Olaf Weber [ 10/Jul/17 ]

Based on my analysis of the source code and the procedure that created the hang, it is almost certain that this is a duplicate of LU-9729.

Please note that even if the fix for LU-9729 is included, the real problem is using an older lnetctl.

To ensure that lnetctl will be built, install the rpms for libyaml and libyaml-devel on the build machine. This is not a hard requirement at the moment, but it will be in the future, because you need an up-to-date lnetctl to enable and configure new functionality like LNet Multi-Rail.

Remember that to use lnetctl the libyaml rpm also has to be installed on all nodes.

Comment by sebg-crd-pm (Inactive) [ 14/Jul/17 ]

Thanks for your kind reminder. => Remember that to use lnetctl the libyaml rpm also has to be installed on all nodes.

Generated at Sat Feb 10 02:28:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.