[LU-638] conf-sanity test_55: @@@@@@ FAIL: client start failed Created: 25/Aug/11  Updated: 15/Dec/11  Resolved: 15/Dec/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James A Simmons Assignee: Minh Diep
Resolution: Duplicate Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 4241

 Description   

ON client we get

Writing CONFIGS/mountdata
start mds service on barry-mds1
Starting mds1: -o user_xattr,acl /dev/md5 /tmp/mds1
barry-mds1: mount.lustre: mount /dev/md5 at /tmp/mds1 failed: Invalid argument
barry-mds1: This may have multiple causes.
barry-mds1: Are the mount options correct?
barry-mds1: Check the syslog for more info.
mount -t lustre /dev/md5 /tmp/mds1
Start of /dev/md5 on mds1 failed 22
start ost1 service on barry-oss1
Starting ost1: /dev/mpath/barry1a-l0 /tmp/ost1

Client dmesgLustre: DEBUG MARKER: == conf-sanity test 56: check big indexes ============================================================ 09:59:58 (1314280798)
Lustre: 30459:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC10.37.248.61@o2ib1->MGC10.37.248.61@o2ib1_0 netid 50001: select flavor null
LustreError: 152-6: Ignoring deprecated mount option 'acl'.
Lustre: MGC10.37.248.61@o2ib1: Reactivating import
Lustre: 30459:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import lustre-MDT0000-mdc-ffff81017a592c00->10.37.248.61@o2ib1 netid 50001: select flavor null
Lustre: 25819:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1378123303092234 sent from lustre-MDT0000-mdc-ffff81017a592c00 to NID 10.37.248.61@o2ib1 has timed out for slow repl
y: [sent 1314280862] [real_sent 1314280862] [current 1314280867] [deadline 5s] [delay 0s] req@ffff81017ed84c00 x1378123303092234/t0(0) o-1->lustre-MDT0000_UUID@10.37.248.61@o2ib1:12/10 lens 368/512 e 0 to 1 dl 1314280867 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1
Lustre: 25820:0:(import.c:526:import_select_connection()) lustre-MDT0000-mdc-ffff81017a592c00: tried all connections, increasing latency to 5s
Lustre: 25819:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1378123303092239 sent from lustre-MDT0000-mdc-ffff81017a592c00 to NID 10.37.248.61@o2ib1 has timed out for slow reply: [sent 1314280872] [real_sent 1314280872] [current 1314280882] [deadline 10s] [delay 0s] req@ffff8101747ca400 x1378123303092239/t0(0) o-1->lustre-MDT0000_UUID@10.37.248.61@o2ib1:12/10 lens 368/512 e 0 to 1 dl 1314280882 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1

MDS dmesg

Lustre: DEBUG MARKER: == conf-sanity test 56: check big indexes ============================================================ 09:59:58 (1314280798)
LDISKFS-fs (md5): warning: maximal mount count reached, running e2fsck is recommended
LDISKFS-fs (md5): mounted filesystem with ordered data mode
JBD: barrier-based sync failed on md5-8 - disabling barriers
LDISKFS-fs (md5): mounted filesystem with ordered data mode
JBD: barrier-based sync failed on md5-8 - disabling barriers
LDISKFS-fs (md5): mounted filesystem with ordered data mode
Lustre: MGS: Regenerating lustre-MDTffff log by user request.
Lustre: Skipped 30 previous similar messages
Lustre: Setting parameter lustre-MDT0001-mdtlov.lov.stripesize in log lustre-MDT0001
Lustre: Skipped 4 previous similar messages
JBD: barrier-based sync failed on md5-8 - disabling barriers
Lustre: Enabling ACL
Lustre: Enabling user_xattr
LustreError: 22858:0:(mdt_handler.c:4504:mdt_init0()) CMD Operation not allowed in IOP mode
LustreError: 22858:0:(obd_config.c:522:class_setup()) setup lustre-MDT0001 failed (-22)
LustreError: 22858:0:(obd_config.c:1361:class_config_llog_handler()) Err -22 on cfg command:
Lustre: cmd=cf003 0:lustre-MDT0001 1:lustre-MDT0001_UUID 2:1 3:lustre-MDT0001-mdtlov 4:f
LustreError: 15b-f: MGC10.37.248.61@o2ib1: The configuration from log 'lustre-MDT0001'failed from the MGS (-22). Make sure this client and the MGS are running compatible versions of Lustre.
LustreError: 15c-8: MGC10.37.248.61@o2ib1: The configuration from log 'lustre-MDT0001' failed (-22). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
LustreError: 22820:0:(obd_mount.c:1192:server_start_targets()) failed to start server lustre-MDT0001: -22
LustreError: 22820:0:(obd_mount.c:1719:server_fill_super()) Unable to start targets: -22
LustreError: 22820:0:(obd_config.c:567:class_cleanup()) Device 3 not setup
Lustre: 22820:0:(obd_mount.c:1540:server_put_super()) Cleaning orphaned obd lustre-MDT0001-mdtlov
Lustre: server umount lustre-MDT0001 complete
Lustre: Skipped 2 previous similar messages
LustreError: 22820:0:(obd_mount.c:2160:lustre_fill_super()) Unable to mount (-22)
Lustre: 21484:0:(ldlm_lib.c:877:target_handle_connect()) MGS: connection from 40a74cfa-a6bf-33ca-ed4c-2f183d1e5bde@10.37.248.62@o2ib1 t0 exp 0000000000000000 cur 1314280859 last 0
Lustre: 21484:0:(ldlm_lib.c:877:target_handle_connect()) Skipped 78 previous similar messages



 Comments   
Comment by James A Simmons [ 25/Aug/11 ]

Sorry meant to label this as config-sanity test_56 failure

Comment by Andreas Dilger [ 27/Aug/11 ]

This looks like you are trying to run with 2 MDTs in CMD mode? There shouldn't be an MDT0001 otherwise.

Comment by James A Simmons [ 29/Aug/11 ]

Doesn't that require a mkfs.lustre parameter iam_dir. This what I'm formating the MDT with "

--mgsnode=10.37.248.61@o2ib1 --mdt --fsname=lustre --param sys.timeout=20 --device-size=200000 --mountfsoptions=errors=remount-ro,user_xattr,acl --param lov.stripesize=1048576 --param lov.stripecount=0 --param mdt.identity_upcall=/usr/sbin/l_getidentity --mkfsoptions=\"-E lazy_itable_init\"

Comment by James A Simmons [ 29/Aug/11 ]

After some tracking I discovered the problem was the mount option acl. Once I removed it from both the client mount string and the mds mount string the test past. I also tried conf-sanity test 55 and the same result. I'm looking to see what other test the mount option acl breaks.

Comment by Peter Jones [ 13/Dec/11 ]

Minh

What would your expectations be re using the mount option acl?

Peter

Comment by Minh Diep [ 13/Dec/11 ]

Hi James,

Could you try the same (with and without acl) with 1 MDT?

Comment by James A Simmons [ 14/Dec/11 ]

Okay I ran a bunch of test with different options. First the acl option doesn't cause the failure any more. It fails in either condition. Only MDT is being formated with

Format mds1: /dev/md5 with --mdt --fsname=lustre --device-size=200000 --param sys.timeout=20 --mountfsoptions=errors=remount-ro,user_xattr,acl --param lov.st....

Now the error I get is...

Lustre: DEBUG MARKER: == conf-sanity test 55: check lov_objid size ========================================================= 09:06:09 (1323871569)
Lustre: import MGC10.37.248.56@o2ib1->MGC10.37.248.56@o2ib1_0 netid 50001: select flavor null
LustreError: 152-6: Ignoring deprecated mount option 'acl'.
Lustre: MGC10.37.248.56@o2ib1: Reactivating import
Lustre: import lustre-MDT0000-mdc-ffff8101680e6400->10.37.248.61@o2ib1 netid 50001: select flavor null
LustreError: 11-0: an error occurred while communicating with 10.37.248.61@o2ib1. The mds_connect operation failed with -11
Lustre: import lustre-OST03ff-osc-ffff8101680e6400->10.37.248.62@o2ib1 netid 50001: select flavor null
Lustre: Client lustre-client has started
LustreError: 19110:0:(ldlm_request.c:1173:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
LustreError: 19110:0:(ldlm_request.c:1800:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
LustreError: 19110:0:(ldlm_request.c:1173:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
LustreError: 19110:0:(ldlm_request.c:1800:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Lustre: client ffff8101680e6400 umount complete
Lustre: import MGC10.37.248.56@o2ib1->MGC10.37.248.56@o2ib1_0 netid 50001: select flavor null
LustreError: 152-6: Ignoring deprecated mount option 'acl'.
Lustre: 26837:0:(client.c:1789:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1323871776/real 1323871776] req@ffff81015de111
Lustre: 26846:0:(import.c:525:import_select_connection()) MGC10.37.248.56@o2ib1: tried all connections, increasing latency to 5s
LustreError: 26894:0:(client.c:1065:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff81015de11800 x1388179954335770/t0(0) o101->MGC10.37.248.56@o2ib11
LustreError: 26904:0:(client.c:1065:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff810192ff9800 x1388179954335773/t0(0) o101->MGC10.37.248.56@o2ib11
Lustre: 26837:0:(client.c:1789:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1323871781/real 1323871781] req@ffff81015de111
Lustre: 26846:0:(import.c:525:import_select_connection()) MGC10.37.248.56@o2ib1: tried all connections, increasing latency to 10s
Lustre: 26837:0:(client.c:1789:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1323871796/real 1323871796] req@ffff81015de111
Lustre: 26846:0:(import.c:525:import_select_connection()) MGC10.37.248.56@o2ib1: tried all connections, increasing latency to 15s
LustreError: 26894:0:(client.c:1065:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff81015de11800 x1388179954335772/t0(0) o101->MGC10.37.248.56@o2ib11
LustreError: 15c-8: MGC10.37.248.56@o2ib1: The configuration from log 'lustre-client' failed (-5). This may be the result of communication errors between this n.
LustreError: 26894:0:(llite_lib.c:951:ll_fill_super()) Unable to process log: -5
Lustre: 26837:0:(client.c:1789:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1323871816/real 1323871816] req@ffff81015de111
Lustre: client ffff810163c07400 umount complete
LustreError: 26894:0:(obd_mount.c:2306:lustre_fill_super()) Unable to mount (-5)
Lustre: DEBUG MARKER: conf-sanity test_55: @@@@@@ FAIL: client start failed

Comment by James A Simmons [ 14/Dec/11 ]

Yipes. The MGS is stopped but never restarted...

Comment by James A Simmons [ 15/Dec/11 ]

Tracked down the problem. Its due to having separate MGS and MDS. This problem was reported in LU-424. I have a patch that fixed the test. Peter you can close this problem out as a duplicate of LU-424.

Comment by Peter Jones [ 15/Dec/11 ]

Duplicate of LU-424

Generated at Sat Feb 10 01:08:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.