[LU-1246] SANITY_QUOTA test_32 failed in cleanup_and_setup_lustre with LOAD_MODULES_REMOTE=true Created: 21/Mar/12  Updated: 30/May/12  Resolved: 30/May/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jay Lan (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

One mgs/mds, two OSS, two clients, lustre-1.8.6.81.
Server is running rhel5.7, client is running sles11sp1.
ofed 1.5.4.1.


Severity: 3
Rank (Obsolete): 6108

 Description   

SANITY_QUOTA test_32 always failed. Test was started from service331 (a lustre client actually):

...
Formatting mgs, mds, osts
...
Setup mgs, mdt, osts
start mds /dev/sdb1 -o errors=panic,acl
Starting mds: -o errors=panic,acl /dev/sdb1 /mnt/mds
service360: Reading test skip list from /usr/lib64/lustre/tests/cfg/tests-to-skip.sh
service360: #!/bin/bash
service360: #SANITY_BIGFILE_EXCEPT="64b"
service360: #export SANITY_EXCEPT="$SANITY_BIGFILE_EXCEPT"
service360: MDSSIZE=2000000, OSTSIZE=2000000.
service360: ncli_nas.sh: Before init_clients_lists
service360: ncli_nas.sh: Done init_clients_lists
service360: lnet.debug=0x33f1504
service360: lnet.subsystem_debug=0xffb7e3ff
service360: lnet.debug_mb=16
Started lustre-MDT0000
start ost1 /dev/sdb1 -o errors=panic,mballoc,extents
Starting ost1: -o errors=panic,mballoc,extents /dev/sdb1 /mnt/ost1
service361: mount.lustre: mount /dev/sdb1 at /mnt/ost1 failed: Cannot send after transport endpoint shutdown
mount -t lustre /dev/sdb1 /mnt/ost1
Start of /dev/sdb1 on ost1 failed 108
...

It seems this test is the only one that set LOAD_MODULES_REMOTE=true before calling cleanup_and_setup_lustre and failed. Sometimes only OST1 had error 108 problem, sometimes both OST1 and OST2 were hit with this problem. I put "sleep 3" in setupall()
after mds started but before trying to start ost, but it did not help.

The 'demsg' from MDS (service360) showed:
Lustre: DEBUG MARKER: == test 32: check lqs hash(bug 21846) ========================================== == 11:05:01
Lustre: MDT lustre-MDT0000 has stopped.
LustreError: 28890:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
LustreError: 28890:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Lustre: MGS has stopped.
Lustre: server umount lustre-MDT0000 complete
Lustre: Removed LNI 10.151.26.38@o2ib
Lustre: OBD class driver, http://wiki.whamcloud.com/
Lustre: Lustre Version: 1.8.6.81
Lustre: Build Version: lustre/scripts-1.8.6
Lustre: Listener bound to ib1:10.151.26.38:987:mlx4_0
Lustre: Register global MR array, MR size: 0xffffffffffffffff, array size: 1
Lustre: Added LNI 10.151.26.38@o2ib [8/64/0/180]
Lustre: Filtering OBD driver; http://wiki.whamcloud.com/
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode
Lustre: MGS MGS started
Lustre: MGC10.151.26.38@o2ib: Reactivating import
Lustre: MGS: Logs for fs lustre were removed by user request. All servers must be restarted in order to regenerate the logs.
Lustre: Setting parameter lustre-mdtlov.lov.stripesize in log lustre-MDT0000
Lustre: Enabling user_xattr
Lustre: Enabling ACL
Lustre: lustre-MDT0000: new disk, initializing
Lustre: lustre-MDT0000: Now serving lustre-MDT0000 on /dev/sdb1 with recovery enabled
Lustre: 30206:0:(lproc_mds.c:271:lprocfs_wr_group_upcall()) lustre-MDT0000: group upcall set to /usr/sbin/l_getgroups
Lustre: MGS: Regenerating lustre-OSTffff log by user request.
Lustre: lustre-MDT0000: temporarily refusing client connection from 10.151.25.182@o2ib
LustreError: 30130:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (11) req@ffff8103f64cc000 x1397073513545734/t0 o38><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1332353120 ref 1 fl Interpret:/0/0 rc -11/0
Lustre: 30308:0:(mds_lov.c:1155:mds_notify()) MDS lustre-MDT0000: add target lustre-OST0000_UUID
Lustre: 29699:0:(quota_master.c:1718:mds_quota_recovery()) Only 0/1 OSTs are active, abort quota recovery
Lustre: MDS lustre-MDT0000: lustre-OST0000_UUID now active, resetting orphans
Lustre: DEBUG MARKER: Using TIMEOUT=20
Lustre: DEBUG MARKER: sanity-quota test_32: @@@@@@ FAIL: Rehearsh didn't happen
Lustre: DEBUG MARKER: == test 99: Quota off =============================== == 11:08:33

The 'dmesg' from OST1 (service361) showed:
Lustre: DEBUG MARKER: == test 32: check lqs hash(bug 21846) ========================================== == 11:05:01
Lustre: OST lustre-OST0000 has stopped.
LustreError: 5972:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
LustreError: 5972:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Lustre: 5972:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1395992728675719 sent from MGC10.151.26.38@o2ib to NID 10.151.26.38@o2ib 6s ago has timed out (6s prior to deadline).
req@ffff810407edd800 x1395992728675719/t0 o251->MGS@MGC10.151.26.38@o2ib_0:26/25 lens 192/384 e 0 to 1 dl 1332352932 ref 1 fl Rpc:N/0/0 rc 0/0
Lustre: server umount lustre-OST0000 complete
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode
Lustre: 20460:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1395992728675720 sent from MGC10.151.26.38@o2ib to NID 10.151.26.38@o2ib 0s ago has failed due to network error (5s prior to deadline).
req@ffff8103ca5edc00 x1395992728675720/t0 o250->MGS@MGC10.151.26.38@o2ib_0:26/25 lens 368/584 e 0 to 1 dl 1332353093 ref 1 fl Rpc:N/0/0 rc 0/0
LustreError: 7058:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8103b2ca7800 x1395992728675721/t0 o253->MGS@MGC10.151.26.38@o2ib_0:26/25 lens 4736/4928 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
LustreError: 7058:0:(obd_mount.c:1112:server_start_targets()) Required registration failed for lustre-OSTffff: -108
LustreError: 7058:0:(obd_mount.c:1670:server_fill_super()) Unable to start targets: -108
LustreError: 7058:0:(obd_mount.c:1453:server_put_super()) no obd lustre-OSTffff
LustreError: 7058:0:(obd_mount.c:147:server_deregister_mount()) lustre-OSTffff not registered
Lustre: server umount lustre-OSTffff complete
LustreError: 7058:0:(obd_mount.c:2065:lustre_fill_super()) Unable to mount (-108)
Lustre: DEBUG MARKER: Using TIMEOUT=20
Lustre: DEBUG MARKER: sanity-quota test_32: @@@@@@ FAIL: Rehearsh didn't happen
Lustre: DEBUG MARKER: == test 99: Quota off =============================== == 11:08:33



 Comments   
Comment by Jay Lan (Inactive) [ 21/Mar/12 ]

Each time after failure, a 'lctl ping' between mds and ost's (both direction) worked. Manually executing mount command from ost also worked. Only failed during the test.

Comment by Peter Jones [ 21/Mar/12 ]

Niu

Could you please comment?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 22/Mar/12 ]

Hi, Jay

Could you try to comment out the "LOAD_MODULES_REMOTE=true" in the sanity-quota test_32() to see if the problem will be gone?

In the load_modules() of test-framework.sh, there is a comment:

    # bug 19124
    # load modules on remote nodes optionally
    # lustre-tests have to be installed on these nodes

Could you make sure that the lustre-tests are installed correctly on remote nodes(MDS & OSS)? Thanks.

Comment by Jay Lan (Inactive) [ 22/Mar/12 ]

Hi Niu,

1. I turned off LOAD_MODULES_REMOTE=yes a few days ago, and the problem went away.
After I turned it back on, the problem came back. That was why I put that
condition on the Summary line.
2. The lustre-tests are installed correctly at MDS and OSS. The SANITY_QUOTA
test_32 was the only test that failed on me this time. All other tests passed.

Comment by Jay Lan (Inactive) [ 22/Mar/12 ]

BTW, by "all other tests" I meant the test suites that Maloo runs when a new patch
is proposed. SANITY_QUOTA is only one of the test suites.

Comment by Niu Yawei (Inactive) [ 23/Mar/12 ]

Thanks, Jay.

I don't know why the OST can't communicate with the MGS in your case. Is it possible to get a full debug log on MDS & OSS? (you can set the PTLDEBUG to -1 on MDS & OSS node, I think the test will dump debug log automatically when test failed, or you can dump the debug log to file by 'lctl dk')

Comment by Jay Lan (Inactive) [ 26/Mar/12 ]

Hi Niu,

I will have to do that later. The MDS and OSS nodes have been re-imaged to rhel6.2 with lustre-2.1.1 server code. When I am done with 2.1.1 I will re-image them back to 1.8.6 and provide you information you need.

BTW, does this following message (cited from the dmesg of OSS1 (from the "Description" of this bug report) imply the timeout was first occured within MDS node?

Lustre: 5972:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1395992728675719 sent from MGC10.151.26.38@o2ib to NID 10.151.26.38@o2ib 6s ago has timed out (6s prior to deadline).

Comment by Niu Yawei (Inactive) [ 26/Mar/12 ]

Thanks, Jay.

BTW, does this following message (cited from the dmesg of OSS1 (from the "Description" of this bug report) imply the timeout was first occured within MDS node?

I think it indicating the OST to MGS request timeout.

Comment by Jay Lan (Inactive) [ 30/May/12 ]

Hi Niu,

We have upgraded our servers to 2.1.1 last week, and I have not seen this problem in testing with 2.1 servers.

Thus, this problem is not important to us any more. You may close it.

Comment by Peter Jones [ 30/May/12 ]

ok thanks Jay

Generated at Sat Feb 10 01:14:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.