[LU-2023] Test failure on test suite parallel-scale-nfsv3 Created: 24/Sep/12  Updated: 02/Dec/16  Resolved: 02/Dec/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Minh Diep
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Server/Client lustre-b2_3-RC1 RHEL6


Severity: 3
Rank (Obsolete): 4142

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/b2d86c02-0658-11e2-9b17-52540035b04c.

From the test report, all sub tests passed while at the end of suite log shows:

client-26vm6: mount.lustre: mount client-26vm3@tcp:/lustre at /mnt/lustre failed: File exists
 parallel-scale-nfsv3 : @@@@@@ FAIL: failed to mount lustre after nfs test 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:3645:error_noexit()
  = /usr/lib64/lustre/tests/parallel-scale-nfs.sh:106:main()
== parallel-scale-nfsv3 parallel-scale-nfs.sh test complete, duration 4952 sec == 00:49:32 (1348472972)
/usr/lib64/lustre/tests/parallel-scale-nfs.sh: FAIL:  failed to mount lustre after nfs test
NFSCLIENT mode: setup, cleanup, check config skipped
CMD: client-26vm5,client-26vm6.lab.whamcloud.com echo \$(hostname); grep ' '/mnt/lustre' ' /proc/mounts
client-26vm5.lab.whamcloud.com
10.10.4.150@tcp:/lustre /mnt/lustre lustre rw,flock,user_xattr 0 0
client-26vm6.lab.whamcloud.com
 parallel-scale-nfsv3 : @@@@@@ FAIL: NFSCLIENT=true mode, but no NFS export found! 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:3645:error_noexit()


 Comments   
Comment by Peter Jones [ 25/Sep/12 ]

Minh

Could you please advise on this one?

thanks

Peter

Comment by Minh Diep [ 25/Sep/12 ]

found this on the mds/nfs server (client-26vm3) console.

00:39:05:Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test iorfpp: iorfpp == 00:38:56 (1348472336)
00:39:05:Lustre: DEBUG MARKER: lfs setstripe /mnt/lustre/d0.ior.fpp -c -1
00:42:57:rpc-srv/tcp: nfsd: got error -32 when sending 140 bytes - shutting down socket <<<<<<<<<<<<<<<<<<<<<<<<<<<
00:49:29:Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 2>/dev/null || true
00:49:29:Lustre: DEBUG MARKER: rc=$([ -f /proc/sys/lnet/catastrophe ] && echo $(< /proc/sys/lnet/catastrophe) || echo 0);
00:49:29:if [ $rc -ne 0 ]; then echo $(hostname): $rc; fi
00:49:29:exit $rc;
00:49:30:Lustre: DEBUG MARKER: service nfs stop

I think this is nfs issue under stress test. This results in nfs shutdown was not completed, hence when lustre mount would fail to do (-17) file exist.

I would like to get a second opinion on this.

Comment by Minh Diep [ 25/Sep/12 ]

FanYong,

could you comment?

Comment by nasf (Inactive) [ 25/Sep/12 ]

I do not think so. I found some logs on the Lustre client (client2, client-26vm6):

=====================================
00:49:41:LustreError: 11188:0:(genops.c:309:class_newdev()) Device lustre-MDT0000-mdc-ffff88005073c800 already exists at 13, won't add
00:49:41:LustreError: 11188:0:(obd_config.c:365:class_attach()) Cannot create device lustre-MDT0000-mdc-ffff88005073c800 of type mdc : -17
00:49:41:LustreError: 11188:0:(obd_config.c:1499:class_config_llog_handler()) Err -17 on cfg command:
00:49:41:Lustre: cmd=cf001 0:lustre-MDT0000-mdc 1:mdc 2:lustre-clilmv_UUID
00:49:41:LustreError: 15c-8: MGC10.10.4.150@tcp: The configuration from log 'lustre-client' failed (-17). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
00:49:41:LustreError: 11185:0:(llite_lib.c:998:ll_fill_super()) Unable to process log: -17
00:49:41:Lustre: Unmounted lustre-client
00:49:41:LustreError: 11185:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
00:49:41:LustreError: 11185:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Skipped 1 previous similar message
00:49:41:LustreError: 11185:0:(ldlm_request.c:1792:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
00:49:41:LustreError: 11185:0:(ldlm_request.c:1792:ldlm_cli_cancel_list()) Skipped 1 previous similar message
00:49:41:LustreError: 11185:0:(obd_mount.c:2569:lustre_fill_super()) Unable to mount (-17)
00:49:41:Lustre: DEBUG MARKER: /usr/sbin/lctl mark parallel-scale-nfsv3 : @@@@@@ FAIL: failed to mount lustre after nfs test
=====================================

I do not know the system configuration exactly. But I guess the client2 is NFS server, right? If yes, then just because the Lustre client fail to mount client-26vm3 (MDS&MGS), then cause NFS server re-export failure. As for why client2 failed to mount, it seems related with former uncleaned environment:

===================================
struct obd_device *class_newdev(const char *type_name, const char *name)
{
...

if (obd && obd->obd_name &&
(strcmp(name, obd->obd_name) == 0)) {
CERROR("Device %s already exists at %d, won't add\n",
name, i);
if (result)

{ LASSERTF(result->obd_magic == OBD_DEVICE_MAGIC, "%p obd_magic %08x != %08x\n", result, result->obd_magic, OBD_DEVICE_MAGIC); LASSERTF(result->obd_minor == new_obd_minor, "%p obd_minor %d != %d\n", result, result->obd_minor, new_obd_minor); obd_devs[result->obd_minor] = NULL; result->obd_name[0]='\0'; }

result = ERR_PTR(-EEXIST);
break;
}
...
}
===================================

We need the lustre kernel log to find why the environment is not cleaned. Without such log, we cannot know much.

Comment by Minh Diep [ 25/Sep/12 ]

Fanyoung, you can find lustre debug log here on brent
/home/autotest/logdir/test_logs/2012-09-22/lustre-b2_3-el6-x86_64_24_-7f2b7f4b9360/parallel-scale-nfsv3..debug_log*

Generated at Sat Feb 10 01:21:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.