[LU-3142] recovery-mds-scale test_failover_mds: dd: writing `/mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com/dd-file': Bad file descriptor - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.4.0
Affects Version/s: Lustre 2.4.0
Labels:
- LB
Environment:

Lustre Branch: master
Lustre Build: http://build.whamcloud.com/job/lustre-master/1381/
Distro/Arch: RHEL6.3/x86_64
Test Group: failover
FAILURE_MODE=HARD

Severity:
3
Rank (Obsolete):
7628

Description

While running recovery-mds-scale test_failover_mds, dd operation failed on one of the client nodes as follows:

2013-04-08 22:25:26: dd run starting
+ mkdir -p /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com
+ /usr/bin/lfs setstripe -c -1 /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com
+ cd /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com
++ /usr/bin/lfs df /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com
+ FREE_SPACE=12963076
+ BLKS=2916692
+ echo 'Free disk space is 12963076, 4k blocks to dd is 2916692'
+ load_pid=8739
+ wait 8739
+ dd bs=4k count=2916692 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com/dd-file
dd: writing `/mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com/dd-file': Bad file descriptor
295176+0 records in
295175+0 records out
+ '[' 1 -eq 0 ']'
++ date '+%F %H:%M:%S'
+ echoerr '2013-04-08 22:27:28: dd failed'
+ echo '2013-04-08 22:27:28: dd failed'
2013-04-08 22:27:28: dd failed

Maloo report: https://maloo.whamcloud.com/test_sets/68bce4aa-a1bb-11e2-bdac-52540035b04c

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

recovery-mds-scale.test_failover_mds.console.tar.bz2
13 kB
15/Apr/13 12:09 PM

Activity

[LU-3142] recovery-mds-scale test_failover_mds: dd: writing `/mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com/dd-file': Bad file descriptor

Jian Yu added a comment - 15/Apr/13 3:29 PM

it could be related to http://review.whamcloud.com/5820

Hi Hongchao, build http://build.whamcloud.com/job/lustre-master/1381/ does not contain the above patch.

Jian Yu added a comment - 15/Apr/13 3:29 PM it could be related to http://review.whamcloud.com/5820 Hi Hongchao, build http://build.whamcloud.com/job/lustre-master/1381/ does not contain the above patch.

Jian Yu added a comment - 15/Apr/13 12:09 PM

Is this actually -EBADF (which is a different error code)? Are there any messages about that in the console log? Are you sure that this was build 1381 (commit 49b06fba39e7fec26a0250ed37f04a620e349b5f) being tested? If it was a later build it might have been caused by commit http://review.whamcloud.com/5820.

I did not find -EBADF(-9) or -EBADFD(-77) in the console logs. Due to TT-1107, the console logs were not gathered completely in the Maloo report. Please refer to the attached tarball. I'm sure this was build http://build.whamcloud.com/job/lustre-master/1381/.

The debug patch in http://review.whamcloud.com/#change,6013 has been waiting for test resource for 3 days. I've to start manual test run to reproduce this issue.

Jian Yu added a comment - 15/Apr/13 12:09 PM Is this actually -EBADF (which is a different error code)? Are there any messages about that in the console log? Are you sure that this was build 1381 (commit 49b06fba39e7fec26a0250ed37f04a620e349b5f) being tested? If it was a later build it might have been caused by commit http://review.whamcloud.com/5820 . I did not find -EBADF(-9) or -EBADFD(-77) in the console logs. Due to TT-1107, the console logs were not gathered completely in the Maloo report. Please refer to the attached tarball. I'm sure this was build http://build.whamcloud.com/job/lustre-master/1381/ . The debug patch in http://review.whamcloud.com/#change,6013 has been waiting for test resource for 3 days. I've to start manual test run to reproduce this issue.

Andreas Dilger added a comment - 12/Apr/13 6:12 PM

Is this actually -EBADF (which is a different error code)? Are there any messages about that in the console log? Are you sure that this was build 1381 (commit 49b06fba39e7fec26a0250ed37f04a620e349b5f) being tested? If it was a later build it might have been caused by commit http://review.whamcloud.com/5820.

Andreas Dilger added a comment - 12/Apr/13 6:12 PM Is this actually -EBADF (which is a different error code)? Are there any messages about that in the console log? Are you sure that this was build 1381 (commit 49b06fba39e7fec26a0250ed37f04a620e349b5f) being tested? If it was a later build it might have been caused by commit http://review.whamcloud.com/5820 .

Hongchao Zhang added a comment - 12/Apr/13 10:48 AM

the logs in MDS doesn't contain any valid info about Lustre.

the error "Bad file descriptor" (-EBADFD) is not a common error, there is only one place in Lustre (in ll_statahead_interpret),
and in Linux, it's only in the following modules

driver/
isdn, net, macintosh, ieee1394, atm, media, usb
fs/
jfss2, ncpfs
net/
iucv, atm, 9p, bluetooth
sound/
core, drivers, usb

then this error could come from driver modules, or trigger at user space.

Hongchao Zhang added a comment - 12/Apr/13 10:48 AM the logs in MDS doesn't contain any valid info about Lustre. the error "Bad file descriptor" (-EBADFD) is not a common error, there is only one place in Lustre (in ll_statahead_interpret), and in Linux, it's only in the following modules driver/ isdn, net, macintosh, ieee1394, atm, media, usb fs/ jfss2, ncpfs net/ iucv, atm, 9p, bluetooth sound/ core, drivers, usb then this error could come from driver modules, or trigger at user space.

Jian Yu added a comment - 11/Apr/13 4:18 AM

The recovery-*-scale tests on master branch have been blocked by ~~LU-2008~~. After the issue was fixed 2 days ago, the hard failover tests were started being performed by autotest. I submitted http://review.whamcloud.com/6013 to reproduce the issue.

Jian Yu added a comment - 11/Apr/13 4:18 AM The recovery-*-scale tests on master branch have been blocked by LU-2008 . After the issue was fixed 2 days ago, the hard failover tests were started being performed by autotest. I submitted http://review.whamcloud.com/6013 to reproduce the issue.

Peter Jones added a comment - 10/Apr/13 6:36 PM

Hongchao

Could you please look into this one?

Thanks

Peter

Peter Jones added a comment - 10/Apr/13 6:36 PM Hongchao Could you please look into this one? Thanks Peter

Andreas Dilger added a comment - 10/Apr/13 5:41 PM

Are we able to pass any MDS failovers, or do they fail 100% of the time? It appears that this test failed immediately on the first MDS failover, but we don't have any useful logs from the MDS, so it is difficult to know why the OSTs were evicted.

Andreas Dilger added a comment - 10/Apr/13 5:41 PM Are we able to pass any MDS failovers, or do they fail 100% of the time? It appears that this test failed immediately on the first MDS failover, but we don't have any useful logs from the MDS, so it is difficult to know why the OSTs were evicted.

People

Assignee:: Hongchao Zhang

Reporter:: Jian Yu

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 10/Apr/13 3:13 PM

Updated:: 06/Jun/13 12:50 PM

Resolved:: 27/Apr/13 9:04 PM