[LU-3142] recovery-mds-scale test_failover_mds: dd: writing `/mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com/dd-file': Bad file descriptor - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.4.0
Affects Version/s: Lustre 2.4.0
Labels:
- LB
Environment:

Lustre Branch: master
Lustre Build: http://build.whamcloud.com/job/lustre-master/1381/
Distro/Arch: RHEL6.3/x86_64
Test Group: failover
FAILURE_MODE=HARD

Severity:
3
Rank (Obsolete):
7628

Description

While running recovery-mds-scale test_failover_mds, dd operation failed on one of the client nodes as follows:

2013-04-08 22:25:26: dd run starting
+ mkdir -p /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com
+ /usr/bin/lfs setstripe -c -1 /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com
+ cd /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com
++ /usr/bin/lfs df /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com
+ FREE_SPACE=12963076
+ BLKS=2916692
+ echo 'Free disk space is 12963076, 4k blocks to dd is 2916692'
+ load_pid=8739
+ wait 8739
+ dd bs=4k count=2916692 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com/dd-file
dd: writing `/mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com/dd-file': Bad file descriptor
295176+0 records in
295175+0 records out
+ '[' 1 -eq 0 ']'
++ date '+%F %H:%M:%S'
+ echoerr '2013-04-08 22:27:28: dd failed'
+ echo '2013-04-08 22:27:28: dd failed'
2013-04-08 22:27:28: dd failed

Maloo report: https://maloo.whamcloud.com/test_sets/68bce4aa-a1bb-11e2-bdac-52540035b04c

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

recovery-mds-scale.test_failover_mds.console.tar.bz2
13 kB
15/Apr/13 12:09 PM

Activity

[LU-3142] recovery-mds-scale test_failover_mds: dd: writing `/mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com/dd-file': Bad file descriptor

Peter Jones added a comment - 06/Jun/13 12:50 PM

James please could you open a new ticket to track this bug? It sounds like what you are describing is a regression introduced into this work (which was included in 2.4.0) by the work from ~~LU-2193~~ (which happened since 2.4.0) so it would be easier to create a new ticket and link it to the other two related tickets. It can get really confusing trying to work out the situation for a given bug if there are commits to it spanning release boundaries.

Peter Jones added a comment - 06/Jun/13 12:50 PM James please could you open a new ticket to track this bug? It sounds like what you are describing is a regression introduced into this work (which was included in 2.4.0) by the work from LU-2193 (which happened since 2.4.0) so it would be easier to create a new ticket and link it to the other two related tickets. It can get really confusing trying to work out the situation for a given bug if there are commits to it spanning release boundaries.

James A Simmons added a comment - 06/Jun/13 12:27 PM

Created a patch to fix this issue.

http://review.whamcloud.com/#change,6566

James A Simmons added a comment - 06/Jun/13 12:27 PM Created a patch to fix this issue. http://review.whamcloud.com/#change,6566

James A Simmons added a comment - 06/Jun/13 11:42 AM - edited

I found it. Patch http://review.whamcloud.com/#change,4501 removed the fid_build_from_res_name function. Creating patch and testing...

James A Simmons added a comment - 06/Jun/13 11:42 AM - edited I found it. Patch http://review.whamcloud.com/#change,4501 removed the fid_build_from_res_name function. Creating patch and testing...

James A Simmons added a comment - 06/Jun/13 11:38 AM

I don't know how it got passed you build system but patch http://review.whamcloud.com/#change,6102 is missing the
function fid_build_from_res_name.

lustre-2.4.0/lustre/mdt/mdt_handler.c: In function ‘mdt_intent_layout’:
lustre-2.4.0/lustre/mdt/mdt_handler.c:3754: error: implicit declaration of function ‘fid_build_from_res_name’

James A Simmons added a comment - 06/Jun/13 11:38 AM I don't know how it got passed you build system but patch http://review.whamcloud.com/#change,6102 is missing the function fid_build_from_res_name. lustre-2.4.0/lustre/mdt/mdt_handler.c: In function ‘mdt_intent_layout’: lustre-2.4.0/lustre/mdt/mdt_handler.c:3754: error: implicit declaration of function ‘fid_build_from_res_name’

Peter Jones made changes - 27/Apr/13 9:04 PM

Fix Version/s		New: Lustre 2.4.0 [ 10154 ]
Resolution		New: Fixed [ 1 ]
Status	Original: Open [ 1 ]	New: Resolved [ 5 ]

Peter Jones added a comment - 27/Apr/13 9:04 PM

Efficiency of solution will be improved under ~~LU-2818~~. Current solution sufficient for 2.4

Peter Jones added a comment - 27/Apr/13 9:04 PM Efficiency of solution will be improved under LU-2818 . Current solution sufficient for 2.4

Andreas Dilger added a comment - 26/Apr/13 6:16 PM

Not closing this bug until the redundant getxattr call has been cleaned up per inspection comments.

Andreas Dilger added a comment - 26/Apr/13 6:16 PM Not closing this bug until the redundant getxattr call has been cleaned up per inspection comments.

Andreas Dilger made changes - 26/Apr/13 6:14 PM

Labels

Original: MB

New: LB

Jian Yu added a comment - 22/Apr/13 7:39 AM

A patch for master branch to gather the logs on passive server nodes in failure configuration: http://review.whamcloud.com/6112.

Jian Yu added a comment - 22/Apr/13 7:39 AM A patch for master branch to gather the logs on passive server nodes in failure configuration: http://review.whamcloud.com/6112 .

Hongchao Zhang added a comment - 19/Apr/13 11:04 AM

this bug should be caused by the wrong size of "MD" in MDT,

00000004:00000001:21.0:1366291257.871077:0:4881:0:(mdd_object.c:268:mdd_xattr_get()) Process entered
00000004:00000001:21.0:1366291257.871080:0:4881:0:(lod_object.c:373:lod_xattr_get()) Process entered
00000004:00000001:21.0:1366291257.871085:0:4881:0:(lod_object.c:377:lod_xattr_get()) Process leaving (rc=176 : 176 : b0)
00000004:00000001:21.0:1366291257.871087:0:4881:0:(mdd_object.c:281:mdd_xattr_get()) Process leaving (rc=176 : 176 : b0)
00000004:00020000:21.0:1366291257.871089:0:4881:0:(mdt_lvb.c:158:mdt_lvbo_fill()) lustre-MDT0000: expected 176 actual 128.
00000004:00000001:21.0:1366291257.880256:0:4881:0:(mdt_lvb.c:159:mdt_lvbo_fill()) Process leaving via out (rc=18446744073709551582 : -34 : 0xffffffffffffffde)

1, the default value of mdt_device->mdt_max_mdsize is 128bytes,
#define MAX_MD_SIZE (sizeof(struct lov_mds_md) + 4 * sizeof(struct lov_ost_data)

2, before failover, the MD size is changed to 176bytes = (sizeof(struct lov_mds_md) + 6 * sizeof(struct lov_ost_data)
mdt_device->mdt_max_mdsize is updated accordingly (see mdt_attr_get_lov, will update "mdt_max_mdsize" in "getattr" request)

3, after failover, the new MDT doesn't know the actual mdt_max_mdsize, and still use the default value, then client calls ll_layout_refresh
to get the MD and the MDT will failed with ~~ERANGE for there is still no "getattr" request to update mdt_device~~>mdt_max_mdsize!

the patch is tracked at http://review.whamcloud.com/#change,6102

Hongchao Zhang added a comment - 19/Apr/13 11:04 AM this bug should be caused by the wrong size of "MD" in MDT, 00000004:00000001:21.0:1366291257.871077:0:4881:0:(mdd_object.c:268:mdd_xattr_get()) Process entered 00000004:00000001:21.0:1366291257.871080:0:4881:0:(lod_object.c:373:lod_xattr_get()) Process entered 00000004:00000001:21.0:1366291257.871085:0:4881:0:(lod_object.c:377:lod_xattr_get()) Process leaving (rc=176 : 176 : b0) 00000004:00000001:21.0:1366291257.871087:0:4881:0:(mdd_object.c:281:mdd_xattr_get()) Process leaving (rc=176 : 176 : b0) 00000004:00020000:21.0:1366291257.871089:0:4881:0:(mdt_lvb.c:158:mdt_lvbo_fill()) lustre-MDT0000: expected 176 actual 128. 00000004:00000001:21.0:1366291257.880256:0:4881:0:(mdt_lvb.c:159:mdt_lvbo_fill()) Process leaving via out (rc=18446744073709551582 : -34 : 0xffffffffffffffde) 1, the default value of mdt_device->mdt_max_mdsize is 128bytes, #define MAX_MD_SIZE (sizeof(struct lov_mds_md) + 4 * sizeof(struct lov_ost_data) 2, before failover, the MD size is changed to 176bytes = (sizeof(struct lov_mds_md) + 6 * sizeof(struct lov_ost_data) mdt_device->mdt_max_mdsize is updated accordingly (see mdt_attr_get_lov, will update "mdt_max_mdsize" in "getattr" request) 3, after failover, the new MDT doesn't know the actual mdt_max_mdsize, and still use the default value, then client calls ll_layout_refresh to get the MD and the MDT will failed with ERANGE for there is still no "getattr" request to update mdt_device >mdt_max_mdsize! the patch is tracked at http://review.whamcloud.com/#change,6102

People

Assignee:: Hongchao Zhang

Reporter:: Jian Yu

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 10/Apr/13 3:13 PM

Updated:: 06/Jun/13 12:50 PM

Resolved:: 27/Apr/13 9:04 PM