Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.4.0
-
Lustre Branch: master
Lustre Build: http://build.whamcloud.com/job/lustre-master/1381/
Distro/Arch: RHEL6.3/x86_64
Test Group: failover
FAILURE_MODE=HARD
-
3
-
7628
Description
While running recovery-mds-scale test_failover_mds, dd operation failed on one of the client nodes as follows:
2013-04-08 22:25:26: dd run starting + mkdir -p /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com + /usr/bin/lfs setstripe -c -1 /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com + cd /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com ++ /usr/bin/lfs df /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com + FREE_SPACE=12963076 + BLKS=2916692 + echo 'Free disk space is 12963076, 4k blocks to dd is 2916692' + load_pid=8739 + wait 8739 + dd bs=4k count=2916692 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com/dd-file dd: writing `/mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com/dd-file': Bad file descriptor 295176+0 records in 295175+0 records out + '[' 1 -eq 0 ']' ++ date '+%F %H:%M:%S' + echoerr '2013-04-08 22:27:28: dd failed' + echo '2013-04-08 22:27:28: dd failed' 2013-04-08 22:27:28: dd failed
Maloo report: https://maloo.whamcloud.com/test_sets/68bce4aa-a1bb-11e2-bdac-52540035b04c
Attachments
Activity
I found it. Patch http://review.whamcloud.com/#change,4501 removed the fid_build_from_res_name function. Creating patch and testing...
I don't know how it got passed you build system but patch http://review.whamcloud.com/#change,6102 is missing the
function fid_build_from_res_name.
lustre-2.4.0/lustre/mdt/mdt_handler.c: In function ‘mdt_intent_layout’:
lustre-2.4.0/lustre/mdt/mdt_handler.c:3754: error: implicit declaration of function ‘fid_build_from_res_name’
Efficiency of solution will be improved under LU-2818. Current solution sufficient for 2.4
Not closing this bug until the redundant getxattr call has been cleaned up per inspection comments.
A patch for master branch to gather the logs on passive server nodes in failure configuration: http://review.whamcloud.com/6112.
this bug should be caused by the wrong size of "MD" in MDT,
00000004:00000001:21.0:1366291257.871077:0:4881:0:(mdd_object.c:268:mdd_xattr_get()) Process entered
00000004:00000001:21.0:1366291257.871080:0:4881:0:(lod_object.c:373:lod_xattr_get()) Process entered
00000004:00000001:21.0:1366291257.871085:0:4881:0:(lod_object.c:377:lod_xattr_get()) Process leaving (rc=176 : 176 : b0)
00000004:00000001:21.0:1366291257.871087:0:4881:0:(mdd_object.c:281:mdd_xattr_get()) Process leaving (rc=176 : 176 : b0)
00000004:00020000:21.0:1366291257.871089:0:4881:0:(mdt_lvb.c:158:mdt_lvbo_fill()) lustre-MDT0000: expected 176 actual 128.
00000004:00000001:21.0:1366291257.880256:0:4881:0:(mdt_lvb.c:159:mdt_lvbo_fill()) Process leaving via out (rc=18446744073709551582 : -34 : 0xffffffffffffffde)
1, the default value of mdt_device->mdt_max_mdsize is 128bytes,
#define MAX_MD_SIZE (sizeof(struct lov_mds_md) + 4 * sizeof(struct lov_ost_data)
2, before failover, the MD size is changed to 176bytes = (sizeof(struct lov_mds_md) + 6 * sizeof(struct lov_ost_data)
mdt_device->mdt_max_mdsize is updated accordingly (see mdt_attr_get_lov, will update "mdt_max_mdsize" in "getattr" request)
3, after failover, the new MDT doesn't know the actual mdt_max_mdsize, and still use the default value, then client calls ll_layout_refresh
to get the MD and the MDT will failed with ERANGE for there is still no "getattr" request to update mdt_device>mdt_max_mdsize!
the patch is tracked at http://review.whamcloud.com/#change,6102
Lustre Branch: master
Lustre Build: http://build.whamcloud.com/job/lustre-master/1406/
Distro/Arch: RHEL6.3/x86_64
Test Group: failover
FAILURE_MODE=HARD
Please look into this report: https://maloo.whamcloud.com/test_sets/3d4c2120-a849-11e2-ba78-52540035b04c
in the logs of http://review.whamcloud.com/#change,6013
the stripe data of the file used by "dd" is missed after failover, then it set the layout type of the file to Empty,
which causes this error (-9, Bad file descriptor)
...
00020000:00000001:0.0:1366159710.913287:0:8913:0:(lov_object.c:359:lov_fini_raid0()) Process entered
00020000:00000010:0.0:1366159710.913288:0:8913:0:(lov_object.c:362:lov_fini_raid0()) kfreed 'r0->lo_sub': 56 at ffff88007d4e6940.
00020000:00000010:0.0:1366159710.913290:0:8913:0:(lov_ea.c:131:lsm_free_plain()) slab-freed 'lsm->lsm_oinfo[i]': 112 at ffff8800373d1840.
00020000:00000010:0.0:1366159710.913292:0:8913:0:(lov_ea.c:131:lsm_free_plain()) slab-freed 'lsm->lsm_oinfo[i]': 112 at ffff8800373d17c0.
00020000:00000010:0.0:1366159710.913293:0:8913:0:(lov_ea.c:131:lsm_free_plain()) slab-freed 'lsm->lsm_oinfo[i]': 112 at ffff8800373d1740.
00020000:00000010:0.0:1366159710.913293:0:8913:0:(lov_ea.c:131:lsm_free_plain()) slab-freed 'lsm->lsm_oinfo[i]': 112 at ffff8800373d16c0.
00020000:00000010:0.0:1366159710.913294:0:8913:0:(lov_ea.c:131:lsm_free_plain()) slab-freed 'lsm->lsm_oinfo[i]': 112 at ffff8800373d1640.
00020000:00000010:0.0:1366159710.913294:0:8913:0:(lov_ea.c:131:lsm_free_plain()) slab-freed 'lsm->lsm_oinfo[i]': 112 at ffff8800373d15c0.
00020000:00000010:0.0:1366159710.913295:0:8913:0:(lov_ea.c:131:lsm_free_plain()) slab-freed 'lsm->lsm_oinfo[i]': 112 at ffff8800373d1540.
00020000:00000010:0.0:1366159710.913296:0:8913:0:(lov_ea.c:133:lsm_free_plain()) kfreed 'lsm': 128 at ffff88007b7cbc40.
00020000:00000001:0.0:1366159710.913297:0:8913:0:(lov_object.c:369:lov_fini_raid0()) Process leaving
00000020:00001000:0.0:1366159710.913297:0:8913:0:(cl_object.c:896:cl_env_put()) 1@ffff88007a373138
00000020:00000010:0.0:1366159710.913298:0:8913:0:(lu_object.c:421:lu_global_key_fini()) kfreed 'info': 512 at ffff8800175e4200.
00000020:00000010:0.0:1366159710.913299:0:8913:0:(cl_object.c:1090:cl0_key_fini()) kfreed 'info': 600 at ffff8800435e3000.
00000008:00000010:0.0:1366159710.913300:0:8913:0:(osc_dev.c:135:osc_key_fini()) slab-freed '(info)': 984 at ffff88002ad267f0.
00020000:00000010:0.0:1366159710.913301:0:8913:0:(lov_dev.c:159:lov_key_fini()) slab-freed '(info)': 288 at ffff88007db76be8.
00000080:00000010:0.0:1366159710.913302:0:8913:0:(lcommon_cl.c:146:ccc_key_fini()) slab-freed '(info)': 352 at ffff88007db048e0.
00000080:00000010:0.0:1366159710.913303:0:8913:0:(vvp_dev.c:96:vvp_key_fini()) slab-freed '(info)': 488 at ffff88002ad28420.
00000020:00000010:0.0:1366159710.913304:0:8913:0:(lu_object.c:1531:keys_fini()) kfreed 'ctx->lc_value': 320 at ffff8800175e4400.
00000008:00000010:0.0:1366159710.913305:0:8913:0:(osc_dev.c:159:osc_session_fini()) slab-freed '(info)': 424 at ffff88007db05590.
00020000:00000010:0.0:1366159710.913305:0:8913:0:(lov_dev.c:183:lov_session_key_fini()) slab-freed '(info)': 400 at ffff88007db74508.
00000080:00000010:0.0:1366159710.913306:0:8913:0:(lcommon_cl.c:164:ccc_session_key_fini()) slab-freed '(session)': 184 at ffff88007a36b1f8.
00000080:00000010:0.0:1366159710.913307:0:8913:0:(vvp_dev.c:114:vvp_session_key_fini()) slab-freed '(session)': 104 at ffff88007a36a7b0.
00000020:00000010:0.0:1366159710.913308:0:8913:0:(lu_object.c:1531:keys_fini()) kfreed 'ctx->lc_value': 320 at ffff880030d75800.
00000020:00000010:0.0:1366159710.913309:0:8913:0:(cl_object.c:787:cl_env_fini()) slab-freed '(cle)': 168 at ffff88007a373138.
00020000:00000001:0.0:1366159710.913310:0:8913:0:(lov_object.c:653:lov_layout_change()) Process leaving (rc=0 : 0 : 0)
00020000:00000001:0.0:1366159710.913311:0:8913:0:(lov_object.c:731:lov_conf_set()) Process leaving
00020000:00000001:0.0:1366159710.913312:0:8913:0:(lov_object.c:735:lov_conf_set()) Process leaving (rc=0 : 0 : 0)
00000020:00000001:0.0:1366159710.913313:0:8913:0:(cl_object.c:329:cl_conf_set()) Process leaving (rc=0 : 0 : 0)
00000020:00001000:0.0:1366159710.913314:0:8913:0:(cl_object.c:896:cl_env_put()) 1@ffff88007a373e58
00000020:00000010:0.0:1366159710.913315:0:8913:0:(lu_object.c:421:lu_global_key_fini()) kfreed 'info': 512 at ffff88001fc15c00.
00000020:00000010:0.0:1366159710.913316:0:8913:0:(cl_object.c:1090:cl0_key_fini()) kfreed 'info': 600 at ffff880049dc7c00.
00000008:00000010:0.0:1366159710.913317:0:8913:0:(osc_dev.c:135:osc_key_fini()) slab-freed '(info)': 984 at ffff88007b751418.
00020000:00000010:0.0:1366159710.913318:0:8913:0:(lov_dev.c:159:lov_key_fini()) slab-freed '(info)': 288 at ffff88007a370728.
00000080:00000010:0.0:1366159710.913319:0:8913:0:(lcommon_cl.c:146:ccc_key_fini()) slab-freed '(info)': 352 at ffff88007a36f5e0.
00000080:00000010:0.0:1366159710.913320:0:8913:0:(vvp_dev.c:96:vvp_key_fini()) slab-freed '(info)': 488 at ffff88007a36e238.
00000020:00000010:0.0:1366159710.913320:0:8913:0:(lu_object.c:1531:keys_fini()) kfreed 'ctx->lc_value': 320 at ffff88001765f400.
00000008:00000010:0.0:1366159710.913321:0:8913:0:(osc_dev.c:159:osc_session_fini()) slab-freed '(info)': 424 at ffff88002ac25e18.
00020000:00000010:0.0:1366159710.913322:0:8913:0:(lov_dev.c:183:lov_session_key_fini()) slab-freed '(info)': 400 at ffff88007db74828.
00000080:00000010:0.0:1366159710.913323:0:8913:0:(lcommon_cl.c:164:ccc_session_key_fini()) slab-freed '(session)': 184 at ffff88007a36b870.
00000080:00000010:0.0:1366159710.913323:0:8913:0:(vvp_dev.c:114:vvp_session_key_fini()) slab-freed '(session)': 104 at ffff88007a36a0c8.
00000020:00000010:0.0:1366159710.913324:0:8913:0:(lu_object.c:1531:keys_fini()) kfreed 'ctx->lc_value': 320 at ffff880017664000.
00000020:00000010:0.0:1366159710.913325:0:8913:0:(cl_object.c:787:cl_env_fini()) slab-freed '(cle)': 168 at ffff88007a373e58.
00000080:00000001:0.0:1366159710.913326:0:8913:0:(file.c:3236:ll_layout_conf()) Process leaving (rc=0 : 0 : 0)
00010000:00000001:0.0:1366159710.913327:0:8913:0:(ldlm_lock.c:211:ldlm_lock_put()) Process entered
00010000:00000001:0.0:1366159710.913328:0:8913:0:(ldlm_lock.c:244:ldlm_lock_put()) Process leaving
00010000:00000001:0.0:1366159710.913329:0:8913:0:(ldlm_lock.c:609:__ldlm_handle2lock()) Process entered
00000020:00000001:0.0:1366159710.913329:0:8913:0:(lustre_handles.c:172:class_handle2object()) Process entered
00000020:00000001:0.0:1366159710.913330:0:8913:0:(lustre_handles.c:195:class_handle2object()) Process leaving (rc=18446612134411306496 : -131939298245120 : ffff88007cff0600)
00010000:00000001:0.0:1366159710.913331:0:8913:0:(ldlm_lock.c:621:__ldlm_handle2lock()) Process leaving (rc=18446612134411306496 : -131939298245120 : ffff88007cff0600)
00010000:00000001:0.0:1366159710.913332:0:8913:0:(ldlm_lock.c:876:ldlm_lock_decref_internal()) Process entered
00010000:00010000:0.0:1366159710.913333:0:8913:0:(ldlm_lock.c:849:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(CR) ns: lustre-MDT0000-mdc-ffff880037d22800 lock: ffff88007cff0600/0x6e4fd575de0e4a90 lrc: 3/1,0 mode: CR/CR res: 8589935616/3 bits 0x8 rrc: 1 type: IBT flags: 0x20000000000 nid: local remote: 0xb524751f9485880b expref: -99 pid: 8913 timeout: 0 lvb_type: 3
00010000:00000001:0.0:1366159710.913335:0:8913:0:(ldlm_lock.c:211:ldlm_lock_put()) Process entered
00010000:00000001:0.0:1366159710.913335:0:8913:0:(ldlm_lock.c:244:ldlm_lock_put()) Process leaving
00010000:00010000:0.0:1366159710.913337:0:8913:0:(ldlm_lock.c:931:ldlm_lock_decref_internal()) ### add lock into lru list ns: lustre-MDT0000-mdc-ffff880037d22800 lock: ffff88007cff0600/0x6e4fd575de0e4a90 lrc: 2/0,0 mode: CR/CR res: 8589935616/3 bits 0x8 rrc: 1 type: IBT flags: 0x20000000000 nid: local remote: 0xb524751f9485880b expref: -99 pid: 8913 timeout: 0 lvb_type: 3
00010000:00000001:0.0:1366159710.913341:0:8913:0:(ldlm_lock.c:312:ldlm_lock_add_to_lru()) Process entered
00010000:00000001:0.0:1366159710.913342:0:8913:0:(ldlm_lock.c:316:ldlm_lock_add_to_lru()) Process leaving
00010000:00000001:0.0:1366159710.913342:0:8913:0:(ldlm_lock.c:952:ldlm_lock_decref_internal()) Process leaving
00010000:00000001:0.0:1366159710.913343:0:8913:0:(ldlm_lock.c:211:ldlm_lock_put()) Process entered
00010000:00000001:0.0:1366159710.913343:0:8913:0:(ldlm_lock.c:244:ldlm_lock_put()) Process leaving
00000080:00000001:0.0:1366159710.913344:0:8913:0:(file.c:3339:ll_layout_lock_set()) Process leaving (rc=0 : 0 : 0)
00000080:00000001:0.0:1366159710.913345:0:8913:0:(file.c:3445:ll_layout_refresh()) Process leaving (rc=0 : 0 : 0)
00000080:00000001:0.0:1366159710.913346:0:8913:0:(vvp_io.c:1220:vvp_io_init()) Process leaving (rc=0 : 0 : 0)
00020000:00000001:0.0:1366159710.913347:0:8913:0:(lov_io.c:939:lov_io_init_empty()) Process entered
00020000:00000001:0.0:1366159710.913347:0:8913:0:(lov_io.c:968:lov_io_init_empty()) Process leaving (rc=1 : 1 : 1)
00000020:00000001:0.0:1366159710.913348:0:8913:0:(cl_io.c:178:cl_io_init0()) Process leaving (rc=1 : 1 : 1)
00000020:00000001:0.0:1366159710.913349:0:8913:0:(cl_io.c:239:cl_io_rw_init()) Process leaving (rc=1 : 1 : 1)
00000080:00000001:0.0:1366159710.913350:0:8913:0:(file.c:920:ll_file_io_generic()) Process leaving via out (rc=18446744073709551607 : -9 : 0xfffffffffffffff7)
...
but the log of MDS is unavailable (it's at another node "wtm-15vm7"), and Yujian will help to collect it manually and will check it once the log is ready.
James please could you open a new ticket to track this bug? It sounds like what you are describing is a regression introduced into this work (which was included in 2.4.0) by the work from
LU-2193(which happened since 2.4.0) so it would be easier to create a new ticket and link it to the other two related tickets. It can get really confusing trying to work out the situation for a given bug if there are commits to it spanning release boundaries.