Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3142

recovery-mds-scale test_failover_mds: dd: writing `/mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com/dd-file': Bad file descriptor

Details

    • 3
    • 7628

    Description

      While running recovery-mds-scale test_failover_mds, dd operation failed on one of the client nodes as follows:

      2013-04-08 22:25:26: dd run starting
      + mkdir -p /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com
      + /usr/bin/lfs setstripe -c -1 /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com
      + cd /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com
      ++ /usr/bin/lfs df /mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com
      + FREE_SPACE=12963076
      + BLKS=2916692
      + echo 'Free disk space is 12963076, 4k blocks to dd is 2916692'
      + load_pid=8739
      + wait 8739
      + dd bs=4k count=2916692 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com/dd-file
      dd: writing `/mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com/dd-file': Bad file descriptor
      295176+0 records in
      295175+0 records out
      + '[' 1 -eq 0 ']'
      ++ date '+%F %H:%M:%S'
      + echoerr '2013-04-08 22:27:28: dd failed'
      + echo '2013-04-08 22:27:28: dd failed'
      2013-04-08 22:27:28: dd failed
      

      Maloo report: https://maloo.whamcloud.com/test_sets/68bce4aa-a1bb-11e2-bdac-52540035b04c

      Attachments

        Activity

          [LU-3142] recovery-mds-scale test_failover_mds: dd: writing `/mnt/lustre/d0.dd-client-32vm5.lab.whamcloud.com/dd-file': Bad file descriptor

          in the logs of http://review.whamcloud.com/#change,6013

          the stripe data of the file used by "dd" is missed after failover, then it set the layout type of the file to Empty,
          which causes this error (-9, Bad file descriptor)

          ...
          00020000:00000001:0.0:1366159710.913287:0:8913:0:(lov_object.c:359:lov_fini_raid0()) Process entered
          00020000:00000010:0.0:1366159710.913288:0:8913:0:(lov_object.c:362:lov_fini_raid0()) kfreed 'r0->lo_sub': 56 at ffff88007d4e6940.
          00020000:00000010:0.0:1366159710.913290:0:8913:0:(lov_ea.c:131:lsm_free_plain()) slab-freed 'lsm->lsm_oinfo[i]': 112 at ffff8800373d1840.
          00020000:00000010:0.0:1366159710.913292:0:8913:0:(lov_ea.c:131:lsm_free_plain()) slab-freed 'lsm->lsm_oinfo[i]': 112 at ffff8800373d17c0.
          00020000:00000010:0.0:1366159710.913293:0:8913:0:(lov_ea.c:131:lsm_free_plain()) slab-freed 'lsm->lsm_oinfo[i]': 112 at ffff8800373d1740.
          00020000:00000010:0.0:1366159710.913293:0:8913:0:(lov_ea.c:131:lsm_free_plain()) slab-freed 'lsm->lsm_oinfo[i]': 112 at ffff8800373d16c0.
          00020000:00000010:0.0:1366159710.913294:0:8913:0:(lov_ea.c:131:lsm_free_plain()) slab-freed 'lsm->lsm_oinfo[i]': 112 at ffff8800373d1640.
          00020000:00000010:0.0:1366159710.913294:0:8913:0:(lov_ea.c:131:lsm_free_plain()) slab-freed 'lsm->lsm_oinfo[i]': 112 at ffff8800373d15c0.
          00020000:00000010:0.0:1366159710.913295:0:8913:0:(lov_ea.c:131:lsm_free_plain()) slab-freed 'lsm->lsm_oinfo[i]': 112 at ffff8800373d1540.
          00020000:00000010:0.0:1366159710.913296:0:8913:0:(lov_ea.c:133:lsm_free_plain()) kfreed 'lsm': 128 at ffff88007b7cbc40.
          00020000:00000001:0.0:1366159710.913297:0:8913:0:(lov_object.c:369:lov_fini_raid0()) Process leaving
          00000020:00001000:0.0:1366159710.913297:0:8913:0:(cl_object.c:896:cl_env_put()) 1@ffff88007a373138
          00000020:00000010:0.0:1366159710.913298:0:8913:0:(lu_object.c:421:lu_global_key_fini()) kfreed 'info': 512 at ffff8800175e4200.
          00000020:00000010:0.0:1366159710.913299:0:8913:0:(cl_object.c:1090:cl0_key_fini()) kfreed 'info': 600 at ffff8800435e3000.
          00000008:00000010:0.0:1366159710.913300:0:8913:0:(osc_dev.c:135:osc_key_fini()) slab-freed '(info)': 984 at ffff88002ad267f0.
          00020000:00000010:0.0:1366159710.913301:0:8913:0:(lov_dev.c:159:lov_key_fini()) slab-freed '(info)': 288 at ffff88007db76be8.
          00000080:00000010:0.0:1366159710.913302:0:8913:0:(lcommon_cl.c:146:ccc_key_fini()) slab-freed '(info)': 352 at ffff88007db048e0.
          00000080:00000010:0.0:1366159710.913303:0:8913:0:(vvp_dev.c:96:vvp_key_fini()) slab-freed '(info)': 488 at ffff88002ad28420.
          00000020:00000010:0.0:1366159710.913304:0:8913:0:(lu_object.c:1531:keys_fini()) kfreed 'ctx->lc_value': 320 at ffff8800175e4400.
          00000008:00000010:0.0:1366159710.913305:0:8913:0:(osc_dev.c:159:osc_session_fini()) slab-freed '(info)': 424 at ffff88007db05590.
          00020000:00000010:0.0:1366159710.913305:0:8913:0:(lov_dev.c:183:lov_session_key_fini()) slab-freed '(info)': 400 at ffff88007db74508.
          00000080:00000010:0.0:1366159710.913306:0:8913:0:(lcommon_cl.c:164:ccc_session_key_fini()) slab-freed '(session)': 184 at ffff88007a36b1f8.
          00000080:00000010:0.0:1366159710.913307:0:8913:0:(vvp_dev.c:114:vvp_session_key_fini()) slab-freed '(session)': 104 at ffff88007a36a7b0.
          00000020:00000010:0.0:1366159710.913308:0:8913:0:(lu_object.c:1531:keys_fini()) kfreed 'ctx->lc_value': 320 at ffff880030d75800.
          00000020:00000010:0.0:1366159710.913309:0:8913:0:(cl_object.c:787:cl_env_fini()) slab-freed '(cle)': 168 at ffff88007a373138.
          00020000:00000001:0.0:1366159710.913310:0:8913:0:(lov_object.c:653:lov_layout_change()) Process leaving (rc=0 : 0 : 0)
          00020000:00000001:0.0:1366159710.913311:0:8913:0:(lov_object.c:731:lov_conf_set()) Process leaving
          00020000:00000001:0.0:1366159710.913312:0:8913:0:(lov_object.c:735:lov_conf_set()) Process leaving (rc=0 : 0 : 0)
          00000020:00000001:0.0:1366159710.913313:0:8913:0:(cl_object.c:329:cl_conf_set()) Process leaving (rc=0 : 0 : 0)
          00000020:00001000:0.0:1366159710.913314:0:8913:0:(cl_object.c:896:cl_env_put()) 1@ffff88007a373e58
          00000020:00000010:0.0:1366159710.913315:0:8913:0:(lu_object.c:421:lu_global_key_fini()) kfreed 'info': 512 at ffff88001fc15c00.
          00000020:00000010:0.0:1366159710.913316:0:8913:0:(cl_object.c:1090:cl0_key_fini()) kfreed 'info': 600 at ffff880049dc7c00.
          00000008:00000010:0.0:1366159710.913317:0:8913:0:(osc_dev.c:135:osc_key_fini()) slab-freed '(info)': 984 at ffff88007b751418.
          00020000:00000010:0.0:1366159710.913318:0:8913:0:(lov_dev.c:159:lov_key_fini()) slab-freed '(info)': 288 at ffff88007a370728.
          00000080:00000010:0.0:1366159710.913319:0:8913:0:(lcommon_cl.c:146:ccc_key_fini()) slab-freed '(info)': 352 at ffff88007a36f5e0.
          00000080:00000010:0.0:1366159710.913320:0:8913:0:(vvp_dev.c:96:vvp_key_fini()) slab-freed '(info)': 488 at ffff88007a36e238.
          00000020:00000010:0.0:1366159710.913320:0:8913:0:(lu_object.c:1531:keys_fini()) kfreed 'ctx->lc_value': 320 at ffff88001765f400.
          00000008:00000010:0.0:1366159710.913321:0:8913:0:(osc_dev.c:159:osc_session_fini()) slab-freed '(info)': 424 at ffff88002ac25e18.
          00020000:00000010:0.0:1366159710.913322:0:8913:0:(lov_dev.c:183:lov_session_key_fini()) slab-freed '(info)': 400 at ffff88007db74828.
          00000080:00000010:0.0:1366159710.913323:0:8913:0:(lcommon_cl.c:164:ccc_session_key_fini()) slab-freed '(session)': 184 at ffff88007a36b870.
          00000080:00000010:0.0:1366159710.913323:0:8913:0:(vvp_dev.c:114:vvp_session_key_fini()) slab-freed '(session)': 104 at ffff88007a36a0c8.
          00000020:00000010:0.0:1366159710.913324:0:8913:0:(lu_object.c:1531:keys_fini()) kfreed 'ctx->lc_value': 320 at ffff880017664000.
          00000020:00000010:0.0:1366159710.913325:0:8913:0:(cl_object.c:787:cl_env_fini()) slab-freed '(cle)': 168 at ffff88007a373e58.
          00000080:00000001:0.0:1366159710.913326:0:8913:0:(file.c:3236:ll_layout_conf()) Process leaving (rc=0 : 0 : 0)
          00010000:00000001:0.0:1366159710.913327:0:8913:0:(ldlm_lock.c:211:ldlm_lock_put()) Process entered
          00010000:00000001:0.0:1366159710.913328:0:8913:0:(ldlm_lock.c:244:ldlm_lock_put()) Process leaving
          00010000:00000001:0.0:1366159710.913329:0:8913:0:(ldlm_lock.c:609:__ldlm_handle2lock()) Process entered
          00000020:00000001:0.0:1366159710.913329:0:8913:0:(lustre_handles.c:172:class_handle2object()) Process entered
          00000020:00000001:0.0:1366159710.913330:0:8913:0:(lustre_handles.c:195:class_handle2object()) Process leaving (rc=18446612134411306496 : -131939298245120 : ffff88007cff0600)
          00010000:00000001:0.0:1366159710.913331:0:8913:0:(ldlm_lock.c:621:__ldlm_handle2lock()) Process leaving (rc=18446612134411306496 : -131939298245120 : ffff88007cff0600)
          00010000:00000001:0.0:1366159710.913332:0:8913:0:(ldlm_lock.c:876:ldlm_lock_decref_internal()) Process entered
          00010000:00010000:0.0:1366159710.913333:0:8913:0:(ldlm_lock.c:849:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(CR) ns: lustre-MDT0000-mdc-ffff880037d22800 lock: ffff88007cff0600/0x6e4fd575de0e4a90 lrc: 3/1,0 mode: CR/CR res: 8589935616/3 bits 0x8 rrc: 1 type: IBT flags: 0x20000000000 nid: local remote: 0xb524751f9485880b expref: -99 pid: 8913 timeout: 0 lvb_type: 3
          00010000:00000001:0.0:1366159710.913335:0:8913:0:(ldlm_lock.c:211:ldlm_lock_put()) Process entered
          00010000:00000001:0.0:1366159710.913335:0:8913:0:(ldlm_lock.c:244:ldlm_lock_put()) Process leaving
          00010000:00010000:0.0:1366159710.913337:0:8913:0:(ldlm_lock.c:931:ldlm_lock_decref_internal()) ### add lock into lru list ns: lustre-MDT0000-mdc-ffff880037d22800 lock: ffff88007cff0600/0x6e4fd575de0e4a90 lrc: 2/0,0 mode: CR/CR res: 8589935616/3 bits 0x8 rrc: 1 type: IBT flags: 0x20000000000 nid: local remote: 0xb524751f9485880b expref: -99 pid: 8913 timeout: 0 lvb_type: 3
          00010000:00000001:0.0:1366159710.913341:0:8913:0:(ldlm_lock.c:312:ldlm_lock_add_to_lru()) Process entered
          00010000:00000001:0.0:1366159710.913342:0:8913:0:(ldlm_lock.c:316:ldlm_lock_add_to_lru()) Process leaving
          00010000:00000001:0.0:1366159710.913342:0:8913:0:(ldlm_lock.c:952:ldlm_lock_decref_internal()) Process leaving
          00010000:00000001:0.0:1366159710.913343:0:8913:0:(ldlm_lock.c:211:ldlm_lock_put()) Process entered
          00010000:00000001:0.0:1366159710.913343:0:8913:0:(ldlm_lock.c:244:ldlm_lock_put()) Process leaving
          00000080:00000001:0.0:1366159710.913344:0:8913:0:(file.c:3339:ll_layout_lock_set()) Process leaving (rc=0 : 0 : 0)
          00000080:00000001:0.0:1366159710.913345:0:8913:0:(file.c:3445:ll_layout_refresh()) Process leaving (rc=0 : 0 : 0)
          00000080:00000001:0.0:1366159710.913346:0:8913:0:(vvp_io.c:1220:vvp_io_init()) Process leaving (rc=0 : 0 : 0)
          00020000:00000001:0.0:1366159710.913347:0:8913:0:(lov_io.c:939:lov_io_init_empty()) Process entered
          00020000:00000001:0.0:1366159710.913347:0:8913:0:(lov_io.c:968:lov_io_init_empty()) Process leaving (rc=1 : 1 : 1)
          00000020:00000001:0.0:1366159710.913348:0:8913:0:(cl_io.c:178:cl_io_init0()) Process leaving (rc=1 : 1 : 1)
          00000020:00000001:0.0:1366159710.913349:0:8913:0:(cl_io.c:239:cl_io_rw_init()) Process leaving (rc=1 : 1 : 1)
          00000080:00000001:0.0:1366159710.913350:0:8913:0:(file.c:920:ll_file_io_generic()) Process leaving via out (rc=18446744073709551607 : -9 : 0xfffffffffffffff7)
          ...

          but the log of MDS is unavailable (it's at another node "wtm-15vm7"), and Yujian will help to collect it manually and will check it once the log is ready.

          hongchao.zhang Hongchao Zhang added a comment - in the logs of http://review.whamcloud.com/#change,6013 the stripe data of the file used by "dd" is missed after failover, then it set the layout type of the file to Empty, which causes this error (-9, Bad file descriptor) ... 00020000:00000001:0.0:1366159710.913287:0:8913:0:(lov_object.c:359:lov_fini_raid0()) Process entered 00020000:00000010:0.0:1366159710.913288:0:8913:0:(lov_object.c:362:lov_fini_raid0()) kfreed 'r0->lo_sub': 56 at ffff88007d4e6940. 00020000:00000010:0.0:1366159710.913290:0:8913:0:(lov_ea.c:131:lsm_free_plain()) slab-freed 'lsm->lsm_oinfo [i] ': 112 at ffff8800373d1840. 00020000:00000010:0.0:1366159710.913292:0:8913:0:(lov_ea.c:131:lsm_free_plain()) slab-freed 'lsm->lsm_oinfo [i] ': 112 at ffff8800373d17c0. 00020000:00000010:0.0:1366159710.913293:0:8913:0:(lov_ea.c:131:lsm_free_plain()) slab-freed 'lsm->lsm_oinfo [i] ': 112 at ffff8800373d1740. 00020000:00000010:0.0:1366159710.913293:0:8913:0:(lov_ea.c:131:lsm_free_plain()) slab-freed 'lsm->lsm_oinfo [i] ': 112 at ffff8800373d16c0. 00020000:00000010:0.0:1366159710.913294:0:8913:0:(lov_ea.c:131:lsm_free_plain()) slab-freed 'lsm->lsm_oinfo [i] ': 112 at ffff8800373d1640. 00020000:00000010:0.0:1366159710.913294:0:8913:0:(lov_ea.c:131:lsm_free_plain()) slab-freed 'lsm->lsm_oinfo [i] ': 112 at ffff8800373d15c0. 00020000:00000010:0.0:1366159710.913295:0:8913:0:(lov_ea.c:131:lsm_free_plain()) slab-freed 'lsm->lsm_oinfo [i] ': 112 at ffff8800373d1540. 00020000:00000010:0.0:1366159710.913296:0:8913:0:(lov_ea.c:133:lsm_free_plain()) kfreed 'lsm': 128 at ffff88007b7cbc40. 00020000:00000001:0.0:1366159710.913297:0:8913:0:(lov_object.c:369:lov_fini_raid0()) Process leaving 00000020:00001000:0.0:1366159710.913297:0:8913:0:(cl_object.c:896:cl_env_put()) 1@ffff88007a373138 00000020:00000010:0.0:1366159710.913298:0:8913:0:(lu_object.c:421:lu_global_key_fini()) kfreed 'info': 512 at ffff8800175e4200. 00000020:00000010:0.0:1366159710.913299:0:8913:0:(cl_object.c:1090:cl0_key_fini()) kfreed 'info': 600 at ffff8800435e3000. 00000008:00000010:0.0:1366159710.913300:0:8913:0:(osc_dev.c:135:osc_key_fini()) slab-freed '(info)': 984 at ffff88002ad267f0. 00020000:00000010:0.0:1366159710.913301:0:8913:0:(lov_dev.c:159:lov_key_fini()) slab-freed '(info)': 288 at ffff88007db76be8. 00000080:00000010:0.0:1366159710.913302:0:8913:0:(lcommon_cl.c:146:ccc_key_fini()) slab-freed '(info)': 352 at ffff88007db048e0. 00000080:00000010:0.0:1366159710.913303:0:8913:0:(vvp_dev.c:96:vvp_key_fini()) slab-freed '(info)': 488 at ffff88002ad28420. 00000020:00000010:0.0:1366159710.913304:0:8913:0:(lu_object.c:1531:keys_fini()) kfreed 'ctx->lc_value': 320 at ffff8800175e4400. 00000008:00000010:0.0:1366159710.913305:0:8913:0:(osc_dev.c:159:osc_session_fini()) slab-freed '(info)': 424 at ffff88007db05590. 00020000:00000010:0.0:1366159710.913305:0:8913:0:(lov_dev.c:183:lov_session_key_fini()) slab-freed '(info)': 400 at ffff88007db74508. 00000080:00000010:0.0:1366159710.913306:0:8913:0:(lcommon_cl.c:164:ccc_session_key_fini()) slab-freed '(session)': 184 at ffff88007a36b1f8. 00000080:00000010:0.0:1366159710.913307:0:8913:0:(vvp_dev.c:114:vvp_session_key_fini()) slab-freed '(session)': 104 at ffff88007a36a7b0. 00000020:00000010:0.0:1366159710.913308:0:8913:0:(lu_object.c:1531:keys_fini()) kfreed 'ctx->lc_value': 320 at ffff880030d75800. 00000020:00000010:0.0:1366159710.913309:0:8913:0:(cl_object.c:787:cl_env_fini()) slab-freed '(cle)': 168 at ffff88007a373138. 00020000:00000001:0.0:1366159710.913310:0:8913:0:(lov_object.c:653:lov_layout_change()) Process leaving (rc=0 : 0 : 0) 00020000:00000001:0.0:1366159710.913311:0:8913:0:(lov_object.c:731:lov_conf_set()) Process leaving 00020000:00000001:0.0:1366159710.913312:0:8913:0:(lov_object.c:735:lov_conf_set()) Process leaving (rc=0 : 0 : 0) 00000020:00000001:0.0:1366159710.913313:0:8913:0:(cl_object.c:329:cl_conf_set()) Process leaving (rc=0 : 0 : 0) 00000020:00001000:0.0:1366159710.913314:0:8913:0:(cl_object.c:896:cl_env_put()) 1@ffff88007a373e58 00000020:00000010:0.0:1366159710.913315:0:8913:0:(lu_object.c:421:lu_global_key_fini()) kfreed 'info': 512 at ffff88001fc15c00. 00000020:00000010:0.0:1366159710.913316:0:8913:0:(cl_object.c:1090:cl0_key_fini()) kfreed 'info': 600 at ffff880049dc7c00. 00000008:00000010:0.0:1366159710.913317:0:8913:0:(osc_dev.c:135:osc_key_fini()) slab-freed '(info)': 984 at ffff88007b751418. 00020000:00000010:0.0:1366159710.913318:0:8913:0:(lov_dev.c:159:lov_key_fini()) slab-freed '(info)': 288 at ffff88007a370728. 00000080:00000010:0.0:1366159710.913319:0:8913:0:(lcommon_cl.c:146:ccc_key_fini()) slab-freed '(info)': 352 at ffff88007a36f5e0. 00000080:00000010:0.0:1366159710.913320:0:8913:0:(vvp_dev.c:96:vvp_key_fini()) slab-freed '(info)': 488 at ffff88007a36e238. 00000020:00000010:0.0:1366159710.913320:0:8913:0:(lu_object.c:1531:keys_fini()) kfreed 'ctx->lc_value': 320 at ffff88001765f400. 00000008:00000010:0.0:1366159710.913321:0:8913:0:(osc_dev.c:159:osc_session_fini()) slab-freed '(info)': 424 at ffff88002ac25e18. 00020000:00000010:0.0:1366159710.913322:0:8913:0:(lov_dev.c:183:lov_session_key_fini()) slab-freed '(info)': 400 at ffff88007db74828. 00000080:00000010:0.0:1366159710.913323:0:8913:0:(lcommon_cl.c:164:ccc_session_key_fini()) slab-freed '(session)': 184 at ffff88007a36b870. 00000080:00000010:0.0:1366159710.913323:0:8913:0:(vvp_dev.c:114:vvp_session_key_fini()) slab-freed '(session)': 104 at ffff88007a36a0c8. 00000020:00000010:0.0:1366159710.913324:0:8913:0:(lu_object.c:1531:keys_fini()) kfreed 'ctx->lc_value': 320 at ffff880017664000. 00000020:00000010:0.0:1366159710.913325:0:8913:0:(cl_object.c:787:cl_env_fini()) slab-freed '(cle)': 168 at ffff88007a373e58. 00000080:00000001:0.0:1366159710.913326:0:8913:0:(file.c:3236:ll_layout_conf()) Process leaving (rc=0 : 0 : 0) 00010000:00000001:0.0:1366159710.913327:0:8913:0:(ldlm_lock.c:211:ldlm_lock_put()) Process entered 00010000:00000001:0.0:1366159710.913328:0:8913:0:(ldlm_lock.c:244:ldlm_lock_put()) Process leaving 00010000:00000001:0.0:1366159710.913329:0:8913:0:(ldlm_lock.c:609:__ldlm_handle2lock()) Process entered 00000020:00000001:0.0:1366159710.913329:0:8913:0:(lustre_handles.c:172:class_handle2object()) Process entered 00000020:00000001:0.0:1366159710.913330:0:8913:0:(lustre_handles.c:195:class_handle2object()) Process leaving (rc=18446612134411306496 : -131939298245120 : ffff88007cff0600) 00010000:00000001:0.0:1366159710.913331:0:8913:0:(ldlm_lock.c:621:__ldlm_handle2lock()) Process leaving (rc=18446612134411306496 : -131939298245120 : ffff88007cff0600) 00010000:00000001:0.0:1366159710.913332:0:8913:0:(ldlm_lock.c:876:ldlm_lock_decref_internal()) Process entered 00010000:00010000:0.0:1366159710.913333:0:8913:0:(ldlm_lock.c:849:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(CR) ns: lustre-MDT0000-mdc-ffff880037d22800 lock: ffff88007cff0600/0x6e4fd575de0e4a90 lrc: 3/1,0 mode: CR/CR res: 8589935616/3 bits 0x8 rrc: 1 type: IBT flags: 0x20000000000 nid: local remote: 0xb524751f9485880b expref: -99 pid: 8913 timeout: 0 lvb_type: 3 00010000:00000001:0.0:1366159710.913335:0:8913:0:(ldlm_lock.c:211:ldlm_lock_put()) Process entered 00010000:00000001:0.0:1366159710.913335:0:8913:0:(ldlm_lock.c:244:ldlm_lock_put()) Process leaving 00010000:00010000:0.0:1366159710.913337:0:8913:0:(ldlm_lock.c:931:ldlm_lock_decref_internal()) ### add lock into lru list ns: lustre-MDT0000-mdc-ffff880037d22800 lock: ffff88007cff0600/0x6e4fd575de0e4a90 lrc: 2/0,0 mode: CR/CR res: 8589935616/3 bits 0x8 rrc: 1 type: IBT flags: 0x20000000000 nid: local remote: 0xb524751f9485880b expref: -99 pid: 8913 timeout: 0 lvb_type: 3 00010000:00000001:0.0:1366159710.913341:0:8913:0:(ldlm_lock.c:312:ldlm_lock_add_to_lru()) Process entered 00010000:00000001:0.0:1366159710.913342:0:8913:0:(ldlm_lock.c:316:ldlm_lock_add_to_lru()) Process leaving 00010000:00000001:0.0:1366159710.913342:0:8913:0:(ldlm_lock.c:952:ldlm_lock_decref_internal()) Process leaving 00010000:00000001:0.0:1366159710.913343:0:8913:0:(ldlm_lock.c:211:ldlm_lock_put()) Process entered 00010000:00000001:0.0:1366159710.913343:0:8913:0:(ldlm_lock.c:244:ldlm_lock_put()) Process leaving 00000080:00000001:0.0:1366159710.913344:0:8913:0:(file.c:3339:ll_layout_lock_set()) Process leaving (rc=0 : 0 : 0) 00000080:00000001:0.0:1366159710.913345:0:8913:0:(file.c:3445:ll_layout_refresh()) Process leaving (rc=0 : 0 : 0) 00000080:00000001:0.0:1366159710.913346:0:8913:0:(vvp_io.c:1220:vvp_io_init()) Process leaving (rc=0 : 0 : 0) 00020000:00000001:0.0:1366159710.913347:0:8913:0:(lov_io.c:939:lov_io_init_empty()) Process entered 00020000:00000001:0.0:1366159710.913347:0:8913:0:(lov_io.c:968:lov_io_init_empty()) Process leaving (rc=1 : 1 : 1) 00000020:00000001:0.0:1366159710.913348:0:8913:0:(cl_io.c:178:cl_io_init0()) Process leaving (rc=1 : 1 : 1) 00000020:00000001:0.0:1366159710.913349:0:8913:0:(cl_io.c:239:cl_io_rw_init()) Process leaving (rc=1 : 1 : 1) 00000080:00000001:0.0:1366159710.913350:0:8913:0:(file.c:920:ll_file_io_generic()) Process leaving via out (rc=18446744073709551607 : -9 : 0xfffffffffffffff7) ... but the log of MDS is unavailable (it's at another node "wtm-15vm7"), and Yujian will help to collect it manually and will check it once the log is ready.
          yujian Jian Yu added a comment -

          While I trying to reproduce the issue in this ticket on Rosso cluster, I hit LU-3175.

          yujian Jian Yu added a comment - While I trying to reproduce the issue in this ticket on Rosso cluster, I hit LU-3175 .
          yujian Jian Yu added a comment -

          it could be related to http://review.whamcloud.com/5820

          Hi Hongchao, build http://build.whamcloud.com/job/lustre-master/1381/ does not contain the above patch.

          yujian Jian Yu added a comment - it could be related to http://review.whamcloud.com/5820 Hi Hongchao, build http://build.whamcloud.com/job/lustre-master/1381/ does not contain the above patch.
          yujian Jian Yu added a comment -

          Is this actually -EBADF (which is a different error code)? Are there any messages about that in the console log? Are you sure that this was build 1381 (commit 49b06fba39e7fec26a0250ed37f04a620e349b5f) being tested? If it was a later build it might have been caused by commit http://review.whamcloud.com/5820.

          I did not find -EBADF(-9) or -EBADFD(-77) in the console logs. Due to TT-1107, the console logs were not gathered completely in the Maloo report. Please refer to the attached tarball. I'm sure this was build http://build.whamcloud.com/job/lustre-master/1381/.

          The debug patch in http://review.whamcloud.com/#change,6013 has been waiting for test resource for 3 days. I've to start manual test run to reproduce this issue.

          yujian Jian Yu added a comment - Is this actually -EBADF (which is a different error code)? Are there any messages about that in the console log? Are you sure that this was build 1381 (commit 49b06fba39e7fec26a0250ed37f04a620e349b5f) being tested? If it was a later build it might have been caused by commit http://review.whamcloud.com/5820 . I did not find -EBADF(-9) or -EBADFD(-77) in the console logs. Due to TT-1107, the console logs were not gathered completely in the Maloo report. Please refer to the attached tarball. I'm sure this was build http://build.whamcloud.com/job/lustre-master/1381/ . The debug patch in http://review.whamcloud.com/#change,6013 has been waiting for test resource for 3 days. I've to start manual test run to reproduce this issue.

          Is this actually -EBADF (which is a different error code)? Are there any messages about that in the console log? Are you sure that this was build 1381 (commit 49b06fba39e7fec26a0250ed37f04a620e349b5f) being tested? If it was a later build it might have been caused by commit http://review.whamcloud.com/5820.

          adilger Andreas Dilger added a comment - Is this actually -EBADF (which is a different error code)? Are there any messages about that in the console log? Are you sure that this was build 1381 (commit 49b06fba39e7fec26a0250ed37f04a620e349b5f) being tested? If it was a later build it might have been caused by commit http://review.whamcloud.com/5820 .

          the logs in MDS doesn't contain any valid info about Lustre.

          the error "Bad file descriptor" (-EBADFD) is not a common error, there is only one place in Lustre (in ll_statahead_interpret),
          and in Linux, it's only in the following modules

          driver/
          isdn, net, macintosh, ieee1394, atm, media, usb
          fs/
          jfss2, ncpfs
          net/
          iucv, atm, 9p, bluetooth
          sound/
          core, drivers, usb

          then this error could come from driver modules, or trigger at user space.

          hongchao.zhang Hongchao Zhang added a comment - the logs in MDS doesn't contain any valid info about Lustre. the error "Bad file descriptor" (-EBADFD) is not a common error, there is only one place in Lustre (in ll_statahead_interpret), and in Linux, it's only in the following modules driver/ isdn, net, macintosh, ieee1394, atm, media, usb fs/ jfss2, ncpfs net/ iucv, atm, 9p, bluetooth sound/ core, drivers, usb then this error could come from driver modules, or trigger at user space.
          yujian Jian Yu added a comment -

          The recovery-*-scale tests on master branch have been blocked by LU-2008. After the issue was fixed 2 days ago, the hard failover tests were started being performed by autotest. I submitted http://review.whamcloud.com/6013 to reproduce the issue.

          yujian Jian Yu added a comment - The recovery-*-scale tests on master branch have been blocked by LU-2008 . After the issue was fixed 2 days ago, the hard failover tests were started being performed by autotest. I submitted http://review.whamcloud.com/6013 to reproduce the issue.
          pjones Peter Jones added a comment -

          Hongchao

          Could you please look into this one?

          Thanks

          Peter

          pjones Peter Jones added a comment - Hongchao Could you please look into this one? Thanks Peter

          Are we able to pass any MDS failovers, or do they fail 100% of the time? It appears that this test failed immediately on the first MDS failover, but we don't have any useful logs from the MDS, so it is difficult to know why the OSTs were evicted.

          adilger Andreas Dilger added a comment - Are we able to pass any MDS failovers, or do they fail 100% of the time? It appears that this test failed immediately on the first MDS failover, but we don't have any useful logs from the MDS, so it is difficult to know why the OSTs were evicted.

          People

            hongchao.zhang Hongchao Zhang
            yujian Jian Yu
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: