Details
-
Bug
-
Resolution: Won't Fix
-
Major
-
None
-
3
-
4411
Description
We had an MDS (running 2.1.1-17chaos) lock up due to bug LU-1276. The admins rebooted the node to work around the issue, and after the reboot when the MDS started up we hit the following error:
2012-08-14 13:48:55 LustreError: 3697:0:(llog_lvfs.c:616:llog_lvfs_create()) error looking up logfile 0x1a768b0a:0xa2f17dca: rc -116 2012-08-14 13:48:55 LustreError: 3697:0:(llog_cat.c:174:llog_cat_id2handle()) error opening log id 0x1a768b0a:a2f17dca: rc -116 2012-08-14 13:48:55 LustreError: 3697:0:(llog_obd.c:318:cat_cancel_cb()) Cannot find handle for log 0x1a768b0a 2012-08-14 13:48:55 LustreError: 3472:0:(llog_obd.c:391:llog_obd_origin_setup()) llog_process() with cat_cancel_cb failed: -116 2012-08-14 13:48:55 LustreError: 3472:0:(llog_obd.c:218:llog_setup_named()) obd lsa-OST00bd-osc ctxt 2 lop_setup=ffffffffa05b0a60 failed -116 2012-08-14 13:48:55 LustreError: 3472:0:(osc_request.c:4186:__osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT 2012-08-14 13:48:55 LustreError: 3472:0:(osc_request.c:4203:__osc_llog_init()) osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' catid ffff8808312bd860 rc=-116 2012-08-14 13:48:55 LustreError: 3472:0:(osc_request.c:4205:__osc_llog_init()) logid 0x1a7680f6:0x615e782e 2012-08-14 13:48:55 LustreError: 3472:0:(osc_request.c:4233:osc_llog_init()) rc: -116 2012-08-14 13:48:55 LustreError: 3472:0:(lov_log.c:248:lov_llog_init()) error osc_llog_init idx 189 osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' (rc=-116) 2012-08-14 13:48:55 LustreError: 3698:0:(llog_lvfs.c:616:llog_lvfs_create()) error looking up logfile 0x1a768b0a:0xa2f17dca: rc -116 2012-08-14 13:48:55 LustreError: 3698:0:(llog_cat.c:174:llog_cat_id2handle()) error opening log id 0x1a768b0a:a2f17dca: rc -116 2012-08-14 13:48:55 LustreError: 3698:0:(llog_obd.c:318:cat_cancel_cb()) Cannot find handle for log 0x1a768b0a 2012-08-14 13:48:55 LustreError: 3407:0:(llog_obd.c:391:llog_obd_origin_setup()) llog_process() with cat_cancel_cb failed: -116 2012-08-14 13:48:55 LustreError: 3407:0:(llog_obd.c:218:llog_setup_named()) obd lsa-OST00bd-osc ctxt 2 lop_setup=ffffffffa05b0a60 failed -116 2012-08-14 13:48:55 LustreError: 3407:0:(osc_request.c:4186:__osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT 2012-08-14 13:48:55 LustreError: 3407:0:(osc_request.c:4203:__osc_llog_init()) osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' catid ffff8804333af960 rc=-116 2012-08-14 13:48:55 LustreError: 3407:0:(osc_request.c:4205:__osc_llog_init()) logid 0x1a7680f6:0x615e782e 2012-08-14 13:48:56 LustreError: 3407:0:(osc_request.c:4233:osc_llog_init()) rc: -116 2012-08-14 13:48:56 LustreError: 3407:0:(lov_log.c:248:lov_llog_init()) error osc_llog_init idx 189 osc 'lsa-OST00bd-osc' tgt 'mdd_obd-lsa-MDT0000' (rc=-116)
I don't see a file in the OBJECTS directory that seems to match 0x1a768b0a:0xa2f17dca (if that is where we are looking). Although -116 is -ESTALE, so I'm not sure that we're even getting to the lower-level lookup. It may be that mds_lvfs_fid2dentry() is returning -ESTALE because the id is 0.
This has left the OST connection "inactive" on the MDS, so any users with data on that OST are currently dead in the water.