Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
None
-
Lustre 2.5.3
-
2
-
16699
Description
After upgrading Lustre to 2.5.3 (specifically lustre-2.5.3-2chaos) we're no longer able to start the MDS due to the following failure.
Lustre: Lustre: Build Version: 2.5.3-2chaos-2chaos--PRISTINE-2.6.32-431.29.2.1chaos.ch5.2.x86_64 LustreError: 13871:0:(obd_mount_server.c:313:server_mgc_set_fs()) can't set_fs -17 Lustre: fsv-MDT0000: Unable to start target: -17 LustreError: 13871:0:(obd_mount_server.c:845:lustre_disconnect_lwp()) fsv-MDT0000-lwp-MDT0000: Can't end config log fsv-client. LustreError: 13871:0:(obd_mount_server.c:1419:server_put_super()) fsv-MDT0000: failed to disconnect lwp. (rc=-2) LustreError: 13871:0:(obd_mount_server.c:1449:server_put_super()) no obd fsv-MDT0000 LustreError: 13871:0:(obd_mount_server.c:135:server_deregister_mount()) fsv-MDT0000 not registered Lustre: server umount fsv-MDT0000 complete LustreError: 13871:0:(obd_mount.c:1326:lustre_fill_super()) Unable to mount (-17)
I took a look at the Lustre debug log and the failure is due to a problem creating the local copy of the config logs. This is a ZFS based MDS which is upgrading from 2.4.x so there was never a local CONFIGS directory.
I'll attach the full log but basically it seems to be correctly detecting there is no CONFIGS directory. Then it attempts to create the directory which fails with -17 EEXISTS. Given the debug log we have it's not clear why this fails since the directory clearly doesn't exist. We've mounted the MDT via the ZPL and verified this.
Hoping we could work around the issue we tried manually created the CONFIGS directory and added a copy of the llogs from the MGS. We also just tried creating an empty CONFIGS directory through the ZPL. In both cases this caused the MDS to LBUG on start as follows:
2014-12-04 11:10:50 LustreError: 16688:0:(osd_index.c:1313:osd_index_try()) ASSERTION( dt_object_exists(dt) ) failed: 2014-12-04 11:10:50 LustreError: 16688:0:(osd_index.c:1313:osd_index_try()) LBUG 2014-12-04 11:10:50 Pid: 16688, comm: mount.lustre 2014-12-04 11:10:50 2014-12-04 11:10:50 Call Trace: 2014-12-04 11:10:50 [<ffffffffa05d18f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2014-12-04 11:10:50 [<ffffffffa05d1ef7>] lbug_with_loc+0x47/0xb0 [libcfs] 2014-12-04 11:10:50 [<ffffffffa0d623e4>] osd_index_try+0x224/0x470 [osd_zfs] 2014-12-04 11:10:50 [<ffffffffa0740d41>] dt_try_as_dir+0x41/0x60 [obdclass] 2014-12-04 11:10:50 [<ffffffffa0741351>] dt_lookup_dir+0x31/0x130 [obdclass] 2014-12-04 11:10:50 [<ffffffffa071f845>] llog_osd_open+0x475/0xbb0 [obdclass] 2014-12-04 11:10:50 [<ffffffffa06f15ba>] llog_open+0xba/0x2c0 [obdclass] 2014-12-04 11:10:50 [<ffffffffa06f5131>] llog_backup+0x61/0x500 [obdclass] 2014-12-04 11:10:50 [<ffffffff8128f540>] ? sprintf+0x40/0x50 2014-12-04 11:10:50 [<ffffffffa0d99757>] mgc_process_log+0x1177/0x18f0 [mgc] 2014-12-04 11:10:50 [<ffffffffa0d93360>] ? mgc_blocking_ast+0x0/0x810 [mgc] 2014-12-04 11:10:50 [<ffffffffa08991e0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc] 2014-12-04 11:10:50 [<ffffffffa0d9b4b5>] mgc_process_config+0x645/0x11d0 [mgc] 2014-12-04 11:10:50 [<ffffffffa07351c6>] lustre_process_log+0x256/0xa60 [obdclass] 2014-12-04 11:10:50 [<ffffffffa05e1971>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2014-12-04 11:10:50 [<ffffffffa05dc378>] ? libcfs_log_return+0x28/0x40 [libcfs] 2014-12-04 11:10:50 [<ffffffffa0766cb7>] server_start_targets+0x9e7/0x1db0 [obdclass] 2014-12-04 11:10:50 [<ffffffffa05dc378>] ? libcfs_log_return+0x28/0x40 [libcfs] 2014-12-04 11:10:50 [<ffffffffa0738876>] ? lustre_start_mgc+0x4b6/0x1e60 [obdclass] 2014-12-04 11:10:50 [<ffffffffa05e1971>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2014-12-04 11:10:50 [<ffffffffa0730760>] ? class_config_llog_handler+0x0/0x1880 [obdclass] 2014-12-04 11:10:50 [<ffffffffa076ceb8>] server_fill_super+0xb98/0x19e0 [obdclass] 2014-12-04 11:10:50 [<ffffffffa05dc378>] ? libcfs_log_return+0x28/0x40 [libcfs] 2014-12-04 11:10:50 [<ffffffffa073a3f8>] lustre_fill_super+0x1d8/0x550 [obdclass] 2014-12-04 11:10:50 [<ffffffffa073a220>] ? lustre_fill_super+0x0/0x550 [obdclass] 2014-12-04 11:10:50 [<ffffffff8118d1ef>] get_sb_nodev+0x5f/0xa0 2014-12-04 11:10:50 [<ffffffffa07320e5>] lustre_get_sb+0x25/0x30 [obdclass] 2014-12-04 11:10:50 [<ffffffff8118c82b>] vfs_kern_mount+0x7b/0x1b0 2014-12-04 11:10:50 [<ffffffff8118c9d2>] do_kern_mount+0x52/0x130 2014-12-04 11:10:50 [<ffffffff811ae21b>] do_mount+0x2fb/0x930 2014-12-04 11:10:50 [<ffffffff811ae8e0>] sys_mount+0x90/0xe0 2014-12-04 11:10:50 [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
At this point we're rolling back to the previous Lustre release in order to make the system available again.
Attachments
Issue Links
- is related to
-
LU-2059 mgc to backup configuration on osd-based llogs
-
- Resolved
-
yes, the 0x200000003:0x0:0x0 must contain lastID and it should be copied from seq-200000003-lastid in case of upgrade but it wasn't due to error in code. So now we need to restore proper counter in oi.3/0x200000003:0x0:0x0 by changing 0001 to 0008. Note, it is not copy of seq-200000003-lastid which has magic word 0xdecafbee in the beginning then 0008. After such change MDT should start any new local files with 0x8 OID and mount successfully. Now it fails to mount because it tries to use 0x1 OID which is already used. Note also, you have to remove all previously manually created CONFIGS dir and its content at first.