Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.4.0
-
None
-
3
-
5533
Description
Our sysadmins were expanding one of our 2.1 filesystems yesterday and ran into a problem. Because of previous 1.8 problems that made out-of-order OST registration problematic, the admin was using a script to mount the OSTs one at a time, in sequential ID order.
With each new OST registration the MGS lock is revoked. Lock timeouts were not infrequent.
One OST hit a timeout while communicating with the MGS, and this proved to be a fairly non-recoverable event. I was forced to break out hexedit to get things working again reasonably.
2012-03-22 10:07:58 LustreError: 166-1: MGC172.19.1.100@o2ib100: Connection to MGS (at 172.19.1.100@o2ib100) was lost; in progress operations using this service will fail 2012-03-22 10:07:58 LustreError: 12150:0:(ldlm_request.c:115:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1332435772, 306s ago), entering recovery for MGS@MGC172.19.1.100@o2ib100_0 ns: MGC172.19.1.100@o2ib100 lock: ffff880314ba0480/0x54303899ba77f6ed lrc: 4/1,0 mode: --/CR res: 6517612/0 rrc: 1 type: PLN flags: 0x10000010 remote: 0x1d01b502e0a66a07 expref: -99 pid: 12150 timeout 0 2012-03-22 10:07:59 LustreError: 12718:0:(obd_mount.c:1163:server_start_targets()) Required registration failed for lsc-OST0174: -4 2012-03-22 10:07:59 LustreError: 12718:0:(obd_mount.c:1719:server_fill_super()) Unable to start targets: -4 2012-03-22 10:07:59 LustreError: 12718:0:(obd_mount.c:1508:server_put_super()) no obd lsc-OST0174 2012-03-22 10:07:59 LustreError: 12718:0:(obd_mount.c:141:server_deregister_mount()) lsc-OST0174 not registered 2012-03-22 10:07:59 LustreError: 11-0: MGC172.19.1.100@o2ib100: Communicating with 172.19.1.100@o2ib100, operation mgs_connect failed with -16. 2012-03-22 10:07:59 Lustre: server umount lsc-OST0174 complete 2012-03-22 10:07:59 LustreError: 12718:0:(obd_mount.c:2160:lustre_fill_super()) Unable to mount (-4) 2012-03-22 10:08:24 LustreError: 11-0: MGC172.19.1.100@o2ib100: Communicating with 172.19.1.100@o2ib100, operation mgs_connect failed with -16. 2012-03-22 10:08:49 LustreError: 11-0: MGC172.19.1.100@o2ib100: Communicating with 172.19.1.100@o2ib100, operation mgs_connect failed with -16. 2012-03-22 10:09:14 LustreError: 11-0: MGC172.19.1.100@o2ib100: Communicating with 172.19.1.100@o2ib100, operation mgs_connect failed with -16. 2012-03-22 10:09:38 LustreError: 11-0: MGC172.19.1.100@o2ib100: Communicating with 172.19.1.100@o2ib100, operation mgs_connect failed with -16.
The admin reran the mount command several more times and each time got either -4 (EINTR) which corresponds with an instance of a lock timeout as above, or -5 (EIO).
When I was called in, I tried the mount my self to start gathering info. At that point it returned -98 (EADDRINUSE), and from the MGS logs that it HAD finally processed the OST's registration:
2012-03-22 11:09:22 LustreError: 140-5: Server lsc-OST0174 requested index 372, but that index is already in use. Use --writeconf to force 2012-03-22 11:09:22 LustreError: 12649:0:(mgs_llog.c:2710:mgs_write_log_target()) Can't get index (-98) 2012-03-22 11:09:22 LustreError: 12649:0:(mgs_handler.c:520:mgs_handle_target_reg()) Failed to write lsc-OST0174 log (-98)
But the OST does not know that the registration succeeded, so its mountdata still has the flag LDD_F_VIRGIN set. Because of that, the MGS will never let the OST connect.
That left us two courses of action (to the best of my knowledge)
- Unmount the filesystem completely, use --writeconf on the MGS, restart everything
- Use a hexeditor on the OST's mountdata file to clear the LDD_F_VIRGIN flag
Since we did not want to cause a downtime for the filesystem, we chose the latter.
The mount of the OST seemed to mostly go well, and it appears to be functioning fine now, but I did see this error on the MGS/MDS console:
2012-03-22 15:09:48 Lustre: Found index 372 for lsc-OST0174, updating log 2012-03-22 15:12:26 LustreError: 6002:0:(obd_config.c:1019:class_process_config()) no device for: lsc-OST0174-osc 2012-03-22 15:12:26 LustreError: 6002:0:(obd_config.c:1363:class_config_llog_handler()) Err -22 on cfg command: 2012-03-22 15:12:26 Lustre: cmd=cf00b 0:lsc-OST0174-osc 1:172.19.1.127@o2ib100
The OST appears to be working fine, so I am not sure how worried I should be about that llog error.
I will attach some console logs to show what was going when the OST registration failed.