[LU-1268] Lustre MDS cannot start after ASSERTION Created: 29/Mar/12 Updated: 02/Apr/12 Resolved: 02/Apr/12 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Christopher Morrone | Assignee: | Johann Lombardi (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
lustre-2.1.0-24chaos (github.com/chaos/lustre) |
||
| Severity: | 1 |
| Rank (Obsolete): | 6425 |
| Description |
|
To work around Unfortunately, that seems to have left the configuration on the MDS in a bad state. The OST was allowed to reconnect, but on the MDS/MGS console we saw the message: 2012-03-22 15:12:26 LustreError: 6002:0:(obd_config.c:1019:class_process_config()) no device for: lsc-OST0174-osc 2012-03-22 15:12:26 LustreError: 6002:0:(obd_config.c:1363:class_config_llog_handler()) Err -22 on cfg command: 2012-03-22 15:12:26 Lustre: cmd=cf00b 0:lsc-OST0174-osc 1:172.19.1.127@o2ib100 With the MDS already running, that error was non-fatal. But after a crash due to 2012-03-29 03:20:45 Lustre: 20272:0:(mdt_handler.c:4705:mdt_process_config()) For 1.8 interoperability, skip this mdt.group_upcall. It is obsolete 2012-03-29 03:20:45 Lustre: 20272:0:(mdt_handler.c:4711:mdt_process_config()) Found old param mdt.quota_type, changed it to mdd.quota_type. 2012-03-29 03:20:47 LustreError: 20272:0:(obd_config.c:1019:class_process_config()) no device for: lsc-OST0174-osc 2012-03-29 03:20:47 LustreError: 20272:0:(obd_config.c:1363:class_config_llog_handler()) Err -22 on cfg command: 2012-03-29 03:20:47 Lustre: cmd=cf00b 0:lsc-OST0174-osc 1:172.19.1.127@o2ib100 2012-03-29 03:20:47 LustreError: 15b-f: MGC172.19.1.100@o2ib100: The configuration from log 'lsc-MDT0000'failed from the MGS (-22). Make sure this client and the MGS are running compatible versions of Lustre. 2012-03-29 03:20:47 LustreError: 15c-8: MGC172.19.1.100@o2ib100: The configuration from log 'lsc-MDT0000' failed (-22). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. 2012-03-29 03:20:47 LustreError: 20183:0:(obd_mount.c:1192:server_start_targets()) failed to start server lsc-MDT0000: -22 2012-03-29 03:20:47 LustreError: 20183:0:(obd_mount.c:1719:server_fill_super()) Unable to start targets: -22 2012-03-29 03:20:47 Lustre: Failing over lsc-MDT0000 Can you suggest any quick fixes? This is a production filesystem that is currently unusable with jobs hung waiting on its return. I fear that we may need to really unmount this filesystem everywhere and resort to completely reinitializing the logs with writeconf. |
| Comments |
| Comment by Christopher Morrone [ 29/Mar/12 ] |
|
llog_reader output from three lsc-OST* (lsc-OST0174 is the problem one) # sumom-mds1 /mnt/tmp/CONFIGS > llog_reader lsc-OST0173 Header size : 8192 Time : Thu Mar 22 10:02:48 2012 Number of records: 7 Target uuid : config_uuid ----------------------- #01 (224)marker 2251 (flags=0x01, v2.1.0.0) lsc-OST0173 'add ost' Thu Mar 22 10:02:48 2012- #02 (128)attach 0:lsc-OST0173 1:obdfilter 2:lsc-OST0173_UUID #03 (112)setup 0:lsc-OST0173 1:dev 2:type 3:f #04 (224)marker 2251 (flags=0x02, v2.1.0.0) lsc-OST0173 'add ost' Thu Mar 22 10:02:48 2012- #05 (224)marker 2254 (flags=0x01, v2.1.0.0) lsc-OST0173 'ost.quota_type' Thu Mar 22 10:02:48 2012- #06 (104)param 0:lsc-OST0173 1:ost.quota_type=ug #07 (224)marker 2254 (flags=0x02, v2.1.0.0) lsc-OST0173 'ost.quota_type' Thu Mar 22 10:02:48 2012- # sumom-mds1 /mnt/tmp/CONFIGS > llog_reader lsc-OST0174 Header size : 8192 Time : Thu Mar 22 10:05:48 2012 Number of records: 10 Target uuid : config_uuid ----------------------- #01 (224)marker 2255 (flags=0x01, v2.1.0.0) lsc-OST0174 'add ost' Thu Mar 22 10:05:48 2012- #02 (128)attach 0:lsc-OST0174 1:obdfilter 2:lsc-OST0174_UUID #03 (112)setup 0:lsc-OST0174 1:dev 2:type 3:f #04 (224)marker 2255 (flags=0x02, v2.1.0.0) lsc-OST0174 'add ost' Thu Mar 22 10:05:48 2012- #05 (224)SKIP START marker 2258 (flags=0x05, v2.1.0.0) lsc-OST0174 'ost.quota_type' Thu Mar 22 10:05:48 2012-Thu Mar 22 15:09:48 2012 #06 (104)SKIP param 0:lsc-OST0174 1:ost.quota_type=ug #07 (224)SKIP END marker 2258 (flags=0x06, v2.1.0.0) lsc-OST0174 'ost.quota_type' Thu Mar 22 10:05:48 2012-Thu Mar 22 15:09:48 2012 #08 (224)marker 2381 (flags=0x01, v2.1.0.0) lsc-OST0174 'ost.quota_type' Thu Mar 22 15:09:48 2012- #09 (104)param 0:lsc-OST0174 1:ost.quota_type=ug #10 (224)marker 2381 (flags=0x02, v2.1.0.0) lsc-OST0174 'ost.quota_type' Thu Mar 22 15:09:48 2012- # sumom-mds1 /mnt/tmp/CONFIGS > llog_reader lsc-OST0175 Header size : 8192 Time : Thu Mar 22 10:08:16 2012 Number of records: 7 Target uuid : config_uuid ----------------------- #01 (224)marker 2259 (flags=0x01, v2.1.0.0) lsc-OST0175 'add ost' Thu Mar 22 10:08:16 2012- #02 (128)attach 0:lsc-OST0175 1:obdfilter 2:lsc-OST0175_UUID #03 (112)setup 0:lsc-OST0175 1:dev 2:type 3:f #04 (224)marker 2259 (flags=0x02, v2.1.0.0) lsc-OST0175 'add ost' Thu Mar 22 10:08:16 2012- #05 (224)marker 2262 (flags=0x01, v2.1.0.0) lsc-OST0175 'ost.quota_type' Thu Mar 22 10:08:16 2012- #06 (104)param 0:lsc-OST0175 1:ost.quota_type=ug #07 (224)marker 2262 (flags=0x02, v2.1.0.0) lsc-OST0175 'ost.quota_type' Thu Mar 22 10:08:16 2012- |
| Comment by Zhenyu Xu [ 29/Mar/12 ] |
|
I suggest you umount ost0174 device, and use "tunefs.lustre --mdt --writeconf /dev/mdtdevice" and |
| Comment by Christopher Morrone [ 29/Mar/12 ] |
|
It looks like the lsc-MDT0000 config is the problem. It has the following entry for lsc-OST0174: #4597 (224)marker 2256 (flags=0x01, v2.1.0.0) lsc-OST0174 'add osc' Thu Mar 22 10:05:48 2012- #4598 (088)add_uuid nid=172.19.1.125@o2ib100(0x50064ac13017d) 0: 1:172.19.1.125@o2ib100 #4599 (120)attach 0:lsc-OST0174-osc-MDT0000 1:osc 2:lsc-mdtlov_UUID #4600 (144)setup 0:lsc-OST0174-osc-MDT0000 1:lsc-OST0174_UUID 2:172.19.1.125@o2ib100 #4601 (088)add_uuid nid=172.19.1.127@o2ib100(0x50064ac13017f) 0: 1:172.19.1.127@o2ib100 #4602 (112)add_conn 0:lsc-OST0174-osc-MDT0000 1:172.19.1.127@o2ib100 #4603 (128)lov_modify_tgts add 0:lsc-mdtlov 1:lsc-OST0174_UUID 2:372 3:1 #4604 (224)marker 2256 (flags=0x02, v2.1.0.0) lsc-OST0174 'add osc' Thu Mar 22 10:05:48 2012- But then later in the log it also has these lines: #4849 (224)marker 2380 (flags=0x01, v2.1.0.0) lsc-OST0174 'add failnid' Thu Mar 22 15:09:48 2012- #4850 (088)add_uuid nid=172.19.1.127@o2ib100(0x50064ac13017f) 0: 1:172.19.1.127@o2ib100 #4851 (104)add_conn 0:lsc-OST0174-osc 1:172.19.1.127@o2ib100 #4852 (224)marker 2380 (flags=0x02, v2.1.0.0) lsc-OST0174 'add failnid' Thu Mar 22 15:09:48 2012- This is the only NEW ost that has an 'add failnid' entry. All of the other 'add failnid' entries are for the old OSTs and have a version of v1.8.3.0 or v1.8.4.0 and dates back in Sept and Nov of 2010 respectively. So that entry likely needs to be purged from the log to get things running again. |
| Comment by Niu Yawei (Inactive) [ 29/Mar/12 ] |
|
Hi, Chris Indeed, that's the problem of mixed OSC name for same OST I mentioned in The OST0174 is registered twice, the second time (you changed it to LDD_F_VIRGIN manually), mgs found that this OST has already registered, so it turn to update the failover nids, that's why only OST0174 has an extra 'add failnid' entry. (in first time registeration, the failover nid is stored along with 'add osc', so no extra 'add failnid'). I don't know if there is any way to purge out the 'add failnid' entry only, but we can try writeconf for MDT and OST0174, and let them reigster again, thus, there will be no 'add failnid' in the MDT log. However, the long term solution should be purging all the old config logs, and regenerate all of them in 2.0, otherwise, when you use 'lctl conf_param' on the new OST, the mixed OSC name problem will show up again. |
| Comment by Christopher Morrone [ 29/Mar/12 ] |
|
hexedit skills to the rescue again! I cleared the four bits associated with those four lines in the MDT log and reduced the entry count by 4 in the header. It seems to have worked. We are back up and running again. Lets just hope that I didn't miss anything... Note that to the best of my knowledge no one has ever used "lctl conf_param" on that OST. I suspect that whatever happened, happened as a result of Lustre's normal registration logic. |
| Comment by Christopher Morrone [ 30/Mar/12 ] |
|
You can close this ticket now. I worked around the problem, and we can just focus on the original problem in |