[LU-8753] Recovery already passed deadline with DNE Created: 24/Oct/16 Updated: 19/Sep/18 Resolved: 30/Jan/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Giuseppe Di Natale (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
lustre-2.8.0_3.chaos-1.ch6.x86_64 |
||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
MDT[0-1,6-16] (decimal) have timed out of recovery; appx 1473 clients recovered, 1 evicted. 2016-10-15 15:49:40 [ 1088.878945] Lustre: lsh-MDT0002: Recovery already passed deadline 0:32, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery. 2016-10-15 15:49:40 [ 1088.899333] Lustre: Skipped 157 previous similar messages 2016-10-15 15:50:12 [ 1121.013380] Lustre: lsh-MDT0002: Recovery already passed deadline 1:04, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery. 2016-10-15 15:50:12 [ 1121.033744] Lustre: Skipped 735 previous similar messages <ConMan> Console [zinc3] departed by <root@localhost> on pts/0 at 10-15 15:50. 2016-10-15 15:50:52 [ 1161.329645] LustreError: 38991:0:(mdt_handler.c:5737:mdt_iocontrol()) lsh-MDT0002: Aborting recovery for device 2016-10-15 15:50:52 [ 1161.341983] LustreError: 38991:0:(ldlm_lib.c:2565:target_stop_recovery_thread()) lsh-MDT0002: Aborting recovery 2016-10-15 15:50:52 [ 1161.343686] LustreError: 18435:0:(lod_dev.c:419:lod_sub_recovery_thread()) lsh-MDT0004-osp-MDT0002 getting update log failed: rc = -108 2016-10-15 15:50:52 [ 1161.377751] Lustre: 18461:0:(ldlm_lib.c:2014:target_recovery_overseer()) recovery is aborted, evict exports in recovery The earliest such messages are: 2016-10-15 15:50:52 [ 1161.390842] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295056926 batchid = 35538 flags = 0 ops = 42 params = 32 2016-10-15 15:50:52 [ 1161.408040] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295056931 batchid = 35542 flags = 0 ops = 42 params = 32 The last few are: 2016-10-15 15:52:11 [ 1240.343780] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064355 batchid = 39987 flags = 0 ops = 42 params = 32 2016-10-15 15:52:11 [ 1240.361375] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064356 batchid = 39999 flags = 0 ops = 42 params = 32 2016-10-15 15:52:11 [ 1240.378995] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064357 batchid = 40018 flags = 0 ops = 42 params = 32 2016-10-15 15:52:11 [ 1240.396579] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064358 batchid = 40011 flags = 0 ops = 42 params = 32 2016-10-15 15:52:11 [ 1240.414180] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064360 batchid = 40005 flags = 0 ops = 42 params = 32 We have seen this type of behavior on multiple DNE filesystems. Also, is there any way to determine if these errors have been corrected, abandoned, etc? |
| Comments |
| Comment by Giuseppe Di Natale (Inactive) [ 24/Oct/16 ] |
|
Went ahead and linked |
| Comment by Peter Jones [ 24/Oct/16 ] |
|
Lai Could you please advise on this one? Thanks Peter |
| Comment by Olaf Faaland [ 26/Oct/16 ] |
|
To give some sense of scale, zinc1 (which hosts the MGS and MDT0000) has encountered this after I aborted recovery. So far 80,000 "update_records.c:72:update_records_dump" rows have been dumped, and it's not done yet. |
| Comment by Olaf Faaland [ 26/Oct/16 ] |
|
This is appearing every time, or almost every time, we abort recovery on an MDT. |
| Comment by Lai Siyao [ 26/Oct/16 ] |
|
The log does't help much, which just tells DNE recovery can't finish, but not why. Could you upload all MDS syslog? In the mean time I'll see how it can be approved. |
| Comment by Olaf Faaland [ 26/Oct/16 ] |
|
Hello Lai, Are there two separate issues here, or do they likely have the same root cause? That is (a) recovery is hung and (b) whatever happens after abort_recovery, with all the update_records_dump() in the logs. If you think think they are two issues, then I should make a separate ticket for the hung recovery. |
| Comment by Di Wang [ 26/Oct/16 ] |
|
Indeed, it looks like they are separate issue. Please create a new ticket for b). Recovery already passed deadline 1:04, It is most likely due to DNE recovery is failed or stuck, Usually caused by one of recovery threads failing on retrieving update recovery logs or replaying these update records. As lai said, more console logs or stack trace on lsh-MDT0002 will definitely help us figure out the issue here. thanks. |
| Comment by Olaf Faaland [ 26/Oct/16 ] |
|
Di, OK, I'll upload more logs. Can you change the summary line of this ticket to "Recovery already passed deadline with DNE" or similar? I don't have perms to modify that field. thanks, |
| Comment by Olaf Faaland [ 27/Oct/16 ] |
|
I needed to reboot a different lustre 2.8 cluster w/ DNE today and the issue occurs there as well. Providing logs from this instance instead of the earlier one, because I have both console and lctl dk output. |
| Comment by Olaf Faaland [ 27/Oct/16 ] |
|
lustre-related syslog for entire jet lustre cluster including the time span when recovery became stuck on jet7. Between the recovery stuck issue, and the lu-8763 issue, it's messy. But see the recovery stuck issue from a cold boot of all the MDS nodes. |
| Comment by Olaf Faaland [ 27/Oct/16 ] |
|
Please label this topllnl, thanks. |
| Comment by Peter Jones [ 27/Oct/16 ] |
|
Olaf I have done so but I am surprised that you could not do this yourself. If you confirm that this is the case then I will dig into your JIRA permissions Peter |
| Comment by Olaf Faaland [ 27/Oct/16 ] |
|
Hi Peter, That would be great, thanks. I can add labels at the time I create a ticket, but I cannot change them after the fact. I confirmed that on this ticket. |
| Comment by Peter Jones [ 27/Oct/16 ] |
|
Olaf I made a tweak to your permissions. I'm not 100% certain it will help but give it another try Peter |
| Comment by Olaf Faaland [ 27/Oct/16 ] |
|
Peter, |
| Comment by Di Wang [ 27/Oct/16 ] |
|
According to the debug log 00000100:02000000:10.0:1477593165.292609:0:143210:0:(import.c:1539:ptlrpc_import_recovery_state_machine()) lquake-MDT0006: Connection restored to lquake-MDT000b-mdtlov_UUID (at 172.19.1.122@o2ib100) 00000040:00020000:11.0:1477593165.661058:0:31346:0:(llog.c:591:llog_process_thread()) lquake-MDT000b-osp-MDT0006 retry remote llog process It looks like MDT0006 keeps trying to retrieve the update log from MDT000b (172.19.1.122@o2ib100), but keep getting EIO here. Do you have the log on MDT000b ? Thanks. |
| Comment by Olaf Faaland [ 27/Oct/16 ] |
|
Di, |
| Comment by Di Wang [ 28/Oct/16 ] |
|
Ah, according to the log, MDT000b(jet12) recover seems failed because of invalid log record. ( hmm, I did not see it for long time). Oct 27 10:24:35 jet12 kernel: [ 356.925961] LustreError: 34856:0:(llog_osd.c:940:llog_osd_next_block()) lquake-MDT0009-osp-MDT000b: invalid llog tail at log id 0x6:1073846872/0 offset 131072 last_rec idx 0 tail idx 809132920 Oct 27 10:24:35 jet12 kernel: [ 356.947238] LustreError: 34856:0:(lod_dev.c:419:lod_sub_recovery_thread()) lquake-MDT0009-osp-MDT000b getting update log failed: rc = -22 Oct 27 10:24:35 jet12 kernel: LustreError: 34856:0:(llog_osd.c:940:llog_osd_next_block()) lquake-MDT0009-osp-MDT000b: invalid llog tail at log id 0x6:1073846872/0 offset 131072 last_rec idx 0 tail idx 809132920 Oct 27 10:24:35 jet12 kernel: LustreError: 34856:0:(lod_dev.c:419:lod_sub_recovery_thread()) lquake-MDT0009-osp-MDT000b getting update log failed: rc = -22 Oct 27 10:24:36 jet12 kernel: LustreError: 34858:0:(lod_dev.c:419:lod_sub_recovery_thread()) lquake-MDT000c-osp-MDT000b getting update log failed: rc = -108 Oct 27 10:24:36 jet12 kernel: LustreError: 34858:0:(lod_dev.c:419:lod_sub_recovery_thread()) Skipped 1 previous similar message Oct 27 10:24:36 jet12 kernel: [ 358.012215] LustreError: 34858:0:(lod_dev.c:419:lod_sub_recovery_thread()) lquake-MDT000c-osp-MDT000b getting update log failed: rc = -108 Oct 27 10:24:36 jet12 kernel: [ 358.028270] LustreError: 34858:0:(lod_dev.c:419:lod_sub_recovery_thread()) Skipped 1 previous similar message Oct 27 10:24:37 jet12 kernel: [ 359.735327] Lustre: 34889:0:(ldlm_lib.c:2014:target_recovery_overseer()) recovery is aborted, evict exports in recovery Oct 27 10:24:37 jet12 kernel: [ 359.748515] Lustre: 34889:0:(ldlm_lib.c:2014:target_recovery_overseer()) Skipped 2 previous similar messages Oct 27 10:24:37 jet12 kernel: [ 359.760557] LustreError: 34889:0:(update_records.c:72:update_records_dump()) master transno = 231932247340 batchid = 210453399132 flags = 0 ops = 154 params = 80 which cause the recovery stuck on jet12, then cause other MDTs recovery stuck. |
| Comment by Di Wang [ 28/Oct/16 ] |
|
Olaf: you can mount the mdt as ZFS, then delete all update_log_dir and update_log rm -rf /mnt/zfs/update_log* Note: this will not destroy the data, but only delete those un-committed recovery logs, but since these update logs are corrupted, they are useless anyway. I am trying to figure out why these update logs are corrupted, but not much information here. |
| Comment by Olaf Faaland [ 28/Oct/16 ] |
|
Di, |
| Comment by Olaf Faaland [ 28/Oct/16 ] |
|
I see now you are working on that ticket and already know about it. |
| Comment by Olaf Faaland [ 29/Oct/16 ] |
|
I tried deleting the update_log and update_log_dir on all the MDTs under ZFS. I was able to delete update_log and the contents of update_log_dir, but not able to remove the directory update_log_dir itself. Attempting to mount MDT0000 fails with EEXIST, and the debug log reports the attempt to create the FID_SEQ_UPDATE_LOG fails: 2016-10-28 17:30:14.292693 00000020:00001000:5.0::0:143276:0:(local_storage.c:379:__local_file_create()) create new object [0x200000009:0x0:0x0] 2016-10-28 17:30:14.292694 00000020:00000001:5.0::0:143276:0:(local_storage.c:260:local_object_create()) Process entered 2016-10-28 17:30:14.292694 00080000:00000001:5.0::0:143276:0:(osd_object.c:1530:osd_object_create()) Process entered 2016-10-28 17:30:14.292716 00080000:00000010:5.0::0:143276:0:(osd_object.c:1236:__osd_attr_init()) kmalloced 'bulk': 520 at ffff887ef276f400. 2016-10-28 17:30:14.292719 00080000:00000010:5.0::0:143276:0:(osd_object.c:1268:__osd_attr_init()) kfreed 'bulk': 520 at ffff887ef276f400. 2016-10-28 17:30:14.292720 00080000:00000001:5.0::0:143276:0:(osd_oi.c:241:fid_is_on_ost()) Process entered 2016-10-28 17:30:14.292720 00080000:00000001:5.0::0:143276:0:(osd_oi.c:261:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0) 2016-10-28 17:30:14.292728 00080000:00000001:5.0::0:143276:0:(osd_object.c:1571:osd_object_create()) Process leaving via out (rc=18446744073709551599 : -17 : 0xffffffffffffffef) 2016-10-28 17:30:14.292729 00080000:00000001:5.0::0:143276:0:(osd_object.c:1604:osd_object_create()) Process leaving (rc=18446744073709551599 : -17 : ffffffffffffffef) 2016-10-28 17:30:14.292730 00000020:00000001:5.0::0:143276:0:(local_storage.c:264:local_object_create()) Process leaving (rc=18446744073709551599 : -17 : ffffffffffffffef) Any suggestions? |
| Comment by Olaf Faaland [ 29/Oct/16 ] |
|
Renaming the directory doesn't change the behavior. |
| Comment by Di Wang [ 29/Oct/16 ] |
|
hmm, it looks like on ZFS, it create the log file in a different way. It seems to me "rm update_log" only delete the name, the real object is still there. Unfortunately, I am not ZFS expert here. Let me ask around. Alex, any ideas? |
| Comment by Di Wang [ 29/Oct/16 ] |
Might it have to do with LU-8569? it might be indeed. |
| Comment by Alex Zhuravlev [ 29/Oct/16 ] |
|
please check with the actual code: rm (direct mount and ZPL, right?) doesn't update OI. |
| Comment by Di Wang [ 31/Oct/16 ] |
|
Sigh, I have to deal with this in lustre. Let me make a patch. |
| Comment by Gerrit Updater [ 31/Oct/16 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/23490 |
| Comment by Di Wang [ 31/Oct/16 ] |
|
I just pushed the patch http://review.whamcloud.com/23490. I do not have ZFS environment to try on my side, Olaf, could you please try to see if this is helpful for your mount failure? |
| Comment by Olaf Faaland [ 03/Nov/16 ] |
|
Hi Di, I get the following: 2016-11-02 20:47:31 [ 3932.246998] Lustre: MGS: Logs for fs lquake were removed by user request. All servers must be restarted in order to regenerate the logs. 2016-11-02 20:47:31 [ 3932.449165] LustreError: 79681:0:(osd_object.c:437:osd_object_init()) lquake-MDT0000: lookup [0x200000009:0x0:0x0]/0xa4 failed: rc = -2 2016-11-02 20:47:31 [ 3932.464118] LustreError: 79681:0:(obd_mount_server.c:1798:server_fill_super()) Unable to start targets: -2 2016-11-02 20:47:31 [ 3932.475878] Lustre: Failing over lquake-MDT0000 2016-11-02 20:47:32 [ 3932.568791] Lustre: server umount lquake-MDT0000 complete 2016-11-02 20:47:32 [ 3932.575517] LustreError: 79681:0:(obd_mount.c:1426:lustre_fill_super()) Unable to mount (-2) Note that I initially started both the mgs and mdt0000 with -o writeconf. |
| Comment by Di Wang [ 03/Nov/16 ] |
|
Hi, Olaf Could you please retry and dump -1 debug log? Please be sure to remove update_log* under zfs before retry, thanks. |
| Comment by Olaf Faaland [ 04/Nov/16 ] |
|
Di, |
| Comment by Di Wang [ 05/Nov/16 ] |
|
Olaf, |
| Comment by Gerrit Updater [ 07/Nov/16 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/23635 |
| Comment by Di Wang [ 07/Nov/16 ] |
|
Olaf Do you still to have that corrupted log file, it should be a regular file under update_log_dir "[0x40019a58:0x6:0x0]". If not, could you please add this debug patch http://review.whamcloud.com/23635, retry? If error happens, could you please post the console message and update the corrupted log file? Thanks. |
| Comment by Olaf Faaland [ 08/Nov/16 ] |
|
Di,
Joe mounted the MDT0000 dataset via ZPL and find there is no entry with name beginning '0x200000009', not under oi.9/ nor anywhere. update_log is deleted. update_log_dir could not be deleted (fails with 'not empty' even though it appears to be so). It appears to be damaged: update_log_dir/[root@jet1:zfs]# ll -d update_* drw-r--r-- 2 root root 18446744073709551140 Oct 28 13:40 update_log_dir Joe attempted the import with debug and subsystem_debug set to -1. I've attached the log, dk.jet1.1478565846.gz |
| Comment by Olaf Faaland [ 08/Nov/16 ] |
|
Di,
MDT000b update_log_dir contains file named "[0x240019a58:0x6:0x0]". It is attached in mdt0b.0x240019a58_0x6_0x0.tgz. Is that the correct one? thanks, |
| Comment by Di Wang [ 08/Nov/16 ] |
|
Hmm, it looks like update_log "0x200000009\:0x1\:0x0" still exists in the system. 00080000:00000001:2.0:1478565782.638227:0:14104:0:(osd_oi.c:261:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0) 00080000:00000001:2.0:1478565782.638236:0:14104:0:(osd_object.c:1571:osd_object_create()) Process leaving via out (rc=18446744073709551599 : -17 : 0xffffffffffffffef) 00080000:00000001:2.0:1478565782.638237:0:14104:0:(osd_object.c:1604:osd_object_create()) Process leaving (rc=18446744073709551599 : -17 : ffffffffffffffef) 00000020:00000001:2.0:1478565782.638238:0:14104:0:(local_storage.c:264:local_object_create()) Process leaving (rc=18446744073709551599 : -17 : ffffffffffffffef) 00000020:00000001:2.0:1478565782.638238:0:14104:0:(local_storage.c:382:__local_file_create()) Process leaving via unlock (rc=18446744073709551599 : -17 : 0xffffffffffffffef) Could you please tell me what lines are these logs mapping to? thanks these seems not match my b2_8 version. |
| Comment by Di Wang [ 08/Nov/16 ] |
|
Olaf, yes, that is the one, thanks. |
| Comment by Di Wang [ 08/Nov/16 ] |
|
Ah, Olaf lquake-MDT0009-osp-MDT000b: invalid llog tail at log id 0x6:1073846872/0 offset 131072 last_rec idx 0 tail idx 809132920 The corrupted log is on MDT0009, instead of MDT000b, could you please check there? thanks. |
| Comment by Giuseppe Di Natale (Inactive) [ 08/Nov/16 ] |
|
Di, Below are snippets of code to help you identify where the log messages you posted are from. 00080000:00000001:2.0:1478565782.638227:0:14104:0:(osd_oi.c:261:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0) In lustre/osd-zfs/osd_oi.c: 249
250 rc = osd_fld_lookup(env, osd, fid_seq(fid), range);
251 if (rc != 0) {
252 if (rc != -ENOENT)
253 CERROR("%s: "DFID" lookup failed: rc = %d\n",
254 osd_name(osd), PFID(fid), rc);
255 RETURN(0);
256 }
257
258 if (fld_range_is_ost(range))
259 RETURN(1);
260
261 RETURN(0);
262 }
00080000:00000001:2.0:1478565782.638236:0:14104:0:(osd_object.c:1571:osd_object_create()) Process leaving via out (rc=18446744073709551599 : -17 : 0xffffffffffffffef) In lustre/osd-zfs/osd_object.c: 1563 zde->zde_pad = 0; 1564 zde->zde_dnode = db->db_object; 1565 zde->zde_type = IFTODT(attr->la_mode & S_IFMT); 1566 1567 zapid = osd_get_name_n_idx(env, osd, fid, buf); 1568 1569 rc = -zap_add(osd->od_os, zapid, buf, 8, 1, zde, oh->ot_tx); 1570 if (rc) 1571 GOTO(out, rc); 1572 1573 /* Add new object to inode accounting. 1574 * Errors are not considered as fatal */ 1575 rc = -zap_increment_int(osd->od_os, osd->od_iusr_oid, 1576 (attr->la_valid & LA_UID) ? attr->la_uid : 0, 1, 1577 oh->ot_tx); 00080000:00000001:2.0:1478565782.638237:0:14104:0:(osd_object.c:1604:osd_object_create()) Process leaving (rc=18446744073709551599 : -17 : ffffffffffffffef) In lustre/osd-zfs/osd_object.c: 1594 rc = osd_init_lma(env, obj, fid, oh);
1595 if (rc) {
1596 CERROR("%s: can not set LMA on "DFID": rc = %d\n",
1597 osd->od_svname, PFID(fid), rc);
1598 /* ignore errors during LMA initialization */
1599 rc = 0;
1600 }
1601
1602 out:
1603 up_write(&obj->oo_guard);
1604 RETURN(rc);
1605 }
00000020:00000001:2.0:1478565782.638238:0:14104:0:(local_storage.c:264:local_object_create()) Process leaving (rc=18446744073709551599 : -17 : ffffffffffffffef) In lustre/obdclass/local_storage.c: 251 int local_object_create(const struct lu_env *env,
252 struct local_oid_storage *los,
253 struct dt_object *o, struct lu_attr *attr,
254 struct dt_object_format *dof, struct thandle *th)
255 {
256 struct dt_thread_info *dti = dt_info(env);
257 u64 lastid;
258 int rc;
259
260 ENTRY;
261
262 rc = dt_create(env, o, attr, NULL, dof, th);
263 if (rc)
264 RETURN(rc);
265
266 if (los == NULL)
267 RETURN(rc);
00000020:00000001:2.0:1478565782.638238:0:14104:0:(local_storage.c:382:__local_file_create()) Process leaving via unlock (rc=18446744073709551599 : -17 : 0xffffffffffffffef) In lustre/obdclass/local_storage.c: 378 CDEBUG(D_OTHER, "create new object "DFID"\n",
379 PFID(lu_object_fid(&dto->do_lu)));
380 rc = local_object_create(env, los, dto, attr, dof, th);
381 if (rc)
382 GOTO(unlock, rc);
383 LASSERT(dt_object_exists(dto));
384
385 if (dti->dti_dof.dof_type == DFT_DIR) {
|
| Comment by Giuseppe Di Natale (Inactive) [ 08/Nov/16 ] |
|
I'm also hunting for the corrupted log on MDT0009. |
| Comment by Giuseppe Di Natale (Inactive) [ 08/Nov/16 ] |
|
I went ahead and attached the log file from mdt0009. File name is mdt09.0x240019a58_0x6_0x0.tgz |
| Comment by Di Wang [ 08/Nov/16 ] |
|
Thanks, Giuseppe. I will check. |
| Comment by Di Wang [ 11/Nov/16 ] |
|
Hello, Giuseppe, mdt09.0x240019a58_0x6_0x0.tgz and mdt0b.0x240019a58_0x6_0x0.tgz seems same. could you please check? thanks. |
| Comment by Olaf Faaland [ 14/Nov/16 ] |
|
Di, Attaching logs.2016-11-14.tgz which includes: mnt/lu-8753/oi.88/0x240019a58:0x6:0x0 The CONFIGS directory is there to document which target's dataset these files came from. Note that the oi.88, oi.10, and update_log_dir entries all refer to the same underlying object/file. thanks, |
| Comment by Di Wang [ 15/Nov/16 ] |
|
Thanks Olaf, though the file seems still same. This multiple link llog file does not looks right to me. especially links both under oi.88 and oi.10. Hmm in llog_osd_create(), it should not do llog_osd_regular_fid_add_name_entry() for ZFS. So either OSD can tell which FID is for llog (remember these normal llog FID inside OSD) or the llog_osd can tell if this is on ldiskfs or ZFS. Alex any suggestion here? Hmm, it looks like we can input the llog type to OSD by dt_object_format, then remember the llog type inside LMA, then we also move those llog_osd_regular_fid_add_name_entry() into ldiskfs-osd. I will cook a patch. |
| Comment by Olaf Faaland [ 15/Nov/16 ] |
|
Thanks Di. I am sure of the provenance of the one I submitted today because of the way I created the tar file. |
| Comment by Di Wang [ 15/Nov/16 ] |
|
I checked the llog file you posted here. Unfortunately this is not the corrupted llog file. I am cooking a patch for this multiple link llog issue. |
| Comment by Olaf Faaland [ 15/Nov/16 ] |
|
Di, |
| Comment by Olaf Faaland [ 15/Nov/16 ] |
|
Di, |
| Comment by Di Wang [ 15/Nov/16 ] |
|
Hello, Olaf, Yes, if we can not get the corrupt log file, it is hard to find the root cause here. If you can reproduce the issue, then please apply the patch http://review.whamcloud.com/#/c/23635/ and also the corruptted log file, thanks. Btw: Were you able to delete thoese update_log files finally? or you will reformat the system (sorry about that. |
| Comment by Olaf Faaland [ 17/Nov/16 ] |
|
Hi Di, I'll apply your patch and attempt to reproduce. I wasn't able to delete those files yet; I'm going to spend a little longer looking into it. Thanks. |
| Comment by Di Wang [ 28/Nov/16 ] |
|
Olaf: any good news? |
| Comment by Olaf Faaland [ 28/Nov/16 ] |
|
Di, |
| Comment by Olaf Faaland [ 28/Nov/16 ] |
|
Di, |
| Comment by Olaf Faaland [ 28/Nov/16 ] |
|
Di, Confirm this is the right llog and I'll upload it. [faaland1@zinci syslog]$grep invalid * console.zinc1:2016-11-28 15:09:41 [1123885.603795] LustreError: 96133:0:(llog_osd.c:954:llog_osd_next_block()) lsh-MDT000c-osp-MDT0000: invalid llog tail at log id 0x1:7024/0 offset 70320128 bytes 32768 [root@zinc13:update_log_dir]# ls ../CONFIGS/ lsh-MDT000c lsh-client param [root@zinc13:update_log_dir]# pwd /mnt/lu-8753/update_log_dir [root@zinc13:update_log_dir]# ll *1b70* -rw-r--r-- 1 root root 71419576 Dec 31 1969 [0x500001b70:0x1:0x0] -rw-r--r-- 1 root root 0 Dec 31 1969 [0x500001b70:0x2:0x0] [root@zinc13:update_log_dir]# llog_reader '[0x500001b70:0x1:0x0]' | less [root@zinc13:update_log_dir]# llog_reader '[0x500001b70:0x1:0x0]' | grep -C 10 70320128 rec #8910 type=106a0000 len=8536 offset 70238896 Bit 8911 of 6687 not set rec #8912 type=106a0000 len=8536 offset 70254592 rec #8913 type=106a0000 len=8536 offset 70263128 rec #8914 type=106a0000 len=8536 offset 70271664 Bit 8915 of 6687 not set rec #8916 type=106a0000 len=8536 offset 70287360 rec #8917 type=106a0000 len=8536 offset 70295896 rec #8918 type=106a0000 len=8536 offset 70304432 off 70312968 skip 7160 to next chunk. off 70320128 skip 32768 to next chunk. rec #8924 type=106a0000 len=9056 offset 70352896 rec #8925 type=106a0000 len=9056 offset 70361952 rec #8926 type=106a0000 len=9048 offset 70371008 Bit 8927 of 6687 not set rec #8928 type=106a0000 len=9056 offset 70385664 rec #8929 type=106a0000 len=9056 offset 70394720 rec #8930 type=106a0000 len=9056 offset 70403776 Bit 8931 of 6687 not set rec #8932 type=106a0000 len=9056 offset 70418432 rec #8933 type=106a0000 len=9048 offset 70427488 |
| Comment by Olaf Faaland [ 28/Nov/16 ] |
|
Attached console logs and a text file listing which target is running on which node, for the most recent instance of the issue on Zinc. target_to_node_map.nov28.txt |
| Comment by Olaf Faaland [ 29/Nov/16 ] |
|
Attached lustre debug logs for zinc target which is still in recovery (zinc7/lsh-MDT0006), and for target holding llog with invalid llog tail (zzinc13/lsh-MDT000c). Also attached what I believe is the relevant llog file from MDT000c. dk.zinc7.1480375634.gz |
| Comment by Olaf Faaland [ 29/Nov/16 ] |
|
Attached lustre debug log for the node reporting the invalid tail. dk.zinc1.1480375634.gz |
| Comment by Olaf Faaland [ 29/Nov/16 ] |
|
I will leave the file system as-is (down for users, due to the MDT stuck in recovery) until you have all the information you need. |
| Comment by Di Wang [ 29/Nov/16 ] |
|
Thanks, Olaf. It looks correct, I will check right away. |
| Comment by Di Wang [ 29/Nov/16 ] |
|
According to the dump of this update log, it looks like there is a hole from 70312968 to 70320128 (all are "0"), which cause the "corruption". master transno = 17237427556 batchid = 12884938203 flags = 0 ops = 172 params = 84 rec_len 8536 rec #8916 type=106a0000 len=8536 offset 70287360, total 6584 offset 70295896 index 8917 type 106a0000 master transno = 17237427557 batchid = 12884938202 flags = 0 ops = 172 params = 84 rec_len 8536 rec #8917 type=106a0000 len=8536 offset 70295896, total 6585 offset 70304432 index 8918 type 106a0000 master transno = 17237427805 batchid = 12884938204 flags = 0 ops = 172 params = 84 rec_len 8536 rec #8918 type=106a0000 len=8536 offset 70304432, total 6586 offset 70312968 index 0 type 0 master transno = 0 batchid = 0 flags = 0 ops = 0 params = 0 rec_len 0 off 70312968 skip 7160 to next chunk. test_bit yes offset 70320128 index 0 type 0 master transno = 0 batchid = 0 flags = 0 ops = 0 params = 0 rec_len 0 off 70320128 skip 32768 to next chunk. test_bit yes offset 70352896 index 8924 type 106a0000 master transno = 17286373092 batchid = 12884981614 flags = 0 ops = 189 params = 86 rec_len 9056 rec #8924 type=106a0000 len=9056 offset 70352896, total 6587 offset 70361952 index 8925 type 106a0000 It looks like lgh_write_offset is not being reset during recovery, but need further investigation. Thanks. |
| Comment by Gerrit Updater [ 29/Nov/16 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/24008 |
| Comment by Di Wang [ 29/Nov/16 ] |
|
Olaf: please enable D_HA and also collect the corrupt llog file, in case this patch does not help. Thanks. |
| Comment by Olaf Faaland [ 29/Nov/16 ] |
|
Hi Di, Disregard my message about not having lgh_write_offset. Not sure what I was looking at, it's there. thanks, |
| Comment by Di Wang [ 29/Nov/16 ] |
|
Btw: you probably need delete the update_logs or reformat the system, since the update log is already corrupted, which will cause recovery stuck anyway. Sorry about that. |
| Comment by Di Wang [ 04/Dec/16 ] |
|
Hi, Olaf, how does the test go? |
| Comment by Olaf Faaland [ 05/Dec/16 ] |
|
Hi Di, |
| Comment by Giuseppe Di Natale (Inactive) [ 07/Dec/16 ] |
|
Di, We couldn't remove update_log_dir thru the ZPL because of an issue with ZAP size bookkeeping. I went ahead and made |
| Comment by Olaf Faaland [ 07/Dec/16 ] |
|
Updated Lustre on the test filesystem jet/lquake with the patch to remove lgh_write_offset. Will attempt to reproduce today. |
| Comment by Olaf Faaland [ 14/Dec/16 ] |
|
Di, With the lgh_write_offset patch recovery seems to complete more reliably. I'm now seeing new symptoms, and I don't know if they are a consequence of the patch or a separate problem that was hidden by the recovery issue. I'll summarize what I am seeing here; please tell me if you think it's a separate issue that should get a separate ticket. I'd created several directories striped across all 16 MDTs. Within each such directory, a node was running mdtest and creating directories and files and deleting them. I'm not sure which phase it was in at the time I powered off 4 of the 16 MDS nodes (MDT0008 through MDT000b). I then powered them on again. During recovery, the server logs show: 2016-12-13 14:48:46 [ 241.954091] Lustre: lquake-MDT0009: Client 987b4351-b459-8f25-89f9-59fa31045f6b (at 192.168.128.32@o2ib18) reconnecting, waiting for 31 clients in recovery for 3:58 2016-12-13 14:48:46 [ 241.972729] Lustre: Skipped 7 previous similar messages 2016-12-13 14:48:46 [ 241.979849] LustreError: 49706:0:(ldlm_lib.c:2751:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff883ee4685400 x1553274421557352/t0(90194378184) o36->987b4351-b459-8f25-89f9-59fa31045f6b@192.168.128.32@o2ib18:7/0 lens 624/0 e 0 to 0 dl 1481669387 ref 2 fl Interpret:/6/ffffffff rc 0/-1 2016-12-13 14:48:46 [ 242.013679] LustreError: 49706:0:(ldlm_lib.c:2751:target_queue_recovery_request()) Skipped 7 previous similar messages 2016-12-13 14:48:50 [ 245.274867] LustreError: 37287:0:(ldlm_lib.c:1903:check_for_next_transno()) lquake-MDT0009: waking for gap in transno, VBR is OFF (skip: 90194377997, ql: 26, comp: 5, conn: 31, next: 90194377998, next_update 90194378030 last_committed: 90194377997) <<< transno messages redacted >>> 2016-12-13 14:48:55 [ 246.868005] Lustre: lquake-MDT0009: Recovery over after 1:39, of 31 clients 31 recovered and 0 were evicted. 2016-12-13 14:48:55 [ 251.145347] sched: RT throttling activated Then there seems to be some inter-MDT issue: 2016-12-13 14:53:16 [ 512.145169] Lustre: 36488:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1481669506/real 1481669506] req@ffff883ef5309b00 x1553642790382996/t0(0) o400->lquake-MDT000a-osp-MDT0009@172.19.1.121@o2ib100:24/4 lens 224/224 e 2 to 1 dl 1481669596 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1 2016-12-13 14:54:46 [ 602.183013] Lustre: 36488:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1481669596/real 1481669596] req@ffff883ee54c6f00 x1553642790383484/t0(0) o400->lquake-MDT000a-osp-MDT0009@172.19.1.121@o2ib100:24/4 lens 224/224 e 2 to 1 dl 1481669686 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1 2016-12-13 14:56:16 [ 692.220902] Lustre: 36488:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1481669686/real 1481669686] req@ffff883edf31f800 x1553642790384036/t0(0) o400->lquake-MDT000a-osp-MDT0009@172.19.1.121@o2ib100:24/4 lens 224/224 e 2 to 1 dl 1481669776 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1 2016-12-13 14:56:23 [ 698.815522] Lustre: lquake-MDT000a-osp-MDT0009: Connection restored to 172.19.1.121@o2ib100 (at 172.19.1.121@o2ib100) 2016-12-13 14:56:23 [ 698.828569] Lustre: Skipped 23 previous similar messages And clients seem to get evicted as a result of that later inter-MDT issue: 2016-12-13 14:53:16 [352250.742120] req@ffff880fe4f8dd00 x1553274427352992/t68719541671(68719541671) o36->lquake-MDT000a-mdc-ffff88103d075000@172.19.1.121@o2ib100:12/10 lens 624/416 e 13 to 1 dl 1481669596 ref 2 fl Interpret:EX/6/ffffffff rc -110/-1 2016-12-13 14:54:46 [352340.778326] LustreError: 8274:0:(client.c:2874:ptlrpc_replay_interpret()) @@@ request replay timed out. 2016-12-13 14:54:46 [352340.778326] req@ffff880fe4f8dd00 x1553274427352992/t68719541671(68719541671) o36->lquake-MDT000a-mdc-ffff88103d075000@172.19.1.121@o2ib100:12/10 lens 624/416 e 16 to 1 dl 1481669686 ref 2 fl Interpret:EX/6/ffffffff rc -110/-1 2016-12-13 14:56:13 [352428.151175] LustreError: 11-0: lquake-MDT000a-mdc-ffff88103d075000: operation mds_reint to node 172.19.1.121@o2ib100 failed: rc = -107 2016-12-13 14:56:42 [352456.923731] LustreError: 167-0: lquake-MDT000a-mdc-ffff88103d075000: This client was evicted by lquake-MDT000a; in progress operations using this service will fail. After the 14:56 message LustreError message stop and listings, rmdirs and unlinks seem to work, but ... There are damaged directories. Long listings of certain directories produce this: lstat("dir.mdtest.1.173", 0x61d930) = -1 ENOENT (No such file or directory)
write(2, "ls: ", 4ls: ) = 4
write(2, "cannot access dir.mdtest.1.173", 30cannot access dir.mdtest.1.173) = 30
write(2, ": No such file or directory", 27: No such file or directory) = 27
write(2, "\n", 1
) = 1
lstat("dir.mdtest.4.170", 0x61d9f0) = -1 ENOENT (No such file or directory)
write(2, "ls: ", 4ls: ) = 4
write(2, "cannot access dir.mdtest.4.170", 30cannot access dir.mdtest.4.170) = 30
write(2, ": No such file or directory", 27: No such file or directory) = 27
write(2, "\n", 1
) = 1
lstat("dir.mdtest.1.172", 0x61dab0) = -1 ENOENT (No such file or directory)
write(2, "ls: ", 4ls: ) = 4
write(2, "cannot access dir.mdtest.1.172", 30cannot access dir.mdtest.1.172) = 30
write(2, ": No such file or directory", 27: No such file or directory) = 27
write(2, "\n", 1
) = 1
and client console log messages like this: 2016-12-13 15:42:12 [355186.720441] LustreError: 159150:0:(llite_lib.c:2309:ll_prep_inode()) new_inode -fatal: rc -2 2016-12-13 15:44:33 [355328.281847] LustreError: 159156:0:(llite_lib.c:2309:ll_prep_inode()) new_inode -fatal: rc -2 2016-12-13 15:44:33 [355328.292937] LustreError: 159156:0:(llite_lib.c:2309:ll_prep_inode()) Skipped 23 previous similar messages |
| Comment by Di Wang [ 14/Dec/16 ] |
|
Thanks for update, Olaf. 2016-12-13 14:48:46 [ 241.954091] Lustre: lquake-MDT0009: Client 987b4351-b459-8f25-89f9-59fa31045f6b (at 192.168.128.32@o2ib18) reconnecting, waiting for 31 clients in recovery for 3:58 2016-12-13 14:48:46 [ 241.972729] Lustre: Skipped 7 previous similar messages 2016-12-13 14:48:46 [ 241.979849] LustreError: 49706:0:(ldlm_lib.c:2751:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff883ee4685400 x1553274421557352/t0(90194378184) o36->987b4351-b459-8f25-89f9-59fa31045f6b@192.168.128.32@o2ib18:7/0 lens 624/0 e 0 to 0 dl 1481669387 ref 2 fl Interpret:/6/ffffffff rc 0/-1 2016-12-13 14:48:46 [ 242.013679] LustreError: 49706:0:(ldlm_lib.c:2751:target_queue_recovery_request()) Skipped 7 previous similar messages 2016-12-13 14:48:50 [ 245.274867] LustreError: 37287:0:(ldlm_lib.c:1903:check_for_next_transno()) lquake-MDT0009: waking for gap in transno, VBR is OFF (skip: 90194377997, ql: 26, comp: 5, conn: 31, next: 90194377998, next_update 90194378030 last_committed: 90194377997) This is ok, especially the recovery seems good according to following message. Actually this "...waking for gap.." message has been removed from console log in 2.9 see https://review.whamcloud.com/#/c/21418/ Those inter-MDT request timeout messages seems suggest MDT000a is stuck or dead for some reasons. Do you have the console logs or stack trace on MDT000a when this happened? This damaged directories seems a bigger issue, but it unlikely caused by lgh_write_offset patch, could you please create a new ticket? And collecting -1 debug log when you do find on these directories, I want to know if this is caused by damaged stripes or dangling entries. Thanks. |
| Comment by Olaf Faaland [ 14/Dec/16 ] |
|
Attached console log from jet11 which was running MDT000a, called console.jet11.2016-12-13-14-47. This is from the lustre startup on 2016-12-13 at 14:47. |
| Comment by Di Wang [ 14/Dec/16 ] |
|
It looks recovery is stucked on MDT000a, and also recovery is aborted somehow because of this timeout. It seems there are 16 clients did not send replay req to the MDS? Hmm, did you manually abort the recovery? otherwise this should not happen. Do you have lustre debug log for this node? 2016-12-13 14:53:57 [ 550.240713] Lustre: lquake-MDT000a: Client lquake-MDT000b-mdtlov_UUID (at 172.19.1.122@o2ib100) reconnecting, waiting for 31 clients in recovery for 0:45 2016-12-13 14:53:57 [ 550.258291] Lustre: Skipped 46 previous similar messages 2016-12-13 14:53:59 [ 552.241660] LustreError: 33918:0:(ldlm_lib.c:2751:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff883efd628850 x1553641484060400/t0(68719540833) o1000->lquake-MDT000f-mdtlov_UUID@172.19.1.126@o2ib100:320/0 lens 344/0 e 0 to 0 dl 1481669700 ref 2 fl Interpret:/6/ffffffff rc 0/-1 2016-12-13 14:54:00 [ 552.274899] LustreError: 33918:0:(ldlm_lib.c:2751:target_queue_recovery_request()) Skipped 44 previous similar messages 2016-12-13 14:54:44 [ 597.107557] Lustre: lquake-MDT000a: Recovery already passed deadline 0:01, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery. 2016-12-13 14:54:45 [ 598.108926] Lustre: lquake-MDT000a: Recovery already passed deadline 0:02, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery. 2016-12-13 14:54:45 [ 598.129849] Lustre: Skipped 2 previous similar messages 2016-12-13 14:54:46 [ 599.211411] Lustre: lquake-MDT000a: Recovery already passed deadline 0:03, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery. 2016-12-13 14:54:46 [ 599.232239] Lustre: Skipped 4 previous similar messages 2016-12-13 14:54:48 [ 601.220136] Lustre: lquake-MDT000a: Recovery already passed deadline 0:05, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery. And then update replay is aborted, which might cause filesystem inconsistency. 2016-12-13 14:56:13 [ 685.746427] Lustre: 34667:0:(ldlm_lib.c:1589:abort_req_replay_queue()) @@@ aborted: req@ffff887f2c41d850 x1553641484047916/t0(68719540823) o1000->lquake-MDT0006-mdtlov_UUID@172.19.1.117@o2ib100:410/0 lens 344/0 e 17 to 0 dl 1481669790 ref 1 fl Complete:/4/ffffffff rc 0/-1 2016-12-13 14:56:13 [ 685.793338] Lustre: lquake-MDT000a: Denying connection for new client f25a5bc2-93d4-c9d5-aae8-bd315848b7ea(at 192.168.128.24@o2ib18), waiting for 31 known clients (3 recovered, 12 in progress, and 16 evicted) to recover in 21188503:49 2016-12-13 14:56:13 [ 685.818693] Lustre: Skipped 15 previous similar messages 2016-12-13 14:56:13 [ 686.254147] Lustre: 34667:0:(ldlm_lib.c:1589:abort_req_replay_queue()) @@@ aborted: req@ffff887f333f0050 x1553641484047916/t0(68719540823) o1000->lquake-MDT0006-mdtlov_UUID@172.19.1.117@o2ib100:454/0 lens 344/0 e 0 to 0 dl 1481669834 ref 1 fl Complete:/6/ffffffff rc 0/-1 2016-12-13 14:56:13 [ 686.284145] Lustre: 34667:0:(ldlm_lib.c:1589:abort_req_replay_queue()) Skipped 971 previous similar messages 2016-12-13 14:56:22 [ 695.215436] Lustre: 34667:0:(ldlm_lib.c:1589:abort_req_replay_queue()) @@@ aborted: req@ffff887f333f2050 x1553641484047916/t0(68719540823) o1000->lquake-MDT0006-mdtlov_UUID@172.19.1.117@o2ib100:455/0 lens 344/0 e 0 to 0 dl 1481669835 ref 1 fl Complete:/6/ffffffff rc 0/-1 Olaf: Could you please tell me more about how do you run the test and fail MDT? Thanks. |
| Comment by Olaf Faaland [ 14/Dec/16 ] |
|
Di, The first test, where the issue seems to have started, was to power off nodes jet[9-12] which host MDT000[8-b]. They were then powered back on again, pools imported, and lustre mounted. There were about 18 mdtest jobs running at the time, one pe clientr node. Each was running within a separate directory that was striped across all 16 MDTs, with the default stripe count set to 16 so that subdirectories created by mdtest were also striped across all 16 MDTs. I did not issue the abort_recovery; Lustre itself did that, but I don't know what triggered it. Sorry, I don't have the debug log. |
| Comment by Olaf Faaland [ 15/Dec/16 ] |
|
I notice that there are at least a few instances of this: (llog.c:529:llog_process_thread()) invalid length 0 in llog record for index 0/80 Attaching console logs for all the server nodes, see console.since-dec13.tgz I accidentally included logs for jet[21,22] which are not part of the lustre file system. Disregard those. |
| Comment by Di Wang [ 15/Dec/16 ] |
|
Olaf, Are there a lot of update_logs under update_log_dir/ on MDT000a, could you please upload those update logs? if it is not too much. Thanks. I suspect all of these recovery hassel is caused by this update log failure. |
| Comment by Gerrit Updater [ 15/Dec/16 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/24364 |
| Comment by Di Wang [ 15/Dec/16 ] |
|
Olaf: I made a debug patch https://review.whamcloud.com/24364 based on http://review.whamcloud.com/24008, could you please try the test with this patch? thank you! |
| Comment by Olaf Faaland [ 15/Dec/16 ] |
|
Hi Di, There are indeed logs of update logs on mdt000a. Raw, 218GB. Do you know the ID of the key log(s)? Or any way I can select a useful subset to give you? |
| Comment by Di Wang [ 15/Dec/16 ] |
|
Unfortunately, the console message only tell me which record is wrong. Hmm, is there a lot of files. Maybe you can run llog_read for each of them, then looking for the message "to next chunk" llog_reader xxxx/update_log_dir/XXXX | grep "to next chunk" Maybe you can even tweak llog_reader a bit to print the llog index to which corruption happened in index 80. (llog.c:529:llog_process_thread()) invalid length 0 in llog record for index 0/80 diff --git a/lustre/utils/llog_reader.c b/lustre/utils/llog_reader.c
index f7dae4c..19805a2 100644
--- a/lustre/utils/llog_reader.c
+++ b/lustre/utils/llog_reader.c
@@ -293,8 +293,8 @@ int llog_pack_buffer(int fd, struct llog_log_hdr **llog,
cur_rec->lrh_len > (*llog)->llh_hdr.lrh_len) {
cur_rec->lrh_len = (*llog)->llh_hdr.lrh_len -
offset % (*llog)->llh_hdr.lrh_len;
- printf("off %lu skip %u to next chunk.\n", offset,
- cur_rec->lrh_len);
+ printf("idx %u off %lu skip %u to next chunk.\n",
+ cur_rec->lrh_index, offset, cur_rec->lrh_len);
i--;
} else if (ext2_test_bit(idx, LLOG_HDR_BITMAP(*llog))) {
printf("rec #%d type=%x len=%u offset %lu\n", idx,
|
| Comment by Olaf Faaland [ 19/Dec/16 ] |
|
Di, Disregard my earlier message about llogs, my mistake. I'll get back to you about what I find in the llogs. |
| Comment by Olaf Faaland [ 19/Dec/16 ] |
|
Di, Bit 77 of 3 not set Bit 78 of 3 not set rec #79 type=106a0000 len=3048 offset 500040 rec #0 off 503088 orig_len 0 skip 21200 to next chunk. Bit 90 of 3 not set Bit 91 of 3 not set |
| Comment by Di Wang [ 20/Dec/16 ] |
|
Hi, Olaf Thanks for uploading the file, it looks like some update log is not being written, so it leaves a hole there master transno = 60130033439 batchid = 55834779575 flags = 0 ops = 172 params = 84 rec_len 8520 Bit 78 of 4 not set offset 491520 offset 500040 index 79 type 106a0000 master transno = 60130041273 batchid = 55834779576 flags = 0 ops = 88 params = 21 rec_len 3048 rec #79 type=106a0000 len=3048 offset 500040, total 1 offset 503088 index 0 type 0 master transno = 0 batchid = 0 flags = 0 ops = 0 params = 0 rec_len 0 off 503088 skip 8520 to next chunk. test_bit yes ---->>> skip 8520 bytes, then it is valid again. offset 511608 index 81 type 106a0000 master transno = 60130043788 batchid = 55834779578 flags = 0 ops = 18 params = 3 rec_len 728 Bit 81 of 4 not set offset 511608 offset 512336 index 82 type 106a0000 master transno = 60130043789 batchid = 55834779579 flags = 0 ops = 18 params = 3 rec_len 728 Bit 82 of 4 not set offset 512336 offset 513064 index 83 type 106a0000 master transno = 60130043798 batchid = 55834779580 flags = 0 ops = 88 params = 21 rec_len 2968 Bit 83 of 4 not set offset 513064 offset 516032 index 84 type 106a0000 So it looks like writing update record (rec_len = 8520), then the following write is not cancelled, so it cause a hole in the update log, which cause the issue. |
| Comment by Olaf Faaland [ 20/Dec/16 ] |
|
Di, I bounced servers or stopped and restarted lustre several times today, in varying combinations. MDT000a still seems to take longer to get connected than the others; several times the recovery_status procfile showed MDTs were waiting for MDT000a. However within a few minutes it seems to successfully complete recovery. I have the debug logs from one of these attempts, and can upload if it's helpful. Let me know. |
| Comment by Di Wang [ 20/Dec/16 ] |
|
Hmm, there are already corrupted update log. Ah, you reformat FS? if not, then I would expect recovery would stuck. Do you have the console log on MDT000a? Thanks. |
| Comment by Di Wang [ 20/Dec/16 ] |
|
Olaf: I just add another fix to this ticket https://review.whamcloud.com/24364 , hopefully this can resolve the update log corruption issue. Please try it. Thanks. |
| Comment by Olaf Faaland [ 21/Dec/16 ] |
This FS (lquake) was last reformatted November 21. I expected recovery to get stuck also, given the presence of the corrupted update log. I don't know why it did not. I've attached the console log, see console.zinc11.2016-12-19 |
| Comment by Di Wang [ 21/Dec/16 ] |
|
According to the console log, it looks like recovery was stuck, but success as you said. This is good but strange, 2016-12-19 18:09:35 [ 9014.403506] Lustre: lquake-MDT000a: Recovery already passed deadline 12:36. It is due to DNE recovery failed/stuck on the 1 MDT(s): 000a. Please wait until all MDTs recovered or abort the recovery by force. 2016-12-19 18:09:35 [ 9014.426106] Lustre: Skipped 60 previous similar messages 2016-12-19 18:10:40 [ 9079.948790] Lustre: lquake-MDT000a: Recovery already passed deadline 11:30. It is due to DNE recovery failed/stuck on the 1 MDT(s): 000a. Please wait until all MDTs recovered or abort the recovery by force. 2016-12-19 18:10:40 [ 9079.971290] Lustre: Skipped 63 previous similar messages 2016-12-19 18:11:05 [ 9104.433087] Lustre: lquake-MDT000a: Connection restored to 172.19.1.111@o2ib100 (at 172.19.1.111@o2ib100) 2016-12-19 18:11:05 [ 9104.444864] Lustre: Skipped 130 previous similar messages 2016-12-19 18:12:35 [ 9194.452689] Lustre: lquake-MDT000a: Recovery already passed deadline 9:35. If you do not want to wait more, please abort the recovery by force. 2016-12-19 18:12:36 [ 9195.708675] Lustre: lquake-MDT000a: Recovery already passed deadline 9:34. If you do not want to wait more, please abort the recovery by force. 2016-12-19 18:12:36 [ 9195.725009] Lustre: Skipped 7 previous similar messages 2016-12-19 18:12:37 [ 9196.061115] Lustre: lquake-MDT000a: Recovery over after 5:26, of 70 clients 70 recovered and 0 were evicted. Please try 24364 + 24008, I hope this could resolve all of these corrupt update log issue, and all of these recovery troubles will go away. Thanks. |
| Comment by Di Wang [ 21/Dec/16 ] |
|
Hi, Olaf So the current corruption happened when MDT0006 retrieves update log from MDT000a (from console.since-dec13) 2016-12-13 13:49:38 [336387.573280] Lustre: 86734:0:(llog.c:529:llog_process_thread()) invalid length 0 in llog record for index 0/80 2016-12-13 13:49:38 [336387.585565] LustreError: 86734:0:(lod_dev.c:419:lod_sub_recovery_thread()) lquake-MDT000a-osp-MDT0006 getting update log failed: rc = -22 Do you still have the console log on MDT0006 for the last run? I want to check if this corrupt log is being hit in last recovery. Thanks. |
| Comment by Gerrit Updater [ 09/Jan/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24008/ |
| Comment by Gerrit Updater [ 24/Jan/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24364/ |
| Comment by Peter Jones [ 30/Jan/17 ] |
|
As per Di remaining patch was a debug only patch. This issue is fixed for 2.10 and the patches will be backported to maintenance releases, |