[LU-8753] Recovery already passed deadline with DNE Created: 24/Oct/16  Updated: 19/Sep/18  Resolved: 30/Jan/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Critical
Reporter: Giuseppe Di Natale (Inactive) Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

lustre-2.8.0_3.chaos-1.ch6.x86_64
16 MDTs


Attachments: File 0x48000a04b-0x1-0x0.tgz     File console.jet11.2016-12-13-14-47     File console.jet7.gz     File console.since-dec13.tgz     File console.zinc11.2016-12-19     File console_logs.nov28.tgz     File dk.jet1.1478223101.gz     File dk.jet1.1478565846.gz     File dk.recovery_stuck.jet7.1477593159.gz     File dk.recovery_stuck.jet7.1477593344.gz     File dk.zinc1.1480375634.gz     File dk.zinc13.1480375634.gz     File dk.zinc7.1480375634.gz     File logs.2016-11-14.tgz     File lsh-mdt000c-1b70.nov28.tgz     File lustre.log.gz     File mdt09.0x240019a58_0x6_0x0.tgz     File mdt0b.0x240019a58_0x6_0x0.tgz     Text File target_to_node_map.nov28.txt    
Issue Links:
Blocker
Related
is related to LU-7675 replay-single test_101 times out afte... Resolved
is related to LU-7426 DNE3: Current llog format for remote ... Open
is related to LU-8916 ods-zfs doesn't manage ZAP sizes corr... Resolved
is related to LU-8787 zpool containing MDT0000 out of space Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

MDT[0-1,6-16] (decimal) have timed out of recovery; appx 1473 clients recovered, 1 evicted.
MDT[2-5] reach the timeout, and report in the log that recovery has hung and should be aborted. After lctl abort_recovery, the nodes begin emitting large numbers of errors in the console log. The nodes are up but mrsh into them hangs, as if they are too busy to service the mrsh session.

2016-10-15 15:49:40 [ 1088.878945] Lustre: lsh-MDT0002: Recovery already passed deadline 0:32, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
2016-10-15 15:49:40 [ 1088.899333] Lustre: Skipped 157 previous similar messages
2016-10-15 15:50:12 [ 1121.013380] Lustre: lsh-MDT0002: Recovery already passed deadline 1:04, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
2016-10-15 15:50:12 [ 1121.033744] Lustre: Skipped 735 previous similar messages

<ConMan> Console [zinc3] departed by <root@localhost> on pts/0 at 10-15 15:50.
2016-10-15 15:50:52 [ 1161.329645] LustreError: 38991:0:(mdt_handler.c:5737:mdt_iocontrol()) lsh-MDT0002: Aborting recovery for device
2016-10-15 15:50:52 [ 1161.341983] LustreError: 38991:0:(ldlm_lib.c:2565:target_stop_recovery_thread()) lsh-MDT0002: Aborting recovery
2016-10-15 15:50:52 [ 1161.343686] LustreError: 18435:0:(lod_dev.c:419:lod_sub_recovery_thread()) lsh-MDT0004-osp-MDT0002 getting update log failed: rc = -108
2016-10-15 15:50:52 [ 1161.377751] Lustre: 18461:0:(ldlm_lib.c:2014:target_recovery_overseer()) recovery is aborted, evict exports in recovery

The earliest such messages are:

2016-10-15 15:50:52 [ 1161.390842] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295056926 batchid = 35538 flags = 0 ops = 42 params = 32
2016-10-15 15:50:52 [ 1161.408040] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295056931 batchid = 35542 flags = 0 ops = 42 params = 32

The last few are:

2016-10-15 15:52:11 [ 1240.343780] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064355 batchid = 39987 flags = 0 ops = 42 params = 32
2016-10-15 15:52:11 [ 1240.361375] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064356 batchid = 39999 flags = 0 ops = 42 params = 32
2016-10-15 15:52:11 [ 1240.378995] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064357 batchid = 40018 flags = 0 ops = 42 params = 32
2016-10-15 15:52:11 [ 1240.396579] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064358 batchid = 40011 flags = 0 ops = 42 params = 32
2016-10-15 15:52:11 [ 1240.414180] LustreError: 18461:0:(update_records.c:72:update_records_dump()) master transno = 4295064360 batchid = 40005 flags = 0 ops = 42 params = 32

We have seen this type of behavior on multiple DNE filesystems. Also, is there any way to determine if these errors have been corrected, abandoned, etc?



 Comments   
Comment by Giuseppe Di Natale (Inactive) [ 24/Oct/16 ]

Went ahead and linked LU-7675, it sounds like it may be related.

Comment by Peter Jones [ 24/Oct/16 ]

Lai

Could you please advise on this one?

Thanks

Peter

Comment by Olaf Faaland [ 26/Oct/16 ]

To give some sense of scale, zinc1 (which hosts the MGS and MDT0000) has encountered this after I aborted recovery. So far 80,000 "update_records.c:72:update_records_dump" rows have been dumped, and it's not done yet.

Comment by Olaf Faaland [ 26/Oct/16 ]

This is appearing every time, or almost every time, we abort recovery on an MDT.

Comment by Lai Siyao [ 26/Oct/16 ]

The log does't help much, which just tells DNE recovery can't finish, but not why. Could you upload all MDS syslog? In the mean time I'll see how it can be approved.

Comment by Olaf Faaland [ 26/Oct/16 ]

Hello Lai,

Are there two separate issues here, or do they likely have the same root cause? That is (a) recovery is hung and (b) whatever happens after abort_recovery, with all the update_records_dump() in the logs.

If you think think they are two issues, then I should make a separate ticket for the hung recovery.

Comment by Di Wang [ 26/Oct/16 ]

Indeed, it looks like they are separate issue. Please create a new ticket for b).

Recovery already passed deadline 1:04, It is most likely due to DNE recovery is failed or stuck,

Usually caused by one of recovery threads failing on retrieving update recovery logs or replaying these update records. As lai said, more console logs or stack trace on lsh-MDT0002 will definitely help us figure out the issue here. thanks.

Comment by Olaf Faaland [ 26/Oct/16 ]

Di,

OK, I'll upload more logs. Can you change the summary line of this ticket to "Recovery already passed deadline with DNE" or similar? I don't have perms to modify that field.

thanks,
Olaf

Comment by Olaf Faaland [ 27/Oct/16 ]

I needed to reboot a different lustre 2.8 cluster w/ DNE today and the issue occurs there as well.

Providing logs from this instance instead of the earlier one, because I have both console and lctl dk output.

Comment by Olaf Faaland [ 27/Oct/16 ]

lustre-related syslog for entire jet lustre cluster including the time span when recovery became stuck on jet7.

Between the recovery stuck issue, and the lu-8763 issue, it's messy. But see the recovery stuck issue from a cold boot of all the MDS nodes.

Comment by Olaf Faaland [ 27/Oct/16 ]

Please label this topllnl, thanks.

Comment by Peter Jones [ 27/Oct/16 ]

Olaf

I have done so but I am surprised that you could not do this yourself. If you confirm that this is the case then I will dig into your JIRA permissions

Peter

Comment by Olaf Faaland [ 27/Oct/16 ]

Hi Peter,

That would be great, thanks. I can add labels at the time I create a ticket, but I cannot change them after the fact. I confirmed that on this ticket.

Comment by Peter Jones [ 27/Oct/16 ]

Olaf

I made a tweak to your permissions. I'm not 100% certain it will help but give it another try

Peter

Comment by Olaf Faaland [ 27/Oct/16 ]

Peter,
That worked. Thank you.
-Olaf

Comment by Di Wang [ 27/Oct/16 ]

According to the debug log

00000100:02000000:10.0:1477593165.292609:0:143210:0:(import.c:1539:ptlrpc_import_recovery_state_machine()) lquake-MDT0006: Connection restored to lquake-MDT000b-mdtlov_UUID (at 172.19.1.122@o2ib100)
00000040:00020000:11.0:1477593165.661058:0:31346:0:(llog.c:591:llog_process_thread()) lquake-MDT000b-osp-MDT0006 retry remote llog process

It looks like MDT0006 keeps trying to retrieve the update log from MDT000b (172.19.1.122@o2ib100), but keep getting EIO here. Do you have the log on MDT000b ? Thanks.

Comment by Olaf Faaland [ 27/Oct/16 ]

Di,
That service would have been running on either jet7 or jet8. Both those nodes have been rebooted since the instance you're looking at, so I can't get the lctl dk output anymore.
However, you should have the relevant console log output in lustre.log.gz.

Comment by Di Wang [ 28/Oct/16 ]

Ah, according to the log, MDT000b(jet12) recover seems failed because of invalid log record. ( hmm, I did not see it for long time).

Oct 27 10:24:35 jet12 kernel: [  356.925961] LustreError: 34856:0:(llog_osd.c:940:llog_osd_next_block()) lquake-MDT0009-osp-MDT000b: invalid llog tail at log id 0x6:1073846872/0 offset 131072 last_rec idx 0 tail idx 809132920
Oct 27 10:24:35 jet12 kernel: [  356.947238] LustreError: 34856:0:(lod_dev.c:419:lod_sub_recovery_thread()) lquake-MDT0009-osp-MDT000b getting update log failed: rc = -22
Oct 27 10:24:35 jet12 kernel: LustreError: 34856:0:(llog_osd.c:940:llog_osd_next_block()) lquake-MDT0009-osp-MDT000b: invalid llog tail at log id 0x6:1073846872/0 offset 131072 last_rec idx 0 tail idx 809132920
Oct 27 10:24:35 jet12 kernel: LustreError: 34856:0:(lod_dev.c:419:lod_sub_recovery_thread()) lquake-MDT0009-osp-MDT000b getting update log failed: rc = -22
Oct 27 10:24:36 jet12 kernel: LustreError: 34858:0:(lod_dev.c:419:lod_sub_recovery_thread()) lquake-MDT000c-osp-MDT000b getting update log failed: rc = -108
Oct 27 10:24:36 jet12 kernel: LustreError: 34858:0:(lod_dev.c:419:lod_sub_recovery_thread()) Skipped 1 previous similar message
Oct 27 10:24:36 jet12 kernel: [  358.012215] LustreError: 34858:0:(lod_dev.c:419:lod_sub_recovery_thread()) lquake-MDT000c-osp-MDT000b getting update log failed: rc = -108
Oct 27 10:24:36 jet12 kernel: [  358.028270] LustreError: 34858:0:(lod_dev.c:419:lod_sub_recovery_thread()) Skipped 1 previous similar message
Oct 27 10:24:37 jet12 kernel: [  359.735327] Lustre: 34889:0:(ldlm_lib.c:2014:target_recovery_overseer()) recovery is aborted, evict exports in recovery
Oct 27 10:24:37 jet12 kernel: [  359.748515] Lustre: 34889:0:(ldlm_lib.c:2014:target_recovery_overseer()) Skipped 2 previous similar messages
Oct 27 10:24:37 jet12 kernel: [  359.760557] LustreError: 34889:0:(update_records.c:72:update_records_dump()) master transno = 231932247340 batchid = 210453399132 flags = 0 ops = 154 params = 80

which cause the recovery stuck on jet12, then cause other MDTs recovery stuck.

Comment by Di Wang [ 28/Oct/16 ]

Olaf:
I think you need delete these update logs on all of MDTs of lquake to avoid the further problem.

you can mount the mdt as ZFS, then delete all update_log_dir and update_log

rm -rf /mnt/zfs/update_log*

Note: this will not destroy the data, but only delete those un-committed recovery logs, but since these update logs are corrupted, they are useless anyway.

I am trying to figure out why these update logs are corrupted, but not much information here.

Comment by Olaf Faaland [ 28/Oct/16 ]

Di,
Might it have to do with LU-8569?

Comment by Olaf Faaland [ 28/Oct/16 ]

I see now you are working on that ticket and already know about it.

Comment by Olaf Faaland [ 29/Oct/16 ]

I tried deleting the update_log and update_log_dir on all the MDTs under ZFS. I was able to delete update_log and the contents of update_log_dir, but not able to remove the directory update_log_dir itself.

Attempting to mount MDT0000 fails with EEXIST, and the debug log reports the attempt to create the FID_SEQ_UPDATE_LOG fails:

2016-10-28 17:30:14.292693 00000020:00001000:5.0::0:143276:0:(local_storage.c:379:__local_file_create()) create new object [0x200000009:0x0:0x0]
2016-10-28 17:30:14.292694 00000020:00000001:5.0::0:143276:0:(local_storage.c:260:local_object_create()) Process entered
2016-10-28 17:30:14.292694 00080000:00000001:5.0::0:143276:0:(osd_object.c:1530:osd_object_create()) Process entered
2016-10-28 17:30:14.292716 00080000:00000010:5.0::0:143276:0:(osd_object.c:1236:__osd_attr_init()) kmalloced 'bulk': 520 at ffff887ef276f400.
2016-10-28 17:30:14.292719 00080000:00000010:5.0::0:143276:0:(osd_object.c:1268:__osd_attr_init()) kfreed 'bulk': 520 at ffff887ef276f400.
2016-10-28 17:30:14.292720 00080000:00000001:5.0::0:143276:0:(osd_oi.c:241:fid_is_on_ost()) Process entered
2016-10-28 17:30:14.292720 00080000:00000001:5.0::0:143276:0:(osd_oi.c:261:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0)
2016-10-28 17:30:14.292728 00080000:00000001:5.0::0:143276:0:(osd_object.c:1571:osd_object_create()) Process leaving via out (rc=18446744073709551599 : -17 : 0xffffffffffffffef)
2016-10-28 17:30:14.292729 00080000:00000001:5.0::0:143276:0:(osd_object.c:1604:osd_object_create()) Process leaving (rc=18446744073709551599 : -17 : ffffffffffffffef)
2016-10-28 17:30:14.292730 00000020:00000001:5.0::0:143276:0:(local_storage.c:264:local_object_create()) Process leaving (rc=18446744073709551599 : -17 : ffffffffffffffef)

Any suggestions?

Comment by Olaf Faaland [ 29/Oct/16 ]

Renaming the directory doesn't change the behavior.

Comment by Di Wang [ 29/Oct/16 ]

hmm, it looks like on ZFS, it create the log file in a different way. It seems to me "rm update_log" only delete the name, the real object is still there. Unfortunately, I am not ZFS expert here. Let me ask around. Alex, any ideas?

Comment by Di Wang [ 29/Oct/16 ]
Might it have to do with LU-8569?

it might be indeed.

Comment by Alex Zhuravlev [ 29/Oct/16 ]

please check with the actual code:
rc = zap_add(osd>od_os, zapid, buf, 8, 1, zde, oh->ot_tx);
if (rc)
GOTO(out, rc);

rm (direct mount and ZPL, right?) doesn't update OI.

Comment by Di Wang [ 31/Oct/16 ]

Sigh, I have to deal with this in lustre. Let me make a patch.

Comment by Gerrit Updater [ 31/Oct/16 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/23490
Subject: LU-8753 lod: delete update_log/update_log_dir
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2af38532d9e11d098f5c765ac8a0263a943c240c

Comment by Di Wang [ 31/Oct/16 ]

I just pushed the patch http://review.whamcloud.com/23490. I do not have ZFS environment to try on my side, Olaf, could you please try to see if this is helpful for your mount failure?

Comment by Olaf Faaland [ 03/Nov/16 ]

Hi Di,

I get the following:

2016-11-02 20:47:31 [ 3932.246998] Lustre: MGS: Logs for fs lquake were removed by user request.  All servers must be restarted in order to regenerate the logs.
2016-11-02 20:47:31 [ 3932.449165] LustreError: 79681:0:(osd_object.c:437:osd_object_init()) lquake-MDT0000: lookup [0x200000009:0x0:0x0]/0xa4 failed: rc = -2
2016-11-02 20:47:31 [ 3932.464118] LustreError: 79681:0:(obd_mount_server.c:1798:server_fill_super()) Unable to start targets: -2
2016-11-02 20:47:31 [ 3932.475878] Lustre: Failing over lquake-MDT0000
2016-11-02 20:47:32 [ 3932.568791] Lustre: server umount lquake-MDT0000 complete
2016-11-02 20:47:32 [ 3932.575517] LustreError: 79681:0:(obd_mount.c:1426:lustre_fill_super()) Unable to mount  (-2)

Note that I initially started both the mgs and mdt0000 with -o writeconf.

Comment by Di Wang [ 03/Nov/16 ]

Hi, Olaf

Could you please retry and dump -1 debug log? Please be sure to remove update_log* under zfs before retry, thanks.

Comment by Olaf Faaland [ 04/Nov/16 ]

Di,
Attached dk.jet1.1478223101.gz for the mount failure with your patch applied.

Comment by Di Wang [ 05/Nov/16 ]

Olaf,
Could you please also delete the OI entry for update_log under ZFS, should be like "/mnt/zfs/oi.9/0x200000009\:0x1\:0x0"?
Anyway please make sure both update_log and oi.9/0x200000009\:0x1\:0x0 are deleted. And also if update_log_dir can not be removed, let's restore it back. If it still fails, please collect -1 debug log for me. Thanks.

Comment by Gerrit Updater [ 07/Nov/16 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/23635
Subject: LU-8753 llog: add some debug information
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 615f878ceafb612ccbb868ad752ecd14458068dd

Comment by Di Wang [ 07/Nov/16 ]

Olaf

Do you still to have that corrupted log file, it should be a regular file under update_log_dir "[0x40019a58:0x6:0x0]". If not, could you please add this debug patch http://review.whamcloud.com/23635, retry? If error happens, could you please post the console message and update the corrupted log file? Thanks.

Comment by Olaf Faaland [ 08/Nov/16 ]

Di,

Could you please also delete the OI entry for update_log under ZFS, should be like "/mnt/zfs/oi.9/0x200000009\:0x1\:0x0"?
Anyway please make sure both update_log and oi.9/0x200000009\:0x1\:0x0 are deleted. And also if update_log_dir can not be removed, let's restore it back. If it still fails, please collect -1 debug log for me. Thanks.

Joe mounted the MDT0000 dataset via ZPL and find there is no entry with name beginning '0x200000009', not under oi.9/ nor anywhere. update_log is deleted. update_log_dir could not be deleted (fails with 'not empty' even though it appears to be so). It appears to be damaged:

update_log_dir/[root@jet1:zfs]# ll -d update_*
drw-r--r-- 2 root root 18446744073709551140 Oct 28 13:40 update_log_dir

Joe attempted the import with debug and subsystem_debug set to -1. I've attached the log, dk.jet1.1478565846.gz

Comment by Olaf Faaland [ 08/Nov/16 ]

Di,

Do you still to have that corrupted log file, it should be a regular file under update_log_dir "[0x40019a58:0x6:0x0]".

MDT000b update_log_dir contains file named "[0x240019a58:0x6:0x0]". It is attached in mdt0b.0x240019a58_0x6_0x0.tgz.

Is that the correct one?

thanks,
Olaf

Comment by Di Wang [ 08/Nov/16 ]

Hmm, it looks like update_log "0x200000009\:0x1\:0x0" still exists in the system.

00080000:00000001:2.0:1478565782.638227:0:14104:0:(osd_oi.c:261:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0)
00080000:00000001:2.0:1478565782.638236:0:14104:0:(osd_object.c:1571:osd_object_create()) Process leaving via out (rc=18446744073709551599 : -17 : 0xffffffffffffffef)
00080000:00000001:2.0:1478565782.638237:0:14104:0:(osd_object.c:1604:osd_object_create()) Process leaving (rc=18446744073709551599 : -17 : ffffffffffffffef)
00000020:00000001:2.0:1478565782.638238:0:14104:0:(local_storage.c:264:local_object_create()) Process leaving (rc=18446744073709551599 : -17 : ffffffffffffffef)
00000020:00000001:2.0:1478565782.638238:0:14104:0:(local_storage.c:382:__local_file_create()) Process leaving via unlock (rc=18446744073709551599 : -17 : 0xffffffffffffffef)

Could you please tell me what lines are these logs mapping to? thanks these seems not match my b2_8 version.

Comment by Di Wang [ 08/Nov/16 ]

Olaf, yes, that is the one, thanks.

Comment by Di Wang [ 08/Nov/16 ]

Ah, Olaf

lquake-MDT0009-osp-MDT000b: invalid llog tail at log id 0x6:1073846872/0 offset 131072 last_rec idx 0 tail idx 809132920

The corrupted log is on MDT0009, instead of MDT000b, could you please check there? thanks.

Comment by Giuseppe Di Natale (Inactive) [ 08/Nov/16 ]

Di,

Below are snippets of code to help you identify where the log messages you posted are from.

00080000:00000001:2.0:1478565782.638227:0:14104:0:(osd_oi.c:261:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0)

In lustre/osd-zfs/osd_oi.c:

249 
250         rc = osd_fld_lookup(env, osd, fid_seq(fid), range);
251         if (rc != 0) {
252                 if (rc != -ENOENT)
253                         CERROR("%s: "DFID" lookup failed: rc = %d\n",
254                                osd_name(osd), PFID(fid), rc);
255                 RETURN(0);
256         }
257 
258         if (fld_range_is_ost(range))
259                 RETURN(1);
260 
261         RETURN(0);
262 }
00080000:00000001:2.0:1478565782.638236:0:14104:0:(osd_object.c:1571:osd_object_create()) Process leaving via out (rc=18446744073709551599 : -17 : 0xffffffffffffffef)

In lustre/osd-zfs/osd_object.c:

1563         zde->zde_pad = 0;
1564         zde->zde_dnode = db->db_object;
1565         zde->zde_type = IFTODT(attr->la_mode & S_IFMT);
1566 
1567         zapid = osd_get_name_n_idx(env, osd, fid, buf);
1568 
1569         rc = -zap_add(osd->od_os, zapid, buf, 8, 1, zde, oh->ot_tx);
1570         if (rc)
1571                 GOTO(out, rc);
1572 
1573         /* Add new object to inode accounting.
1574          * Errors are not considered as fatal */
1575         rc = -zap_increment_int(osd->od_os, osd->od_iusr_oid,
1576                                 (attr->la_valid & LA_UID) ? attr->la_uid : 0, 1,
1577                                 oh->ot_tx);
00080000:00000001:2.0:1478565782.638237:0:14104:0:(osd_object.c:1604:osd_object_create()) Process leaving (rc=18446744073709551599 : -17 : ffffffffffffffef)

In lustre/osd-zfs/osd_object.c:

1594         rc = osd_init_lma(env, obj, fid, oh);
1595         if (rc) {
1596                 CERROR("%s: can not set LMA on "DFID": rc = %d\n",
1597                        osd->od_svname, PFID(fid), rc);
1598                 /* ignore errors during LMA initialization */
1599                 rc = 0;
1600         }
1601 
1602 out:
1603         up_write(&obj->oo_guard);
1604         RETURN(rc);
1605 }
00000020:00000001:2.0:1478565782.638238:0:14104:0:(local_storage.c:264:local_object_create()) Process leaving (rc=18446744073709551599 : -17 : ffffffffffffffef)

In lustre/obdclass/local_storage.c:

 251 int local_object_create(const struct lu_env *env,
 252                         struct local_oid_storage *los,
 253                         struct dt_object *o, struct lu_attr *attr,
 254                         struct dt_object_format *dof, struct thandle *th)
 255 {
 256         struct dt_thread_info   *dti = dt_info(env);
 257         u64                      lastid;
 258         int                      rc;
 259 
 260         ENTRY;
 261 
 262         rc = dt_create(env, o, attr, NULL, dof, th);
 263         if (rc)
 264                 RETURN(rc);
 265 
 266         if (los == NULL)
 267                 RETURN(rc);
00000020:00000001:2.0:1478565782.638238:0:14104:0:(local_storage.c:382:__local_file_create()) Process leaving via unlock (rc=18446744073709551599 : -17 : 0xffffffffffffffef)

In lustre/obdclass/local_storage.c:

 378         CDEBUG(D_OTHER, "create new object "DFID"\n",
 379                PFID(lu_object_fid(&dto->do_lu)));
 380         rc = local_object_create(env, los, dto, attr, dof, th);
 381         if (rc)
 382                 GOTO(unlock, rc);
 383         LASSERT(dt_object_exists(dto));
 384 
 385         if (dti->dti_dof.dof_type == DFT_DIR) {
Comment by Giuseppe Di Natale (Inactive) [ 08/Nov/16 ]

I'm also hunting for the corrupted log on MDT0009.

Comment by Giuseppe Di Natale (Inactive) [ 08/Nov/16 ]

I went ahead and attached the log file from mdt0009. File name is mdt09.0x240019a58_0x6_0x0.tgz

Comment by Di Wang [ 08/Nov/16 ]

Thanks, Giuseppe. I will check.

Comment by Di Wang [ 11/Nov/16 ]

Hello, Giuseppe, mdt09.0x240019a58_0x6_0x0.tgz and mdt0b.0x240019a58_0x6_0x0.tgz seems same. could you please check? thanks.

Comment by Olaf Faaland [ 14/Nov/16 ]

Di,

Attaching logs.2016-11-14.tgz which includes:

mnt/lu-8753/oi.88/0x240019a58:0x6:0x0
mnt/lu-8753/update_log_dir/[0x240019a58:0x6:0x0]
mnt/lu-8753/oi.10/0x20000000a:0x9:0x0/[0x240019a58:0x6:0x0]
mnt/lu-8753/CONFIGS/
mnt/lu-8753/CONFIGS/lquake-MDT0009
mnt/lu-8753/CONFIGS/params
mnt/lu-8753/CONFIGS/lquake-client

The CONFIGS directory is there to document which target's dataset these files came from.

Note that the oi.88, oi.10, and update_log_dir entries all refer to the same underlying object/file.

thanks,
Olaf

Comment by Di Wang [ 15/Nov/16 ]

Thanks Olaf, though the file seems still same. This multiple link llog file does not looks right to me. especially links both under oi.88 and oi.10. Hmm in llog_osd_create(), it should not do llog_osd_regular_fid_add_name_entry() for ZFS.

So either OSD can tell which FID is for llog (remember these normal llog FID inside OSD) or the llog_osd can tell if this is on ldiskfs or ZFS. Alex any suggestion here?

Hmm, it looks like we can input the llog type to OSD by dt_object_format, then remember the llog type inside LMA, then we also move those llog_osd_regular_fid_add_name_entry() into ldiskfs-osd. I will cook a patch.

Comment by Olaf Faaland [ 15/Nov/16 ]

Thanks Di.
I double checked and there's no object with 40019a58 in the name on MDT000b. I think the original file I submitted actually was from MDT0009, and I mislabeled it; the pools for MDT0009 and MDT000b reside on drives that are in the same enclosure and visible from the same set of nodes. Sorry for the confusion.

I am sure of the provenance of the one I submitted today because of the way I created the tar file.

Comment by Di Wang [ 15/Nov/16 ]

I checked the llog file you posted here. Unfortunately this is not the corrupted llog file. I am cooking a patch for this multiple link llog issue.

Comment by Olaf Faaland [ 15/Nov/16 ]

Di,
1) With regards to the problem of the update llog getting corrupted, we have no more places to look to determine the root cause, right?
2) If I'm able to get the filesystem back up again, and reproduce the problem, do you have advice on how to capture the data we need?

Comment by Olaf Faaland [ 15/Nov/16 ]

Di,
I created ticket LU-8833 for the patch to correct the number of links to the update llogs under zfs.

Comment by Di Wang [ 15/Nov/16 ]

Hello, Olaf,

Yes, if we can not get the corrupt log file, it is hard to find the root cause here. If you can reproduce the issue, then please apply the patch http://review.whamcloud.com/#/c/23635/ and also the corruptted log file, thanks.

Btw: Were you able to delete thoese update_log files finally? or you will reformat the system (sorry about that. ) ? Thanks.

Comment by Olaf Faaland [ 17/Nov/16 ]

Hi Di, I'll apply your patch and attempt to reproduce. I wasn't able to delete those files yet; I'm going to spend a little longer looking into it. Thanks.

Comment by Di Wang [ 28/Nov/16 ]

Olaf: any good news?

Comment by Olaf Faaland [ 28/Nov/16 ]

Di,
No new data yet.

Comment by Olaf Faaland [ 28/Nov/16 ]

Di,
I encountered the invalid llog tail issue after lunch today. I'll attach logs momentarily.

Comment by Olaf Faaland [ 28/Nov/16 ]

Di,

Confirm this is the right llog and I'll upload it.

[faaland1@zinci syslog]$grep invalid *
console.zinc1:2016-11-28 15:09:41 [1123885.603795] LustreError: 96133:0:(llog_osd.c:954:llog_osd_next_block()) lsh-MDT000c-osp-MDT0000: invalid llog tail at log id 0x1:7024/0 offset 70320128 bytes 32768

[root@zinc13:update_log_dir]# ls ../CONFIGS/
lsh-MDT000c  lsh-client  param

[root@zinc13:update_log_dir]# pwd
/mnt/lu-8753/update_log_dir

[root@zinc13:update_log_dir]# ll *1b70*
-rw-r--r-- 1 root root 71419576 Dec 31  1969 [0x500001b70:0x1:0x0]
-rw-r--r-- 1 root root        0 Dec 31  1969 [0x500001b70:0x2:0x0]
[root@zinc13:update_log_dir]# llog_reader '[0x500001b70:0x1:0x0]' | less
[root@zinc13:update_log_dir]# llog_reader '[0x500001b70:0x1:0x0]' | grep -C 10 70320128
rec #8910 type=106a0000 len=8536 offset 70238896
Bit 8911 of 6687 not set
rec #8912 type=106a0000 len=8536 offset 70254592
rec #8913 type=106a0000 len=8536 offset 70263128
rec #8914 type=106a0000 len=8536 offset 70271664
Bit 8915 of 6687 not set
rec #8916 type=106a0000 len=8536 offset 70287360
rec #8917 type=106a0000 len=8536 offset 70295896
rec #8918 type=106a0000 len=8536 offset 70304432
off 70312968 skip 7160 to next chunk.
off 70320128 skip 32768 to next chunk.
rec #8924 type=106a0000 len=9056 offset 70352896
rec #8925 type=106a0000 len=9056 offset 70361952
rec #8926 type=106a0000 len=9048 offset 70371008
Bit 8927 of 6687 not set
rec #8928 type=106a0000 len=9056 offset 70385664
rec #8929 type=106a0000 len=9056 offset 70394720
rec #8930 type=106a0000 len=9056 offset 70403776
Bit 8931 of 6687 not set
rec #8932 type=106a0000 len=9056 offset 70418432
rec #8933 type=106a0000 len=9048 offset 70427488
Comment by Olaf Faaland [ 28/Nov/16 ]

Attached console logs and a text file listing which target is running on which node, for the most recent instance of the issue on Zinc.

target_to_node_map.nov28.txt
console_logs.nov28.tgz

Comment by Olaf Faaland [ 29/Nov/16 ]

Attached lustre debug logs for zinc target which is still in recovery (zinc7/lsh-MDT0006), and for target holding llog with invalid llog tail (zzinc13/lsh-MDT000c). Also attached what I believe is the relevant llog file from MDT000c.

dk.zinc7.1480375634.gz
dk.zinc13.1480375634.gz
lsh-mdt000c-1b70.nov28.tgz

Comment by Olaf Faaland [ 29/Nov/16 ]

Attached lustre debug log for the node reporting the invalid tail.

dk.zinc1.1480375634.gz

Comment by Olaf Faaland [ 29/Nov/16 ]

I will leave the file system as-is (down for users, due to the MDT stuck in recovery) until you have all the information you need.

Comment by Di Wang [ 29/Nov/16 ]

Thanks, Olaf. It looks correct, I will check right away.

Comment by Di Wang [ 29/Nov/16 ]

According to the dump of this update log, it looks like there is a hole from 70312968 to 70320128 (all are "0"), which cause the "corruption".

master transno = 17237427556 batchid = 12884938203 flags = 0 ops = 172 params = 84 rec_len 8536
rec #8916 type=106a0000 len=8536 offset 70287360, total 6584
offset 70295896 index 8917 type 106a0000
master transno = 17237427557 batchid = 12884938202 flags = 0 ops = 172 params = 84 rec_len 8536
rec #8917 type=106a0000 len=8536 offset 70295896, total 6585
offset 70304432 index 8918 type 106a0000
master transno = 17237427805 batchid = 12884938204 flags = 0 ops = 172 params = 84 rec_len 8536
rec #8918 type=106a0000 len=8536 offset 70304432, total 6586


offset 70312968 index 0 type 0
master transno = 0 batchid = 0 flags = 0 ops = 0 params = 0 rec_len 0
off 70312968 skip 7160 to next chunk. test_bit yes
offset 70320128 index 0 type 0
master transno = 0 batchid = 0 flags = 0 ops = 0 params = 0 rec_len 0
off 70320128 skip 32768 to next chunk. test_bit yes

offset 70352896 index 8924 type 106a0000
master transno = 17286373092 batchid = 12884981614 flags = 0 ops = 189 params = 86 rec_len 9056
rec #8924 type=106a0000 len=9056 offset 70352896, total 6587
offset 70361952 index 8925 type 106a0000

It looks like lgh_write_offset is not being reset during recovery, but need further investigation. Thanks.

Comment by Gerrit Updater [ 29/Nov/16 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/24008
Subject: LU-8753 llog: remove lgh_write_offset
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9f73a3f56778dd72feb802fc80d58c55dd3d4088

Comment by Di Wang [ 29/Nov/16 ]

Olaf: please enable D_HA and also collect the corrupt llog file, in case this patch does not help. Thanks.

Comment by Olaf Faaland [ 29/Nov/16 ]

Hi Di,

Disregard my message about not having lgh_write_offset. Not sure what I was looking at, it's there.

thanks,
Olaf

Comment by Di Wang [ 29/Nov/16 ]

Btw: you probably need delete the update_logs or reformat the system, since the update log is already corrupted, which will cause recovery stuck anyway. Sorry about that.

Comment by Di Wang [ 04/Dec/16 ]

Hi, Olaf, how does the test go?

Comment by Olaf Faaland [ 05/Dec/16 ]

Hi Di,
Nothing to report yet. We reformatted the file system but I am still working on getting updated RPMs on there. Thanks for checking. Hopefully we'll get it going by tomorrow and reproduce.

Comment by Giuseppe Di Natale (Inactive) [ 07/Dec/16 ]

Di,

We couldn't remove update_log_dir thru the ZPL because of an issue with ZAP size bookkeeping. I went ahead and made LU-8916 for that issue.

Comment by Olaf Faaland [ 07/Dec/16 ]

Updated Lustre on the test filesystem jet/lquake with the patch to remove lgh_write_offset. Will attempt to reproduce today.

Comment by Olaf Faaland [ 14/Dec/16 ]

Di,

With the lgh_write_offset patch recovery seems to complete more reliably. I'm now seeing new symptoms, and I don't know if they are a consequence of the patch or a separate problem that was hidden by the recovery issue. I'll summarize what I am seeing here; please tell me if you think it's a separate issue that should get a separate ticket.

I'd created several directories striped across all 16 MDTs. Within each such directory, a node was running mdtest and creating directories and files and deleting them. I'm not sure which phase it was in at the time I powered off 4 of the 16 MDS nodes (MDT0008 through MDT000b). I then powered them on again. During recovery, the server logs show:

2016-12-13 14:48:46 [  241.954091] Lustre: lquake-MDT0009: Client 987b4351-b459-8f25-89f9-59fa31045f6b (at 192.168.128.32@o2ib18) reconnecting, waiting for 31 clients in recovery for 3:58
2016-12-13 14:48:46 [  241.972729] Lustre: Skipped 7 previous similar messages
2016-12-13 14:48:46 [  241.979849] LustreError: 49706:0:(ldlm_lib.c:2751:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffff883ee4685400 x1553274421557352/t0(90194378184) o36->987b4351-b459-8f25-89f9-59fa31045f6b@192.168.128.32@o2ib18:7/0 lens 624/0 e 0 to 0 dl 1481669387 ref 2 fl Interpret:/6/ffffffff rc 0/-1
2016-12-13 14:48:46 [  242.013679] LustreError: 49706:0:(ldlm_lib.c:2751:target_queue_recovery_request()) Skipped 7 previous similar messages
2016-12-13 14:48:50 [  245.274867] LustreError: 37287:0:(ldlm_lib.c:1903:check_for_next_transno()) lquake-MDT0009: waking for gap in transno, VBR is OFF (skip: 90194377997, ql: 26, comp: 5, conn: 31, next: 90194377998, next_update 90194378030 last_committed: 90194377997)

<<< transno messages redacted >>>

2016-12-13 14:48:55 [  246.868005] Lustre: lquake-MDT0009: Recovery over after 1:39, of 31 clients 31 recovered and 0 were evicted.
2016-12-13 14:48:55 [  251.145347] sched: RT throttling activated

Then there seems to be some inter-MDT issue:

2016-12-13 14:53:16 [  512.145169] Lustre: 36488:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1481669506/real 1481669506]  req@ffff883ef5309b00 x1553642790382996/t0(0) o400->lquake-MDT000a-osp-MDT0009@172.19.1.121@o2ib100:24/4 lens 224/224 e 2 to 1 dl 1481669596 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
2016-12-13 14:54:46 [  602.183013] Lustre: 36488:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1481669596/real 1481669596]  req@ffff883ee54c6f00 x1553642790383484/t0(0) o400->lquake-MDT000a-osp-MDT0009@172.19.1.121@o2ib100:24/4 lens 224/224 e 2 to 1 dl 1481669686 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
2016-12-13 14:56:16 [  692.220902] Lustre: 36488:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1481669686/real 1481669686]  req@ffff883edf31f800 x1553642790384036/t0(0) o400->lquake-MDT000a-osp-MDT0009@172.19.1.121@o2ib100:24/4 lens 224/224 e 2 to 1 dl 1481669776 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
2016-12-13 14:56:23 [  698.815522] Lustre: lquake-MDT000a-osp-MDT0009: Connection restored to 172.19.1.121@o2ib100 (at 172.19.1.121@o2ib100)
2016-12-13 14:56:23 [  698.828569] Lustre: Skipped 23 previous similar messages

And clients seem to get evicted as a result of that later inter-MDT issue:

2016-12-13 14:53:16 [352250.742120]   req@ffff880fe4f8dd00 x1553274427352992/t68719541671(68719541671) o36->lquake-MDT000a-mdc-ffff88103d075000@172.19.1.121@o2ib100:12/10 lens 624/416 e 13 to 1 dl 1481669596 ref 2 fl Interpret:EX/6/ffffffff rc -110/-1
2016-12-13 14:54:46 [352340.778326] LustreError: 8274:0:(client.c:2874:ptlrpc_replay_interpret()) @@@ request replay timed out.
2016-12-13 14:54:46 [352340.778326]   req@ffff880fe4f8dd00 x1553274427352992/t68719541671(68719541671) o36->lquake-MDT000a-mdc-ffff88103d075000@172.19.1.121@o2ib100:12/10 lens 624/416 e 16 to 1 dl 1481669686 ref 2 fl Interpret:EX/6/ffffffff rc -110/-1
2016-12-13 14:56:13 [352428.151175] LustreError: 11-0: lquake-MDT000a-mdc-ffff88103d075000: operation mds_reint to node 172.19.1.121@o2ib100 failed: rc = -107
2016-12-13 14:56:42 [352456.923731] LustreError: 167-0: lquake-MDT000a-mdc-ffff88103d075000: This client was evicted by lquake-MDT000a; in progress operations using this service will fail.

After the 14:56 message LustreError message stop and listings, rmdirs and unlinks seem to work, but ...

There are damaged directories. Long listings of certain directories produce this:

lstat("dir.mdtest.1.173", 0x61d930)     = -1 ENOENT (No such file or directory)
write(2, "ls: ", 4ls: )                     = 4
write(2, "cannot access dir.mdtest.1.173", 30cannot access dir.mdtest.1.173) = 30
write(2, ": No such file or directory", 27: No such file or directory) = 27
write(2, "\n", 1
)                       = 1
lstat("dir.mdtest.4.170", 0x61d9f0)     = -1 ENOENT (No such file or directory)
write(2, "ls: ", 4ls: )                     = 4
write(2, "cannot access dir.mdtest.4.170", 30cannot access dir.mdtest.4.170) = 30
write(2, ": No such file or directory", 27: No such file or directory) = 27
write(2, "\n", 1
)                       = 1
lstat("dir.mdtest.1.172", 0x61dab0)     = -1 ENOENT (No such file or directory)
write(2, "ls: ", 4ls: )                     = 4
write(2, "cannot access dir.mdtest.1.172", 30cannot access dir.mdtest.1.172) = 30
write(2, ": No such file or directory", 27: No such file or directory) = 27
write(2, "\n", 1
)                       = 1

and client console log messages like this:

2016-12-13 15:42:12 [355186.720441] LustreError: 159150:0:(llite_lib.c:2309:ll_prep_inode()) new_inode -fatal: rc -2
2016-12-13 15:44:33 [355328.281847] LustreError: 159156:0:(llite_lib.c:2309:ll_prep_inode()) new_inode -fatal: rc -2
2016-12-13 15:44:33 [355328.292937] LustreError: 159156:0:(llite_lib.c:2309:ll_prep_inode()) Skipped 23 previous similar messages
Comment by Di Wang [ 14/Dec/16 ]

Thanks for update, Olaf.

2016-12-13 14:48:46 [  241.954091] Lustre: lquake-MDT0009: Client 987b4351-b459-8f25-89f9-59fa31045f6b (at 192.168.128.32@o2ib18) reconnecting, waiting for 31 clients in recovery for 3:58
2016-12-13 14:48:46 [  241.972729] Lustre: Skipped 7 previous similar messages
2016-12-13 14:48:46 [  241.979849] LustreError: 49706:0:(ldlm_lib.c:2751:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffff883ee4685400 x1553274421557352/t0(90194378184) o36->987b4351-b459-8f25-89f9-59fa31045f6b@192.168.128.32@o2ib18:7/0 lens 624/0 e 0 to 0 dl 1481669387 ref 2 fl Interpret:/6/ffffffff rc 0/-1
2016-12-13 14:48:46 [  242.013679] LustreError: 49706:0:(ldlm_lib.c:2751:target_queue_recovery_request()) Skipped 7 previous similar messages
2016-12-13 14:48:50 [  245.274867] LustreError: 37287:0:(ldlm_lib.c:1903:check_for_next_transno()) lquake-MDT0009: waking for gap in transno, VBR is OFF (skip: 90194377997, ql: 26, comp: 5, conn: 31, next: 90194377998, next_update 90194378030 last_committed: 90194377997)

This is ok, especially the recovery seems good according to following message. Actually this "...waking for gap.." message has been removed from console log in 2.9 see https://review.whamcloud.com/#/c/21418/

Those inter-MDT request timeout messages seems suggest MDT000a is stuck or dead for some reasons. Do you have the console logs or stack trace on MDT000a when this happened?

This damaged directories seems a bigger issue, but it unlikely caused by lgh_write_offset patch, could you please create a new ticket? And collecting -1 debug log when you do find on these directories, I want to know if this is caused by damaged stripes or dangling entries. Thanks.

Comment by Olaf Faaland [ 14/Dec/16 ]

Attached console log from jet11 which was running MDT000a, called console.jet11.2016-12-13-14-47. This is from the lustre startup on 2016-12-13 at 14:47.

Comment by Di Wang [ 14/Dec/16 ]

It looks recovery is stucked on MDT000a, and also recovery is aborted somehow because of this timeout. It seems there are 16 clients did not send replay req to the MDS? Hmm, did you manually abort the recovery? otherwise this should not happen. Do you have lustre debug log for this node?

2016-12-13 14:53:57 [  550.240713] Lustre: lquake-MDT000a: Client lquake-MDT000b-mdtlov_UUID (at 172.19.1.122@o2ib100) reconnecting, waiting for 31 clients in recovery for 0:45
2016-12-13 14:53:57 [  550.258291] Lustre: Skipped 46 previous similar messages
2016-12-13 14:53:59 [  552.241660] LustreError: 33918:0:(ldlm_lib.c:2751:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffff883efd628850 x1553641484060400/t0(68719540833) o1000->lquake-MDT000f-mdtlov_UUID@172.19.1.126@o2ib100:320/0 lens 344/0 e 0 to 0 dl 1481669700 ref 2 fl Interpret:/6/ffffffff rc 0/-1
2016-12-13 14:54:00 [  552.274899] LustreError: 33918:0:(ldlm_lib.c:2751:target_queue_recovery_request()) Skipped 44 previous similar messages
2016-12-13 14:54:44 [  597.107557] Lustre: lquake-MDT000a: Recovery already passed deadline 0:01, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
2016-12-13 14:54:45 [  598.108926] Lustre: lquake-MDT000a: Recovery already passed deadline 0:02, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
2016-12-13 14:54:45 [  598.129849] Lustre: Skipped 2 previous similar messages
2016-12-13 14:54:46 [  599.211411] Lustre: lquake-MDT000a: Recovery already passed deadline 0:03, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
2016-12-13 14:54:46 [  599.232239] Lustre: Skipped 4 previous similar messages
2016-12-13 14:54:48 [  601.220136] Lustre: lquake-MDT000a: Recovery already passed deadline 0:05, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.

And then update replay is aborted, which might cause filesystem inconsistency. This might explain those damaged directory entries.

2016-12-13 14:56:13 [  685.746427] Lustre: 34667:0:(ldlm_lib.c:1589:abort_req_replay_queue()) @@@ aborted:  req@ffff887f2c41d850 x1553641484047916/t0(68719540823) o1000->lquake-MDT0006-mdtlov_UUID@172.19.1.117@o2ib100:410/0 lens 344/0 e 17 to 0 dl 1481669790 ref 1 fl Complete:/4/ffffffff rc 0/-1
2016-12-13 14:56:13 [  685.793338] Lustre: lquake-MDT000a: Denying connection for new client f25a5bc2-93d4-c9d5-aae8-bd315848b7ea(at 192.168.128.24@o2ib18), waiting for 31 known clients (3 recovered, 12 in progress, and 16 evicted) to recover in 21188503:49
2016-12-13 14:56:13 [  685.818693] Lustre: Skipped 15 previous similar messages
2016-12-13 14:56:13 [  686.254147] Lustre: 34667:0:(ldlm_lib.c:1589:abort_req_replay_queue()) @@@ aborted:  req@ffff887f333f0050 x1553641484047916/t0(68719540823) o1000->lquake-MDT0006-mdtlov_UUID@172.19.1.117@o2ib100:454/0 lens 344/0 e 0 to 0 dl 1481669834 ref 1 fl Complete:/6/ffffffff rc 0/-1
2016-12-13 14:56:13 [  686.284145] Lustre: 34667:0:(ldlm_lib.c:1589:abort_req_replay_queue()) Skipped 971 previous similar messages
2016-12-13 14:56:22 [  695.215436] Lustre: 34667:0:(ldlm_lib.c:1589:abort_req_replay_queue()) @@@ aborted:  req@ffff887f333f2050 x1553641484047916/t0(68719540823) o1000->lquake-MDT0006-mdtlov_UUID@172.19.1.117@o2ib100:455/0 lens 344/0 e 0 to 0 dl 1481669835 ref 1 fl Complete:/6/ffffffff rc 0/-1

Olaf: Could you please tell me more about how do you run the test and fail MDT? Thanks.

Comment by Olaf Faaland [ 14/Dec/16 ]

Di,
The mds nodes were all powered off while clients were connected and the file system was active. This file system has 16 MDTs hosted on 16 MDS nodes, and 4 OSTs hosted on 4 OSS nodes.

The first test, where the issue seems to have started, was to power off nodes jet[9-12] which host MDT000[8-b]. They were then powered back on again, pools imported, and lustre mounted. There were about 18 mdtest jobs running at the time, one pe clientr node. Each was running within a separate directory that was striped across all 16 MDTs, with the default stripe count set to 16 so that subdirectories created by mdtest were also striped across all 16 MDTs.

I did not issue the abort_recovery; Lustre itself did that, but I don't know what triggered it. Sorry, I don't have the debug log.

Comment by Olaf Faaland [ 15/Dec/16 ]

I notice that there are at least a few instances of this:

(llog.c:529:llog_process_thread()) invalid length 0 in llog record for index 0/80

Attaching console logs for all the server nodes, see console.since-dec13.tgz

I accidentally included logs for jet[21,22] which are not part of the lustre file system. Disregard those.

Comment by Di Wang [ 15/Dec/16 ]

Olaf, Are there a lot of update_logs under update_log_dir/ on MDT000a, could you please upload those update logs? if it is not too much. Thanks. I suspect all of these recovery hassel is caused by this update log failure.

Comment by Gerrit Updater [ 15/Dec/16 ]

wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/24364
Subject: LU-8753 llog: add some debug patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 91b2fd79275ef0c8b0cc6d72f66f7c53593bebcf

Comment by Di Wang [ 15/Dec/16 ]

Olaf: I made a debug patch https://review.whamcloud.com/24364 based on http://review.whamcloud.com/24008, could you please try the test with this patch? thank you!

Comment by Olaf Faaland [ 15/Dec/16 ]

Hi Di,
I'm working on building updated lustre rpms for our testing, with these patches and those from other tickets as well. It will probably take me until Monday to get the rpms built and get control of the cluster again.

There are indeed logs of update logs on mdt000a. Raw, 218GB. Do you know the ID of the key log(s)? Or any way I can select a useful subset to give you?

Comment by Di Wang [ 15/Dec/16 ]

Unfortunately, the console message only tell me which record is wrong. Hmm, is there a lot of files. Maybe you can run llog_read for each of them, then looking for the message "to next chunk"

llog_reader  xxxx/update_log_dir/XXXX | grep "to next chunk"

Maybe you can even tweak llog_reader a bit to print the llog index to which corruption happened in index 80.

(llog.c:529:llog_process_thread()) invalid length 0 in llog record for index 0/80
diff --git a/lustre/utils/llog_reader.c b/lustre/utils/llog_reader.c
index f7dae4c..19805a2 100644
--- a/lustre/utils/llog_reader.c
+++ b/lustre/utils/llog_reader.c
@@ -293,8 +293,8 @@ int llog_pack_buffer(int fd, struct llog_log_hdr **llog,
                    cur_rec->lrh_len > (*llog)->llh_hdr.lrh_len) {
                        cur_rec->lrh_len = (*llog)->llh_hdr.lrh_len -
                                offset % (*llog)->llh_hdr.lrh_len;
-                       printf("off %lu skip %u to next chunk.\n", offset,
-                              cur_rec->lrh_len);
+                       printf("idx %u off %lu skip %u to next chunk.\n",
+                              cur_rec->lrh_index, offset, cur_rec->lrh_len);
                        i--;
                } else if (ext2_test_bit(idx, LLOG_HDR_BITMAP(*llog))) {
                        printf("rec #%d type=%x len=%u offset %lu\n", idx,
Comment by Olaf Faaland [ 19/Dec/16 ]

Di,
Friday the new lustre build was installed in the image and the servers rebooted. I'll bounce them again today to see if they recover normally.

Disregard my earlier message about llogs, my mistake. I'll get back to you about what I find in the llogs.

Comment by Olaf Faaland [ 19/Dec/16 ]

Di,
Only one log contained an invalid length (0). I've attached it as 0x48000a04b-0x1-0x0.tgz

Bit 77 of 3 not set
Bit 78 of 3 not set
rec #79 type=106a0000 len=3048 offset 500040
rec #0 off 503088 orig_len 0 skip 21200 to next chunk.
Bit 90 of 3 not set
Bit 91 of 3 not set
Comment by Di Wang [ 20/Dec/16 ]

Hi, Olaf

Thanks for uploading the file, it looks like some update log is not being written, so it leaves a hole there

master transno = 60130033439 batchid = 55834779575 flags = 0 ops = 172 params = 84 rec_len 8520
Bit 78 of 4 not set offset 491520
offset 500040 index 79 type 106a0000
master transno = 60130041273 batchid = 55834779576 flags = 0 ops = 88 params = 21 rec_len 3048
rec #79 type=106a0000 len=3048 offset 500040, total 1
offset 503088 index 0 type 0
master transno = 0 batchid = 0 flags = 0 ops = 0 params = 0 rec_len 0
off 503088 skip 8520 to next chunk. test_bit yes      ---->>> skip 8520 bytes, then it is valid again.
offset 511608 index 81 type 106a0000
master transno = 60130043788 batchid = 55834779578 flags = 0 ops = 18 params = 3 rec_len 728
Bit 81 of 4 not set offset 511608
offset 512336 index 82 type 106a0000
master transno = 60130043789 batchid = 55834779579 flags = 0 ops = 18 params = 3 rec_len 728
Bit 82 of 4 not set offset 512336
offset 513064 index 83 type 106a0000
master transno = 60130043798 batchid = 55834779580 flags = 0 ops = 88 params = 21 rec_len 2968
Bit 83 of 4 not set offset 513064
offset 516032 index 84 type 106a0000

So it looks like writing update record (rec_len = 8520), then the following write is not cancelled, so it cause a hole in the update log, which cause the issue.

Comment by Olaf Faaland [ 20/Dec/16 ]

Di,

I bounced servers or stopped and restarted lustre several times today, in varying combinations. MDT000a still seems to take longer to get connected than the others; several times the recovery_status procfile showed MDTs were waiting for MDT000a. However within a few minutes it seems to successfully complete recovery.

I have the debug logs from one of these attempts, and can upload if it's helpful. Let me know.

Comment by Di Wang [ 20/Dec/16 ]

Hmm, there are already corrupted update log. Ah, you reformat FS? if not, then I would expect recovery would stuck. Do you have the console log on MDT000a? Thanks.

Comment by Di Wang [ 20/Dec/16 ]

Olaf: I just add another fix to this ticket https://review.whamcloud.com/24364 , hopefully this can resolve the update log corruption issue. Please try it. Thanks.

Comment by Olaf Faaland [ 21/Dec/16 ]

Hmm, there are already corrupted update log. Ah, you reformat FS? if not, then I would expect recovery would stuck. Do you have the console log on MDT000a? Thanks.

This FS (lquake) was last reformatted November 21. I expected recovery to get stuck also, given the presence of the corrupted update log. I don't know why it did not. I've attached the console log, see console.zinc11.2016-12-19

Comment by Di Wang [ 21/Dec/16 ]

According to the console log, it looks like recovery was stuck, but success as you said. This is good but strange,

2016-12-19 18:09:35 [ 9014.403506] Lustre: lquake-MDT000a: Recovery already passed deadline 12:36. It is due to DNE recovery failed/stuck on the 1 MDT(s): 000a. Please wait until all MDTs recovered or abort the recovery by force.
2016-12-19 18:09:35 [ 9014.426106] Lustre: Skipped 60 previous similar messages
2016-12-19 18:10:40 [ 9079.948790] Lustre: lquake-MDT000a: Recovery already passed deadline 11:30. It is due to DNE recovery failed/stuck on the 1 MDT(s): 000a. Please wait until all MDTs recovered or abort the recovery by force.
2016-12-19 18:10:40 [ 9079.971290] Lustre: Skipped 63 previous similar messages
2016-12-19 18:11:05 [ 9104.433087] Lustre: lquake-MDT000a: Connection restored to 172.19.1.111@o2ib100 (at 172.19.1.111@o2ib100)
2016-12-19 18:11:05 [ 9104.444864] Lustre: Skipped 130 previous similar messages
2016-12-19 18:12:35 [ 9194.452689] Lustre: lquake-MDT000a: Recovery already passed deadline 9:35. If you do not want to wait more, please abort the recovery by force.
2016-12-19 18:12:36 [ 9195.708675] Lustre: lquake-MDT000a: Recovery already passed deadline 9:34. If you do not want to wait more, please abort the recovery by force.
2016-12-19 18:12:36 [ 9195.725009] Lustre: Skipped 7 previous similar messages
2016-12-19 18:12:37 [ 9196.061115] Lustre: lquake-MDT000a: Recovery over after 5:26, of 70 clients 70 recovered and 0 were evicted.

Please try 24364 + 24008, I hope this could resolve all of these corrupt update log issue, and all of these recovery troubles will go away. Thanks.

Comment by Di Wang [ 21/Dec/16 ]

Hi, Olaf

So the current corruption happened when MDT0006 retrieves update log from MDT000a (from console.since-dec13)

2016-12-13 13:49:38 [336387.573280] Lustre: 86734:0:(llog.c:529:llog_process_thread()) invalid length 0 in llog record for index 0/80
2016-12-13 13:49:38 [336387.585565] LustreError: 86734:0:(lod_dev.c:419:lod_sub_recovery_thread()) lquake-MDT000a-osp-MDT0006 getting update log failed: rc = -22

Do you still have the console log on MDT0006 for the last run? I want to check if this corrupt log is being hit in last recovery. Thanks.

Comment by Gerrit Updater [ 09/Jan/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24008/
Subject: LU-8753 llog: remove lgh_write_offset
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f36daac69fe6e0cd35e2369967f4bae11bd2666f

Comment by Gerrit Updater [ 24/Jan/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24364/
Subject: LU-8753 osp: add rpc generation
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0844905a308d614c86b56df70c8f03e5d59ee286

Comment by Peter Jones [ 30/Jan/17 ]

As per Di remaining patch was a debug only patch. This issue is fixed for 2.10 and the patches will be backported to maintenance releases,

Generated at Sat Feb 10 02:20:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.