[LU-8569] Sharded DNE directory full of files that don't exist Created: 30/Aug/16  Updated: 10/Aug/17  Resolved: 18/Jan/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Critical
Reporter: Christopher Morrone Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: llnl

Attachments: Text File getstripelogs.tar.gz     Text File jet-link-logs-part1.tar.gz     Text File jet-link-logs-part2.tar.gz     Text File jet-link-logs-part3.tar.gz     Text File jet-link-logs-part4.tar.gz     Text File lfsck_namespace_state-9-28-2016.log    
Issue Links:
Related
is related to LU-8647 lfsck_namespace_double_scan()) ASSERT... Resolved
is related to LU-5802 LFSCK 5: avoid the (direct) interacti... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

On our DNE testbed, one of our sharded directories seems to contain files that are all in a broken state. Currently both servers and clients are running 2.8.0_0.0.llnlpreview.40 (see the lustre-release-fe-llnl repo).

We can get a directory listing, but nothing listed is actually accessible. Here is an excerpt from running ls -l:

# pwd
/p/lquake/casses1/opal-jet/simul_2
# ls -l
ls: cannot access simul_link.2243: No such file or directory
ls: cannot access simul_link.3161: No such file or directory
ls: cannot access simul_link.3129: No such file or directory
ls: cannot access simul_link.3893: No such file or directory
ls: cannot access simul_link.691: No such file or directory
ls: cannot access simul_link.3233: No such file or directory
ls: cannot access simul_link.235: No such file or directory
ls: cannot access simul_link.1653: No such file or directory
ls: cannot access simul_link.3167: No such file or directory
ls: cannot access simul_link.681: No such file or directory
ls: cannot access simul_link.835: No such file or directory
ls: cannot access simul_link.3857: No such file or directory
ls: cannot access simul_link.1591: No such file or directory
ls: cannot access simul_link.1175: No such file or directory
[cut]
-????????? ? ? ? ?            ? simul_link.937
-????????? ? ? ? ?            ? simul_link.94
-????????? ? ? ? ?            ? simul_link.940
-????????? ? ? ? ?            ? simul_link.941
-????????? ? ? ? ?            ? simul_link.942
-????????? ? ? ? ?            ? simul_link.943
-????????? ? ? ? ?            ? simul_link.944
-????????? ? ? ? ?            ? simul_link.947
[cut]

Here is the striping information:

# lfs getdirstripe .
.
lmv_stripe_count: 16 lmv_stripe_offset: 12
mdtidx           FID[seq:oid:ver]
    12           [0x50000996c:0x14fed:0x0]
    13           [0x54000919d:0x14fed:0x0]
    14           [0x58000a086:0x14fed:0x0]
    15           [0x5c000996b:0x14fed:0x0]
     0           [0x200006b03:0x14fed:0x0]
     1           [0x3000089cc:0x14fed:0x0]
     2           [0x38000996d:0x14fed:0x0]
     3           [0x4c000b0df:0x14fed:0x0]
     4           [0x2c000a142:0xec09:0x0]
     5           [0x3c000b8b2:0xec09:0x0]
     6           [0x34000a143:0xec09:0x0]
     7           [0x40000a143:0xec09:0x0]
     8           [0x44000a142:0xec09:0x0]
     9           [0x24000a143:0xec09:0x0]
    10           [0x2800091a4:0xec09:0x0]
    11           [0x4800091a3:0xec09:0x0]

I ran lfsck on all services (at least those started by the "--all" option), but that did not address this situation.

The problem files cannot be unlinked:

# rm simul_link.999
rm: cannot remove 'simul_link.999': No such file or directory


 Comments   
Comment by Andreas Dilger [ 31/Aug/16 ]

Can you check "lfs getstripe" on a few of the broken files, to see if the FIDs of the IST objects are unusual? I suspect that the directory is OK, but the error is coming from the OST which does not have the objects in the MDT file's layout. That may still indicate a problem with the MDT or OST, but will give a starting point.

Comment by Andreas Dilger [ 31/Aug/16 ]

Can you please check "lfs getstripe" on a few of the broken files. It may be that the error is coming from the OST and not the directory at all.

Comment by Peter Jones [ 31/Aug/16 ]

Assigning to Fan Yong for further investigation

Comment by Christopher Morrone [ 31/Aug/16 ]

Here is the result of lfs getstripe for files in that directory:

# lfs getstripe simul_link.2280
error opening simul_link.2280: Bad address (14)
llapi_semantic_traverse: Failed to open 'simul_link.2280': Bad address (14)
error: getstripe failed for simul_link.2280.
Comment by nasf (Inactive) [ 01/Sep/16 ]

Would you please to collect the -1 level Lustre debug log on both the client and MDT when you hit "lfs getstripe simul_link.2280" failure? Since we do NOT know (if you know, that is better) on which MDT the file "lfs getstripe simul_link.2280" resides, then have to collect the logs on all MDTs.

Thanks!

Comment by Giuseppe Di Natale (Inactive) [ 19/Sep/16 ]

I collected -1 level Lustre logs on the client and for each MDT. They are in the tar file 'getstripelogs.tar.gz' which I attached to this issue.

The command I logged is:

lfs getstripe simul_link.898

The output of the command was:

error opening simul_link.898: Bad address (14)
llapi_semantic_traverse: Failed to open 'simul_link.898': Bad address (14)
error: getstripe failed for simul_link.898.

A grep seems to indicate that jet2 may be the log of interest, but I included all of them for completeness. Let me know if you need any other information.

Comment by nasf (Inactive) [ 20/Sep/16 ]

The log on the client (client-getstripe.log) shows that:

00800000:00000001:3.0:1474324139.597219:0:117923:0:(lmv_intent.c:276:lmv_intent_open()) Process entered
00800000:00000040:3.0:1474324139.597221:0:117923:0:(lustre_lmv.h:170:lmv_name_to_stripe_index()) name simul_link.898 hash_type 2 idx 1
00800000:00000040:3.0:1474324139.597223:0:117923:0:(lmv_obd.c:1715:lmv_locate_target_for_name()) locate on mds 1 [0x30000cf20:0x1:0x0]
00800000:00000002:3.0:1474324139.597224:0:117923:0:(lmv_intent.c:316:lmv_intent_open()) OPEN_INTENT with fid1=[0x30000cf20:0x1:0x0], fid2=[0x0:0x0:0x0], name='simul_link.898' -> mds #1
...

Means client intent open ([0x30000cf20:0x1:0x0]/simul_link.898) RPC to the mds#1
The log on the mds1 (jet2-getstripe.log) shows that:

00000004:00000001:7.0:1474324139.598512:0:38638:0:(mdt_open.c:1198:mdt_reint_open()) Process entered
00000020:00000001:7.0:1474324139.598514:0:38638:0:(lprocfs_jobstats.c:272:lprocfs_job_stats_log()) Process entered
00000020:00000001:7.0:1474324139.598517:0:38638:0:(lprocfs_jobstats.c:323:lprocfs_job_stats_log()) Process leaving (rc=0 : 0 : 0)
00000004:00000002:7.0:1474324139.598518:0:38638:0:(mdt_open.c:1226:mdt_reint_open()) I am going to open [0x30000cf20:0x1:0x0]/(simul_link.898->[0x0:0x0:0x0]) cr_flag=01 mode=0100000 msg_flag=0x0
...
00080000:00000001:7.0:1474324139.598600:0:38638:0:(osd_index.c:395:osd_dir_lookup()) Process entered
00080000:00000001:7.0:1474324139.598639:0:38638:0:(osd_index.c:415:osd_dir_lookup()) Process leaving (rc=1 : 1 : 1)
...
00000004:00000001:7.0:1474324139.599521:0:38638:0:(osp_trans.c:469:osp_remote_sync()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
00000004:00000001:7.0:1474324139.599522:0:38638:0:(osp_object.c:591:osp_attr_get()) Process leaving via out (rc=18446744073709551614 : -2 : 0xfffffffffffffffe)
...

Means the mds#1 received the intent open RPC. It did lookup firstly and found the name entry "simul_link.898" existed on this MDT, but its FID is remote, then triggered osp_attr_get() to fetch the object's attribute when initialise the object. Unfortunately, the remote MDT returned -2 (-ENOENT) to this MDT. That the "simul_link.898" is dangling name entry. That is why the subsequent operation got -14 (-EFAULT) failure.

Currently, I do not know what caused the dangling name entry. But I would suggest to run namespace LFSCK to fix related Lustre inconsistency. To be safe, you can run namespace LFSCK without "-C" option firstly, that will detect how many dangling name entries in the system but NOT auto repair them. Then you can check whether need to fix them. If you think it is necessary to re-create related lost MDT-objects, then re-run the namespace LFSCK with "-C" specified.

Comment by Giuseppe Di Natale (Inactive) [ 21/Sep/16 ]

We came up with an easier reproducer for this issue in case you need to collect more information. Details are below.

Create a striped directory for this test. cd to that directory and create a simple file:

echo "hello world" > afile

From there, create a script called 'linkme.sh' with the following contents:

#!/bin/bash
filename=$(hostname)_${RANDOM}
ln afile $filename

Now, using srun, we can run the script across many nodes/cores w/ no timeout. Example below:

srun -W 0 -N 47 -n $((47*36)) linkme.sh

The script ran for a bit, but eventually we started seeing "bad address" errors. I'll continue to try and collect more information.

Comment by Giuseppe Di Natale (Inactive) [ 22/Sep/16 ]

Ran an lfsck namespace with -C and got the following LBUG on multiple MDTs.

2016-09-22 10:04:23 [493341.943717] LustreError: 127771:0:(lfsck_namespace.c:4452:lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed: 
2016-09-22 10:04:23 [493341.958848] LustreError: 127771:0:(lfsck_namespace.c:4452:lfsck_namespace_double_scan()) LBUG
2016-09-22 10:04:23 [493341.968781] Pid: 127771, comm: lfsck

Have the following call stack on two MDTs.

2016-09-22 10:03:52 Sep 22 10:03:52 [493315.464373] Kernel panic - not syncing: LBUG
2016-09-22 10:03:52 jet6 kernel: [49[493315.470430] CPU: 2 PID: 111809 Comm: lfsck Tainted: P           OE  ------------   3.10.0-327.28.2.1chaos.ch6.x86_64 #1
2016-09-22 10:03:52 3315.297027] Lus[493315.484175] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
2016-09-22 10:03:52 treError: 111809[493315.497715]  ffffffffa079be0f 0000000055805053 ffff882757e4fc78 ffffffff8164cae7
2016-09-22 10:03:52 :0:(lfsck_namesp[493315.507701]  ffff882757e4fcf8 ffffffff81645adf ffffffff00000008 ffff882757e4fd08
2016-09-22 10:03:52 ace.c:4452:lfsck[493315.517684]  ffff882757e4fca8 0000000055805053 ffffffffa1070e70 0000000000000246
2016-09-22 10:03:52 _namespace_doubl[493315.527666] Call Trace:
2016-09-22 10:03:52 e_scan()) ASSERT[493315.532060]  [<ffffffff8164cae7>] dump_stack+0x19/0x1b
2016-09-22 10:03:52 ION( list_empty([493315.539478]  [<ffffffff81645adf>] panic+0xd8/0x1e7
2016-09-22 10:03:52 &lad->lad_req_li[493315.546501]  [<ffffffffa077fdeb>] lbug_with_loc+0xab/0xc0 [libcfs]
2016-09-22 10:03:52 st) ) failed: 
2016-09-22 10:03:52 [493315.555082]  [<ffffffffa102c2a6>] lfsck_namespace_double_scan+0x106/0x140 [lfsck]
2016-09-22 10:03:52 Sep 22 10:03:52 [493315.565122]  [<ffffffffa10234f9>] lfsck_double_scan+0x59/0x200 [lfsck]
2016-09-22 10:03:52 jet6 kernel: [49[493315.574086]  [<ffffffffa0d88fc4>] ? osd_zfs_otable_it_fini+0x64/0x110 [osd_zfs]
2016-09-22 10:03:52 3315.311863] Lus[493315.583931]  [<ffffffffa0d88fc4>] ? osd_zfs_otable_it_fini+0x64/0x110 [osd_zfs]
2016-09-22 10:03:52 treError: 111809[493315.593765]  [<ffffffff811c8bad>] ? kfree+0x12d/0x170
2016-09-22 10:03:52 :0:(lfsck_namesp[493315.601075]  [<ffffffffa1028044>] lfsck_master_engine+0x434/0x1310 [lfsck]
2016-09-22 10:03:52 ace.c:4452:lfsck[493315.610415]  [<ffffffff81015588>] ? __switch_to+0xf8/0x4d0
2016-09-22 10:03:52 _namespace_doubl[493315.618212]  [<ffffffff810bd4f0>] ? wake_up_state+0x20/0x20
2016-09-22 10:03:52 e_scan()) LBUG
2016-09-22 10:03:52 [493315.626108]  [<ffffffffa1027c10>] ? lfsck_master_oit_engine+0x1430/0x1430 [lfsck]
2016-09-22 10:03:52 [493315.636145]  [<ffffffff810a99bf>] kthread+0xcf/0xe0
2016-09-22 10:03:52 [493315.642238]  [<ffffffff810a98f0>] ? kthread_create_on_node+0x140/0x140
2016-09-22 10:03:52 [493315.650187]  [<ffffffff8165d9d8>] ret_from_fork+0x58/0x90
2016-09-22 10:03:52 [493315.656864]  [<ffffffff810a98f0>] ? kthread_create_on_node+0x140/0x140
2016-09-22 10:03:52 [493315.711916] drm_kms_helper: panic occurred, switching back to text console
2016-09-22 10:03:52 [493315.720378] ------------[ cut here ]------------
2016-09-22 10:03:52 [493315.726202] WARNING: at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x5f/0x70()
2016-09-22 10:03:52 [493315.735902] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) rpcsec_gss_krb5 ko2iblnd(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) nfsv3 iTCO_wdt iTCO_vendor_support intel_powerclamp coretemp intel_rapl kvm mlx5_ib pcspkr mlx5_core sb_edac lpc_ich edac_core mfd_core mei_me mei zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) ses enclosure ipmi_devintf spl(OE) zlib_deflate sg i2c_i801 ioatdma shpchp ipmi_si ipmi_msghandler acpi_power_meter acpi_cpufreq binfmt_misc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr nfsd nfs_acl ip_tables auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache dm_round_robin sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32_pclmul mgag200 crc32c_intel syscopyarea sysfillrect sysimgblt ghash_clmulni_intel i2c_algo_bit drm_kms_helper mxm_wmi ttm aesni_intel ixgbe lrw gf128mul ahci drm dca glue_helper mpt3sas libahci ptp i2c_core ablk_helper cryptd libata raid_class pps_core scsi_transport_sas mdio wmi sunrpc dm_mirror dm_region_hash dm_log scsi_transport_iscsi dm_multipath dm_mod
2016-09-22 10:03:52 [493315.859970] CPU: 2 PID: 0 Comm: swapper/2 Tainted: P           OE  ------------   3.10.0-327.28.2.1chaos.ch6.x86_64 #1
2016-09-22 10:03:52 [493315.872734] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
2016-09-22 10:03:53 [493315.885407]  0000000000000000 bcf7d7e5812e0014 ffff883f7e683d78 ffffffff8164cae7
2016-09-22 10:03:53 [493315.894536]  ffff883f7e683db0 ffffffff8107d6d0 0000000000000000 ffff883f7e6967c0
2016-09-22 10:03:53 [493315.903668]  000000011d5cacb8 ffff883f7e6167c0 0000000000000002 ffff883f7e683dc0
2016-09-22 10:03:53 [493315.912796] Call Trace:
2016-09-22 10:03:53 [493315.916347]  <IRQ>  [<ffffffff8164cae7>] dump_stack+0x19/0x1b
2016-09-22 10:03:53 [493315.923621]  [<ffffffff8107d6d0>] warn_slowpath_common+0x70/0xb0
2016-09-22 10:03:53 [493315.931168]  [<ffffffff8107d81a>] warn_slowpath_null+0x1a/0x20
2016-09-22 10:03:53 [493315.938512]  [<ffffffff81048fdf>] native_smp_send_reschedule+0x5f/0x70
2016-09-22 10:03:53 [493315.946646]  [<ffffffff810cb04d>] trigger_load_balance+0x18d/0x250
2016-09-22 10:03:53 [493315.954390]  [<ffffffff810bbdd3>] scheduler_tick+0x103/0x150
2016-09-22 10:03:53 [493315.961553]  [<ffffffff810e5800>] ? tick_sched_handle.isra.14+0x60/0x60
2016-09-22 10:03:53 [493315.969775]  [<ffffffff81091a06>] update_process_times+0x66/0x80
2016-09-22 10:03:53 [493315.977304]  [<ffffffff810e57c5>] tick_sched_handle.isra.14+0x25/0x60
2016-09-22 10:03:53 [493315.985310]  [<ffffffff810e5841>] tick_sched_timer+0x41/0x70
2016-09-22 10:03:53 [493315.992432]  [<ffffffff810adeda>] __hrtimer_run_queues+0xea/0x2c0
2016-09-22 10:03:53 [493316.000042]  [<ffffffff810ae4e0>] hrtimer_interrupt+0xb0/0x1e0
2016-09-22 10:03:53 [493316.007351]  [<ffffffff8104be47>] local_apic_timer_interrupt+0x37/0x60
2016-09-22 10:03:53 [493316.015442]  [<ffffffff8166000f>] smp_apic_timer_interrupt+0x3f/0x60
2016-09-22 10:03:53 [493316.023338]  [<ffffffff8165e6dd>] apic_timer_interrupt+0x6d/0x80
2016-09-22 10:03:53 [493316.030848]  <EOI>  [<ffffffff810dd69c>] ? ktime_get+0x4c/0xd0
2016-09-22 10:03:53 [493316.038194]  [<ffffffff810b8da6>] ? finish_task_switch+0x56/0x180
2016-09-22 10:03:53 [493316.045803]  [<ffffffff81651df0>] __schedule+0x2e0/0x940
2016-09-22 10:03:53 [493316.052533]  [<ffffffff81653709>] schedule_preempt_disabled+0x39/0x90
2016-09-22 10:03:53 [493316.060533]  [<ffffffff810db1f4>] cpu_startup_entry+0x184/0x2d0
2016-09-22 10:03:53 [493316.067949]  [<ffffffff81049eea>] start_secondary+0x1ca/0x240
2016-09-22 10:03:53 [493316.075162] ---[ end trace 28897805122ddeee ]---

Filesystem info:
16 MDS, 4 OSS, running ZFS 0.7.0-0.3llnl and lustre 2.8.0 on a RHEL 7.2 based operating system (3.10.0-327.28.2.1chaos.ch6.x86_64).

Also worth noting, once we have a directory with files that exhibit this "bad address" error, the directory cannot be removed.

Let me know if you need more info.

Comment by Giuseppe Di Natale (Inactive) [ 23/Sep/16 ]

I'm going to attempt to bring our filesystem back up this afternoon, if you could let me know if you have everything you need, that'd be great! Thanks!

Comment by Peter Jones [ 23/Sep/16 ]

Joe

Fan Yong is based in China so mayl not see this question until Sunday evening by this time of day

Peter

Comment by Giuseppe Di Natale (Inactive) [ 23/Sep/16 ]

Ah, thanks for letting me know, Peter. We are able to reproduce it if necessary, so I think it's safe to reboot our filesystem.

Comment by nasf (Inactive) [ 25/Sep/16 ]

I will make patch to fix the namespace LFSCK assertion.

Also worth noting, once we have a directory with files that exhibit this "bad address" error, the directory cannot be removed.

That is because there are dangling name entries under the parent directory, the dangling name entries cannot be removed via normal unlink/rmdir command, as to the parent directory are not empty. That is why the parent directory cannot be removed under such case.

To resolve such trouble, you have to use the namespace LFSCK with "-C" option to fix the dangling name entries firstly, then removed them.

Comment by Gerrit Updater [ 25/Sep/16 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/22723
Subject: LU-8569 lfsck: cleanup lfsck requests list before exit
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 46dd7d98cb262fbbe1285b447cd763bfc80b27d4

Comment by Christopher Morrone [ 26/Sep/16 ]

But I would suggest to run namespace LFSCK to fix related Lustre inconsistency.

I had done that, but it did not fix the problem. lfsck ran to completion and did not assert when I ran it. The assertion is a new thing. I have no idea why it crashed this time around.

Comment by nasf (Inactive) [ 27/Sep/16 ]

I think you have specified "-C" option when run the namespace LFSCK completely, right?
Do you have the Lustre debug log (with "lfsck" debug enabled) when LFSCK ran? That will record which inconsistency have been detected and repaired (or failure). If you have not collected related information, can you show the me the lfsck status lproc output?

Thanks!

Comment by Giuseppe Di Natale (Inactive) [ 28/Sep/16 ]

I went ahead and attached a log file called "lfsck_namespace_state-9-28-2016.log" which was obtained by running the following on each MDS:

pdsh -g mds 'lctl get_param -n mdd.$(ldev -l | grep lquake-MDT).lfsck_namespace' | dshbak -c

Worth noting, when I restarted the filesystem, I had to stop the lfsck namespace check because the kernel panics would continue to occur because lfsck tried picking up where it left off.

Also, we are creating the name dangling issue at will at the moment with the reproduction steps I provided in my Sept 21, 2016 comment (the one with the linkme.sh). I think that still needs to be addressed.

I'm also going to break out the lfsck call stack issue to a separate ticket, it is unclear whether or not it is related.

Comment by Giuseppe Di Natale (Inactive) [ 28/Sep/16 ]

LU-8647 for lfsck assertion which we started discussing in this ticket.

Comment by nasf (Inactive) [ 30/Sep/16 ]

According to namespace LFSCK status, some dangling name entry should have been fixed:

# grep dangling lfsck_namespace_state-9-28-2016.log 
33:dangling_repaired: 423
92:dangling_repaired: 442
151:dangling_repaired: 431
210:dangling_repaired: 437
269:dangling_repaired: 406
328:dangling_repaired: 437
387:dangling_repaired: 440
446:dangling_repaired: 403
505:dangling_repaired: 511
564:dangling_repaired: 434
623:dangling_repaired: 432
682:dangling_repaired: 434
741:dangling_repaired: 540
800:dangling_repaired: 429
859:dangling_repaired: 435
918:dangling_repaired: 411

But still some failures when try to repair the striped directories:

# grep failed lfsck_namespace_state-9-28-2016.log | grep -v 0
5:48:striped_shards_failed: 6
75:874:striped_shards_failed: 1

Unfortunately, if without the detailed LFSCK Lustre kernel debug logs, we cannot know what caused the LFSCK failure. If you can re-run the namespace LFSCK, then please enable "lfsck" debug on the MDTs, and collect the Lustre kernel debug logs.

Currently, since the the namespace LFSCK failed to fix some inconsistency, if you have to remove those dangling entries some soon, then one possible solution is that: mount the backend as "ZFS" and remove those entries under "ZFS" mode directly. That will leave some stale OI mappings in the system, but it is almost harmless but space waste.

Comment by Andreas Dilger [ 30/Sep/16 ]

It may also be possible to use "lfs rm" to remove dangling remote directory entries without trying to unlink the remote inode. That is intended for use in case of an MDT becoming permanently unavailable, but should also work in this case.

Comment by nasf (Inactive) [ 30/Sep/16 ]

#!/bin/bash
filename=$(hostname)_${RANDOM}
ln afile $filename

I have ever tried above scripts with multiple clients run in parallel, but cannot reproduce the trouble.
On the other hand, analysis your scripts: it creates a regular file 'afile', then repeatedly hardlink to it. If the hardlink triggers the issue finally, since nobody unlink the source object 'afile', then the unique possible case (for dangling name entry) is that the FID stored in the hardlink name entry is wrong. But I cannot imagine how this can happen.

Giuseppe, would you please to reproduce the issue as the way you mentioned with "-1" level Lustre kernel debug logs collected on the MDTs? Thanks!

Comment by Giuseppe Di Natale (Inactive) [ 05/Oct/16 ]

I can get you some logs soon. Our test system isn't happy right now. Working on getting it back up so I can reproduce this to get those logs. Stay tuned.

Comment by Brad Hoagland (Inactive) [ 12/Oct/16 ]

Hi dinatale2,
Were you able to get the test system up and reproduce?

Comment by Giuseppe Di Natale (Inactive) [ 13/Oct/16 ]

Still having issues with it. Will attempt to reproduce this ASAP.

Comment by Giuseppe Di Natale (Inactive) [ 14/Oct/16 ]

Logs are now attached to this incident. The file names are jet-link-logs-part[1-4].tar.gz. The part 1 gzip has errors.log in it which has a sampling of what shows up in the console so you can use that to track down a specific file in the logs. Let me know if you need anything else.

Comment by Gerrit Updater [ 20/Oct/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22723/
Subject: LU-8569 lfsck: cleanup lfsck requests list before exit
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 445da16c2ac0475b1c1077c822800b68cdbb7ce3

Comment by Peter Jones [ 20/Oct/16 ]

Landed for 2.9

Comment by Peter Jones [ 20/Oct/16 ]

Actually perhaps I was premature to mark as resolved here. Fan Yong, what did the patch tracked under this ticket that jus tlanded to master address? Is there still work to be tracked under this ticket?

Comment by Giuseppe Di Natale (Inactive) [ 21/Oct/16 ]

Peter,

There is still work being tracked under this ticket. The logs I posted last week are to help find a resolution to this issue.

The patch that landed was for LU-8647.

Comment by Peter Jones [ 22/Oct/16 ]

So LU-8647 was fixed by http://git.whamcloud.com/fs/lustre-release.git/commit/445da16c2ac0475b1c1077c822800b68cdbb7ce3 even though it used the LU-8569 JIRA reference in the commit message?

Comment by nasf (Inactive) [ 23/Oct/16 ]

Peter,

As you can see in the comment history, to make LU-8569 original issues to be clear, the new test failure about the LFSCK was split from LU-8569 description with new ticket LU-8647. The patch http://review.whamcloud.com/22723/ was used for resolving LU-8647 issue, but because the patch http://review.whamcloud.com/22723/ was push to Gerrit before LU-8647 created, then such patch still used the old ticket number.

So we can close the ticket LU-8647 as resolved. There are still some work to be done for LU-8569. I am investigating the huge logs.

Comment by Peter Jones [ 24/Oct/16 ]

Got it. For future reference it is possible to make adjustments to git commit messages when landing, so it would have been possible to use the correct JIRA reference without delaying things.

Comment by Di Wang [ 27/Oct/16 ]

Just looked the debug log, it looks like update log is too long, which seems not right.

.............
0x23:47025: 200000020:00000040:9.0:1476399235.972447:0:154190:0:(update_trans.c:93:top_multiple_thandle_dump())  cookie 0x23:47025: 1

too much log cookies ( > 1k) for this transaction, each cookie can hold 32k update records. So I do not understand why link can generate such big record size. Hmm, even though the linkea size might be big in your test. (Do we limit linkea size for zfs?) the problem might be in
sub_updates_write. and related with this patch http://review.whamcloud.com/21334 , I will check.

I suspect this test might reproduce the problem, sigh, I do not have zfs environment here,

diff --git a/lustre/tests/sanity.sh b/lustre/tests/sanity.sh
index c61e3bc..0a3a82c 100755
--- a/lustre/tests/sanity.sh
+++ b/lustre/tests/sanity.sh
@@ -15196,6 +15196,29 @@ test_300q() {
 }
 run_test 300q "create remote directory under orphan directory"
 
+test_300r() {
+       [ $PARALLEL == "yes" ] && skip "skip parallel run" && return
+       [ $(lustre_version_code $SINGLEMDS) -lt $(version_code 2.7.55) ] &&
+               skip "Need MDS version at least 2.7.55" && return
+       [ $MDSCOUNT -lt 2 ] && skip "needs >= 2 MDTs" && return
+       local stripe_count
+       local file
+
+       mkdir $DIR/$tdir
+
+       $LFS setdirstripe -i1 -c3 $DIR/$tdir/remote_dir ||
+               error "set striped dir error"
+
+       touch $DIR/$tdir/$tfile
+       for ((i = 0; i < 50000; i++)); do
+               ln $DIR/$tdir/$tfile $DIR/$tdir/remote_dir/fffffffffffffffffffffffffffffffffffffffff-$i ||
+                       error "ln remote file fails"
+       done
+
+       return 0
+}
+run_test 300r "test remote ln under striped directory"
+
 prepare_remote_file() {
        mkdir $DIR/$tdir/src_dir ||
                error "create remote source failed"

Comment by Di Wang [ 27/Oct/16 ]

Just did some tests on ZFS and it looks like the problem is because the linkEA on ZFS reach above the llog chunk size (32768), which our current update llog system can not handle. i.e. one update operation (update op + its parameter) size can not > llog chunk size (32KB).

So is it ok to limit the linkea size here?

Comment by Andreas Dilger [ 27/Oct/16 ]

Yes, I think it is reasonable to limit linkEA size in this case. The Linux kernel xattr API is also similarly limited by the size of individual xattrs, and ldiskfs has a 4KB limit for xattrs, so the Lustre code is already expecting that not all links will be stored for a given file.

Comment by Gerrit Updater [ 01/Nov/16 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/23500
Subject: LU-8569 linkea: linkEA size limitation
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0d8fe108f7b7f267fa790320954fc55e996af964

Comment by Gerrit Updater [ 14/Nov/16 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/23741
Subject: LU-8569 lfsck: handle linkEA overflow
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 94f5d2fec9edb6e1e5359ceebea9882cb5bb2719

Comment by Gerrit Updater [ 01/Jan/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23500/
Subject: LU-8569 linkea: linkEA size limitation
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e760042016bb5b12f9b21568304c02711930720f

Comment by Giuseppe Di Natale (Inactive) [ 11/Jan/17 ]

Before this closes, can these patches also be ported to the 2.8FE branch?

Comment by Peter Jones [ 11/Jan/17 ]

Giuseppe

The ticket will be marked resolved when the patches land to master but the ticket will remain on the LLNL prority list until the equivalent patches have been ported and landed to the 2.8 FE branch

Peter

Comment by Gerrit Updater [ 18/Jan/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23741/
Subject: LU-8569 lfsck: handle linkEA overflow
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 048a8740ae26e3406a7eab3bca383a90490cef93

Comment by Peter Jones [ 18/Jan/17 ]

All patches landed to master for 2.10. Ports to 2.8 and 2.9 FE branches will be tracked separately.

Comment by Giuseppe Di Natale (Inactive) [ 19/Jan/17 ]

Peter,

Are there tasks created so I can keep track of the 2.8 FE port?

Joe

Comment by Peter Jones [ 19/Jan/17 ]

We'll post the links on the ticket and mark with llnlfixready when it's ready for you to pick up

Comment by Giuseppe Di Natale (Inactive) [ 20/Jan/17 ]

Apologies Peter, I went ahead and created LU-9037 to keep track of the porting so those who are interested can keep track of it's progress.

Generated at Sat Feb 10 02:18:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.