[LU-8647] lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed Created: 28/Sep/16  Updated: 18/Oct/17  Resolved: 23/Oct/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Critical
Reporter: Giuseppe Di Natale (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

3.10.0-327.28.2.1chaos.ch6.x86_64


Attachments: Text File lfsck_namespace_state-9-28-2016.log    
Issue Links:
Related
is related to LU-8569 Sharded DNE directory full of files t... Resolved
is related to LU-10134 LBUG lfsck_namespace_double_scan()) A... Resolved
is related to LU-8863 LFSCK fails to complete, node cannot ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Ran an lfsck namespace with -C and got the following LBUG on multiple MDTs.

2016-09-22 10:04:23 [493341.943717] LustreError: 127771:0:(lfsck_namespace.c:4452:lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed: 
2016-09-22 10:04:23 [493341.958848] LustreError: 127771:0:(lfsck_namespace.c:4452:lfsck_namespace_double_scan()) LBUG
2016-09-22 10:04:23 [493341.968781] Pid: 127771, comm: lfsck

Have the following call stack on two MDTs.

2016-09-22 10:03:52 Sep 22 10:03:52 [493315.464373] Kernel panic - not syncing: LBUG
2016-09-22 10:03:52 jet6 kernel: [49[493315.470430] CPU: 2 PID: 111809 Comm: lfsck Tainted: P           OE  ------------   3.10.0-327.28.2.1chaos.ch6.x86_64 #1
2016-09-22 10:03:52 3315.297027] Lus[493315.484175] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
2016-09-22 10:03:52 treError: 111809[493315.497715]  ffffffffa079be0f 0000000055805053 ffff882757e4fc78 ffffffff8164cae7
2016-09-22 10:03:52 :0:(lfsck_namesp[493315.507701]  ffff882757e4fcf8 ffffffff81645adf ffffffff00000008 ffff882757e4fd08
2016-09-22 10:03:52 ace.c:4452:lfsck[493315.517684]  ffff882757e4fca8 0000000055805053 ffffffffa1070e70 0000000000000246
2016-09-22 10:03:52 _namespace_doubl[493315.527666] Call Trace:
2016-09-22 10:03:52 e_scan()) ASSERT[493315.532060]  [<ffffffff8164cae7>] dump_stack+0x19/0x1b
2016-09-22 10:03:52 ION( list_empty([493315.539478]  [<ffffffff81645adf>] panic+0xd8/0x1e7
2016-09-22 10:03:52 &lad->lad_req_li[493315.546501]  [<ffffffffa077fdeb>] lbug_with_loc+0xab/0xc0 [libcfs]
2016-09-22 10:03:52 st) ) failed: 
2016-09-22 10:03:52 [493315.555082]  [<ffffffffa102c2a6>] lfsck_namespace_double_scan+0x106/0x140 [lfsck]
2016-09-22 10:03:52 Sep 22 10:03:52 [493315.565122]  [<ffffffffa10234f9>] lfsck_double_scan+0x59/0x200 [lfsck]
2016-09-22 10:03:52 jet6 kernel: [49[493315.574086]  [<ffffffffa0d88fc4>] ? osd_zfs_otable_it_fini+0x64/0x110 [osd_zfs]
2016-09-22 10:03:52 3315.311863] Lus[493315.583931]  [<ffffffffa0d88fc4>] ? osd_zfs_otable_it_fini+0x64/0x110 [osd_zfs]
2016-09-22 10:03:52 treError: 111809[493315.593765]  [<ffffffff811c8bad>] ? kfree+0x12d/0x170
2016-09-22 10:03:52 :0:(lfsck_namesp[493315.601075]  [<ffffffffa1028044>] lfsck_master_engine+0x434/0x1310 [lfsck]
2016-09-22 10:03:52 ace.c:4452:lfsck[493315.610415]  [<ffffffff81015588>] ? __switch_to+0xf8/0x4d0
2016-09-22 10:03:52 _namespace_doubl[493315.618212]  [<ffffffff810bd4f0>] ? wake_up_state+0x20/0x20
2016-09-22 10:03:52 e_scan()) LBUG
2016-09-22 10:03:52 [493315.626108]  [<ffffffffa1027c10>] ? lfsck_master_oit_engine+0x1430/0x1430 [lfsck]
2016-09-22 10:03:52 [493315.636145]  [<ffffffff810a99bf>] kthread+0xcf/0xe0
2016-09-22 10:03:52 [493315.642238]  [<ffffffff810a98f0>] ? kthread_create_on_node+0x140/0x140
2016-09-22 10:03:52 [493315.650187]  [<ffffffff8165d9d8>] ret_from_fork+0x58/0x90
2016-09-22 10:03:52 [493315.656864]  [<ffffffff810a98f0>] ? kthread_create_on_node+0x140/0x140
2016-09-22 10:03:52 [493315.711916] drm_kms_helper: panic occurred, switching back to text console
2016-09-22 10:03:52 [493315.720378] ------------[ cut here ]------------
2016-09-22 10:03:52 [493315.726202] WARNING: at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x5f/0x70()
2016-09-22 10:03:52 [493315.735902] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) rpcsec_gss_krb5 ko2iblnd(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) nfsv3 iTCO_wdt iTCO_vendor_support intel_powerclamp coretemp intel_rapl kvm mlx5_ib pcspkr mlx5_core sb_edac lpc_ich edac_core mfd_core mei_me mei zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) ses enclosure ipmi_devintf spl(OE) zlib_deflate sg i2c_i801 ioatdma shpchp ipmi_si ipmi_msghandler acpi_power_meter acpi_cpufreq binfmt_misc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr nfsd nfs_acl ip_tables auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache dm_round_robin sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32_pclmul mgag200 crc32c_intel syscopyarea sysfillrect sysimgblt ghash_clmulni_intel i2c_algo_bit drm_kms_helper mxm_wmi ttm aesni_intel ixgbe lrw gf128mul ahci drm dca glue_helper mpt3sas libahci ptp i2c_core ablk_helper cryptd libata raid_class pps_core scsi_transport_sas mdio wmi sunrpc dm_mirror dm_region_hash dm_log scsi_transport_iscsi dm_multipath dm_mod
2016-09-22 10:03:52 [493315.859970] CPU: 2 PID: 0 Comm: swapper/2 Tainted: P           OE  ------------   3.10.0-327.28.2.1chaos.ch6.x86_64 #1
2016-09-22 10:03:52 [493315.872734] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
2016-09-22 10:03:53 [493315.885407]  0000000000000000 bcf7d7e5812e0014 ffff883f7e683d78 ffffffff8164cae7
2016-09-22 10:03:53 [493315.894536]  ffff883f7e683db0 ffffffff8107d6d0 0000000000000000 ffff883f7e6967c0
2016-09-22 10:03:53 [493315.903668]  000000011d5cacb8 ffff883f7e6167c0 0000000000000002 ffff883f7e683dc0
2016-09-22 10:03:53 [493315.912796] Call Trace:
2016-09-22 10:03:53 [493315.916347]  <IRQ>  [<ffffffff8164cae7>] dump_stack+0x19/0x1b
2016-09-22 10:03:53 [493315.923621]  [<ffffffff8107d6d0>] warn_slowpath_common+0x70/0xb0
2016-09-22 10:03:53 [493315.931168]  [<ffffffff8107d81a>] warn_slowpath_null+0x1a/0x20
2016-09-22 10:03:53 [493315.938512]  [<ffffffff81048fdf>] native_smp_send_reschedule+0x5f/0x70
2016-09-22 10:03:53 [493315.946646]  [<ffffffff810cb04d>] trigger_load_balance+0x18d/0x250
2016-09-22 10:03:53 [493315.954390]  [<ffffffff810bbdd3>] scheduler_tick+0x103/0x150
2016-09-22 10:03:53 [493315.961553]  [<ffffffff810e5800>] ? tick_sched_handle.isra.14+0x60/0x60
2016-09-22 10:03:53 [493315.969775]  [<ffffffff81091a06>] update_process_times+0x66/0x80
2016-09-22 10:03:53 [493315.977304]  [<ffffffff810e57c5>] tick_sched_handle.isra.14+0x25/0x60
2016-09-22 10:03:53 [493315.985310]  [<ffffffff810e5841>] tick_sched_timer+0x41/0x70
2016-09-22 10:03:53 [493315.992432]  [<ffffffff810adeda>] __hrtimer_run_queues+0xea/0x2c0
2016-09-22 10:03:53 [493316.000042]  [<ffffffff810ae4e0>] hrtimer_interrupt+0xb0/0x1e0
2016-09-22 10:03:53 [493316.007351]  [<ffffffff8104be47>] local_apic_timer_interrupt+0x37/0x60
2016-09-22 10:03:53 [493316.015442]  [<ffffffff8166000f>] smp_apic_timer_interrupt+0x3f/0x60
2016-09-22 10:03:53 [493316.023338]  [<ffffffff8165e6dd>] apic_timer_interrupt+0x6d/0x80
2016-09-22 10:03:53 [493316.030848]  <EOI>  [<ffffffff810dd69c>] ? ktime_get+0x4c/0xd0
2016-09-22 10:03:53 [493316.038194]  [<ffffffff810b8da6>] ? finish_task_switch+0x56/0x180
2016-09-22 10:03:53 [493316.045803]  [<ffffffff81651df0>] __schedule+0x2e0/0x940
2016-09-22 10:03:53 [493316.052533]  [<ffffffff81653709>] schedule_preempt_disabled+0x39/0x90
2016-09-22 10:03:53 [493316.060533]  [<ffffffff810db1f4>] cpu_startup_entry+0x184/0x2d0
2016-09-22 10:03:53 [493316.067949]  [<ffffffff81049eea>] start_secondary+0x1ca/0x240
2016-09-22 10:03:53 [493316.075162] ---[ end trace 28897805122ddeee ]---

Filesystem info:
16 MDS, 4 OSS, running ZFS 0.7.0-0.3llnl and lustre 2.8.0 on a RHEL 7.2 based operating system (3.10.0-327.28.2.1chaos.ch6.x86_64).

Also worth noting, once we have a directory with files that exhibit this "bad address" error, the directory cannot be removed.

Let me know if you need more info.



 Comments   
Comment by Giuseppe Di Natale (Inactive) [ 28/Sep/16 ]

Fan Yong's Comment:

I think you have specified "-C" option when run the namespace LFSCK completely, right?
Do you have the Lustre debug log (with "lfsck" debug enabled) when LFSCK ran? That will record which inconsistency have been detected and repaired (or failure). If you have not collected related information, can you show the me the lfsck status lproc output?

Thanks!

I went ahead and attached a log file called "lfsck_namespace_state-9-28-2016.log" which was obtained by running the following on each MDS:

pdsh -g mds 'lctl get_param -n mdd.$(ldev -l | grep lquake-MDT).lfsck_namespace' | dshbak -c

Worth noting, when I restarted the filesystem, I had to stop the lfsck namespace check because the kernel panics would continue to occur because lfsck tried picking up where it left off.

Yes, we were running lfsck namespace w/ -C

Also, we are creating the name dangling issue at will at the moment with the reproduction steps I provided in my Sept 21, 2016 comment (the one with the linkme.sh). I think that still needs to be addressed.

I'm also going to break out the lfsck call stack issue to a separate ticket, it is unclear whether or not it is related.

Comment by Giuseppe Di Natale (Inactive) [ 28/Sep/16 ]

Patch was also submitted for this. Can be found at http://review.whamcloud.com/#/c/22723/

Comment by nasf (Inactive) [ 23/Oct/16 ]

The patch http://review.whamcloud.com/22723/ has been landed for Lustre-2.9

Generated at Sat Feb 10 02:19:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.