[LU-8647] lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed Created: 28/Sep/16 Updated: 18/Oct/17 Resolved: 23/Oct/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Giuseppe Di Natale (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
3.10.0-327.28.2.1chaos.ch6.x86_64 |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Ran an lfsck namespace with -C and got the following LBUG on multiple MDTs. 2016-09-22 10:04:23 [493341.943717] LustreError: 127771:0:(lfsck_namespace.c:4452:lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed: 2016-09-22 10:04:23 [493341.958848] LustreError: 127771:0:(lfsck_namespace.c:4452:lfsck_namespace_double_scan()) LBUG 2016-09-22 10:04:23 [493341.968781] Pid: 127771, comm: lfsck Have the following call stack on two MDTs. 2016-09-22 10:03:52 Sep 22 10:03:52 [493315.464373] Kernel panic - not syncing: LBUG 2016-09-22 10:03:52 jet6 kernel: [49[493315.470430] CPU: 2 PID: 111809 Comm: lfsck Tainted: P OE ------------ 3.10.0-327.28.2.1chaos.ch6.x86_64 #1 2016-09-22 10:03:52 3315.297027] Lus[493315.484175] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016 2016-09-22 10:03:52 treError: 111809[493315.497715] ffffffffa079be0f 0000000055805053 ffff882757e4fc78 ffffffff8164cae7 2016-09-22 10:03:52 :0:(lfsck_namesp[493315.507701] ffff882757e4fcf8 ffffffff81645adf ffffffff00000008 ffff882757e4fd08 2016-09-22 10:03:52 ace.c:4452:lfsck[493315.517684] ffff882757e4fca8 0000000055805053 ffffffffa1070e70 0000000000000246 2016-09-22 10:03:52 _namespace_doubl[493315.527666] Call Trace: 2016-09-22 10:03:52 e_scan()) ASSERT[493315.532060] [<ffffffff8164cae7>] dump_stack+0x19/0x1b 2016-09-22 10:03:52 ION( list_empty([493315.539478] [<ffffffff81645adf>] panic+0xd8/0x1e7 2016-09-22 10:03:52 &lad->lad_req_li[493315.546501] [<ffffffffa077fdeb>] lbug_with_loc+0xab/0xc0 [libcfs] 2016-09-22 10:03:52 st) ) failed: 2016-09-22 10:03:52 [493315.555082] [<ffffffffa102c2a6>] lfsck_namespace_double_scan+0x106/0x140 [lfsck] 2016-09-22 10:03:52 Sep 22 10:03:52 [493315.565122] [<ffffffffa10234f9>] lfsck_double_scan+0x59/0x200 [lfsck] 2016-09-22 10:03:52 jet6 kernel: [49[493315.574086] [<ffffffffa0d88fc4>] ? osd_zfs_otable_it_fini+0x64/0x110 [osd_zfs] 2016-09-22 10:03:52 3315.311863] Lus[493315.583931] [<ffffffffa0d88fc4>] ? osd_zfs_otable_it_fini+0x64/0x110 [osd_zfs] 2016-09-22 10:03:52 treError: 111809[493315.593765] [<ffffffff811c8bad>] ? kfree+0x12d/0x170 2016-09-22 10:03:52 :0:(lfsck_namesp[493315.601075] [<ffffffffa1028044>] lfsck_master_engine+0x434/0x1310 [lfsck] 2016-09-22 10:03:52 ace.c:4452:lfsck[493315.610415] [<ffffffff81015588>] ? __switch_to+0xf8/0x4d0 2016-09-22 10:03:52 _namespace_doubl[493315.618212] [<ffffffff810bd4f0>] ? wake_up_state+0x20/0x20 2016-09-22 10:03:52 e_scan()) LBUG 2016-09-22 10:03:52 [493315.626108] [<ffffffffa1027c10>] ? lfsck_master_oit_engine+0x1430/0x1430 [lfsck] 2016-09-22 10:03:52 [493315.636145] [<ffffffff810a99bf>] kthread+0xcf/0xe0 2016-09-22 10:03:52 [493315.642238] [<ffffffff810a98f0>] ? kthread_create_on_node+0x140/0x140 2016-09-22 10:03:52 [493315.650187] [<ffffffff8165d9d8>] ret_from_fork+0x58/0x90 2016-09-22 10:03:52 [493315.656864] [<ffffffff810a98f0>] ? kthread_create_on_node+0x140/0x140 2016-09-22 10:03:52 [493315.711916] drm_kms_helper: panic occurred, switching back to text console 2016-09-22 10:03:52 [493315.720378] ------------[ cut here ]------------ 2016-09-22 10:03:52 [493315.726202] WARNING: at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x5f/0x70() 2016-09-22 10:03:52 [493315.735902] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) rpcsec_gss_krb5 ko2iblnd(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) nfsv3 iTCO_wdt iTCO_vendor_support intel_powerclamp coretemp intel_rapl kvm mlx5_ib pcspkr mlx5_core sb_edac lpc_ich edac_core mfd_core mei_me mei zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) ses enclosure ipmi_devintf spl(OE) zlib_deflate sg i2c_i801 ioatdma shpchp ipmi_si ipmi_msghandler acpi_power_meter acpi_cpufreq binfmt_misc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr nfsd nfs_acl ip_tables auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache dm_round_robin sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32_pclmul mgag200 crc32c_intel syscopyarea sysfillrect sysimgblt ghash_clmulni_intel i2c_algo_bit drm_kms_helper mxm_wmi ttm aesni_intel ixgbe lrw gf128mul ahci drm dca glue_helper mpt3sas libahci ptp i2c_core ablk_helper cryptd libata raid_class pps_core scsi_transport_sas mdio wmi sunrpc dm_mirror dm_region_hash dm_log scsi_transport_iscsi dm_multipath dm_mod 2016-09-22 10:03:52 [493315.859970] CPU: 2 PID: 0 Comm: swapper/2 Tainted: P OE ------------ 3.10.0-327.28.2.1chaos.ch6.x86_64 #1 2016-09-22 10:03:52 [493315.872734] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016 2016-09-22 10:03:53 [493315.885407] 0000000000000000 bcf7d7e5812e0014 ffff883f7e683d78 ffffffff8164cae7 2016-09-22 10:03:53 [493315.894536] ffff883f7e683db0 ffffffff8107d6d0 0000000000000000 ffff883f7e6967c0 2016-09-22 10:03:53 [493315.903668] 000000011d5cacb8 ffff883f7e6167c0 0000000000000002 ffff883f7e683dc0 2016-09-22 10:03:53 [493315.912796] Call Trace: 2016-09-22 10:03:53 [493315.916347] <IRQ> [<ffffffff8164cae7>] dump_stack+0x19/0x1b 2016-09-22 10:03:53 [493315.923621] [<ffffffff8107d6d0>] warn_slowpath_common+0x70/0xb0 2016-09-22 10:03:53 [493315.931168] [<ffffffff8107d81a>] warn_slowpath_null+0x1a/0x20 2016-09-22 10:03:53 [493315.938512] [<ffffffff81048fdf>] native_smp_send_reschedule+0x5f/0x70 2016-09-22 10:03:53 [493315.946646] [<ffffffff810cb04d>] trigger_load_balance+0x18d/0x250 2016-09-22 10:03:53 [493315.954390] [<ffffffff810bbdd3>] scheduler_tick+0x103/0x150 2016-09-22 10:03:53 [493315.961553] [<ffffffff810e5800>] ? tick_sched_handle.isra.14+0x60/0x60 2016-09-22 10:03:53 [493315.969775] [<ffffffff81091a06>] update_process_times+0x66/0x80 2016-09-22 10:03:53 [493315.977304] [<ffffffff810e57c5>] tick_sched_handle.isra.14+0x25/0x60 2016-09-22 10:03:53 [493315.985310] [<ffffffff810e5841>] tick_sched_timer+0x41/0x70 2016-09-22 10:03:53 [493315.992432] [<ffffffff810adeda>] __hrtimer_run_queues+0xea/0x2c0 2016-09-22 10:03:53 [493316.000042] [<ffffffff810ae4e0>] hrtimer_interrupt+0xb0/0x1e0 2016-09-22 10:03:53 [493316.007351] [<ffffffff8104be47>] local_apic_timer_interrupt+0x37/0x60 2016-09-22 10:03:53 [493316.015442] [<ffffffff8166000f>] smp_apic_timer_interrupt+0x3f/0x60 2016-09-22 10:03:53 [493316.023338] [<ffffffff8165e6dd>] apic_timer_interrupt+0x6d/0x80 2016-09-22 10:03:53 [493316.030848] <EOI> [<ffffffff810dd69c>] ? ktime_get+0x4c/0xd0 2016-09-22 10:03:53 [493316.038194] [<ffffffff810b8da6>] ? finish_task_switch+0x56/0x180 2016-09-22 10:03:53 [493316.045803] [<ffffffff81651df0>] __schedule+0x2e0/0x940 2016-09-22 10:03:53 [493316.052533] [<ffffffff81653709>] schedule_preempt_disabled+0x39/0x90 2016-09-22 10:03:53 [493316.060533] [<ffffffff810db1f4>] cpu_startup_entry+0x184/0x2d0 2016-09-22 10:03:53 [493316.067949] [<ffffffff81049eea>] start_secondary+0x1ca/0x240 2016-09-22 10:03:53 [493316.075162] ---[ end trace 28897805122ddeee ]--- Filesystem info: Also worth noting, once we have a directory with files that exhibit this "bad address" error, the directory cannot be removed. Let me know if you need more info. |
| Comments |
| Comment by Giuseppe Di Natale (Inactive) [ 28/Sep/16 ] |
|
Fan Yong's Comment: I think you have specified "-C" option when run the namespace LFSCK completely, right? Do you have the Lustre debug log (with "lfsck" debug enabled) when LFSCK ran? That will record which inconsistency have been detected and repaired (or failure). If you have not collected related information, can you show the me the lfsck status lproc output? Thanks! I went ahead and attached a log file called "lfsck_namespace_state-9-28-2016.log" which was obtained by running the following on each MDS: pdsh -g mds 'lctl get_param -n mdd.$(ldev -l | grep lquake-MDT).lfsck_namespace' | dshbak -c Worth noting, when I restarted the filesystem, I had to stop the lfsck namespace check because the kernel panics would continue to occur because lfsck tried picking up where it left off. Yes, we were running lfsck namespace w/ -C Also, we are creating the name dangling issue at will at the moment with the reproduction steps I provided in my Sept 21, 2016 comment (the one with the linkme.sh). I think that still needs to be addressed. I'm also going to break out the lfsck call stack issue to a separate ticket, it is unclear whether or not it is related. |
| Comment by Giuseppe Di Natale (Inactive) [ 28/Sep/16 ] |
|
Patch was also submitted for this. Can be found at http://review.whamcloud.com/#/c/22723/ |
| Comment by nasf (Inactive) [ 23/Oct/16 ] |
|
The patch http://review.whamcloud.com/22723/ has been landed for Lustre-2.9 |