[LU-9857] sanityn test_35: (lmv_obd.c:2122:stripe_dirent_next()) ASSERTION( hash <= end ) Created: 10/Aug/17 Updated: 04/Jun/20 Resolved: 12/May/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | Lustre 2.12.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for Jinshan Xiong <jinshan.xiong@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/5ab79694-7d98-11e7-9e43-5254006e85c2. The sub-test test_35 failed with the following error: test failed to respond and timed out Please provide additional information about the failure here. Info required for matching: sanityn 35 23:46:36:[ 7850.159072] LustreError: 29261:0:(libcfs_fail.h:165:cfs_race()) cfs_race id 318 sleeping 23:46:36:[ 7850.166021] LustreError: 30957:0:(libcfs_fail.h:170:cfs_race()) cfs_fail_race id 318 waking 23:46:36:[ 7850.170410] LustreError: 29261:0:(libcfs_fail.h:168:cfs_race()) cfs_fail_race id 318 awake, rc=0 23:46:36:[ 7850.174782] LustreError: 29261:0:(mdc_request.c:1328:mdc_read_page()) lustre-MDT0000-mdc-ffff880067c40000: [0x200007164:0x6:0x0] lock enqueue fails: rc = -4 23:46:36:[ 7850.182752] Lustre: dir [0x2c00013a0:0xa:0x0] stripe 1 readdir failed: -4, directory is partially accessed! 23:46:36:[ 7850.191827] LustreError: 29261:0:(lmv_obd.c:2122:stripe_dirent_next()) ASSERTION( hash <= end ) failed: 23:46:36:[ 7850.196278] LustreError: 29261:0:(lmv_obd.c:2122:stripe_dirent_next()) LBUG 23:46:36:[ 7850.199151] Pid: 29261, comm: ls 23:46:36:[ 7850.202731] 23:46:36:[ 7850.202731] Call Trace: 23:46:36:[ 7850.209235] [<ffffffffa070b7ee>] libcfs_call_trace+0x4e/0x60 [libcfs] 23:46:36:[ 7850.212997] [<ffffffffa070b87c>] lbug_with_loc+0x4c/0xb0 [libcfs] 23:46:36:[ 7850.216695] [<ffffffffa095f07c>] stripe_dirent_next+0x33c/0x922 [lmv] 23:46:36:[ 7850.220405] [<ffffffffa095f8d9>] lmv_striped_read_page.isra.29+0x277/0x5e8 [lmv] 23:46:36:[ 7850.223563] [<ffffffffa094c503>] lmv_read_page+0x4e3/0x550 [lmv] 23:46:36:[ 7850.226019] [<ffffffffa0c2d6a8>] ll_get_dir_page+0xc8/0x380 [lustre] 23:46:36:[ 7850.228433] [<ffffffffa0c6bf20>] ? ll_md_blocking_ast+0x0/0x730 [lustre] 23:46:36:[ 7850.230825] [<ffffffffa0c2dc34>] ll_dir_read+0x214/0x320 [lustre] 23:46:36:[ 7850.233247] [<ffffffff81212c00>] ? filldir+0x0/0xf0 23:46:36:[ 7850.235466] [<ffffffff81212c00>] ? filldir+0x0/0xf0 23:46:36:[ 7850.237609] [<ffffffffa0c2de59>] ll_readdir+0x119/0x4f0 [lustre] 23:46:36:[ 7850.239809] [<ffffffff81212c00>] ? filldir+0x0/0xf0 23:46:36:[ 7850.241879] [<ffffffff81212af0>] vfs_readdir+0xb0/0xe0 23:46:36:[ 7850.243937] [<ffffffff81212f15>] SyS_getdents+0x95/0x120 23:46:36:[ 7850.246042] [<ffffffff81697809>] system_call_fastpath+0x16/0x1b The kernel on client node crashed. |
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 11/Aug/17 ] |
|
another occurrence: https://testing.hpdd.intel.com/test_sets/01bb5028-7e45-11e7-96ed-5254006e85c2 |
| Comment by John Hammond [ 11/Aug/17 ] |
|
https://testing.hpdd.intel.com/test_sessions/00c6ff85-4aa2-4b28-a0ae-55073c579736 This assertion was added by https://review.whamcloud.com/27663 |
| Comment by John Hammond [ 11/Aug/17 ] |
|
Lai, I understand that this change is trying to return as many of the directory entries as possible in case of errors. But stripe_dirent_next() must propagate errors from mdc_read_page() up to the application. But currently it discards the return value of mdc_read_page() and none of its callers use the its own return value. |
| Comment by Wang Shilong (Inactive) [ 13/Aug/17 ] |
|
https://testing.hpdd.intel.com/test_sets/ff7b28d4-8023-11e7-bbc6-5254006e85c2 |
| Comment by Lai Siyao [ 14/Aug/17 ] |
|
I see, I'll work on it. |
| Comment by Gerrit Updater [ 15/Aug/17 ] |
|
Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/28548 |
| Comment by Steve Guminski (Inactive) [ 15/Aug/17 ] |
|
Another on master: https://testing.hpdd.intel.com/test_sessions/d7870a08-73b3-4f95-898b-f4f0908c9214 |
| Comment by Gerrit Updater [ 28/Aug/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28548/ |
| Comment by Peter Jones [ 28/Aug/17 ] |
|
Landed for 2.11 |
| Comment by Alex Zhuravlev [ 31/Aug/17 ] |
|
https://testing.hpdd.intel.com/test_sets/7fe3bcd4-8d99-11e7-b4f3-5254006e85c2 |
| Comment by nasf (Inactive) [ 30/Nov/17 ] |
|
+1 on master: |
| Comment by Bruno Faccini (Inactive) [ 12/Dec/17 ] |
|
+1 on latest master : https://testing.hpdd.intel.com/test_sets/8182cffe-dee7-11e7-9840-52540065bddc |
| Comment by Andreas Dilger [ 13/Dec/17 ] |
|
Just hit this again on master, dated 2017-12-09: https://testing.hpdd.intel.com/test_sets/ebf98b0c-dd34-11e7-9c63-52540065bddc |
| Comment by Lai Siyao [ 19/Dec/17 ] |
|
I don't see this assert in all the failures, but timeout, so I'm afraid it's a different issue. And unluckily these timeout failures don't leave any logs, I'll try to reproduce it first. |
| Comment by Bruno Faccini (Inactive) [ 18/Jan/18 ] |
|
+1 with latest master at https://testing.hpdd.intel.com/test_sets/e3801c38-fc16-11e7-bd00-52540065bddc. |
| Comment by Jian Yu [ 19/Jan/18 ] |
|
+1 on master branch: |
| Comment by nasf (Inactive) [ 22/Jan/18 ] |
|
+1 on master: |
| Comment by Jian Yu [ 08/Feb/18 ] |
|
+1 on master branch: |
| Comment by Sarah Liu [ 14/Mar/18 ] |
|
on master tag-2.10.59 https://testing.hpdd.intel.com/test_sets/38bd33e4-2786-11e8-b6a0-52540065bddc |
| Comment by James Nunez (Inactive) [ 20/Apr/18 ] |
|
All recent failures look like they are for DNE configurations. Here are some recent failures with links to logs and core files: Logs at https://testing.hpdd.intel.com/test_sets/614c9d28-444f-11e8-95c0-52540065bddc Logs at https://testing.hpdd.intel.com/test_sets/1cf4c6b6-42a9-11e8-8f8a-52540065bddc https://testing.hpdd.intel.com/test_sets/1cf4c6b6-42a9-11e8-8f8a-52540065bddc |
| Comment by Lai Siyao [ 24/Apr/18 ] |
|
James, the links of the the 2nd and the 3rd are duplicate. And the 1st looks to be another issue: [10146.583339] Lustre: DEBUG MARKER: == sanityn test 16a: 500 iterations of dual-mount fsx ================================================ 03:55:24 (1524196524)
[10252.159310] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [ll_ost_io00_003:23498]
[10252.159310] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core dm_mod zfs(POE) zunicode(POE) zavl(POE) icp(POE) iosf_mbi crc32_pclmul ghash_clmulni_intel zcommon(POE) znvpair(POE) ppdev spl(OE) aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev virtio_balloon pcspkr i2c_piix4 parport_pc parport nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi ata_piix virtio_blk cirrus drm_kms_helper libata syscopyarea sysfillrect 8139too sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common drm crc32c_intel serio_raw virtio_pci virtio_ring virtio 8139cp i2c_core mii floppy
[10252.159310] CPU: 1 PID: 23498 Comm: ll_ost_io00_003 Tainted: P OE ------------ 3.10.0-693.21.1.el7_lustre.x86_64 #1
[10252.159310] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[10252.159310] task: ffff880059e56eb0 ti: ffff880059538000 task.ti: ffff880059538000
[10252.159310] RIP: 0010:[<ffffffff81337f93>] [<ffffffff81337f93>] memset+0x33/0xb0
[10252.159310] RSP: 0018:ffff88005953b820 EFLAGS: 00010212
[10252.159310] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000fbf
[10252.159310] RDX: 0000000000100000 RSI: 0000000000000000 RDI: ffffc900140ec000
[10252.159310] RBP: ffff88005953b8c0 R08: ffff880036aab690 R09: 0000000000000000
[10252.159310] R10: ffffc9001402b000 R11: ffffea000116bc00 R12: ffffffffc04a7d3b
[10252.159310] R13: ffff88005953b7b0 R14: ffff880059a5b000 R15: ffffffffc056e667
[10252.159310] FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[10252.159310] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10252.159310] CR2: 00007fd2559ae000 CR3: 0000000079462000 CR4: 00000000000606e0
[10252.159310] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[10252.159310] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[10252.159310] Call Trace:
[10252.159310] [<ffffffffc04b2cc7>] ? dbuf_read+0x397/0x9e0 [zfs]
[10252.159310] [<ffffffffc0575128>] ? zio_done+0x748/0xd20 [zfs]
[10252.159310] [<ffffffffc056eb8c>] ? zio_destroy+0x7c/0x80 [zfs]
[10252.159310] [<ffffffffc04b66d9>] dmu_buf_will_dirty+0x119/0x130 [zfs]
[10252.159310] [<ffffffffc04bcf15>] dmu_write_impl+0x45/0xd0 [zfs]
[10252.159310] [<ffffffffc04beb57>] dmu_write.part.7+0xa7/0x110 [zfs]
[10252.159310] [<ffffffffc04bed36>] dmu_assign_arcbuf+0x156/0x1a0 [zfs]
[10252.159310] [<ffffffffc10cfdcd>] osd_write_commit+0x46d/0xa00 [osd_zfs]
[10252.159310] [<ffffffffc120b29a>] ofd_commitrw_write+0xf9a/0x1d00 [ofd]
[10252.159310] [<ffffffffc120f112>] ofd_commitrw+0x4b2/0xa10 [ofd]
[10252.159310] [<ffffffffc0c36c39>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[10252.159310] [<ffffffffc0ef8430>] tgt_brw_write+0x1180/0x1d70 [ptlrpc]
[10252.159310] INFO: rcu_sched detected stalls on CPUs/tasks:
[10252.159310] [<ffffffffc0e4b940>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[10252.159310] [<ffffffffc0ef994a>] tgt_request_handle+0x92a/0x13b0 [ptlrpc]
[10252.159310] [<ffffffffc0e9da53>] ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc]
[10252.159310] [<ffffffff810bdc4b>] ? __wake_up_common+0x5b/0x90
[10252.159310] [<ffffffffc0ea1202>] ptlrpc_main+0xa92/0x1f40 [ptlrpc]
[10252.159310] [<ffffffffc0ea0770>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]
[10252.159310] [<ffffffff810b4031>] kthread+0xd1/0xe0
[10252.159310] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[10252.159310] [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
[10252.159310] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[10252.159310] Code: b8 01 01 01 01 01 01 01 01 48 0f af c1 41 89 f9 41 83 e1 07 75 70 48 89 d1 48 c1 e9 06 74 39 66 0f 1f 84 00 00 00 00 00 48 ff c9 <48> 89 07 48 89 47 08 48 89 47 10 48 89 47 18 48 89 47 20 48 89
[10252.159310] Kernel panic - not syncing: softlockup: hung tasks
[10274.555013] { 1[10252.159310] CPU: 1 PID: 23498 Comm: ll_ost_io00_003 Tainted: P OEL ------------ 3.10.0-693.21.1.el7_lustre.x86_64 #1
[10252.159310] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[10252.159310] Call Trace:
[10252.159310] <IRQ> [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
[10252.159310] [<ffffffff816a8634>] panic+0xe8/0x21f
[10252.159310] [<ffffffff8102d7cf>] ? show_regs+0x5f/0x210
[10252.159310] [<ffffffff811334e1>] watchdog_timer_fn+0x231/0x240
[10252.159310] [<ffffffff811332b0>] ? watchdog+0x40/0x40
[10252.159310] [<ffffffff810b8196>] __hrtimer_run_queues+0xd6/0x260
[10252.159310] [<ffffffff810b872f>] hrtimer_interrupt+0xaf/0x1d0
[10252.159310] [<ffffffff8105467b>] local_apic_timer_interrupt+0x3b/0x60
[10252.159310] [<ffffffff816c4e73>] smp_apic_timer_interrupt+0x43/0x60
[10252.159310] [<ffffffff816c1732>] apic_timer_interrupt+0x162/0x170
[10252.159310] <EOI> [<ffffffff81337f93>] ? memset+0x33/0xb0
[10252.159310] [<ffffffffc04b2cc7>] ? dbuf_read+0x397/0x9e0 [zfs]
[10252.159310] [<ffffffffc0575128>] ? zio_done+0x748/0xd20 [zfs]
[10252.159310] [<ffffffffc056eb8c>] ? zio_destroy+0x7c/0x80 [zfs]
[10252.159310] [<ffffffffc04b66d9>] dmu_buf_will_dirty+0x119/0x130 [zfs]
[10252.159310] [<ffffffffc04bcf15>] dmu_write_impl+0x45/0xd0 [zfs]
[10252.159310] [<ffffffffc04beb57>] dmu_write.part.7+0xa7/0x110 [zfs]
[10252.159310] [<ffffffffc04bed36>] dmu_assign_arcbuf+0x156/0x1a0 [zfs]
[10252.159310] [<ffffffffc10cfdcd>] osd_write_commit+0x46d/0xa00 [osd_zfs]
[10252.159310] [<ffffffffc120b29a>] ofd_commitrw_write+0xf9a/0x1d00 [ofd]
[10252.159310] [<ffffffffc120f112>] ofd_commitrw+0x4b2/0xa10 [ofd]
[10252.159310] [<ffffffffc0c36c39>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[10252.159310] [<ffffffffc0ef8430>] tgt_brw_write+0x1180/0x1d70 [ptlrpc]
[10252.159310] [<ffffffffc0e4b940>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[10252.159310] [<ffffffffc0ef994a>] tgt_request_handle+0x92a/0x13b0 [ptlrpc]
[10252.159310] [<ffffffffc0e9da53>] ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc]
[10252.159310] [<ffffffff810bdc4b>] ? __wake_up_common+0x5b/0x90
[10252.159310] [<ffffffffc0ea1202>] ptlrpc_main+0xa92/0x1f40 [ptlrpc]
[10252.159310] [<ffffffffc0ea0770>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]
[10252.159310] [<ffffffff810b4031>] kthread+0xd1/0xe0
[10252.159310] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[10252.159310] [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
[10252.159310] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[10274.555013] } (detected by 0, t=60299 jiffies, g=942723, c=942722, q=34)
[10274.555013] Task dump for CPU 1:
[10274.555013] ll_ost_io00_003 R running task 0 23498 2 0x00000088
[10274.555013] Call Trace:
[10274.555013] [<ffffffffc04bed36>] ? dmu_assign_arcbuf+0x156/0x1a0 [zfs]
[10274.555013] [<ffffffffc10cfdcd>] ? osd_write_commit+0x46d/0xa00 [osd_zfs]
[10274.555013] [<ffffffffc120b29a>] ? ofd_commitrw_write+0xf9a/0x1d00 [ofd]
[10274.555013] [<ffffffffc120f112>] ? ofd_commitrw+0x4b2/0xa10 [ofd]
[10274.555013] [<ffffffffc0c36c39>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[10274.555013] [<ffffffffc0ef8430>] ? tgt_brw_write+0x1180/0x1d70 [ptlrpc]
[10274.555013] [<ffffffffc0e4b940>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[10274.555013] [<ffffffffc0ef994a>] ? tgt_request_handle+0x92a/0x13b0 [ptlrpc]
[10274.555013] [<ffffffffc0e9da53>] ? ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc]
[10274.555013] [<ffffffff810bdc4b>] ? __wake_up_common+0x5b/0x90
[10274.555013] [<ffffffffc0ea1202>] ? ptlrpc_main+0xa92/0x1f40 [ptlrpc]
[10274.555013] [<ffffffffc0ea0770>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]
[10274.555013] [<ffffffff810b4031>] ? kthread+0xd1/0xe0
[10274.555013] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[10274.555013] [<ffffffff816c0577>] ? ret_from_fork+0x77/0xb0
[10274.555013] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
|
| Comment by Lai Siyao [ 26/Apr/18 ] |
|
I found something strange in the dump: crash> struct lu_dirpage 0xffff8800a6631000
struct lu_dirpage {
ldp_hash_start = 16131858539135107072,
ldp_hash_end = 8850982375266543352,
ldp_flags = 0,
ldp_pad0 = 0,
ldp_entries = 0xffff8800a6631018
}
This shows in one of the dir page read, its start hash is bigger than end hash, which causes the LASSERT in the code. I've added some debug code and trying to reproduce it. |
| Comment by Gerrit Updater [ 27/Apr/18 ] |
|
Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/32180 |
| Comment by Gerrit Updater [ 12/May/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/32180/ |
| Comment by Peter Jones [ 12/May/18 ] |
|
Landed for 2.12 |
| Comment by Gerrit Updater [ 04/Jun/20 ] |
|
Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/38838 |
| Comment by Gerrit Updater [ 04/Jun/20 ] |
|
Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/38839 |