[LU-9857] sanityn test_35: (lmv_obd.c:2122:stripe_dirent_next()) ASSERTION( hash <= end ) Created: 10/Aug/17  Updated: 04/Jun/20  Resolved: 12/May/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-9923 lmv does not propagate read page erro... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Jinshan Xiong <jinshan.xiong@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/5ab79694-7d98-11e7-9e43-5254006e85c2.

The sub-test test_35 failed with the following error:

test failed to respond and timed out

Please provide additional information about the failure here.

Info required for matching: sanityn 35

23:46:36:[ 7850.159072] LustreError: 29261:0:(libcfs_fail.h:165:cfs_race()) cfs_race id 318 sleeping
23:46:36:[ 7850.166021] LustreError: 30957:0:(libcfs_fail.h:170:cfs_race()) cfs_fail_race id 318 waking
23:46:36:[ 7850.170410] LustreError: 29261:0:(libcfs_fail.h:168:cfs_race()) cfs_fail_race id 318 awake, rc=0
23:46:36:[ 7850.174782] LustreError: 29261:0:(mdc_request.c:1328:mdc_read_page()) lustre-MDT0000-mdc-ffff880067c40000: [0x200007164:0x6:0x0] lock enqueue fails: rc = -4
23:46:36:[ 7850.182752] Lustre: dir [0x2c00013a0:0xa:0x0] stripe 1 readdir failed: -4, directory is partially accessed!
23:46:36:[ 7850.191827] LustreError: 29261:0:(lmv_obd.c:2122:stripe_dirent_next()) ASSERTION( hash <= end ) failed: 
23:46:36:[ 7850.196278] LustreError: 29261:0:(lmv_obd.c:2122:stripe_dirent_next()) LBUG
23:46:36:[ 7850.199151] Pid: 29261, comm: ls
23:46:36:[ 7850.202731] 
23:46:36:[ 7850.202731] Call Trace:
23:46:36:[ 7850.209235]  [<ffffffffa070b7ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
23:46:36:[ 7850.212997]  [<ffffffffa070b87c>] lbug_with_loc+0x4c/0xb0 [libcfs]
23:46:36:[ 7850.216695]  [<ffffffffa095f07c>] stripe_dirent_next+0x33c/0x922 [lmv]
23:46:36:[ 7850.220405]  [<ffffffffa095f8d9>] lmv_striped_read_page.isra.29+0x277/0x5e8 [lmv]
23:46:36:[ 7850.223563]  [<ffffffffa094c503>] lmv_read_page+0x4e3/0x550 [lmv]
23:46:36:[ 7850.226019]  [<ffffffffa0c2d6a8>] ll_get_dir_page+0xc8/0x380 [lustre]
23:46:36:[ 7850.228433]  [<ffffffffa0c6bf20>] ? ll_md_blocking_ast+0x0/0x730 [lustre]
23:46:36:[ 7850.230825]  [<ffffffffa0c2dc34>] ll_dir_read+0x214/0x320 [lustre]
23:46:36:[ 7850.233247]  [<ffffffff81212c00>] ? filldir+0x0/0xf0
23:46:36:[ 7850.235466]  [<ffffffff81212c00>] ? filldir+0x0/0xf0
23:46:36:[ 7850.237609]  [<ffffffffa0c2de59>] ll_readdir+0x119/0x4f0 [lustre]
23:46:36:[ 7850.239809]  [<ffffffff81212c00>] ? filldir+0x0/0xf0
23:46:36:[ 7850.241879]  [<ffffffff81212af0>] vfs_readdir+0xb0/0xe0
23:46:36:[ 7850.243937]  [<ffffffff81212f15>] SyS_getdents+0x95/0x120
23:46:36:[ 7850.246042]  [<ffffffff81697809>] system_call_fastpath+0x16/0x1b

The kernel on client node crashed.



 Comments   
Comment by Jinshan Xiong (Inactive) [ 11/Aug/17 ]

another occurrence: https://testing.hpdd.intel.com/test_sets/01bb5028-7e45-11e7-96ed-5254006e85c2

Comment by John Hammond [ 11/Aug/17 ]

https://testing.hpdd.intel.com/test_sessions/00c6ff85-4aa2-4b28-a0ae-55073c579736

This assertion was added by https://review.whamcloud.com/27663 LU-5106 readdir: improve striped readdir.

Comment by John Hammond [ 11/Aug/17 ]

Lai, I understand that this change is trying to return as many of the directory entries as possible in case of errors. But stripe_dirent_next() must propagate errors from mdc_read_page() up to the application. But currently it discards the return value of mdc_read_page() and none of its callers use the its own return value.

Comment by Wang Shilong (Inactive) [ 13/Aug/17 ]

https://testing.hpdd.intel.com/test_sets/ff7b28d4-8023-11e7-bbc6-5254006e85c2

Comment by Lai Siyao [ 14/Aug/17 ]

I see, I'll work on it.

Comment by Gerrit Updater [ 15/Aug/17 ]

Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/28548
Subject: LU-9857 lmv: stripe dir page may be released mistakenly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ff1e9e251f027c044d7853ff920dd31c4b3d1888

Comment by Steve Guminski (Inactive) [ 15/Aug/17 ]

Another on master:

https://testing.hpdd.intel.com/test_sessions/d7870a08-73b3-4f95-898b-f4f0908c9214

Comment by Gerrit Updater [ 28/Aug/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28548/
Subject: LU-9857 lmv: stripe dir page may be released mistakenly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: beadbad429aeffa66898bd62cc99e76290644ca8

Comment by Peter Jones [ 28/Aug/17 ]

Landed for 2.11

Comment by Alex Zhuravlev [ 31/Aug/17 ]

https://testing.hpdd.intel.com/test_sets/7fe3bcd4-8d99-11e7-b4f3-5254006e85c2

Comment by nasf (Inactive) [ 30/Nov/17 ]

+1 on master:
https://testing.hpdd.intel.com/test_sessions/e9580a21-afc8-493a-90f6-be7f0e47bc94
https://testing.hpdd.intel.com/test_logs/ca1d75b8-d53f-11e7-8027-52540065bddc/show_text

Comment by Bruno Faccini (Inactive) [ 12/Dec/17 ]

+1 on latest master : https://testing.hpdd.intel.com/test_sets/8182cffe-dee7-11e7-9840-52540065bddc

Comment by Andreas Dilger [ 13/Dec/17 ]

Just hit this again on master, dated 2017-12-09:

https://testing.hpdd.intel.com/test_sets/ebf98b0c-dd34-11e7-9c63-52540065bddc

Comment by Lai Siyao [ 19/Dec/17 ]

I don't see this assert in all the failures, but timeout, so I'm afraid it's a different issue. And unluckily these timeout failures don't leave any logs, I'll try to reproduce it first.

Comment by Bruno Faccini (Inactive) [ 18/Jan/18 ]

+1 with latest master at https://testing.hpdd.intel.com/test_sets/e3801c38-fc16-11e7-bd00-52540065bddc.
Lai, to find the LBUG just look for sanityn/test_35 related outputs in Client's Console log.

Comment by Jian Yu [ 19/Jan/18 ]

+1 on master branch:
https://testing.hpdd.intel.com/test_sets/49b96ede-fcdd-11e7-a7cd-52540065bddc

Comment by nasf (Inactive) [ 22/Jan/18 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/44775e1e-fe03-11e7-a10a-52540065bddc

Comment by Jian Yu [ 08/Feb/18 ]

+1 on master branch:
https://testing.hpdd.intel.com/test_sets/6667a692-0cd5-11e8-bd00-52540065bddc

Comment by Sarah Liu [ 14/Mar/18 ]

on master tag-2.10.59

https://testing.hpdd.intel.com/test_sets/38bd33e4-2786-11e8-b6a0-52540065bddc

Comment by James Nunez (Inactive) [ 20/Apr/18 ]

All recent failures look like they are for DNE configurations. Here are some recent failures with links to logs and core files:

Logs at https://testing.hpdd.intel.com/test_sets/614c9d28-444f-11e8-95c0-52540065bddc
Core file can be found in /scratch/dumps/trevis-4vm8.trevis.hpdd.intel.com/10.9.4.36-2018-04-20-03:58:14

Logs at https://testing.hpdd.intel.com/test_sets/1cf4c6b6-42a9-11e8-8f8a-52540065bddc
Core file can be found in /scratch/dumps/trevis-44vm1.trevis.hpdd.intel.com/10.9.6.10-2018-04-18-01:35:14

https://testing.hpdd.intel.com/test_sets/1cf4c6b6-42a9-11e8-8f8a-52540065bddc
Core file can be found in /scratch/dumps/trevis-44vm1.trevis.hpdd.intel.com/10.9.6.10-2018-04-18-01:35:14

Comment by Lai Siyao [ 24/Apr/18 ]

James, the links of the the 2nd and the 3rd are duplicate. And the 1st looks to be another issue:

[10146.583339] Lustre: DEBUG MARKER: == sanityn test 16a: 500 iterations of dual-mount fsx ================================================ 03:55:24 (1524196524)
[10252.159310] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [ll_ost_io00_003:23498]
[10252.159310] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core dm_mod zfs(POE) zunicode(POE) zavl(POE) icp(POE) iosf_mbi crc32_pclmul ghash_clmulni_intel zcommon(POE) znvpair(POE) ppdev spl(OE) aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev virtio_balloon pcspkr i2c_piix4 parport_pc parport nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi ata_piix virtio_blk cirrus drm_kms_helper libata syscopyarea sysfillrect 8139too sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common drm crc32c_intel serio_raw virtio_pci virtio_ring virtio 8139cp i2c_core mii floppy
[10252.159310] CPU: 1 PID: 23498 Comm: ll_ost_io00_003 Tainted: P           OE  ------------   3.10.0-693.21.1.el7_lustre.x86_64 #1
[10252.159310] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[10252.159310] task: ffff880059e56eb0 ti: ffff880059538000 task.ti: ffff880059538000
[10252.159310] RIP: 0010:[<ffffffff81337f93>]  [<ffffffff81337f93>] memset+0x33/0xb0
[10252.159310] RSP: 0018:ffff88005953b820  EFLAGS: 00010212
[10252.159310] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000fbf
[10252.159310] RDX: 0000000000100000 RSI: 0000000000000000 RDI: ffffc900140ec000
[10252.159310] RBP: ffff88005953b8c0 R08: ffff880036aab690 R09: 0000000000000000
[10252.159310] R10: ffffc9001402b000 R11: ffffea000116bc00 R12: ffffffffc04a7d3b
[10252.159310] R13: ffff88005953b7b0 R14: ffff880059a5b000 R15: ffffffffc056e667
[10252.159310] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[10252.159310] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10252.159310] CR2: 00007fd2559ae000 CR3: 0000000079462000 CR4: 00000000000606e0
[10252.159310] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[10252.159310] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[10252.159310] Call Trace:
[10252.159310]  [<ffffffffc04b2cc7>] ? dbuf_read+0x397/0x9e0 [zfs]
[10252.159310]  [<ffffffffc0575128>] ? zio_done+0x748/0xd20 [zfs]
[10252.159310]  [<ffffffffc056eb8c>] ? zio_destroy+0x7c/0x80 [zfs]
[10252.159310]  [<ffffffffc04b66d9>] dmu_buf_will_dirty+0x119/0x130 [zfs]
[10252.159310]  [<ffffffffc04bcf15>] dmu_write_impl+0x45/0xd0 [zfs]
[10252.159310]  [<ffffffffc04beb57>] dmu_write.part.7+0xa7/0x110 [zfs]
[10252.159310]  [<ffffffffc04bed36>] dmu_assign_arcbuf+0x156/0x1a0 [zfs]
[10252.159310]  [<ffffffffc10cfdcd>] osd_write_commit+0x46d/0xa00 [osd_zfs]
[10252.159310]  [<ffffffffc120b29a>] ofd_commitrw_write+0xf9a/0x1d00 [ofd]
[10252.159310]  [<ffffffffc120f112>] ofd_commitrw+0x4b2/0xa10 [ofd]
[10252.159310]  [<ffffffffc0c36c39>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[10252.159310]  [<ffffffffc0ef8430>] tgt_brw_write+0x1180/0x1d70 [ptlrpc]
[10252.159310] INFO: rcu_sched detected stalls on CPUs/tasks:
[10252.159310]  [<ffffffffc0e4b940>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[10252.159310]  [<ffffffffc0ef994a>] tgt_request_handle+0x92a/0x13b0 [ptlrpc]
[10252.159310]  [<ffffffffc0e9da53>] ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc]
[10252.159310]  [<ffffffff810bdc4b>] ? __wake_up_common+0x5b/0x90
[10252.159310]  [<ffffffffc0ea1202>] ptlrpc_main+0xa92/0x1f40 [ptlrpc]
[10252.159310]  [<ffffffffc0ea0770>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]
[10252.159310]  [<ffffffff810b4031>] kthread+0xd1/0xe0
[10252.159310]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[10252.159310]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
[10252.159310]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[10252.159310] Code: b8 01 01 01 01 01 01 01 01 48 0f af c1 41 89 f9 41 83 e1 07 75 70 48 89 d1 48 c1 e9 06 74 39 66 0f 1f 84 00 00 00 00 00 48 ff c9 <48> 89 07 48 89 47 08 48 89 47 10 48 89 47 18 48 89 47 20 48 89 
[10252.159310] Kernel panic - not syncing: softlockup: hung tasks
[10274.555013]  { 1[10252.159310] CPU: 1 PID: 23498 Comm: ll_ost_io00_003 Tainted: P           OEL ------------   3.10.0-693.21.1.el7_lustre.x86_64 #1
[10252.159310] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[10252.159310] Call Trace:
[10252.159310]  <IRQ>  [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
[10252.159310]  [<ffffffff816a8634>] panic+0xe8/0x21f
[10252.159310]  [<ffffffff8102d7cf>] ? show_regs+0x5f/0x210
[10252.159310]  [<ffffffff811334e1>] watchdog_timer_fn+0x231/0x240
[10252.159310]  [<ffffffff811332b0>] ? watchdog+0x40/0x40
[10252.159310]  [<ffffffff810b8196>] __hrtimer_run_queues+0xd6/0x260
[10252.159310]  [<ffffffff810b872f>] hrtimer_interrupt+0xaf/0x1d0
[10252.159310]  [<ffffffff8105467b>] local_apic_timer_interrupt+0x3b/0x60
[10252.159310]  [<ffffffff816c4e73>] smp_apic_timer_interrupt+0x43/0x60
[10252.159310]  [<ffffffff816c1732>] apic_timer_interrupt+0x162/0x170
[10252.159310]  <EOI>  [<ffffffff81337f93>] ? memset+0x33/0xb0
[10252.159310]  [<ffffffffc04b2cc7>] ? dbuf_read+0x397/0x9e0 [zfs]
[10252.159310]  [<ffffffffc0575128>] ? zio_done+0x748/0xd20 [zfs]
[10252.159310]  [<ffffffffc056eb8c>] ? zio_destroy+0x7c/0x80 [zfs]
[10252.159310]  [<ffffffffc04b66d9>] dmu_buf_will_dirty+0x119/0x130 [zfs]
[10252.159310]  [<ffffffffc04bcf15>] dmu_write_impl+0x45/0xd0 [zfs]
[10252.159310]  [<ffffffffc04beb57>] dmu_write.part.7+0xa7/0x110 [zfs]
[10252.159310]  [<ffffffffc04bed36>] dmu_assign_arcbuf+0x156/0x1a0 [zfs]
[10252.159310]  [<ffffffffc10cfdcd>] osd_write_commit+0x46d/0xa00 [osd_zfs]
[10252.159310]  [<ffffffffc120b29a>] ofd_commitrw_write+0xf9a/0x1d00 [ofd]
[10252.159310]  [<ffffffffc120f112>] ofd_commitrw+0x4b2/0xa10 [ofd]
[10252.159310]  [<ffffffffc0c36c39>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[10252.159310]  [<ffffffffc0ef8430>] tgt_brw_write+0x1180/0x1d70 [ptlrpc]
[10252.159310]  [<ffffffffc0e4b940>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[10252.159310]  [<ffffffffc0ef994a>] tgt_request_handle+0x92a/0x13b0 [ptlrpc]
[10252.159310]  [<ffffffffc0e9da53>] ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc]
[10252.159310]  [<ffffffff810bdc4b>] ? __wake_up_common+0x5b/0x90
[10252.159310]  [<ffffffffc0ea1202>] ptlrpc_main+0xa92/0x1f40 [ptlrpc]
[10252.159310]  [<ffffffffc0ea0770>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]
[10252.159310]  [<ffffffff810b4031>] kthread+0xd1/0xe0
[10252.159310]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[10252.159310]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
[10252.159310]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40

[10274.555013] } (detected by 0, t=60299 jiffies, g=942723, c=942722, q=34)
[10274.555013] Task dump for CPU 1:
[10274.555013] ll_ost_io00_003 R  running task        0 23498      2 0x00000088
[10274.555013] Call Trace:
[10274.555013]  [<ffffffffc04bed36>] ? dmu_assign_arcbuf+0x156/0x1a0 [zfs]
[10274.555013]  [<ffffffffc10cfdcd>] ? osd_write_commit+0x46d/0xa00 [osd_zfs]
[10274.555013]  [<ffffffffc120b29a>] ? ofd_commitrw_write+0xf9a/0x1d00 [ofd]
[10274.555013]  [<ffffffffc120f112>] ? ofd_commitrw+0x4b2/0xa10 [ofd]
[10274.555013]  [<ffffffffc0c36c39>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[10274.555013]  [<ffffffffc0ef8430>] ? tgt_brw_write+0x1180/0x1d70 [ptlrpc]
[10274.555013]  [<ffffffffc0e4b940>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[10274.555013]  [<ffffffffc0ef994a>] ? tgt_request_handle+0x92a/0x13b0 [ptlrpc]
[10274.555013]  [<ffffffffc0e9da53>] ? ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc]
[10274.555013]  [<ffffffff810bdc4b>] ? __wake_up_common+0x5b/0x90
[10274.555013]  [<ffffffffc0ea1202>] ? ptlrpc_main+0xa92/0x1f40 [ptlrpc]
[10274.555013]  [<ffffffffc0ea0770>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]
[10274.555013]  [<ffffffff810b4031>] ? kthread+0xd1/0xe0
[10274.555013]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[10274.555013]  [<ffffffff816c0577>] ? ret_from_fork+0x77/0xb0
[10274.555013]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
Comment by Lai Siyao [ 26/Apr/18 ]

I found something strange in the dump:

crash> struct lu_dirpage 0xffff8800a6631000
struct lu_dirpage {
  ldp_hash_start = 16131858539135107072,
  ldp_hash_end = 8850982375266543352,
  ldp_flags = 0,
  ldp_pad0 = 0,
  ldp_entries = 0xffff8800a6631018
}

This shows in one of the dir page read, its start hash is bigger than end hash, which causes the LASSERT in the code. I've added some debug code and trying to reproduce it.

Comment by Gerrit Updater [ 27/Apr/18 ]

Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/32180
Subject: LU-9857 lmv: dir page is released while in use
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: dc6e38453c0fd50ce1b094d3aa348c0ab7efcdde

Comment by Gerrit Updater [ 12/May/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/32180/
Subject: LU-9857 lmv: dir page is released while in use
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b51e8d6b53a3f3257fd9106ec5ec5eec302baa7f

Comment by Peter Jones [ 12/May/18 ]

Landed for 2.12

Comment by Gerrit Updater [ 04/Jun/20 ]

Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/38838
Subject: LU-9857 lmv: stripe dir page may be released mistakenly
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: fddce17bbad4fe379bedf479edb60034a69af977

Comment by Gerrit Updater [ 04/Jun/20 ]

Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/38839
Subject: LU-9857 lmv: dir page is released while in use
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: a489fb1afa69e5a695e26219ce0fb1d12b73a4f5

Generated at Sat Feb 10 02:29:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.