[LU-8438] sanity test 182 hung Created: 26/Jul/16  Updated: 05/Aug/20  Resolved: 05/Aug/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Jian Yu Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: Text File vmcore-dmesg-onyx-57vm3.txt    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Console log on MDS showed that:

Lustre: DEBUG MARKER: == sanity test 182: Test parallel modify metadata operations ========================================= 20:06:19 (1469502379)
BUG: soft lockup - CPU#0 stuck for 22s! [osp-syn-0-0:16414]
Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ppdev virtio_balloon pcspkr ib_core ib_addr parport_pc i2c_piix4 parport zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) nfsd spl(OE) zlib_deflate nfs_acl auth_rpcgss lockd grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus syscopyarea sysfillrect sysimgblt virtio_blk drm_kms_helper ttm ata_piix 8139too libata serio_raw drm virtio_pci virtio_ring[    0.000000] Initializing cgroup subsys cpuset

And the stack backtrace on MDS showed that:

 jbd2/vda1-8     D ffff880036ba78e0     0   268      2 0x00000000  
  ffff880036ba7780 0000000000000046 ffff880079360000 ffff880036ba7fd8
  ffff880036ba7fd8 ffff880036ba7fd8 ffff880079360000 ffff88007fd147c0
  0000000000000000 7fffffffffffffff ffffffff81211940 ffff880036ba78e0
 Call Trace:
  [<ffffffff81211940>] ? generic_block_bmap+0x70/0x70
  [<ffffffff8163ba29>] schedule+0x29/0x70
  [<ffffffff81639719>] schedule_timeout+0x209/0x2d0
  [<ffffffff81058aaf>] ? kvm_clock_get_cycles+0x1f/0x30
  [<ffffffff81211940>] ? generic_block_bmap+0x70/0x70
  [<ffffffff8163b05e>] io_schedule_timeout+0xae/0x130
  [<ffffffff8163b0f8>] io_schedule+0x18/0x20
  [<ffffffff8121194e>] sleep_on_buffer+0xe/0x20
  [<ffffffff816398a0>] __wait_on_bit+0x60/0x90
  [<ffffffff81211940>] ? generic_block_bmap+0x70/0x70
  [<ffffffff81639957>] out_of_line_wait_on_bit+0x87/0xb0
  [<ffffffff810a6b60>] ? wake_atomic_t_function+0x40/0x40
  [<ffffffff81212e10>] ? _submit_bh+0x160/0x210 
  [<ffffffff81213848>] bh_submit_read+0x78/0x90
  [<ffffffffa01c43a7>] ext4_get_branch+0xd7/0x170 [ext4]
  [<ffffffffa01c4d5e>] ext4_ind_map_blocks+0xce/0x760 [ext4]
  [<ffffffffa01c6f8c>] ? __es_remove_extent+0x5c/0x300 [ext4]
  [<ffffffffa0181c1b>] ext4_map_blocks+0x9b/0x590 [ext4]
  [<ffffffffa01821cc>] _ext4_get_block+0xbc/0x1b0 [ext4]
  [<ffffffffa01822d6>] ext4_get_block+0x16/0x20 [ext4]
  [<ffffffff8121191b>] generic_block_bmap+0x4b/0x70
  [<ffffffff81212611>] ? alloc_buffer_head+0x21/0x70
  [<ffffffffa0181381>] ext4_bmap+0x81/0xf0 [ext4]
  [<ffffffff811f8c1e>] bmap+0x1e/0x30
  [<ffffffffa0169fc8>] jbd2_journal_bmap+0x28/0xa0 [jbd2]
  [<ffffffffa016a0b2>] jbd2_journal_next_log_block+0x72/0x80 [jbd2]
  [<ffffffffa0161668>] jbd2_journal_commit_transaction+0x798/0x19a0 [jbd2]
  [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0 
  [<ffffffffa0166d79>] kjournald2+0xc9/0x260 [jbd2]
  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
  [<ffffffffa0166cb0>] ? commit_timeout+0x10/0x10 [jbd2]
  [<ffffffff810a5aef>] kthread+0xcf/0xe0
  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
  [<ffffffff816469d8>] ret_from_fork+0x58/0x90
  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140

Maloo report: https://testing.hpdd.intel.com/test_sets/bc7f8634-530a-11e6-bf87-5254006e85c2



 Comments   
Comment by Jian Yu [ 26/Jul/16 ]

This is affecting patch review testing on master branch.

Comment by Oleg Drokin [ 26/Jul/16 ]

So I see there was a crash and crashump was generated.
Can you lease attach the vmcore-dmesg.txt form the crash here as a first step?

Comment by Jian Yu [ 26/Jul/16 ]

Sure, Oleg, please see the attached file.

The vmcore is under /scratch/dumps/onyx-57vm3.onyx.hpdd.intel.com/10.2.5.84-2016-07-25-20:07:03 on Onyx test cluster.

Comment by Jian Yu [ 26/Jul/16 ]

More failure instances on master branch:
https://testing.hpdd.intel.com/test_sets/46c88096-49ad-11e6-9f8e-5254006e85c2
https://testing.hpdd.intel.com/test_sets/7d846a60-4993-11e6-bf87-5254006e85c2
https://testing.hpdd.intel.com/test_sets/0a624bfa-48bd-11e6-8968-5254006e85c2

Comment by Andreas Dilger [ 05/Aug/20 ]

Closing old issue that has not been seen in a long time.

Generated at Sat Feb 10 02:17:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.