[LU-8971] sanity test_99a: test failed to respond and timed out Created: 23/Dec/16  Updated: 30/Dec/16  Resolved: 30/Dec/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Bob Glossman (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-6854 Setting page_writeback on a non-dirty... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/1df6416a-c94d-11e6-a37a-5254006e85c2.

The sub-test test_99a failed with the following error:

test failed to respond and timed out

autotest log and syslog from Client 1 are missing so it's hard to see what went wrong.

Info required for matching: sanity 99a



 Comments   
Comment by Bob Glossman (Inactive) [ 27/Dec/16 ]

another on master, sles12sp2:
https://testing.hpdd.intel.com/test_sets/ef072448-cbed-11e6-8580-5254006e85c2

Comment by Bob Glossman (Inactive) [ 28/Dec/16 ]

seems to be faulting in the kernel API set_page_writeback()

fault stack trace:

02:34:38:[ 2687.910728] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
02:34:38:[ 2687.913539] IP: [<ffffffff813344ea>] __percpu_counter_add+0xa/0x60
02:34:38:[ 2687.914460] PGD ea8067 PUD 8b067 PMD 0 
02:34:38:[ 2687.914460] Oops: 0000 [#1] SMP 
02:34:38:[ 2687.914460] Modules linked in: loop lustre(OEN) obdecho(OEN) mgc(OEN) lov(OEN) osc(OEN) mdc(OEN) lmv(OEN) fid(OEN) fld(OEN) ptlrpc_gss(OEN) ptlrpc(OEN) obdclass(OEN) ksocklnd(OEN) lnet(OEN) sha512_ssse3 sha512_generic libcfs(OEN) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache af_packet iscsi_boot_sysfs ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm configfs ib_cm iw_cm ib_sa ib_mad ib_core ib_addr crct10dif_pclmul crc32_pclmul crc32c_intel jitterentropy_rng drbg ansi_cprng aesni_intel ppdev aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd 8139too joydev pcspkr virtio_balloon parport_pc 8139cp i2c_piix4 mii parport pvpanic acpi_cpufreq processor button ext4 crc16 jbd2 mbcache ata_generic virtio_blk floppy ata_piix ahci libahci cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm uhci_hcd ehci_hcd usbcore serio_raw libata virtio_pci virtio_ring virtio usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
02:34:38:[ 2687.914460] Supported: No, Unsupported modules are loaded
02:34:38:[ 2687.914460] CPU: 0 PID: 8115 Comm: cvs Tainted: G           OE   N  4.4.21-84-default #1
02:34:38:[ 2687.914460] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
02:34:38:[ 2687.914460] task: ffff8800761ed6c0 ti: ffff88007b3dc000 task.ti: ffff88007b3dc000
02:34:38:[ 2687.914460] RIP: 0010:[<ffffffff813344ea>]  [<ffffffff813344ea>] __percpu_counter_add+0xa/0x60
02:34:38:[ 2687.914460] RSP: 0018:ffff88007b3df9f0  EFLAGS: 00010006
02:34:38:[ 2687.914460] RAX: 0000000000000001 RBX: 0000000000000088 RCX: 000000000000001b
02:34:38:[ 2687.914460] RDX: 0000000000000010 RSI: 0000000000000001 RDI: 0000000000000088
02:34:38:[ 2687.914460] RBP: ffff88004ffed8b8 R08: 0000000002080020 R09: 0000000000000000
02:34:38:[ 2687.914460] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000001
02:34:38:[ 2687.914460] R13: ffff88004ffed750 R14: ffff88004ffed8d0 R15: 0000000000000000
02:34:38:[ 2687.914460] FS:  00007f26d9d37700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
02:34:38:[ 2687.914460] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
02:34:38:[ 2687.914460] CR2: 00000000000000a8 CR3: 0000000000023000 CR4: 00000000000406f0
02:34:38:[ 2687.914460] Stack:
02:34:38:[ 2687.914460]  ffffea00016fae00 ffff88004ffed8b8 ffff88007d3fec00 ffffffff8118f595
02:34:38:[ 2687.914460]  ffff8800732ae918 0000000000000287 000000007b3dfa38 ffff880037393c50
02:34:38:[ 2687.914460]  0000000000000068 ffff88004cf936e8 ffff88004d47dd30 ffff880037393c28
02:34:38:[ 2687.914460] Call Trace:
02:34:38:[ 2687.914460]  [<ffffffff8118f595>] __test_set_page_writeback+0xd5/0x1c0
02:34:38:[ 2687.914460]  [<ffffffffa0d51b2e>] vvp_page_prep_write+0x1e/0x90 [lustre]
02:34:38:[ 2687.914460]  [<ffffffffa0831995>] cl_page_invoke+0x55/0x160 [obdclass]
02:34:38:[ 2687.914460]  [<ffffffffa0831c6f>] cl_page_prep+0x2f/0x1d0 [obdclass]
02:34:38:[ 2687.914460]  [<ffffffffa0c49d06>] osc_io_submit+0x156/0x620 [osc]
02:34:38:[ 2687.914460]  [<ffffffffa083626a>] cl_io_submit_rw+0x5a/0x150 [obdclass]
02:34:38:[ 2687.914460]  [<ffffffffa0c9c2a4>] lov_io_submit+0x294/0x410 [lov]
02:34:38:[ 2687.914460]  [<ffffffffa083626a>] cl_io_submit_rw+0x5a/0x150 [obdclass]
02:34:38:[ 2687.914460]  [<ffffffffa0836f6b>] cl_io_submit_sync+0x8b/0x1b0 [obdclass]
02:34:38:[ 2687.914460]  [<ffffffffa0d55aeb>] vvp_io_write_commit+0x61b/0x8c0 [lustre]
02:34:38:[ 2687.914460]  [<ffffffffa0d561ac>] vvp_io_write_start+0x41c/0x5b0 [lustre]
02:34:38:[ 2687.914460]  [<ffffffffa0835f35>] cl_io_start+0x55/0x110 [obdclass]
02:34:38:[ 2687.914460]  [<ffffffffa0838033>] cl_io_loop+0x93/0x170 [obdclass]
02:34:38:[ 2687.914460]  [<ffffffffa0d0b5fe>] ll_file_io_generic+0x5ee/0x9e0 [lustre]
02:34:38:[ 2687.914460]  [<ffffffffa0d0bc09>] ll_file_write_iter+0x69/0xc0 [lustre]
02:34:38:[ 2687.914460]  [<ffffffff811fa659>] __vfs_write+0xa9/0xf0
02:34:38:[ 2687.914460]  [<ffffffff811fac4d>] vfs_write+0x9d/0x190
02:34:38:[ 2687.914460]  [<ffffffff811fb912>] SyS_write+0x42/0xa0
02:34:38:[ 2687.914460]  [<ffffffff815e142e>] entry_SYSCALL_64_fastpath+0x12/0x6d
02:34:38:[ 2687.914460] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x12/0x6d
02:34:38:[ 2687.914460] 
02:34:38:[ 2687.914460] Leftover inexact backtrace:
02:34:38:[ 2687.914460] 
02:34:38:[ 2687.914460] Code: 63 35 7f 41 a1 00 89 c2 39 f0 7c cf 4c 89 ee 48 89 ef e8 0a ca 2a 00 48 89 d8 5b 5d 41 5c 41 5d c3 41 54 49 89 f4 55 53 48 89 fb <48> 8b 47 20 48 63 ca 65 8b 28 48 63 ed 48 01 f5 48 39 cd 7d 0a 
02:34:38:[ 2687.914460] RIP  [<ffffffff813344ea>] __percpu_counter_add+0xa/0x60
02:34:38:[ 2687.914460]  RSP <ffff88007b3df9f0>

I'm guessing there's something significantly different in the page handling primitives in linux 4.4 vs. earlier versions. Hard to say exactly what.

Comment by Bob Glossman (Inactive) [ 28/Dec/16 ]

I note that the llite/vvp_page.c source in the lustre tree looks quite different in the upstream lustre client found in drivers/staging/lustre in the linux 4.4 source compared to the llite/vvp_page.c in our current master.

In particular this includes our vvp_page_prep_write(), which calls set_page_writeback().

I strongly suspect some changes that went into the upstream lustre client need to come back into master. Probably needs some additional autoconf changes to work on both old and new kernels.

Comment by Bob Glossman (Inactive) [ 28/Dec/16 ]

I've been advised there's a fix in flight for this that never landed, LU-6854.
The mod did go upstream, but never landed in master.

Comment by Bob Glossman (Inactive) [ 30/Dec/16 ]

In the following test run of sles12sp2 rebased on top of the fix for LU-6854 this failure of test_99a is no longer seen. Other known bugs in sanity are still hit, but this one is gone.

https://testing.hpdd.intel.com/test_sets/114e1984-cece-11e6-af6a-5254006e85c2

Strong evidence that this problem is in fact a dup of LU-6854.

I'm going to close this ticket as a dup.

Generated at Sat Feb 10 02:22:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.