[LU-8971] sanity test_99a: test failed to respond and timed out Created: 23/Dec/16 Updated: 30/Dec/16 Resolved: 30/Dec/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Bob Glossman (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for Bob Glossman <bob.glossman@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/1df6416a-c94d-11e6-a37a-5254006e85c2. The sub-test test_99a failed with the following error: test failed to respond and timed out autotest log and syslog from Client 1 are missing so it's hard to see what went wrong. Info required for matching: sanity 99a |
| Comments |
| Comment by Bob Glossman (Inactive) [ 27/Dec/16 ] |
|
another on master, sles12sp2: |
| Comment by Bob Glossman (Inactive) [ 28/Dec/16 ] |
|
seems to be faulting in the kernel API set_page_writeback() fault stack trace: 02:34:38:[ 2687.910728] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8 02:34:38:[ 2687.913539] IP: [<ffffffff813344ea>] __percpu_counter_add+0xa/0x60 02:34:38:[ 2687.914460] PGD ea8067 PUD 8b067 PMD 0 02:34:38:[ 2687.914460] Oops: 0000 [#1] SMP 02:34:38:[ 2687.914460] Modules linked in: loop lustre(OEN) obdecho(OEN) mgc(OEN) lov(OEN) osc(OEN) mdc(OEN) lmv(OEN) fid(OEN) fld(OEN) ptlrpc_gss(OEN) ptlrpc(OEN) obdclass(OEN) ksocklnd(OEN) lnet(OEN) sha512_ssse3 sha512_generic libcfs(OEN) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache af_packet iscsi_boot_sysfs ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm configfs ib_cm iw_cm ib_sa ib_mad ib_core ib_addr crct10dif_pclmul crc32_pclmul crc32c_intel jitterentropy_rng drbg ansi_cprng aesni_intel ppdev aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd 8139too joydev pcspkr virtio_balloon parport_pc 8139cp i2c_piix4 mii parport pvpanic acpi_cpufreq processor button ext4 crc16 jbd2 mbcache ata_generic virtio_blk floppy ata_piix ahci libahci cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm uhci_hcd ehci_hcd usbcore serio_raw libata virtio_pci virtio_ring virtio usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4 02:34:38:[ 2687.914460] Supported: No, Unsupported modules are loaded 02:34:38:[ 2687.914460] CPU: 0 PID: 8115 Comm: cvs Tainted: G OE N 4.4.21-84-default #1 02:34:38:[ 2687.914460] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 02:34:38:[ 2687.914460] task: ffff8800761ed6c0 ti: ffff88007b3dc000 task.ti: ffff88007b3dc000 02:34:38:[ 2687.914460] RIP: 0010:[<ffffffff813344ea>] [<ffffffff813344ea>] __percpu_counter_add+0xa/0x60 02:34:38:[ 2687.914460] RSP: 0018:ffff88007b3df9f0 EFLAGS: 00010006 02:34:38:[ 2687.914460] RAX: 0000000000000001 RBX: 0000000000000088 RCX: 000000000000001b 02:34:38:[ 2687.914460] RDX: 0000000000000010 RSI: 0000000000000001 RDI: 0000000000000088 02:34:38:[ 2687.914460] RBP: ffff88004ffed8b8 R08: 0000000002080020 R09: 0000000000000000 02:34:38:[ 2687.914460] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000001 02:34:38:[ 2687.914460] R13: ffff88004ffed750 R14: ffff88004ffed8d0 R15: 0000000000000000 02:34:38:[ 2687.914460] FS: 00007f26d9d37700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 02:34:38:[ 2687.914460] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 02:34:38:[ 2687.914460] CR2: 00000000000000a8 CR3: 0000000000023000 CR4: 00000000000406f0 02:34:38:[ 2687.914460] Stack: 02:34:38:[ 2687.914460] ffffea00016fae00 ffff88004ffed8b8 ffff88007d3fec00 ffffffff8118f595 02:34:38:[ 2687.914460] ffff8800732ae918 0000000000000287 000000007b3dfa38 ffff880037393c50 02:34:38:[ 2687.914460] 0000000000000068 ffff88004cf936e8 ffff88004d47dd30 ffff880037393c28 02:34:38:[ 2687.914460] Call Trace: 02:34:38:[ 2687.914460] [<ffffffff8118f595>] __test_set_page_writeback+0xd5/0x1c0 02:34:38:[ 2687.914460] [<ffffffffa0d51b2e>] vvp_page_prep_write+0x1e/0x90 [lustre] 02:34:38:[ 2687.914460] [<ffffffffa0831995>] cl_page_invoke+0x55/0x160 [obdclass] 02:34:38:[ 2687.914460] [<ffffffffa0831c6f>] cl_page_prep+0x2f/0x1d0 [obdclass] 02:34:38:[ 2687.914460] [<ffffffffa0c49d06>] osc_io_submit+0x156/0x620 [osc] 02:34:38:[ 2687.914460] [<ffffffffa083626a>] cl_io_submit_rw+0x5a/0x150 [obdclass] 02:34:38:[ 2687.914460] [<ffffffffa0c9c2a4>] lov_io_submit+0x294/0x410 [lov] 02:34:38:[ 2687.914460] [<ffffffffa083626a>] cl_io_submit_rw+0x5a/0x150 [obdclass] 02:34:38:[ 2687.914460] [<ffffffffa0836f6b>] cl_io_submit_sync+0x8b/0x1b0 [obdclass] 02:34:38:[ 2687.914460] [<ffffffffa0d55aeb>] vvp_io_write_commit+0x61b/0x8c0 [lustre] 02:34:38:[ 2687.914460] [<ffffffffa0d561ac>] vvp_io_write_start+0x41c/0x5b0 [lustre] 02:34:38:[ 2687.914460] [<ffffffffa0835f35>] cl_io_start+0x55/0x110 [obdclass] 02:34:38:[ 2687.914460] [<ffffffffa0838033>] cl_io_loop+0x93/0x170 [obdclass] 02:34:38:[ 2687.914460] [<ffffffffa0d0b5fe>] ll_file_io_generic+0x5ee/0x9e0 [lustre] 02:34:38:[ 2687.914460] [<ffffffffa0d0bc09>] ll_file_write_iter+0x69/0xc0 [lustre] 02:34:38:[ 2687.914460] [<ffffffff811fa659>] __vfs_write+0xa9/0xf0 02:34:38:[ 2687.914460] [<ffffffff811fac4d>] vfs_write+0x9d/0x190 02:34:38:[ 2687.914460] [<ffffffff811fb912>] SyS_write+0x42/0xa0 02:34:38:[ 2687.914460] [<ffffffff815e142e>] entry_SYSCALL_64_fastpath+0x12/0x6d 02:34:38:[ 2687.914460] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x12/0x6d 02:34:38:[ 2687.914460] 02:34:38:[ 2687.914460] Leftover inexact backtrace: 02:34:38:[ 2687.914460] 02:34:38:[ 2687.914460] Code: 63 35 7f 41 a1 00 89 c2 39 f0 7c cf 4c 89 ee 48 89 ef e8 0a ca 2a 00 48 89 d8 5b 5d 41 5c 41 5d c3 41 54 49 89 f4 55 53 48 89 fb <48> 8b 47 20 48 63 ca 65 8b 28 48 63 ed 48 01 f5 48 39 cd 7d 0a 02:34:38:[ 2687.914460] RIP [<ffffffff813344ea>] __percpu_counter_add+0xa/0x60 02:34:38:[ 2687.914460] RSP <ffff88007b3df9f0> I'm guessing there's something significantly different in the page handling primitives in linux 4.4 vs. earlier versions. Hard to say exactly what. |
| Comment by Bob Glossman (Inactive) [ 28/Dec/16 ] |
|
I note that the llite/vvp_page.c source in the lustre tree looks quite different in the upstream lustre client found in drivers/staging/lustre in the linux 4.4 source compared to the llite/vvp_page.c in our current master. In particular this includes our vvp_page_prep_write(), which calls set_page_writeback(). I strongly suspect some changes that went into the upstream lustre client need to come back into master. Probably needs some additional autoconf changes to work on both old and new kernels. |
| Comment by Bob Glossman (Inactive) [ 28/Dec/16 ] |
|
I've been advised there's a fix in flight for this that never landed, |
| Comment by Bob Glossman (Inactive) [ 30/Dec/16 ] |
|
In the following test run of sles12sp2 rebased on top of the fix for https://testing.hpdd.intel.com/test_sets/114e1984-cece-11e6-af6a-5254006e85c2 Strong evidence that this problem is in fact a dup of I'm going to close this ticket as a dup. |