[LU-9469] conf-sanity test_61: test failed to respond and timed out Created: 08/May/17 Updated: 24/Oct/17 Resolved: 16/Oct/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.11.0, Lustre 2.10.2 |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Casper | Assignee: | Bob Glossman (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
onyx-37, full |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
https://testing.hpdd.intel.com/test_sessions/b7669e10-d1f4-42ce-a573-31dd57859317 From MDS console: BUG: unable to handle kernel NULL pointer dereference at (null) Oops: 0000 [#1] SMP Followed by a trace that is very similar to the one seen in LU-6821: [<ffffffffa0764214>] ldiskfs_map_blocks+0x554/0x600 [ldiskfs] 21:19:41:[23773.866330] [<ffffffffa074aa81>] ldiskfs_xattr_set_entry+0x4a1/0x790 [ldiskfs] 21:19:41:[23773.866330] [<ffffffffa074b970>] ldiskfs_xattr_ibody_set+0x30/0x80 [ldiskfs] 21:19:41:[23773.866330] [<ffffffffa074bd55>] ldiskfs_xattr_set_handle+0x1c5/0x530 [ldiskfs] 21:19:41:[23773.866330] [<ffffffffa074c1fd>] ldiskfs_xattr_set+0x13d/0x1a0 [ldiskfs] 21:19:41:[23773.866330] [<ffffffff8121e4bb>] generic_setxattr+0x5b/0x70 21:19:41:[23773.866330] [<ffffffffa0e6bfd4>] osd_xattr_set+0x294/0xaf0 [osd_ldiskfs] 21:19:41:[23773.866330] [<ffffffffa109bf2f>] lod_sub_object_xattr_set+0x1df/0x430 [lod] 21:19:41:[23773.866330] [<ffffffffa108372f>] lod_xattr_set_internal+0xaf/0x2a0 [lod] 21:19:41:[23773.866330] [<ffffffffa1091574>] lod_xattr_set+0xe4/0x530 [lod] 21:19:41:[23773.866330] [<ffffffffa110277d>] mdd_xattr_set+0x6cd/0xe50 [mdd] 21:19:41:[23773.866330] [<ffffffffa0fbac30>] mdt_reint_setxattr+0x410/0xf60 [mdt] 21:19:41:[23773.866330] [<ffffffffa0fb967e>] mdt_reint_rec+0x6e/0x1e0 [mdt] 21:19:41:[23773.866330] [<ffffffffa0f9d087>] mdt_reint_internal+0x557/0x950 [mdt] 21:19:41:[23773.866330] [<ffffffffa0fa8051>] mdt_reint+0x61/0x120 [mdt] 21:19:41:[23773.866330] [<ffffffffa0bd1cb8>] tgt_request_handle+0x8b8/0x1290 [ptlrpc] 21:19:41:[23773.866330] [<ffffffffa0b7e573>] ptlrpc_server_handle_request+0x203/0xa00 [ptlrpc] 21:19:41:[23773.866330] [<ffffffffa0b82398>] ptlrpc_main+0xa38/0x1d80 [ptlrpc] 21:19:41:[23773.866330] [<ffffffff810999dd>] kthread+0xbd/0xe0 21:19:41:[23773.866330] [<ffffffff815e6f3f>] ret_from_fork+0x3f/0x70 |
| Comments |
| Comment by Peter Jones [ 10/May/17 ] |
|
Bob Can you please look into this one? Thanks Peter |
| Comment by Andreas Dilger [ 11/May/17 ] |
|
It is very likely that this is related to patch https://review.whamcloud.com/25595 " At the same time, it makes sense to port patch https://review.whamcloud.com/27045 |
| Comment by Bob Glossman (Inactive) [ 11/May/17 ] |
|
I have asked Yang Sheng to take a look. He was the author of https://review.whamcloud.com/25595 |
| Comment by Yang Sheng [ 12/May/17 ] |
|
From MDS log: 21:19:41:[23773.807317] Lustre: DEBUG MARKER: /usr/sbin/lctl mark save large xattr trusted.big on \/mnt\/lustre\/f61.conf-sanity 21:19:41:[23773.846525] Lustre: DEBUG MARKER: save large xattr trusted.big on /mnt/lustre/f61.conf-sanity 21:19:41:[23773.862509] BUG: unable to handle kernel NULL pointer dereference at (null) 21:19:41:[23773.864655] IP: [<ffffffffa0288a10>] jbd2_journal_file_inode+0x30/0xf0 [jbd2] 21:19:41:[23773.866330] PGD 0 21:19:41:[23773.866330] Oops: 0000 [#1] SMP 21:19:41:[23773.866330] Modules linked in: osp(OEN) mdd(OEN) lod(OEN) mdt(OEN) lfsck(OEN) mgs(OEN) mgc(OEN) osd_ldiskfs(OEN) lquota(OEN) fid(OEN) fld(OEN) ksocklnd(OEN) ptlrpc(OEN) obdclass(OEN) lnet(OEN) ldiskfs(OEN) libcfs(OEN) loop(E) rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) sunrpc(E) fscache(E) af_packet(E) iscsi_boot_sysfs(E) ib_ipoib(E) rdma_ucm(E) ib_ucm(E) ib_uverbs(E) ib_umad(E) rdma_cm(E) It is a jinode issue. We lost control since some codes have changed. For https://review.whamcloud.com/25595, looks like sles12sp2 already has included it. For https://review.whamcloud.com/27045, It can be ported if we will port project quota to sles12sp2. Thanks, |
| Comment by Peter Jones [ 15/May/17 ] |
|
Yes. Let's bring the SLES12 SP2 patches up to date with the RHEL7.3 patches |
| Comment by Peter Jones [ 23/Aug/17 ] |
|
Any progress on this? |
| Comment by Bob Glossman (Inactive) [ 23/Aug/17 ] |
|
attempting a patch suggested by Yang Sheng that addressed a similar panic seen on el7.4 |
| Comment by Yang Sheng [ 23/Aug/17 ] |
|
It looks same as https://testing.hpdd.intel.com/sub_tests/3157a7c8-7814-11e7-a12c-5254006e85c2. 03:59:12:[20259.344620] BUG: unable to handle kernel NULL pointer dereference at (null) 03:59:12:[20259.345439] IP: [<ffffffffc0195c84>] jbd2_journal_file_inode+0x34/0xf0 [jbd2] 03:59:12:[20259.345439] PGD 0 03:59:12:[20259.345439] Oops: 0000 [#1] SMP 03:59:12:[20259.345439] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) loop dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev virtio_balloon i2c_piix4 parport_pc parport nfsd nfs_acl lockd auth_rpcgss grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_blk drm ata_piix 8139too libata crct10dif_pclmul crct10dif_common crc32c_intel serio_raw 8139cp virtio_pci virtio_ring mii virtio i2c_core floppy [last unloaded: libcfs] 03:59:12:[20259.345439] CPU: 1 PID: 31105 Comm: mdt00_002 Tainted: G W OE ------------ 3.10.0-693.el7_lustre.x86_64 #1 03:59:12:[20259.345439] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 03:59:12:[20259.345439] task: ffff880064ab2f70 ti: ffff880064b58000 task.ti: ffff880064b58000 03:59:12:[20259.345439] RIP: 0010:[<ffffffffc0195c84>] [<ffffffffc0195c84>] jbd2_journal_file_inode+0x34/0xf0 [jbd2] 03:59:12:[20259.345439] RSP: 0018:ffff880064b5b6b0 EFLAGS: 00010246 03:59:12:[20259.345439] RAX: 0000000000000800 RBX: ffff880054ba9800 RCX: 0000080000200000 03:59:12:[20259.345439] RDX: 000000008002c823 RSI: 0000000000000000 RDI: ffff88007bd3d2a0 03:59:12:[20259.345439] RBP: ffff880064b5b6d8 R08: 000000000000c350 R09: ffffffffc07b75c3 03:59:12:[20259.345439] R10: 0000000000006b03 R11: 0000000000000000 R12: 0000000000000000 03:59:12:[20259.345439] R13: 0000000000000001 R14: 0000000000000000 R15: ffff8800796f2000 03:59:12:[20259.345439] FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000 03:59:12:[20259.345439] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 03:59:12:[20259.345439] CR2: 0000000000000000 CR3: 0000000079809000 CR4: 00000000000406e0 03:59:12:[20259.345439] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 03:59:12:[20259.345439] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 03:59:12:[20259.345439] Stack: 03:59:12:[20259.345439] ffff880000c28958 0000000000000000 0000000000000001 ffff880064b5b7c8 03:59:12:[20259.345439] 0000000000000001 ffff880064b5b760 ffffffffc07c3b4c 0000000000000000 03:59:12:[20259.345439] 0000000000c28958 000000007bd3d2a0 ffff88007bd3d2a0 0000000000000003 03:59:12:[20259.345439] Call Trace: 03:59:12:[20259.345439] [<ffffffffc07c3b4c>] ldiskfs_map_blocks+0x64c/0x6f0 [ldiskfs] 03:59:12:[20259.345439] [<ffffffffc077824c>] ldiskfs_xattr_set_entry+0x2cc/0x7c0 [ldiskfs] 03:59:12:[20259.345439] [<ffffffffc07791c5>] ldiskfs_xattr_ibody_set+0x35/0x80 [ldiskfs] 03:59:12:[20259.345439] [<ffffffffc07795a4>] ldiskfs_xattr_set_handle+0x1a4/0x500 [ldiskfs] 03:59:12:[20259.345439] [<ffffffffc0779a46>] ldiskfs_xattr_set+0x146/0x1c0 [ldiskfs] 03:59:12:[20259.345439] [<ffffffffc07b89cd>] ldiskfs_xattr_trusted_set+0x2d/0x30 [ldiskfs] 03:59:12:[20259.345439] [<ffffffff812273d2>] generic_setxattr+0x62/0x80 03:59:12:[20259.345439] [<ffffffffc0d2f43c>] osd_xattr_set+0x2cc/0xbf0 [osd_ldiskfs] 03:59:12:[20259.345439] [<ffffffffc06933e4>] ? libcfs_log_return+0x24/0x30 [libcfs] 03:59:12:[20259.345439] [<ffffffffc0faec00>] lod_sub_xattr_set+0x1f0/0x480 [lod] 03:59:12:[20259.345439] [<ffffffffc0f94ce9>] lod_xattr_set_internal+0xb9/0x2d0 [lod] 03:59:12:[20259.345439] [<ffffffffc0fa3ce1>] lod_xattr_set+0xf1/0x560 [lod] 03:59:12:[20259.345439] [<ffffffffc1010b05>] mdd_xattr_set+0x705/0xe80 [mdd] 03:59:12:[20259.345439] [<ffffffffc0ec2607>] ? mdt_version_save+0x67/0x120 [mdt] 03:59:12:[20259.345439] [<ffffffffc0ecdde0>] mdt_reint_setxattr+0x410/0xf80 [mdt] 03:59:12:[20259.345439] [<ffffffffc0696bc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] 03:59:12:[20259.345439] [<ffffffffc0ecc790>] mdt_reint_rec+0x80/0x210 [mdt] 03:59:12:[20259.345439] [<ffffffffc0eae31b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] 03:59:12:[20259.345439] [<ffffffffc0eb9da7>] mdt_reint+0x67/0x140 [mdt] 03:59:12:[20259.345439] [<ffffffffc0b5d195>] tgt_request_handle+0x925/0x1370 [ptlrpc] 03:59:12:[20259.345439] [<ffffffffc0b06036>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc] 03:59:12:[20259.345439] [<ffffffffc0696bc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] 03:59:12:[20259.345439] [<ffffffffc0b097d2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] 03:59:12:[20259.345439] [<ffffffff81029557>] ? __switch_to+0xd7/0x510 03:59:12:[20259.345439] [<ffffffff816a8f00>] ? __schedule+0x310/0x8b0 03:59:12:[20259.345439] [<ffffffffc0b08d40>] ? ptlrpc_register_service+0xe80/0xe80 [ptlrpc] 03:59:12:[20259.345439] [<ffffffff810b098f>] kthread+0xcf/0xe0 03:59:12:[20259.345439] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 03:59:12:[20259.345439] [<ffffffff816b4f18>] ret_from_fork+0x58/0x90 03:59:12:[20259.345439] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 03:59:12:[20259.345439] Code: 41 57 41 56 41 55 41 54 53 f6 47 1c 08 48 8b 1f 0f 85 91 00 00 00 48 85 db 0f 84 88 00 00 00 4c 8b 3b 45 8b 37 41 83 e6 02 75 7c <48> 39 1e 49 89 f4 74 7a 48 39 5e 08 74 74 4d 8d af a0 03 00 00 03:59:12:[20259.345439] RIP [<ffffffffc0195c84>] jbd2_journal_file_inode+0x34/0xf0 [jbd2] 03:59:12:[20259.345439] RSP <ffff880064b5b6b0> 03:59:12:[20259.345439] CR2: 0000000000000000 I have proposed a fix for that. Bob will port it to sles12. Thanks, |
| Comment by Gerrit Updater [ 23/Aug/17 ] |
|
Bob Glossman (bob.glossman@intel.com) uploaded a new patch: https://review.whamcloud.com/28665 |
| Comment by Gerrit Updater [ 23/Aug/17 ] |
|
Yang Sheng (yang.sheng@intel.com) uploaded a new patch: https://review.whamcloud.com/28666 |
| Comment by Yang Sheng [ 25/Aug/17 ] |
|
So looks like it is very easy to reproduce on sles12sp2. https://testing.hpdd.intel.com/test_sessions/cc1c2380-7d1d-4672-a3ba-0c0f4275f105 We need run fullset tests while kernel update. Else may lost chance to catch issue timely. Thanks, |
| Comment by Bob Glossman (Inactive) [ 25/Aug/17 ] |
|
YangSheng, You could help in a couple of ways:
Thanks, |
| Comment by Yang Sheng [ 25/Aug/17 ] |
|
Hi, Bob, Do you have a way to run all of tests group for sles12 sp2 instead of conf-sanity only? I afraid testing of sles12 is not enough. Maybe more issues will be revealed while full tests is running. I think just change Test-parameter of https://review.whamcloud.com/28665 is enough. My patch is just for test, it change nothing. Thanks, |
| Comment by Bob Glossman (Inactive) [ 25/Aug/17 ] |
|
I know your patch is just for test. want to just test with #29665 to see if original problem will occur with just conf-sanity if repeated enough times before trying anything else. The only way I can think of to run all normal test groups in #29665 is to have 5 separate Test-Parmaters lines, 1 for each of the 5 default test runs. I can do that, but want to have more confidence about conf-sanity first. Doing that seems likely to run into other, unrelated, pre-existing fails on sles12. |
| Comment by Gerrit Updater [ 13/Sep/17 ] |
|
Bob Glossman (bob.glossman@intel.com) uploaded a new patch: https://review.whamcloud.com/28968 |
| Comment by Gerrit Updater [ 16/Oct/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28665/ |
| Comment by Peter Jones [ 16/Oct/17 ] |
|
Landed for 2.11 |
| Comment by Gerrit Updater [ 24/Oct/17 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28968/ |