[LU-9433] sanity-scrub test_6: Error in dmesg detected Created: 02/May/17 Updated: 10/Jun/17 Resolved: 10/Jun/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | James Casper | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
onyx-48, full, DNE/ldiskfs |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
https://testing.hpdd.intel.com/test_sessions/cb12c60c-613a-44b3-bfef-03c0651d2607 Note: This was also seen in v2.8: This sanity-scrub subtest failure in test_6 was followed by the same failure in the next ~375 subtests (in sanity-scrub, sanity-benchmark, sanity-lfsck, sanityn, and sanity-hsm. From test_log: CMD: onyx-48vm1.onyx.hpdd.intel.com,onyx-48vm2,onyx-48vm3,onyx-48vm7,onyx-48vm8 dmesg Kernel error detected: [11155.947772] VFS: Busy inodes after unmount of dm-1. Self-destruct in 5 seconds. Have a nice day... sanity-scrub test_6: @@@@@@ FAIL: Error in dmesg detected Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4931:error() = /usr/lib64/lustre/tests/test-framework.sh:5212:run_one() = /usr/lib64/lustre/tests/test-framework.sh:5246:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:5093:run_test() = /usr/lib64/lustre/tests/sanity-scrub.sh:773:main() Note: The "VFS: Busy inodes after unmount" message is also present in the MDS2/MDS4 console log. |
| Comments |
| Comment by Jian Yu [ 05/May/17 ] |
|
More failure instance on master branch: |
| Comment by Gerrit Updater [ 05/May/17 ] |
|
Wei Liu (wei3.liu@intel.com) uploaded a new patch: https://review.whamcloud.com/26967 |
| Comment by Sarah Liu [ 10/May/17 ] |
|
After adding umount/remount between each subtest, cannot reproduce the problem, any suggestion? Maloo report |
| Comment by James Casper [ 10/May/17 ] |
|
FYI: In tag 56, this was only seen in the following config: DNE/ldiskfs. In tag 57, sanity-scrub passed for DNE/ldiskfs. |
| Comment by Sarah Liu [ 10/May/17 ] |
|
Thank you Jim. The debug patch I pushed is using DNE/ldiskfs config. Please let me know if you see the bug in other config or tests. Thanks. |
| Comment by James Nunez (Inactive) [ 11/May/17 ] |
|
We had a sanity-scrub failure last night at https://testing.hpdd.intel.com/test_sets/999eb6ae-3607-11e7-8847-5254006e85c2; sanity-scrub test 9 timed out. When we run the same test suite multiple times in one patch (for example when we specify testlist=sanity-scrub,sanity-scrub), the logs get combined; see ATM-13. Thus, all the logs for all the sanity-scrub runs are combined and located at https://testing.hpdd.intel.com/test_sets/9a81d60a-3607-11e7-8847-5254006e85c2. Looking at test 9 MDS1/MDS3 console logs, about half way through the log, we see 15:58:47:[15886.272894] Lustre: DEBUG MARKER: onyx-61vm3.onyx.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 15:58:47:[15887.708303] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds3 15:58:47:[15887.978941] Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P3 16:58:33:********** Timeout by autotest system **********18:10:53:[ 2558.290504] Lustre: DEBUG MARKER: == sanity-scrub test 9: OI scrub speed control Looking at the MDS2/MDS4 console log, if you search, you will find 15:59:31:[15925.512212] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n mdd.lustre-MDT0003.lfsck_speed_limit 300 15:59:31:[15936.070002] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [OI_scrub:28076] 15:59:31:[15936.070002] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev pcspkr virtio_balloon i2c_piix4 parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk crct10dif_pclmul crct10dif_common cirrus crc32c_intel drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops 8139too ttm serio_raw virtio_pci virtio_ring virtio ata_piix drm libata i2c_core 8139cp mii floppy 15:59:31:[15936.070002] CPU: 1 PID: 28076 Comm: OI_scrub Tainted: G OE ------------ 3.10.0-514.16.1.el7_lustre.x86_64 #1 15:59:31:[15936.070002] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 15:59:31:[15936.070002] task: ffff880063478fb0 ti: ffff88007a8cc000 task.ti: ffff88007a8cc000 15:59:31:[15936.070002] RIP: 0010:[<ffffffffa0d24a09>] [<ffffffffa0d24a09>] osd_scrub_exec+0x79/0x5b0 [osd_ldiskfs] 15:59:31:[15936.070002] RSP: 0018:ffff88007a8cfc78 EFLAGS: 00000202 15:59:31:[15936.070002] RAX: 0000000000000000 RBX: ffff88007a8cfc37 RCX: 00000000555939a4 15:59:31:[15936.070002] RDX: ffff88007a8cfd78 RSI: ffff88005f5ce000 RDI: ffff880061399000 15:59:31:[15936.070002] RBP: ffff88007a8cfd08 R08: ffff88007a8cfd57 R09: 0000000000000004 15:59:31:[15936.070002] R10: ffff88007fd19a80 R11: ffffea0001e90800 R12: ffffffffa0d214d1 15:59:31:[15936.070002] R13: ffff88007a8cfc68 R14: ffff88005f5cf110 R15: ffff88007a420a00 15:59:31:[15936.070002] FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000 15:59:31:[15936.070002] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 15:59:31:[15936.070002] CR2: 00007f94144fa000 CR3: 00000000019be000 CR4: 00000000000406e0 15:59:31:[15936.070002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 15:59:31:[15936.070002] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 15:59:31:[15936.070002] Stack: 15:59:31:[15936.070002] 0000000000000004 ffff88007a8cfd57 ffff88007a8cfdf0 0000000000000000 15:59:31:[15936.070002] ffff88007a8cfd57 ffff88007fd19a80 0000000000000004 ffff88007a8cfd57 15:59:31:[15936.070002] ffff880068c13800 ffff880068c13800 0000000000005404 ffff8800550dc000 15:59:31:[15936.070002] Call Trace: 15:59:31:[15936.070002] [<ffffffffa0d261e9>] osd_inode_iteration+0x499/0xcc0 [osd_ldiskfs] 15:59:31:[15936.070002] [<ffffffffa0d24990>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs] 15:59:31:[15936.070002] [<ffffffffa0d20a20>] ? osd_preload_next+0xb0/0xb0 [osd_ldiskfs] 15:59:31:[15936.070002] [<ffffffffa0d27370>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs] 15:59:31:[15936.070002] [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20 15:59:31:[15936.070002] [<ffffffffa0d26a10>] ? osd_inode_iteration+0xcc0/0xcc0 [osd_ldiskfs] 15:59:31:[15936.070002] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 15:59:31:[15936.070002] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 15:59:31:[15936.070002] [<ffffffff81697318>] ret_from_fork+0x58/0x90 15:59:31:[15936.070002] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 15:59:31:[15936.070002] Code: 07 74 4f 41 83 f9 02 74 78 48 89 ca 44 89 c9 e8 0e c6 ff ff 85 c0 41 89 c1 0f 84 a3 01 00 00 41 80 a7 74 14 00 00 fe 48 8b 4d d0 <65> 48 33 0c 25 28 00 00 00 0f 85 b1 03 00 00 48 83 c4 68 5b 41 15:59:31:[15936.070002] Kernel panic - not syncing: softlockup: hung tasks 15:59:31:[15936.070002] CPU: 1 PID: 28076 Comm: OI_scrub Tainted: G OEL ------------ 3.10.0-514.16.1.el7_lustre.x86_64 #1 15:59:31:[15936.070002] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 15:59:31:[15936.070002] ffffffff818da70d 00000000555939a4 ffff88007fd03e18 ffffffff81686d1f 15:59:31:[15936.070002] ffff88007fd03e98 ffffffff8168014a 0000000000000008 ffff88007fd03ea8 15:59:31:[15936.070002] ffff88007fd03e48 00000000555939a4 ffff88007fd03e67 0000000000000000 15:59:31:[15936.070002] Call Trace: 15:59:31:[15936.070002] <IRQ> [<ffffffff81686d1f>] dump_stack+0x19/0x1b 15:59:31:[15936.070002] [<ffffffff8168014a>] panic+0xe3/0x1f2 15:59:31:[15936.070002] [<ffffffff8112f3bc>] watchdog_timer_fn+0x20c/0x220 15:59:31:[15936.070002] [<ffffffff8112f1b0>] ? watchdog+0x50/0x50 15:59:31:[15936.070002] [<ffffffff810b4d72>] __hrtimer_run_queues+0xd2/0x260 15:59:31:[15936.070002] [<ffffffff810b5310>] hrtimer_interrupt+0xb0/0x1e0 15:59:31:[15936.070002] [<ffffffff81698e5c>] ? call_softirq+0x1c/0x30 15:59:31:[15936.070002] [<ffffffff81050fd7>] local_apic_timer_interrupt+0x37/0x60 15:59:31:[15936.070002] [<ffffffff81699acf>] smp_apic_timer_interrupt+0x3f/0x60 15:59:31:[15936.070002] [<ffffffff8169801d>] apic_timer_interrupt+0x6d/0x80 15:59:31:[15936.070002] <EOI> [<ffffffffa0d24a09>] ? osd_scrub_exec+0x79/0x5b0 [osd_ldiskfs] 15:59:31:[15936.070002] [<ffffffffa0d261e9>] osd_inode_iteration+0x499/0xcc0 [osd_ldiskfs] 15:59:31:[15936.070002] [<ffffffffa0d24990>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs] 15:59:31:[15936.070002] [<ffffffffa0d20a20>] ? osd_preload_next+0xb0/0xb0 [osd_ldiskfs] 15:59:31:[15936.070002] [<ffffffffa0d27370>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs] 15:59:31:[15936.070002] [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20 15:59:31:[15936.070002] [<ffffffffa0d26a10>] ? osd_inode_iteration+0xcc0/0xcc0 [osd_ldiskfs] 15:59:31:[15936.070002] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 15:59:31:[15936.070002] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 15:59:31:[15936.070002] [<ffffffff81697318>] ret_from_fork+0x58/0x90 15:59:31:[15936.070002] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 The stack trace is similar to |
| Comment by Peter Jones [ 11/May/17 ] |
|
Fan Yong Can you please advise on this one? Thanks Peter |
| Comment by Gerrit Updater [ 19/May/17 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/27212 |
| Comment by nasf (Inactive) [ 19/May/17 ] |
|
The patch https://review.whamcloud.com/27212 is for fixing inode reference leak. |
| Comment by nasf (Inactive) [ 22/May/17 ] |
|
The https://review.whamcloud.com/27228 is used to fix the OI scrub soft lockup issue. |
| Comment by Gerrit Updater [ 10/Jun/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27212/ |
| Comment by Peter Jones [ 10/Jun/17 ] |
|
Landed for 2.10 |