[LU-9433] sanity-scrub test_6: Error in dmesg detected Created: 02/May/17  Updated: 10/Jun/17  Resolved: 10/Jun/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Critical
Reporter: James Casper Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

onyx-48, full, DNE/ldiskfs
EL7, master branch, v2.9.56.11, b3565


Issue Links:
Related
is related to LU-8279 sanity-scrub test_4b: @@@@@@ FAIL: Er... Resolved
is related to LU-5729 VFS: Busy inodes after unmount of loo... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://testing.hpdd.intel.com/test_sessions/cb12c60c-613a-44b3-bfef-03c0651d2607

Note: This was also seen in v2.8: LU-8279

This sanity-scrub subtest failure in test_6 was followed by the same failure in the next ~375 subtests (in sanity-scrub, sanity-benchmark, sanity-lfsck, sanityn, and sanity-hsm.

From test_log:

CMD: onyx-48vm1.onyx.hpdd.intel.com,onyx-48vm2,onyx-48vm3,onyx-48vm7,onyx-48vm8 dmesg
Kernel error detected: [11155.947772] VFS: Busy inodes after unmount of dm-1. Self-destruct in 5 seconds.  Have a nice day...
 sanity-scrub test_6: @@@@@@ FAIL: Error in dmesg detected 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4931:error()
  = /usr/lib64/lustre/tests/test-framework.sh:5212:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5246:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:5093:run_test()
  = /usr/lib64/lustre/tests/sanity-scrub.sh:773:main()

Note: The "VFS: Busy inodes after unmount" message is also present in the MDS2/MDS4 console log.



 Comments   
Comment by Jian Yu [ 05/May/17 ]

More failure instance on master branch:
https://testing.hpdd.intel.com/test_sets/4215ab26-3090-11e7-8847-5254006e85c2

Comment by Gerrit Updater [ 05/May/17 ]

Wei Liu (wei3.liu@intel.com) uploaded a new patch: https://review.whamcloud.com/26967
Subject: LU-9433 test: rerun sanity-scrub to reproduce it
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1889b2073c087be7328044cfa309996721717ff8

Comment by Sarah Liu [ 10/May/17 ]

After adding umount/remount between each subtest, cannot reproduce the problem, any suggestion?

Maloo report
https://testing.hpdd.intel.com/test_sessions/74dc4308-1a49-494d-9472-c75d12430ca3

Comment by James Casper [ 10/May/17 ]

FYI: In tag 56, this was only seen in the following config: DNE/ldiskfs. In tag 57, sanity-scrub passed for DNE/ldiskfs.

Comment by Sarah Liu [ 10/May/17 ]

Thank you Jim. The debug patch I pushed is using DNE/ldiskfs config. Please let me know if you see the bug in other config or tests. Thanks.

Comment by James Nunez (Inactive) [ 11/May/17 ]

We had a sanity-scrub failure last night at https://testing.hpdd.intel.com/test_sets/999eb6ae-3607-11e7-8847-5254006e85c2; sanity-scrub test 9 timed out. When we run the same test suite multiple times in one patch (for example when we specify testlist=sanity-scrub,sanity-scrub), the logs get combined; see ATM-13. Thus, all the logs for all the sanity-scrub runs are combined and located at https://testing.hpdd.intel.com/test_sets/9a81d60a-3607-11e7-8847-5254006e85c2.

Looking at test 9 MDS1/MDS3 console logs, about half way through the log, we see

15:58:47:[15886.272894] Lustre: DEBUG MARKER: onyx-61vm3.onyx.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
15:58:47:[15887.708303] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds3
15:58:47:[15887.978941] Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P3
16:58:33:********** Timeout by autotest system **********18:10:53:[ 2558.290504] Lustre: DEBUG MARKER: == sanity-scrub test 9: OI scrub speed control 

Looking at the MDS2/MDS4 console log, if you search, you will find

15:59:31:[15925.512212] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n mdd.lustre-MDT0003.lfsck_speed_limit 300
15:59:31:[15936.070002] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [OI_scrub:28076]
15:59:31:[15936.070002] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev pcspkr virtio_balloon i2c_piix4 parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk crct10dif_pclmul crct10dif_common cirrus crc32c_intel drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops 8139too ttm serio_raw virtio_pci virtio_ring virtio ata_piix drm libata i2c_core 8139cp mii floppy
15:59:31:[15936.070002] CPU: 1 PID: 28076 Comm: OI_scrub Tainted: G           OE  ------------   3.10.0-514.16.1.el7_lustre.x86_64 #1
15:59:31:[15936.070002] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
15:59:31:[15936.070002] task: ffff880063478fb0 ti: ffff88007a8cc000 task.ti: ffff88007a8cc000
15:59:31:[15936.070002] RIP: 0010:[<ffffffffa0d24a09>]  [<ffffffffa0d24a09>] osd_scrub_exec+0x79/0x5b0 [osd_ldiskfs]
15:59:31:[15936.070002] RSP: 0018:ffff88007a8cfc78  EFLAGS: 00000202
15:59:31:[15936.070002] RAX: 0000000000000000 RBX: ffff88007a8cfc37 RCX: 00000000555939a4
15:59:31:[15936.070002] RDX: ffff88007a8cfd78 RSI: ffff88005f5ce000 RDI: ffff880061399000
15:59:31:[15936.070002] RBP: ffff88007a8cfd08 R08: ffff88007a8cfd57 R09: 0000000000000004
15:59:31:[15936.070002] R10: ffff88007fd19a80 R11: ffffea0001e90800 R12: ffffffffa0d214d1
15:59:31:[15936.070002] R13: ffff88007a8cfc68 R14: ffff88005f5cf110 R15: ffff88007a420a00
15:59:31:[15936.070002] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
15:59:31:[15936.070002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
15:59:31:[15936.070002] CR2: 00007f94144fa000 CR3: 00000000019be000 CR4: 00000000000406e0
15:59:31:[15936.070002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
15:59:31:[15936.070002] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
15:59:31:[15936.070002] Stack:
15:59:31:[15936.070002]  0000000000000004 ffff88007a8cfd57 ffff88007a8cfdf0 0000000000000000
15:59:31:[15936.070002]  ffff88007a8cfd57 ffff88007fd19a80 0000000000000004 ffff88007a8cfd57
15:59:31:[15936.070002]  ffff880068c13800 ffff880068c13800 0000000000005404 ffff8800550dc000
15:59:31:[15936.070002] Call Trace:
15:59:31:[15936.070002]  [<ffffffffa0d261e9>] osd_inode_iteration+0x499/0xcc0 [osd_ldiskfs]
15:59:31:[15936.070002]  [<ffffffffa0d24990>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs]
15:59:31:[15936.070002]  [<ffffffffa0d20a20>] ? osd_preload_next+0xb0/0xb0 [osd_ldiskfs]
15:59:31:[15936.070002]  [<ffffffffa0d27370>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs]
15:59:31:[15936.070002]  [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20
15:59:31:[15936.070002]  [<ffffffffa0d26a10>] ? osd_inode_iteration+0xcc0/0xcc0 [osd_ldiskfs]
15:59:31:[15936.070002]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
15:59:31:[15936.070002]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
15:59:31:[15936.070002]  [<ffffffff81697318>] ret_from_fork+0x58/0x90
15:59:31:[15936.070002]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
15:59:31:[15936.070002] Code: 07 74 4f 41 83 f9 02 74 78 48 89 ca 44 89 c9 e8 0e c6 ff ff 85 c0 41 89 c1 0f 84 a3 01 00 00 41 80 a7 74 14 00 00 fe 48 8b 4d d0 <65> 48 33 0c 25 28 00 00 00 0f 85 b1 03 00 00 48 83 c4 68 5b 41 
15:59:31:[15936.070002] Kernel panic - not syncing: softlockup: hung tasks
15:59:31:[15936.070002] CPU: 1 PID: 28076 Comm: OI_scrub Tainted: G           OEL ------------   3.10.0-514.16.1.el7_lustre.x86_64 #1
15:59:31:[15936.070002] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
15:59:31:[15936.070002]  ffffffff818da70d 00000000555939a4 ffff88007fd03e18 ffffffff81686d1f
15:59:31:[15936.070002]  ffff88007fd03e98 ffffffff8168014a 0000000000000008 ffff88007fd03ea8
15:59:31:[15936.070002]  ffff88007fd03e48 00000000555939a4 ffff88007fd03e67 0000000000000000
15:59:31:[15936.070002] Call Trace:
15:59:31:[15936.070002]  <IRQ>  [<ffffffff81686d1f>] dump_stack+0x19/0x1b
15:59:31:[15936.070002]  [<ffffffff8168014a>] panic+0xe3/0x1f2
15:59:31:[15936.070002]  [<ffffffff8112f3bc>] watchdog_timer_fn+0x20c/0x220
15:59:31:[15936.070002]  [<ffffffff8112f1b0>] ? watchdog+0x50/0x50
15:59:31:[15936.070002]  [<ffffffff810b4d72>] __hrtimer_run_queues+0xd2/0x260
15:59:31:[15936.070002]  [<ffffffff810b5310>] hrtimer_interrupt+0xb0/0x1e0
15:59:31:[15936.070002]  [<ffffffff81698e5c>] ? call_softirq+0x1c/0x30
15:59:31:[15936.070002]  [<ffffffff81050fd7>] local_apic_timer_interrupt+0x37/0x60
15:59:31:[15936.070002]  [<ffffffff81699acf>] smp_apic_timer_interrupt+0x3f/0x60
15:59:31:[15936.070002]  [<ffffffff8169801d>] apic_timer_interrupt+0x6d/0x80
15:59:31:[15936.070002]  <EOI>  [<ffffffffa0d24a09>] ? osd_scrub_exec+0x79/0x5b0 [osd_ldiskfs]
15:59:31:[15936.070002]  [<ffffffffa0d261e9>] osd_inode_iteration+0x499/0xcc0 [osd_ldiskfs]
15:59:31:[15936.070002]  [<ffffffffa0d24990>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs]
15:59:31:[15936.070002]  [<ffffffffa0d20a20>] ? osd_preload_next+0xb0/0xb0 [osd_ldiskfs]
15:59:31:[15936.070002]  [<ffffffffa0d27370>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs]
15:59:31:[15936.070002]  [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20
15:59:31:[15936.070002]  [<ffffffffa0d26a10>] ? osd_inode_iteration+0xcc0/0xcc0 [osd_ldiskfs]
15:59:31:[15936.070002]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
15:59:31:[15936.070002]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
15:59:31:[15936.070002]  [<ffffffff81697318>] ret_from_fork+0x58/0x90
15:59:31:[15936.070002]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140

The stack trace is similar to LU-9488.

Comment by Peter Jones [ 11/May/17 ]

Fan Yong

Can you please advise on this one?

Thanks

Peter

Comment by Gerrit Updater [ 19/May/17 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/27212
Subject: LU-9433 osd-ldiskfs: fix inode reference leak
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d3204613fcc23ef97a632425d4ff49ce399b40ee

Comment by nasf (Inactive) [ 19/May/17 ]

The patch https://review.whamcloud.com/27212 is for fixing inode reference leak.
As for the OI scrub soft lookup, it is another failure instance of LU-9488, I will make another patch to fix it.

Comment by nasf (Inactive) [ 22/May/17 ]

The https://review.whamcloud.com/27228 is used to fix the OI scrub soft lockup issue.

Comment by Gerrit Updater [ 10/Jun/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27212/
Subject: LU-9433 osd-ldiskfs: fix inode reference leak
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1cdb212683824ff24f8366c4e32efb559c46aee3

Comment by Peter Jones [ 10/Jun/17 ]

Landed for 2.10

Generated at Sat Feb 10 02:26:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.