Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.5.0
-
Luster master build # 1715
OpenSFS cluster with combined MGS/MDS, single OSS with two OSTs, three clients; one agent + client, one with robinhood/db running + client and one just running as Lustre clients
-
3
-
10892
Description
The test results are at: https://maloo.whamcloud.com/test_sets/8e9cca2c-2c8b-11e3-85ee-52540035b04c
From the client test_log:
== sanity-hsm test 300: On disk coordinator state kept between MDT umount/mount == 14:22:47 (1380835367) Stop coordinator and remove coordinator state at mount mdt.scratch-MDT0000.hsm_control=shutdown Changed after 0s: from '' to 'stopping' Waiting 10 secs for update Updated after 8s: wanted 'stopped' got 'stopped' Failing mds1 on mds Stopping /lustre/scratch/mdt0 (opts:) on mds pdsh@c15: mds: ssh exited with exit code 1 reboot facets: mds1 Failover mds1 to mds 14:23:15 (1380835395) waiting for mds network 900 secs ... 14:23:15 (1380835395) network interface is UP mount facets: mds1 Starting mds1: /dev/sda3 /lustre/scratch/mdt0 Started scratch-MDT0000 c15: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 25 sec Changed after 0s: from '' to 'enabled' Waiting 20 secs for update Waiting 10 secs for update Update not seen after 20s: wanted 'stopped' got 'enabled' sanity-hsm test_300: @@@@@@ FAIL: cdt state is not stopped Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4264:error_noexit() = /usr/lib64/lustre/tests/test-framework.sh:4291:error() = /usr/lib64/lustre/tests/sanity-hsm.sh:298:cdt_check_state() = /usr/lib64/lustre/tests/sanity-hsm.sh:3063:test_300()
Attachments
- lustre-debug.log.tgz
- 1.22 MB
- lustre-dmesg.txt
- 2 kB
Activity
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12783/
Subject: LU-4065 tests: hsm copytool_cleanup improvement
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 73bca6c1f4923cdf673fa11486aec04ec3576051
There is a +1 from Andreas in gerrit. Can somebody else do review to move forward ?
Yes we hit this bug in a lot of subtests in sanity-hsm: 402, 3, 106 ...
In our case this was race between copytool_cleanup and cdt_mount_state. So usually copytool_cleanup failed.
About why test_300 may fail(my view):
- cdt_set_mount_state sets param using -P
- "cdt_check_state stopped" waiting when hsm_control becames "stopped"(after after cdt_shutdown and cdt_clear_mount_state)
- At this moment("cdt_check_state stopped" waits for "stopped") mgc retrieves and applys configuration from server with hsm_control=enabled that was set in step 1.
Also, did you find the detailed behavior you have described by analyzing MGS/MDS nodes debug-log?
yes
About reproducing the problem. You may try to make custom build with MGC_TIMEOUT_MIN_SECONDS = 10 or 15. If it will not brake something else it may help.
Sergei, thanks for this comment and patch, but I just wonder if you found the related issue/failure only when running sanity-hsm/test_300 sub-test or also with others usage of "cdt_set_mount_state()" within sanity-hsm ?? BTW, I don't understand why you refer to copytool_cleanup() in your description.
Also, did you find the detailed behavior you have described by analyzing MGS/MDS nodes debug-log?
And last, your patch has failed to pass auto-tests for unrelated failures, and I think you need to rebase it to avoid such issues.
Last, auto-tests/Maloo query indicates that issue for this ticket has only been encountered 1 time (on 2014-11-19 15:49:17 UTC) during the past 4 months. And I am presently analyzing this failure's associated MDS/MGS debug-log, and in parallel I am trying to reproduce in-house.
Hello, we faced the same error in seagate and now using following solution http://review.whamcloud.com/#/c/12783/.
Short explanation:
- "cdt_set_mount_state enabled" sets params on server. It uses lctl set_param -P.
- copytool_cleanup set hsm_control=shutdown and waiting when hsm_control becames "stopped"(after "shutdown").
- At this moment(copytool_cleanup waits for "stopped") mgc retrieves and applys configuration from server with hsm_control=enabled that was set in step 1.
- Mgc starts log processing with delay about 10 seconds (see mgc_requeue_thread).
copytool_cleanup waits 20 seconds and get hsm_control==enabled because this parameter was modified at step 3.
James, according to this auto-test failure report it is unclear for me how this happened ...
BTW, I ran a Maloo/auto-tests search for sanity-hsm/test_300 failures and seems that it runs pretty good and the failure you reported was the last/only one since 2 months now :
Name Status Run at Duration Return code Error Test log Bugs test_300 FAIL 2014-09-05 01:47:58 UTC 31 1 Restart of mds1 failed! Preview 10 test_300 FAIL 2014-09-02 13:14:59 UTC 80 1 hsm_control state is not 'stopped' on mds1 Preview 10 test_300 FAIL 2014-08-27 01:19:32 UTC 107 1 post-failover df: 1 Preview 10 test_300 FAIL 2014-08-21 19:55:39 UTC 714 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-07-01 22:43:31 UTC 599 1 hsm_control state is not 'stopped' on mds1 Preview 10 test_300 FAIL 2014-07-01 10:47:59 UTC 107 1 hsm_control state is not 'stopped' on mds1 Preview 10 test_300 FAIL 2014-06-21 14:32:23 UTC 723 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-06-02 17:16:48 UTC 331 1 hsm_control state is not 'stopped' on mds1 Preview 10 test_300 FAIL 2014-05-28 05:46:52 UTC 714 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-05-27 09:56:15 UTC 975 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-05-22 19:10:52 UTC 105 1 hsm_control state is not 'stopped' on mds1 Preview 10 LU-4065 test_300 FAIL 2014-05-08 18:55:20 UTC 716 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-05-07 22:34:02 UTC 746 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-03-26 22:02:07 UTC 77 1 hsm_control state is not 'stopped' on mds1 Preview 10 LU-4125, LU-4065 test_300 FAIL 2014-03-11 20:01:49 UTC 79 1 hsm_control state is not 'stopped' on mds1 Preview 10 LU-4065, LU-4065 test_300 FAIL 2013-12-08 19:39:12 UTC 755 1 import is not in FULL state LU-4018, LU-4361
I've run into this error again while testing HSM with the 2.5.3-RC1 version of Lustre on the OpenSFS cluster. The results are at https://testing.hpdd.intel.com/test_sessions/60308f12-3472-11e4-995a-5254006e85c2
Patch landed to Master. Please reopen ticket if more work is needed.
Since I'm testing HSM on 2.5.1, I thought I would collect the information Bruno asked for on 09/Oct/13. I ran sanity-hsm test 300 alone and got the "cdt state not stopped on mds1" On the MDT, the first few commands succeeded, but the unmount hung and later crashed.
Here's what I see in dmesg captured by crash:
<4>Lustre: DEBUG MARKER: == sanity-hsm test 300: On disk coordinator state kept between MDT umount/mount == 09:30:44 (1394641844) <6>Lustre: Disabling parameter general.mdt.lscratch-MDT0000.hsm_control in log params <4>Lustre: Failing over lscratch-MDT0000 <3>LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.118@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. <4>Lustre: 16111:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394641848/real 1394641848] req@ffff88039ae00800 x1462381936924596/t0(0) o251->MGC192.168.2.116@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1394641854 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 <4>Lustre: server umount lscratch-MDT0000 complete <6>LDISKFS-fs (sda3): mounted filesystem with ordered data mode. quota=on. Opts: <3>LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.119@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. <4>Lustre: lscratch-MDT0000: used disk, loading <6>format at mdt_coordinator.c:977:mdt_hsm_cdt_start doesn't end in newline <3>LustreError: 16267:0:(mdt_coordinator.c:977:mdt_hsm_cdt_start()) lscratch-MDT0000: cannot take the layout locks needed for registered restore: -2 <4>Lustre: MGS: non-config logname received: params <3>LustreError: 11-0: lscratch-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11. <6>Lustre: lscratch-MDD0000: changelog on <4>Lustre: lscratch-MDT0000: Will be in recovery for at least 5:00, or until 5 clients reconnect <6>Lustre: lscratch-MDT0000: Recovery over after 0:29, of 5 clients 5 recovered and 0 were evicted. <4>Lustre: DEBUG MARKER: mdc.lscratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 29 sec <4>Lustre: MGS: non-config logname received: params <4>Lustre: Skipped 4 previous similar messages <4>Lustre: DEBUG MARKER: sanity-hsm test_300: @@@@@@ FAIL: hsm_control state is not 'stopped' on mds1 <4>Lustre: DEBUG MARKER: == sanity-hsm test complete, duration 92 sec == 09:32:06 (1394641926) <4>Lustre: DEBUG MARKER: shutdown <4>Lustre: DEBUG MARKER: set_param-d-P <6>Lustre: Disabling parameter general.mdt.lscratch-MDT0000.hsm_control in log params <3>LustreError: 17173:0:(mgs_handler.c:744:mgs_iocontrol()) MGS: setparam err: rc = 1 <4>Lustre: DEBUG MARKER: get_param <4>Lustre: DEBUG MARKER: umount <4>Lustre: Failing over lscratch-MDT0000 <3>LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.118@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. <4>------------[ cut here ]------------ <4>WARNING: at lib/list_debug.c:48 list_del+0x6e/0xa0() (Not tainted) <4>Hardware name: X8DTT-H <4>list_del corruption. prev->next should be ffff88039810de18, but was 5a5a5a5a5a5a5a5a <4>Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables nfsd exportfs nfs lockd fscache auth_rpc gss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode iTCO_wdt iTCO_vendor_support serio_raw i2c_i801 i2c_core sg lpc_ich mfd_core mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core e1000e ptp pps_core ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_pi ix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_conntrack] <4>Pid: 17120, comm: hsm_cdtr Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1 <4>Call Trace: <4> [<ffffffff81071e27>] ? warn_slowpath_common+0x87/0xc0 <4> [<ffffffff81071f16>] ? warn_slowpath_fmt+0x46/0x50 <4> [<ffffffff81084220>] ? process_timeout+0x0/0x10 <4> [<ffffffff8129489e>] ? list_del+0x6e/0xa0 <4> [<ffffffff8109b681>] ? remove_wait_queue+0x31/0x50 <4> [<ffffffffa0e8ef6e>] ? mdt_coordinator+0xbce/0x16b0 [mdt] <4> [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 <4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 <4> [<ffffffff81528090>] ? thread_return+0x4e/0x76e <4> [<ffffffffa0e8e3a0>] ? mdt_coordinator+0x0/0x16b0 [mdt] <4> [<ffffffff8109aee6>] ? kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20 <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4>---[ end trace 4c18e762bb5c5a65 ]--- <4>------------[ cut here ]------------ <4>WARNING: at lib/list_debug.c:51 list_del+0x8d/0xa0() (Tainted: G W -- ------------- ) <4>Hardware name: X8DTT-H <4>list_del corruption. next->prev should be ffff88039810de18, but was 5a5a5a5a5a5a5a5a <4>Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables nfsd exportfs nfs lockd fscache auth_rpc gss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode iTCO_wdt iTCO_vendor_support serio_raw i2c_i801 i2c_core sg lpc_ich mfd_core mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core e1000e ptp pps_core ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_pi ix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_conntrack] <4>Pid: 17120, comm: hsm_cdtr Tainted: G W --------------- 2.6.32-431.5.1.el6_lustre.x86_64 #1 <4>Call Trace: <4> [<ffffffff81071e27>] ? warn_slowpath_common+0x87/0xc0 <4> [<ffffffff81071f16>] ? warn_slowpath_fmt+0x46/0x50 <4> [<ffffffff81084220>] ? process_timeout+0x0/0x10 <4> [<ffffffff812948bd>] ? list_del+0x8d/0xa0 <4> [<ffffffff8109b681>] ? remove_wait_queue+0x31/0x50 <4> [<ffffffffa0e8ef6e>] ? mdt_coordinator+0xbce/0x16b0 [mdt] <4> [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 <4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 <4> [<ffffffff81528090>] ? thread_return+0x4e/0x76e <4> [<ffffffffa0e8e3a0>] ? mdt_coordinator+0x0/0x16b0 [mdt] <4> [<ffffffff8109aee6>] ? kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20 <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4>---[ end trace 4c18e762bb5c5a66 ]--- <4>Lustre: 17182:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394641933/real 1394641933] req@ffff880392bf6c00 x1462381936924904/t0(0) o251->MGC192.168.2.116@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1394641939 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 <4>general protection fault: 0000 [#1] SMP <4>last sysfs file: /sys/devices/system/cpu/online <4>CPU 7 <4>Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables nfsd exportfs nfs lockd fscache auth_rpc gss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode iTCO_wdt iTCO_vendor_support serio_raw i2c_i801 i2c_core sg lpc_ich mfd_core mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core e1000e ptp pps_core ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_pi ix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_conntrack] <4> <4>Pid: 17120, comm: hsm_cdtr Tainted: G W --------------- 2.6.32-431.5.1.el6_lustre.x86_64 #1 Supermicro X8DTT-H/X8DTT-H <4>RIP: 0010:[<ffffffff8128a039>] [<ffffffff8128a039>] strnlen+0x9/0x40 <4>RSP: 0018:ffff88039810dae0 EFLAGS: 00010286 <4>RAX: ffffffff817b5a4e RBX: ffff8803a9176000 RCX: 0000000000000002 <4>RDX: 5a5a5a5a5a5a5a66 RSI: ffffffffffffffff RDI: 5a5a5a5a5a5a5a66 <4>RBP: ffff88039810dae0 R08: 0000000000000073 R09: ffff88043839cd40 <4>R10: 0000000000000001 R11: 000000000000000f R12: ffff8803a9175134 <4>R13: 5a5a5a5a5a5a5a66 R14: 00000000ffffffff R15: 0000000000000000 <4>FS: 0000000000000000(0000) GS:ffff880045ce0000(0000) knlGS:0000000000000000 <4>CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b <4>CR2: 0000000000448000 CR3: 000000082d703000 CR4: 00000000000007e0 <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 <4>Process hsm_cdtr (pid: 17120, threadinfo ffff88039810c000, task ffff88082d41a080) <4>Stack: <4> ffff88039810db20 ffffffff8128b2f0 ffff880000033b48 ffff8803a9175134 <4><d> ffffffffa0ea175a ffffffffa0ea1758 ffff88039810dcb0 ffff8803a9176000 <4><d> ffff88039810dbc0 ffffffff8128c738 0000000000000004 0000000affffffff <4>Call Trace: <4> [<ffffffff8128b2f0>] string+0x40/0x100 <4> [<ffffffff8128c738>] vsnprintf+0x218/0x5e0 <4> [<ffffffffa052f27b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs] <4> [<ffffffffa053effa>] libcfs_debug_vmsg2+0x2ea/0xbc0 [libcfs] <4> [<ffffffff81071e34>] ? warn_slowpath_common+0x94/0xc0 <4> [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20 <4> [<ffffffffa053f911>] libcfs_debug_msg+0x41/0x50 [libcfs] <4> [<ffffffff81058d53>] ? __wake_up+0x53/0x70 <4> [<ffffffffa0e8f2af>] mdt_coordinator+0xf0f/0x16b0 [mdt] <4> [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 <4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 <4> [<ffffffff81528090>] ? thread_return+0x4e/0x76e <4> [<ffffffffa0e8e3a0>] ? mdt_coordinator+0x0/0x16b0 [mdt] <4> [<ffffffff8109aee6>] kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] child_rip+0xa/0x20 <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4>Code: 66 90 48 83 c2 01 80 3a 00 75 f7 48 89 d0 48 29 f8 c9 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 f6 48 89 e5 74 2e <80> 3f 00 74 29 48 83 ee 01 48 89 f8 eb 12 66 0f 1f 84 00 00 00 <1>RIP [<ffffffff8128a039>] strnlen+0x9/0x40 <4> RSP <ffff88039810dae0>
Landed for 2.8.0