Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4065

sanity-hsm test_300 failure: 'cdt state is not stopped'

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.5.0
    • Luster master build # 1715
      OpenSFS cluster with combined MGS/MDS, single OSS with two OSTs, three clients; one agent + client, one with robinhood/db running + client and one just running as Lustre clients
    • 3
    • 10892

    Description

      The test results are at: https://maloo.whamcloud.com/test_sets/8e9cca2c-2c8b-11e3-85ee-52540035b04c

      From the client test_log:

      == sanity-hsm test 300: On disk coordinator state kept between MDT umount/mount == 14:22:47 (1380835367)
      Stop coordinator and remove coordinator state at mount
      mdt.scratch-MDT0000.hsm_control=shutdown
      Changed after 0s: from '' to 'stopping'
      Waiting 10 secs for update
      Updated after 8s: wanted 'stopped' got 'stopped'
      Failing mds1 on mds
      Stopping /lustre/scratch/mdt0 (opts:) on mds
      pdsh@c15: mds: ssh exited with exit code 1
      reboot facets: mds1
      Failover mds1 to mds
      14:23:15 (1380835395) waiting for mds network 900 secs ...
      14:23:15 (1380835395) network interface is UP
      mount facets: mds1
      Starting mds1:   /dev/sda3 /lustre/scratch/mdt0
      Started scratch-MDT0000
      c15: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 25 sec
      Changed after 0s: from '' to 'enabled'
      Waiting 20 secs for update
      Waiting 10 secs for update
      Update not seen after 20s: wanted 'stopped' got 'enabled'
       sanity-hsm test_300: @@@@@@ FAIL: cdt state is not stopped 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4264:error_noexit()
        = /usr/lib64/lustre/tests/test-framework.sh:4291:error()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:298:cdt_check_state()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:3063:test_300()
      

      Attachments

        Activity

          [LU-4065] sanity-hsm test_300 failure: 'cdt state is not stopped'

          Landed for 2.8.0

          jgmitter Joseph Gmitter (Inactive) added a comment - Landed for 2.8.0

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12783/
          Subject: LU-4065 tests: hsm copytool_cleanup improvement
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 73bca6c1f4923cdf673fa11486aec04ec3576051

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12783/ Subject: LU-4065 tests: hsm copytool_cleanup improvement Project: fs/lustre-release Branch: master Current Patch Set: Commit: 73bca6c1f4923cdf673fa11486aec04ec3576051

          There is a +1 from Andreas in gerrit. Can somebody else do review to move forward ?

          scherementsev Sergey Cheremencev added a comment - There is a +1 from Andreas in gerrit. Can somebody else do review to move forward ?

          Yes we hit this bug in a lot of subtests in sanity-hsm: 402, 3, 106 ...
          In our case this was race between copytool_cleanup and cdt_mount_state. So usually copytool_cleanup failed.

          About why test_300 may fail(my view):

          1. cdt_set_mount_state sets param using -P
          2. "cdt_check_state stopped" waiting when hsm_control becames "stopped"(after after cdt_shutdown and cdt_clear_mount_state)
          3. At this moment("cdt_check_state stopped" waits for "stopped") mgc retrieves and applys configuration from server with hsm_control=enabled that was set in step 1.

          Also, did you find the detailed behavior you have described by analyzing MGS/MDS nodes debug-log?

          yes

          About reproducing the problem. You may try to make custom build with MGC_TIMEOUT_MIN_SECONDS = 10 or 15. If it will not brake something else it may help.

          scherementsev Sergey Cheremencev added a comment - Yes we hit this bug in a lot of subtests in sanity-hsm: 402, 3, 106 ... In our case this was race between copytool_cleanup and cdt_mount_state. So usually copytool_cleanup failed. About why test_300 may fail(my view): cdt_set_mount_state sets param using -P "cdt_check_state stopped" waiting when hsm_control becames "stopped"(after after cdt_shutdown and cdt_clear_mount_state) At this moment("cdt_check_state stopped" waits for "stopped") mgc retrieves and applys configuration from server with hsm_control=enabled that was set in step 1. Also, did you find the detailed behavior you have described by analyzing MGS/MDS nodes debug-log? yes About reproducing the problem. You may try to make custom build with MGC_TIMEOUT_MIN_SECONDS = 10 or 15. If it will not brake something else it may help.

          Sergei, thanks for this comment and patch, but I just wonder if you found the related issue/failure only when running sanity-hsm/test_300 sub-test or also with others usage of "cdt_set_mount_state()" within sanity-hsm ?? BTW, I don't understand why you refer to copytool_cleanup() in your description.

          Also, did you find the detailed behavior you have described by analyzing MGS/MDS nodes debug-log?

          And last, your patch has failed to pass auto-tests for unrelated failures, and I think you need to rebase it to avoid such issues.

          Last, auto-tests/Maloo query indicates that issue for this ticket has only been encountered 1 time (on 2014-11-19 15:49:17 UTC) during the past 4 months. And I am presently analyzing this failure's associated MDS/MGS debug-log, and in parallel I am trying to reproduce in-house.

          bfaccini Bruno Faccini (Inactive) added a comment - Sergei, thanks for this comment and patch, but I just wonder if you found the related issue/failure only when running sanity-hsm/test_300 sub-test or also with others usage of "cdt_set_mount_state()" within sanity-hsm ?? BTW, I don't understand why you refer to copytool_cleanup() in your description. Also, did you find the detailed behavior you have described by analyzing MGS/MDS nodes debug-log? And last, your patch has failed to pass auto-tests for unrelated failures, and I think you need to rebase it to avoid such issues. Last, auto-tests/Maloo query indicates that issue for this ticket has only been encountered 1 time (on 2014-11-19 15:49:17 UTC) during the past 4 months. And I am presently analyzing this failure's associated MDS/MGS debug-log, and in parallel I am trying to reproduce in-house.

          Hello, we faced the same error in seagate and now using following solution http://review.whamcloud.com/#/c/12783/.

          Short explanation:

          1. "cdt_set_mount_state enabled" sets params on server. It uses lctl set_param -P.
          2. copytool_cleanup set hsm_control=shutdown and waiting when hsm_control becames "stopped"(after "shutdown").
          3. At this moment(copytool_cleanup waits for "stopped") mgc retrieves and applys configuration from server with hsm_control=enabled that was set in step 1.
          4. Mgc starts log processing with delay about 10 seconds (see mgc_requeue_thread).
            copytool_cleanup waits 20 seconds and get hsm_control==enabled because this parameter was modified at step 3.
          scherementsev Sergey Cheremencev added a comment - Hello, we faced the same error in seagate and now using following solution http://review.whamcloud.com/#/c/12783/ . Short explanation: "cdt_set_mount_state enabled" sets params on server. It uses lctl set_param -P. copytool_cleanup set hsm_control=shutdown and waiting when hsm_control becames "stopped"(after "shutdown"). At this moment(copytool_cleanup waits for "stopped") mgc retrieves and applys configuration from server with hsm_control=enabled that was set in step 1. Mgc starts log processing with delay about 10 seconds (see mgc_requeue_thread). copytool_cleanup waits 20 seconds and get hsm_control==enabled because this parameter was modified at step 3.

          James, according to this auto-test failure report it is unclear for me how this happened ...
          BTW, I ran a Maloo/auto-tests search for sanity-hsm/test_300 failures and seems that it runs pretty good and the failure you reported was the last/only one since 2 months now :

          Name            Status          Run at                  Duration        Return code     Error                                           Test log        Bugs
          test_300        FAIL    2014-09-05 01:47:58 UTC         31              1               Restart of mds1 failed!                         Preview 10
          test_300        FAIL    2014-09-02 13:14:59 UTC         80              1               hsm_control state is not 'stopped' on mds1      Preview 10
          test_300        FAIL    2014-08-27 01:19:32 UTC         107             1               post-failover df: 1                             Preview 10
          test_300        FAIL    2014-08-21 19:55:39 UTC         714             1               import is not in FULL state                     Preview 10      LU-4018
          test_300        FAIL    2014-07-01 22:43:31 UTC         599             1               hsm_control state is not 'stopped' on mds1      Preview 10
          test_300        FAIL    2014-07-01 10:47:59 UTC         107             1               hsm_control state is not 'stopped' on mds1      Preview 10
          test_300        FAIL    2014-06-21 14:32:23 UTC         723             1               import is not in FULL state                     Preview 10      LU-4018
          test_300        FAIL    2014-06-02 17:16:48 UTC         331             1               hsm_control state is not 'stopped' on mds1      Preview 10
          test_300        FAIL    2014-05-28 05:46:52 UTC         714             1               import is not in FULL state                     Preview 10      LU-4018
          test_300        FAIL    2014-05-27 09:56:15 UTC         975             1               import is not in FULL state                     Preview 10      LU-4018
          test_300        FAIL    2014-05-22 19:10:52 UTC         105             1               hsm_control state is not 'stopped' on mds1      Preview 10      LU-4065
          test_300        FAIL    2014-05-08 18:55:20 UTC         716             1               import is not in FULL state                     Preview 10      LU-4018
          test_300        FAIL    2014-05-07 22:34:02 UTC         746             1               import is not in FULL state                     Preview 10      LU-4018
          test_300        FAIL    2014-03-26 22:02:07 UTC         77              1               hsm_control state is not 'stopped' on mds1      Preview 10      LU-4125, LU-4065
          test_300        FAIL    2014-03-11 20:01:49 UTC         79              1               hsm_control state is not 'stopped' on mds1      Preview 10      LU-4065, LU-4065
          test_300        FAIL    2013-12-08 19:39:12 UTC         755             1               import is not in FULL state                                     LU-4018, LU-4361
          
          bfaccini Bruno Faccini (Inactive) added a comment - James, according to this auto-test failure report it is unclear for me how this happened ... BTW, I ran a Maloo/auto-tests search for sanity-hsm/test_300 failures and seems that it runs pretty good and the failure you reported was the last/only one since 2 months now : Name Status Run at Duration Return code Error Test log Bugs test_300 FAIL 2014-09-05 01:47:58 UTC 31 1 Restart of mds1 failed! Preview 10 test_300 FAIL 2014-09-02 13:14:59 UTC 80 1 hsm_control state is not 'stopped' on mds1 Preview 10 test_300 FAIL 2014-08-27 01:19:32 UTC 107 1 post-failover df: 1 Preview 10 test_300 FAIL 2014-08-21 19:55:39 UTC 714 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-07-01 22:43:31 UTC 599 1 hsm_control state is not 'stopped' on mds1 Preview 10 test_300 FAIL 2014-07-01 10:47:59 UTC 107 1 hsm_control state is not 'stopped' on mds1 Preview 10 test_300 FAIL 2014-06-21 14:32:23 UTC 723 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-06-02 17:16:48 UTC 331 1 hsm_control state is not 'stopped' on mds1 Preview 10 test_300 FAIL 2014-05-28 05:46:52 UTC 714 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-05-27 09:56:15 UTC 975 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-05-22 19:10:52 UTC 105 1 hsm_control state is not 'stopped' on mds1 Preview 10 LU-4065 test_300 FAIL 2014-05-08 18:55:20 UTC 716 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-05-07 22:34:02 UTC 746 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-03-26 22:02:07 UTC 77 1 hsm_control state is not 'stopped' on mds1 Preview 10 LU-4125, LU-4065 test_300 FAIL 2014-03-11 20:01:49 UTC 79 1 hsm_control state is not 'stopped' on mds1 Preview 10 LU-4065, LU-4065 test_300 FAIL 2013-12-08 19:39:12 UTC 755 1 import is not in FULL state LU-4018, LU-4361

          I've run into this error again while testing HSM with the 2.5.3-RC1 version of Lustre on the OpenSFS cluster. The results are at https://testing.hpdd.intel.com/test_sessions/60308f12-3472-11e4-995a-5254006e85c2

          jamesanunez James Nunez (Inactive) added a comment - I've run into this error again while testing HSM with the 2.5.3-RC1 version of Lustre on the OpenSFS cluster. The results are at https://testing.hpdd.intel.com/test_sessions/60308f12-3472-11e4-995a-5254006e85c2

          Patch landed to Master. Please reopen ticket if more work is needed.

          jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master. Please reopen ticket if more work is needed.

          Since I'm testing HSM on 2.5.1, I thought I would collect the information Bruno asked for on 09/Oct/13. I ran sanity-hsm test 300 alone and got the "cdt state not stopped on mds1" On the MDT, the first few commands succeeded, but the unmount hung and later crashed.

          Here's what I see in dmesg captured by crash:

          <4>Lustre: DEBUG MARKER: == sanity-hsm test 300: On disk coordinator state kept between MDT umount/mount == 09:30:44 (1394641844)
          <6>Lustre: Disabling parameter general.mdt.lscratch-MDT0000.hsm_control in log params
          <4>Lustre: Failing over lscratch-MDT0000
          <3>LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.118@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
          <4>Lustre: 16111:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394641848/real 1394641848]  req@ffff88039ae00800 x1462381936924596/t0(0) o251->MGC192.168.2.116@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1394641854 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
          <4>Lustre: server umount lscratch-MDT0000 complete
          <6>LDISKFS-fs (sda3): mounted filesystem with ordered data mode. quota=on. Opts:
           
          <3>LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.119@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
          <4>Lustre: lscratch-MDT0000: used disk, loading
          <6>format at mdt_coordinator.c:977:mdt_hsm_cdt_start doesn't end in newline
          <3>LustreError: 16267:0:(mdt_coordinator.c:977:mdt_hsm_cdt_start()) lscratch-MDT0000: cannot take the layout locks needed for registered restore: -2
          <4>Lustre: MGS: non-config logname received: params
          <3>LustreError: 11-0: lscratch-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
          <6>Lustre: lscratch-MDD0000: changelog on
          <4>Lustre: lscratch-MDT0000: Will be in recovery for at least 5:00, or until 5 clients reconnect
          <6>Lustre: lscratch-MDT0000: Recovery over after 0:29, of 5 clients 5 recovered and 0 were evicted.
          <4>Lustre: DEBUG MARKER: mdc.lscratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 29 sec
          <4>Lustre: MGS: non-config logname received: params
          <4>Lustre: Skipped 4 previous similar messages
          <4>Lustre: DEBUG MARKER: sanity-hsm test_300: @@@@@@ FAIL: hsm_control state is not 'stopped' on mds1
          <4>Lustre: DEBUG MARKER: == sanity-hsm test complete, duration 92 sec == 09:32:06 (1394641926)
          <4>Lustre: DEBUG MARKER: shutdown
          <4>Lustre: DEBUG MARKER: set_param-d-P
          <6>Lustre: Disabling parameter general.mdt.lscratch-MDT0000.hsm_control in log params
          <3>LustreError: 17173:0:(mgs_handler.c:744:mgs_iocontrol()) MGS: setparam err: rc = 1
          <4>Lustre: DEBUG MARKER: get_param
          <4>Lustre: DEBUG MARKER: umount
          <4>Lustre: Failing over lscratch-MDT0000
          <3>LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.118@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
          <4>------------[ cut here ]------------
          <4>WARNING: at lib/list_debug.c:48 list_del+0x6e/0xa0() (Not tainted)
          <4>Hardware name: X8DTT-H
          <4>list_del corruption. prev->next should be ffff88039810de18, but was 5a5a5a5a5a5a5a5a
          <4>Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables nfsd exportfs nfs lockd fscache auth_rpc
          gss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode iTCO_wdt iTCO_vendor_support serio_raw i2c_i801 i2c_core sg lpc_ich mfd_core mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core e1000e ptp pps_core ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_pi
          ix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_conntrack]
          <4>Pid: 17120, comm: hsm_cdtr Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
          <4>Call Trace:
          <4> [<ffffffff81071e27>] ? warn_slowpath_common+0x87/0xc0
          <4> [<ffffffff81071f16>] ? warn_slowpath_fmt+0x46/0x50
          <4> [<ffffffff81084220>] ? process_timeout+0x0/0x10
          <4> [<ffffffff8129489e>] ? list_del+0x6e/0xa0
          <4> [<ffffffff8109b681>] ? remove_wait_queue+0x31/0x50
          <4> [<ffffffffa0e8ef6e>] ? mdt_coordinator+0xbce/0x16b0 [mdt]
          <4> [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
          <4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
          <4> [<ffffffff81528090>] ? thread_return+0x4e/0x76e
          <4> [<ffffffffa0e8e3a0>] ? mdt_coordinator+0x0/0x16b0 [mdt]
          <4> [<ffffffff8109aee6>] ? kthread+0x96/0xa0
          <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
          <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
          <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
          <4>---[ end trace 4c18e762bb5c5a65 ]---
          <4>------------[ cut here ]------------
          <4>WARNING: at lib/list_debug.c:51 list_del+0x8d/0xa0() (Tainted: G        W  --
          -------------   )
          <4>Hardware name: X8DTT-H
          <4>list_del corruption. next->prev should be ffff88039810de18, but was 5a5a5a5a5a5a5a5a
          <4>Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables nfsd exportfs nfs lockd fscache auth_rpc
          gss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode iTCO_wdt iTCO_vendor_support serio_raw i2c_i801 i2c_core sg lpc_ich mfd_core mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core e1000e ptp pps_core ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_pi
          ix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_conntrack]
          <4>Pid: 17120, comm: hsm_cdtr Tainted: G        W  ---------------    2.6.32-431.5.1.el6_lustre.x86_64 #1
          <4>Call Trace:
          <4> [<ffffffff81071e27>] ? warn_slowpath_common+0x87/0xc0
          <4> [<ffffffff81071f16>] ? warn_slowpath_fmt+0x46/0x50
          <4> [<ffffffff81084220>] ? process_timeout+0x0/0x10
          <4> [<ffffffff812948bd>] ? list_del+0x8d/0xa0
          <4> [<ffffffff8109b681>] ? remove_wait_queue+0x31/0x50
          <4> [<ffffffffa0e8ef6e>] ? mdt_coordinator+0xbce/0x16b0 [mdt]
          <4> [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
          <4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
          <4> [<ffffffff81528090>] ? thread_return+0x4e/0x76e
          <4> [<ffffffffa0e8e3a0>] ? mdt_coordinator+0x0/0x16b0 [mdt]
          <4> [<ffffffff8109aee6>] ? kthread+0x96/0xa0
          <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
          <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
          <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
          <4>---[ end trace 4c18e762bb5c5a66 ]---
          <4>Lustre: 17182:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394641933/real 1394641933]  req@ffff880392bf6c00 x1462381936924904/t0(0) o251->MGC192.168.2.116@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1394641939 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
          <4>general protection fault: 0000 [#1] SMP 
          <4>last sysfs file: /sys/devices/system/cpu/online
          <4>CPU 7 
          <4>Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables nfsd exportfs nfs lockd fscache auth_rpc
          gss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode iTCO_wdt iTCO_vendor_support serio_raw i2c_i801 i2c_core sg lpc_ich mfd_core mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core e1000e ptp pps_core ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_pi
          ix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_conntrack]
          <4>
          <4>Pid: 17120, comm: hsm_cdtr Tainted: G        W  ---------------    2.6.32-431.5.1.el6_lustre.x86_64 #1 Supermicro X8DTT-H/X8DTT-H
          <4>RIP: 0010:[<ffffffff8128a039>]  [<ffffffff8128a039>] strnlen+0x9/0x40
          <4>RSP: 0018:ffff88039810dae0  EFLAGS: 00010286
          <4>RAX: ffffffff817b5a4e RBX: ffff8803a9176000 RCX: 0000000000000002
          <4>RDX: 5a5a5a5a5a5a5a66 RSI: ffffffffffffffff RDI: 5a5a5a5a5a5a5a66
          <4>RBP: ffff88039810dae0 R08: 0000000000000073 R09: ffff88043839cd40
          <4>R10: 0000000000000001 R11: 000000000000000f R12: ffff8803a9175134
          <4>R13: 5a5a5a5a5a5a5a66 R14: 00000000ffffffff R15: 0000000000000000
          <4>FS:  0000000000000000(0000) GS:ffff880045ce0000(0000) knlGS:0000000000000000
          <4>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
          <4>CR2: 0000000000448000 CR3: 000000082d703000 CR4: 00000000000007e0
          <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
          <4>Process hsm_cdtr (pid: 17120, threadinfo ffff88039810c000, task ffff88082d41a080)
          <4>Stack:
          <4> ffff88039810db20 ffffffff8128b2f0 ffff880000033b48 ffff8803a9175134
          <4><d> ffffffffa0ea175a ffffffffa0ea1758 ffff88039810dcb0 ffff8803a9176000
          <4><d> ffff88039810dbc0 ffffffff8128c738 0000000000000004 0000000affffffff
          <4>Call Trace:
          <4> [<ffffffff8128b2f0>] string+0x40/0x100
          <4> [<ffffffff8128c738>] vsnprintf+0x218/0x5e0
          <4> [<ffffffffa052f27b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs]
          <4> [<ffffffffa053effa>] libcfs_debug_vmsg2+0x2ea/0xbc0 [libcfs]
          <4> [<ffffffff81071e34>] ? warn_slowpath_common+0x94/0xc0
          <4> [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
          <4> [<ffffffffa053f911>] libcfs_debug_msg+0x41/0x50 [libcfs]
          <4> [<ffffffff81058d53>] ? __wake_up+0x53/0x70
          <4> [<ffffffffa0e8f2af>] mdt_coordinator+0xf0f/0x16b0 [mdt]
          <4> [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
          <4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
          <4> [<ffffffff81528090>] ? thread_return+0x4e/0x76e
          <4> [<ffffffffa0e8e3a0>] ? mdt_coordinator+0x0/0x16b0 [mdt]
          <4> [<ffffffff8109aee6>] kthread+0x96/0xa0
          <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
          <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
          <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
          <4>Code: 66 90 48 83 c2 01 80 3a 00 75 f7 48 89 d0 48 29 f8 c9 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 f6 48 89 e5 74 2e <80> 3f 00 74 29 48 83 ee 01 48 89 f8 eb 12 66 0f 1f 84 00 00 00 
          <1>RIP  [<ffffffff8128a039>] strnlen+0x9/0x40
          <4> RSP <ffff88039810dae0>
          
          
          jamesanunez James Nunez (Inactive) added a comment - Since I'm testing HSM on 2.5.1, I thought I would collect the information Bruno asked for on 09/Oct/13. I ran sanity-hsm test 300 alone and got the "cdt state not stopped on mds1" On the MDT, the first few commands succeeded, but the unmount hung and later crashed. Here's what I see in dmesg captured by crash: <4>Lustre: DEBUG MARKER: == sanity-hsm test 300: On disk coordinator state kept between MDT umount/mount == 09:30:44 (1394641844) <6>Lustre: Disabling parameter general.mdt.lscratch-MDT0000.hsm_control in log params <4>Lustre: Failing over lscratch-MDT0000 <3>LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.118@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. <4>Lustre: 16111:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394641848/real 1394641848] req@ffff88039ae00800 x1462381936924596/t0(0) o251->MGC192.168.2.116@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1394641854 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 <4>Lustre: server umount lscratch-MDT0000 complete <6>LDISKFS-fs (sda3): mounted filesystem with ordered data mode. quota=on. Opts: <3>LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.119@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. <4>Lustre: lscratch-MDT0000: used disk, loading <6>format at mdt_coordinator.c:977:mdt_hsm_cdt_start doesn't end in newline <3>LustreError: 16267:0:(mdt_coordinator.c:977:mdt_hsm_cdt_start()) lscratch-MDT0000: cannot take the layout locks needed for registered restore: -2 <4>Lustre: MGS: non-config logname received: params <3>LustreError: 11-0: lscratch-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11. <6>Lustre: lscratch-MDD0000: changelog on <4>Lustre: lscratch-MDT0000: Will be in recovery for at least 5:00, or until 5 clients reconnect <6>Lustre: lscratch-MDT0000: Recovery over after 0:29, of 5 clients 5 recovered and 0 were evicted. <4>Lustre: DEBUG MARKER: mdc.lscratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 29 sec <4>Lustre: MGS: non-config logname received: params <4>Lustre: Skipped 4 previous similar messages <4>Lustre: DEBUG MARKER: sanity-hsm test_300: @@@@@@ FAIL: hsm_control state is not 'stopped' on mds1 <4>Lustre: DEBUG MARKER: == sanity-hsm test complete, duration 92 sec == 09:32:06 (1394641926) <4>Lustre: DEBUG MARKER: shutdown <4>Lustre: DEBUG MARKER: set_param-d-P <6>Lustre: Disabling parameter general.mdt.lscratch-MDT0000.hsm_control in log params <3>LustreError: 17173:0:(mgs_handler.c:744:mgs_iocontrol()) MGS: setparam err: rc = 1 <4>Lustre: DEBUG MARKER: get_param <4>Lustre: DEBUG MARKER: umount <4>Lustre: Failing over lscratch-MDT0000 <3>LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.118@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. <4>------------[ cut here ]------------ <4>WARNING: at lib/list_debug.c:48 list_del+0x6e/0xa0() (Not tainted) <4>Hardware name: X8DTT-H <4>list_del corruption. prev->next should be ffff88039810de18, but was 5a5a5a5a5a5a5a5a <4>Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables nfsd exportfs nfs lockd fscache auth_rpc gss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode iTCO_wdt iTCO_vendor_support serio_raw i2c_i801 i2c_core sg lpc_ich mfd_core mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core e1000e ptp pps_core ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_pi ix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_conntrack] <4>Pid: 17120, comm: hsm_cdtr Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1 <4>Call Trace: <4> [<ffffffff81071e27>] ? warn_slowpath_common+0x87/0xc0 <4> [<ffffffff81071f16>] ? warn_slowpath_fmt+0x46/0x50 <4> [<ffffffff81084220>] ? process_timeout+0x0/0x10 <4> [<ffffffff8129489e>] ? list_del+0x6e/0xa0 <4> [<ffffffff8109b681>] ? remove_wait_queue+0x31/0x50 <4> [<ffffffffa0e8ef6e>] ? mdt_coordinator+0xbce/0x16b0 [mdt] <4> [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 <4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 <4> [<ffffffff81528090>] ? thread_return+0x4e/0x76e <4> [<ffffffffa0e8e3a0>] ? mdt_coordinator+0x0/0x16b0 [mdt] <4> [<ffffffff8109aee6>] ? kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20 <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4>---[ end trace 4c18e762bb5c5a65 ]--- <4>------------[ cut here ]------------ <4>WARNING: at lib/list_debug.c:51 list_del+0x8d/0xa0() (Tainted: G W -- ------------- ) <4>Hardware name: X8DTT-H <4>list_del corruption. next->prev should be ffff88039810de18, but was 5a5a5a5a5a5a5a5a <4>Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables nfsd exportfs nfs lockd fscache auth_rpc gss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode iTCO_wdt iTCO_vendor_support serio_raw i2c_i801 i2c_core sg lpc_ich mfd_core mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core e1000e ptp pps_core ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_pi ix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_conntrack] <4>Pid: 17120, comm: hsm_cdtr Tainted: G W --------------- 2.6.32-431.5.1.el6_lustre.x86_64 #1 <4>Call Trace: <4> [<ffffffff81071e27>] ? warn_slowpath_common+0x87/0xc0 <4> [<ffffffff81071f16>] ? warn_slowpath_fmt+0x46/0x50 <4> [<ffffffff81084220>] ? process_timeout+0x0/0x10 <4> [<ffffffff812948bd>] ? list_del+0x8d/0xa0 <4> [<ffffffff8109b681>] ? remove_wait_queue+0x31/0x50 <4> [<ffffffffa0e8ef6e>] ? mdt_coordinator+0xbce/0x16b0 [mdt] <4> [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 <4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 <4> [<ffffffff81528090>] ? thread_return+0x4e/0x76e <4> [<ffffffffa0e8e3a0>] ? mdt_coordinator+0x0/0x16b0 [mdt] <4> [<ffffffff8109aee6>] ? kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20 <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4>---[ end trace 4c18e762bb5c5a66 ]--- <4>Lustre: 17182:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394641933/real 1394641933] req@ffff880392bf6c00 x1462381936924904/t0(0) o251->MGC192.168.2.116@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1394641939 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 <4>general protection fault: 0000 [#1] SMP <4>last sysfs file: /sys/devices/system/cpu/online <4>CPU 7 <4>Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables nfsd exportfs nfs lockd fscache auth_rpc gss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode iTCO_wdt iTCO_vendor_support serio_raw i2c_i801 i2c_core sg lpc_ich mfd_core mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core e1000e ptp pps_core ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_pi ix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_conntrack] <4> <4>Pid: 17120, comm: hsm_cdtr Tainted: G W --------------- 2.6.32-431.5.1.el6_lustre.x86_64 #1 Supermicro X8DTT-H/X8DTT-H <4>RIP: 0010:[<ffffffff8128a039>] [<ffffffff8128a039>] strnlen+0x9/0x40 <4>RSP: 0018:ffff88039810dae0 EFLAGS: 00010286 <4>RAX: ffffffff817b5a4e RBX: ffff8803a9176000 RCX: 0000000000000002 <4>RDX: 5a5a5a5a5a5a5a66 RSI: ffffffffffffffff RDI: 5a5a5a5a5a5a5a66 <4>RBP: ffff88039810dae0 R08: 0000000000000073 R09: ffff88043839cd40 <4>R10: 0000000000000001 R11: 000000000000000f R12: ffff8803a9175134 <4>R13: 5a5a5a5a5a5a5a66 R14: 00000000ffffffff R15: 0000000000000000 <4>FS: 0000000000000000(0000) GS:ffff880045ce0000(0000) knlGS:0000000000000000 <4>CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b <4>CR2: 0000000000448000 CR3: 000000082d703000 CR4: 00000000000007e0 <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 <4>Process hsm_cdtr (pid: 17120, threadinfo ffff88039810c000, task ffff88082d41a080) <4>Stack: <4> ffff88039810db20 ffffffff8128b2f0 ffff880000033b48 ffff8803a9175134 <4><d> ffffffffa0ea175a ffffffffa0ea1758 ffff88039810dcb0 ffff8803a9176000 <4><d> ffff88039810dbc0 ffffffff8128c738 0000000000000004 0000000affffffff <4>Call Trace: <4> [<ffffffff8128b2f0>] string+0x40/0x100 <4> [<ffffffff8128c738>] vsnprintf+0x218/0x5e0 <4> [<ffffffffa052f27b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs] <4> [<ffffffffa053effa>] libcfs_debug_vmsg2+0x2ea/0xbc0 [libcfs] <4> [<ffffffff81071e34>] ? warn_slowpath_common+0x94/0xc0 <4> [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20 <4> [<ffffffffa053f911>] libcfs_debug_msg+0x41/0x50 [libcfs] <4> [<ffffffff81058d53>] ? __wake_up+0x53/0x70 <4> [<ffffffffa0e8f2af>] mdt_coordinator+0xf0f/0x16b0 [mdt] <4> [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 <4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 <4> [<ffffffff81528090>] ? thread_return+0x4e/0x76e <4> [<ffffffffa0e8e3a0>] ? mdt_coordinator+0x0/0x16b0 [mdt] <4> [<ffffffff8109aee6>] kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] child_rip+0xa/0x20 <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4>Code: 66 90 48 83 c2 01 80 3a 00 75 f7 48 89 d0 48 29 f8 c9 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 f6 48 89 e5 74 2e <80> 3f 00 74 29 48 83 ee 01 48 89 f8 eb 12 66 0f 1f 84 00 00 00 <1>RIP [<ffffffff8128a039>] strnlen+0x9/0x40 <4> RSP <ffff88039810dae0>

          People

            bfaccini Bruno Faccini (Inactive)
            jamesanunez James Nunez (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: