[LU-4065] sanity-hsm test_300 failure: 'cdt state is not stopped' Created: 04/Oct/13 Updated: 10/Apr/17 Resolved: 02/Oct/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.2, Lustre 2.8.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HSM | ||
| Environment: |
Luster master build # 1715 |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 10892 | ||||
| Description |
|
The test results are at: https://maloo.whamcloud.com/test_sets/8e9cca2c-2c8b-11e3-85ee-52540035b04c From the client test_log: == sanity-hsm test 300: On disk coordinator state kept between MDT umount/mount == 14:22:47 (1380835367) Stop coordinator and remove coordinator state at mount mdt.scratch-MDT0000.hsm_control=shutdown Changed after 0s: from '' to 'stopping' Waiting 10 secs for update Updated after 8s: wanted 'stopped' got 'stopped' Failing mds1 on mds Stopping /lustre/scratch/mdt0 (opts:) on mds pdsh@c15: mds: ssh exited with exit code 1 reboot facets: mds1 Failover mds1 to mds 14:23:15 (1380835395) waiting for mds network 900 secs ... 14:23:15 (1380835395) network interface is UP mount facets: mds1 Starting mds1: /dev/sda3 /lustre/scratch/mdt0 Started scratch-MDT0000 c15: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 25 sec Changed after 0s: from '' to 'enabled' Waiting 20 secs for update Waiting 10 secs for update Update not seen after 20s: wanted 'stopped' got 'enabled' sanity-hsm test_300: @@@@@@ FAIL: cdt state is not stopped Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4264:error_noexit() = /usr/lib64/lustre/tests/test-framework.sh:4291:error() = /usr/lib64/lustre/tests/sanity-hsm.sh:298:cdt_check_state() = /usr/lib64/lustre/tests/sanity-hsm.sh:3063:test_300() |
| Comments |
| Comment by jacques-charles lafoucriere [ 06/Oct/13 ] |
|
We recently have seen lctl set_param -P was not working as expected |
| Comment by Bruno Faccini (Inactive) [ 07/Oct/13 ] |
|
Hello JC, Do you mean cdt_state permanent/on-disk change may have failed ?? |
| Comment by Jinshan Xiong (Inactive) [ 07/Oct/13 ] |
|
Hi JC, if I remember it correctly, we had a conversation about this to make procfs exported all the time so that we can set the parameter from configuration. Can you refresh my memory on this? |
| Comment by jacques-charles lafoucriere [ 08/Oct/13 ] |
|
During recent "by hand" tests, we have seen lctl set_param -P was not working like for test 300. But gerrit is usually ok. We also have seen with latest master that even sanity-hsm was no more starting when running manually at home. Thomas or Aurelien should open a ticket with the details. |
| Comment by John Hammond [ 08/Oct/13 ] |
|
JC, when you test locally do you install Lustre on all nodes or do you run from a build directory? Using 'lctl set_param -P' requires that lctl be installed at /usr/sbin/lctl on all nodes. See LU-4041. |
| Comment by Aurelien Degremont (Inactive) [ 09/Oct/13 ] |
|
I confirm. We are doing this in our test environment where sanity-hsm is run without Lustre utils being installed in standard paths. This is really LU-4041. |
| Comment by John Hammond [ 09/Oct/13 ] |
|
Using set_param ...hsm_control=shutdown on a CDT that is already stopped will create a state where the CDT cannot be started: # llmount.sh # lctl get_param mdt.lustre-MDT0000.hsm_control mdt.lustre-MDT0000.hsm_control=stopped # lctl set_param mdt.lustre-MDT0000.hsm_control=shutdown mdt.lustre-MDT0000.hsm_control=shutdown # lctl get_param mdt.lustre-MDT0000.hsm_control mdt.lustre-MDT0000.hsm_control=stopping # lctl set_param mdt.lustre-MDT0000.hsm_control=enabled mdt.lustre-MDT0000.hsm_control=enabled error: set_param: setting /proc/fs/lustre/mdt/lustre-MDT0000/hsm_control=enabled: Operation already in progress # lctl get_param mdt.lustre-MDT0000.hsm_control mdt.lustre-MDT0000.hsm_control=stopping # lctl set_param mdt.lustre-MDT0000.hsm_control=disabled mdt.lustre-MDT0000.hsm_control=disabled # lctl get_param mdt.lustre-MDT0000.hsm_control mdt.lustre-MDT0000.hsm_control=disabled # lctl set_param mdt.lustre-MDT0000.hsm_control=enabled mdt.lustre-MDT0000.hsm_control=enabled # lctl get_param mdt.lustre-MDT0000.hsm_control mdt.lustre-MDT0000.hsm_control=enabled # pgrep hsm # Amusingly if you try to remount the MDT to resolve this state then you get the following. Message from syslogd@t at Oct 9 13:00:45 ... kernel:LustreError: 7626:0:(mdt_coordinator.c:417:hsm_cdt_procfs_fini()) ASSERTION( cdt->cdt_state == CDT_STOPPED ) failed: Message from syslogd@t at Oct 9 13:00:45 ... kernel:LustreError: 7626:0:(mdt_coordinator.c:417:hsm_cdt_procfs_fini()) LBUG |
| Comment by John Hammond [ 09/Oct/13 ] |
|
I think we need some more protective logic in lprocfs_wr_hsm_cdt_control() and its callees. Shouldn't every write to hsm_control force a call to mdt_hsm_cdt_wakeup()? |
| Comment by Bruno Faccini (Inactive) [ 25/Oct/13 ] |
|
Assigned to me since I am working on a fix for |
| Comment by Bruno Faccini (Inactive) [ 25/Oct/13 ] |
|
Concerning the cdt_state that can become easily stale, I think I have a patch for it, the problem seems to be that both the shutdown/disabled cmds assume CDT is started and thus simply set cdt_state=STOPPING/DISABLED (expecting for CDT thread to check it and gracefully shutdown or go to sleep! ) and then if umount/stop MDT you simply trigger the ASSERTION( cdt->cdt_state == CDT_STOPPED ) in hsm_cdt_procfs_fini(). Better/enforced checks in lprocfs_wr_hsm_cdt_control() should fix this. Even if I don't think it is fully safe (against concurrent hsm_control/hsm_state updates …) my patch at http://review.whamcloud.com/8074 should prevent such cdt_state inaccurate values. |
| Comment by Bruno Faccini (Inactive) [ 20/Nov/13 ] |
|
Patch has land, but I wonder if we need to add extra check/test in sanity-hsm to verify that even with multiple/random/replicated hsm_control setting (like John's serie to trigger LBUG), cdt_state is kept accurate ?? |
| Comment by James Nunez (Inactive) [ 05/Dec/13 ] |
|
sanity-hsm test 300 is still failing on the OpenSFS cluster with master build #1790. I've uploaded results from two runs of sanity-hsm where test 300 fails at https://maloo.whamcloud.com/test_sessions/b223c634-5d51-11e3-ad71-52540035b04c and https://maloo.whamcloud.com/test_sessions/643460bc-5d4e-11e3-956b-52540035b04c . Test 300 consistently fails on this cluster. |
| Comment by Bruno Faccini (Inactive) [ 09/Dec/13 ] |
|
In the 2 tests sessions you just uploaded, sanity-hsm/test_300 seems to fail because the permanent/on-disk config llog setting ("lctl set_param -d -P mdt.<FSNAME>-MDT0000.hsm_control" in sanity-hsm/cdt_clear_mount_state() function), is mysteriously not re-read after umount/re-mount. Lustre: DEBUG MARKER: == sanity-hsm test 300: On disk coordinator state kept between MDT umount/mount == 12:30:45 (1386189045) Lustre: Disabling parameter general.mdt.scratch-MDT0000.hsm_control in log params <<<<<--- config log set Lustre: Failing over scratch-MDT0000 LustreError: 137-5: scratch-MDT0000_UUID: not available for connect from 192.168.2.112@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 1 previous similar message LustreError: 137-5: scratch-MDT0000_UUID: not available for connect from 192.168.2.112@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. Lustre: 9522:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1386189056/real 1386189056] req@ffff8803cc092000 x1453343831874984/t0(0) o251->MGC192.168.2.108@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1386189062 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: server umount scratch-MDT0000 complete LDISKFS-fs (sda3): mounted filesystem with ordered data mode. quota=on. Opts: format at mdt_coordinator.c:977:mdt_hsm_cdt_start doesn't end in newline LustreError: 9685:0:(mdt_coordinator.c:977:mdt_hsm_cdt_start()) scratch-MDT0000: cannot take the layout locks needed for registered restore: -2 Lustre: MGS: non-config logname received: params <<<<----- PB ?? Lustre: Skipped 6 previous similar messages LustreError: 11-0: scratch-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11. Lustre: scratch-MDD0000: changelog on Lustre: DEBUG MARKER: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: scratch-MDT0000: Will be in recovery for at least 5:00, or until 7 clients reconnect LustreError: 9700:0:(ldlm_lib.c:1733:check_for_next_transno()) scratch-MDT0000: waking for gap in transno, VBR is OFF (skip: 77309411692, ql: 1, comp: 6, conn: 7, next: 77309411694, last_committed: 77309411691) Lustre: scratch-MDT0000: Recovery over after 0:21, of 7 clients 7 recovered and 0 were evicted. Lustre: DEBUG MARKER: sanity-hsm test_300: @@@@@@ FAIL: hsm_control state is not 'stopped' on mds1 But in fact I have the same on my test platform where I am unable to reproduce any sanity-hsm/test_300 sub-test failure actually … James, I am not sure to remember, but did you not say that test_300 can even fail when run as the single sub-test of sanity-hsm ? If yes, can you run the following sequence manually with full debug-logs enabled on MDS/MGS : lctl set_param debug=-1 lctl clear lctl mark shutdown lctl set_param mdt.<FSNAME>-MDT0000.hsm_control=shutdown lctl mark set_param-d-P lctl set_param -d -P mdt.<FSNAME>-MDT0000.hsm_control lctl mark get_param lctl get_param mdt.<FSNAME>-MDT0000.hsm_control lctl mark umount umount <MDT> lctl mark mount mount <MDT> sleep 10 lctl mark get_param lctl get_param mdt.<FSNAME>-MDT0000.hsm_control |
| Comment by James Nunez (Inactive) [ 10/Dec/13 ] |
|
Yes, I can get test_300 to fail running just that test. I ran sanity-hsm test 300 and it failed and then I ran the commands you asked for with the following output: [root@c08 ~]# lctl set_param debug=-1 debug=-1 [root@c08 ~]# lctl clear [root@c08 ~]# lctl mark shutdown [root@c08 ~]# lctl set_param mdt.scratch-MDT0000.hsm_control=shutdown mdt.scratch-MDT0000.hsm_control=shutdown [root@c08 ~]# lctl mark set_param-d-P [root@c08 ~]# lctl set_param -d -P mdt.scratch-MDT0000.hsm_control error: executing set_param: Success [root@c08 ~]# lctl mark get_param [root@c08 ~]# lctl get_param mdt.scratch-MDT0000.hsm_control mdt.scratch-MDT0000.hsm_control=stopped [root@c08 ~]# umount /lustre/scratch/mdt0 [root@c08 ~]# lctl mark mount [root@c08 ~]# mount -t lustre /dev/sda3 /lustre/scratch/mdt0 [root@c08 ~]# sleep 10 [root@c08 ~]# lctl mark get_param [root@c08 ~]# lctl get_param mdt.scratch-MDT0000.hsm_control mdt.scratch-MDT0000.hsm_control=enabled dmesg on the MDS/MGS: Lustre: DEBUG MARKER: shutdown Lustre: MGS: non-config logname received: params Lustre: Skipped 8 previous similar messages Lustre: DEBUG MARKER: set_param-d-P Lustre: Disabling parameter general.mdt.scratch-MDT0000.hsm_control in log params LustreError: 7226:0:(mgs_handler.c:744:mgs_iocontrol()) MGS: setparam err: rc = 1 Lustre: DEBUG MARKER: get_param Lustre: Failing over scratch-MDT0000 LustreError: 137-5: scratch-MDT0000_UUID: not available for connect from 192.168.2.113@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 1 previous similar message LustreError: 137-5: scratch-MDT0000_UUID: not available for connect from 192.168.2.112@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: 137-5: scratch-MDT0000_UUID: not available for connect from 192.168.2.115@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. LustreError: Skipped 2 previous similar messages Lustre: 7239:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1386634800/real 1386634800] req@ffff880409bd9800 x1453343832768544/t0(0) o251->MGC192.168.2.108@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1386634806 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: server umount scratch-MDT0000 complete Lustre: DEBUG MARKER: mount Lustre: 7244:0:(obd_mount.c:1246:lustre_fill_super()) VFS Op: sb ffff8803e594b400 LDISKFS-fs (sda3): mounted filesystem with ordered data mode. quota=on. Opts: format at mdt_coordinator.c:977:mdt_hsm_cdt_start doesn't end in newline LustreError: 7308:0:(mdt_coordinator.c:977:mdt_hsm_cdt_start()) scratch-MDT0000: cannot take the layout locks needed for registered restore: -2 LustreError: 11-0: scratch-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11. Lustre: scratch-MDD0000: changelog on Lustre: scratch-MDT0000: Will be in recovery for at least 5:00, or until 4 clients reconnect Lustre: 7284:0:(mdt_handler.c:449:mdt_pack_attr2body()) [0x200000002:0x1:0x0]: returning size 4096 Lustre: 7284:0:(mdt_handler.c:449:mdt_pack_attr2body()) [0x200000002:0x2:0x0]: returning size 4096 Lustre: 7286:0:(mdt_handler.c:449:mdt_pack_attr2body()) [0x200000002:0x1:0x0]: returning size 4096 Lustre: 7286:0:(mdt_handler.c:449:mdt_pack_attr2body()) [0x200000002:0x2:0x0]: returning size 4096 Lustre: 7324:0:(mdt_handler.c:449:mdt_pack_attr2body()) [0x200000002:0x1:0x0]: returning size 4096 Lustre: 7324:0:(mdt_handler.c:449:mdt_pack_attr2body()) [0x200000002:0x2:0x0]: returning size 4096 Lustre: MGS: non-config logname received: params Lustre: Skipped 12 previous similar messages Lustre: scratch-MDT0000: Recovery over after 0:15, of 4 clients 4 recovered and 0 were evicted. Lustre: DEBUG MARKER: get_param |
| Comment by Bruno Faccini (Inactive) [ 10/Dec/13 ] |
|
Well, seems that I missed to ask you to collect the MDS Lustre debug-log at the end !! Sorry … BTW, the new+strange "error: executing set_param: Success" and "LustreError: 7226:0:(mgs_handler.c:744:mgs_iocontrol()) MGS: setparam err: rc = 1" msgs upon "lctl set_param -d -P mdt.<FSNAME>-MDT0000.hsm_control", after I checked in the source code they only/finally indicate that the parameter's value was already the default/disabled. I may push a cosmetic change to avoid this. |
| Comment by James Nunez (Inactive) [ 10/Dec/13 ] |
|
Here are the logs you requested. |
| Comment by Bruno Faccini (Inactive) [ 11/Dec/13 ] |
|
James, I am afraid I forgot to ask you to also grow the debug log buffer size with "lctl set_param debug_mb=2048", because the debug-log misses the traces for the first commands. Can you also "mark" the umount, I forgot too. I also would like you to dump the config llog content after the "lctl set_param -d -P mdt.scratch-MDT0000.hsm_control", to do so you need to mount the MDT as "-t ldiskfs" separately and the go/cd to CONFIGS sub-dir to run the command "llog_reader params". Last, I created a separate ticket |
| Comment by Bruno Faccini (Inactive) [ 12/Dec/13 ] |
|
I requested access to OpenSFS cluster, waiting for it now and in order to work on the issue on the platform where you reproduct! |
| Comment by James Nunez (Inactive) [ 11/Mar/14 ] |
|
Another instance of this failure at https://maloo.whamcloud.com/test_sets/7f2aafdc-a959-11e3-95fe-52540035b04c |
| Comment by Bruno Faccini (Inactive) [ 12/Mar/14 ] |
|
Humm, I thought (and hoped!) that this problem have disappeared because it is the first failure occurrence of sanity-hsm/test_300 subtest since mid-december, according to maloo/auto-tests stats. I will try to find a way to reproduce it, but still I still strongly suspect it to occur with specific environment/platform, it may not be so easy. BTW James, the new instance you pointed indicates details that I am not familiar with : Test group = "acc-sm-c20", Arch/Lustre version = "2.5.1-RC3--PRISTINE-2.6.32-431.5.1.el6[_lustre].x86_64". Can you better describe me the configuration/platform/... you were using ? Is it again on OpenSFS cluster? Also, the test log shows the following error : format at mdt_coordinator.c:977:mdt_hsm_cdt_start doesn't end in newline which is due to a missing new line at the end of the msg passed to CERROR, with the additional consequence of 2 merged lines/traces in Lustre debug-log. |
| Comment by James Nunez (Inactive) [ 12/Mar/14 ] |
|
Bruno, Yes, this was run on the OpenSFS platform with RHEL 6.5 Lustre 2.5.1 RC3, single MGS/MDS, single OSS with two OSTs, one node running the latest Robinhood, one agent/client and one client. I don't know why the test group is "acc-sm-c20"; I was just running sanity-hsm on its own with no other tests. |
| Comment by James Nunez (Inactive) [ 13/Mar/14 ] |
|
Since I'm testing HSM on 2.5.1, I thought I would collect the information Bruno asked for on 09/Oct/13. I ran sanity-hsm test 300 alone and got the "cdt state not stopped on mds1" On the MDT, the first few commands succeeded, but the unmount hung and later crashed. Here's what I see in dmesg captured by crash: <4>Lustre: DEBUG MARKER: == sanity-hsm test 300: On disk coordinator state kept between MDT umount/mount == 09:30:44 (1394641844) <6>Lustre: Disabling parameter general.mdt.lscratch-MDT0000.hsm_control in log params <4>Lustre: Failing over lscratch-MDT0000 <3>LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.118@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. <4>Lustre: 16111:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394641848/real 1394641848] req@ffff88039ae00800 x1462381936924596/t0(0) o251->MGC192.168.2.116@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1394641854 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 <4>Lustre: server umount lscratch-MDT0000 complete <6>LDISKFS-fs (sda3): mounted filesystem with ordered data mode. quota=on. Opts: <3>LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.119@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. <4>Lustre: lscratch-MDT0000: used disk, loading <6>format at mdt_coordinator.c:977:mdt_hsm_cdt_start doesn't end in newline <3>LustreError: 16267:0:(mdt_coordinator.c:977:mdt_hsm_cdt_start()) lscratch-MDT0000: cannot take the layout locks needed for registered restore: -2 <4>Lustre: MGS: non-config logname received: params <3>LustreError: 11-0: lscratch-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11. <6>Lustre: lscratch-MDD0000: changelog on <4>Lustre: lscratch-MDT0000: Will be in recovery for at least 5:00, or until 5 clients reconnect <6>Lustre: lscratch-MDT0000: Recovery over after 0:29, of 5 clients 5 recovered and 0 were evicted. <4>Lustre: DEBUG MARKER: mdc.lscratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 29 sec <4>Lustre: MGS: non-config logname received: params <4>Lustre: Skipped 4 previous similar messages <4>Lustre: DEBUG MARKER: sanity-hsm test_300: @@@@@@ FAIL: hsm_control state is not 'stopped' on mds1 <4>Lustre: DEBUG MARKER: == sanity-hsm test complete, duration 92 sec == 09:32:06 (1394641926) <4>Lustre: DEBUG MARKER: shutdown <4>Lustre: DEBUG MARKER: set_param-d-P <6>Lustre: Disabling parameter general.mdt.lscratch-MDT0000.hsm_control in log params <3>LustreError: 17173:0:(mgs_handler.c:744:mgs_iocontrol()) MGS: setparam err: rc = 1 <4>Lustre: DEBUG MARKER: get_param <4>Lustre: DEBUG MARKER: umount <4>Lustre: Failing over lscratch-MDT0000 <3>LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.118@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. <4>------------[ cut here ]------------ <4>WARNING: at lib/list_debug.c:48 list_del+0x6e/0xa0() (Not tainted) <4>Hardware name: X8DTT-H <4>list_del corruption. prev->next should be ffff88039810de18, but was 5a5a5a5a5a5a5a5a <4>Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables nfsd exportfs nfs lockd fscache auth_rpc gss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode iTCO_wdt iTCO_vendor_support serio_raw i2c_i801 i2c_core sg lpc_ich mfd_core mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core e1000e ptp pps_core ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_pi ix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_conntrack] <4>Pid: 17120, comm: hsm_cdtr Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1 <4>Call Trace: <4> [<ffffffff81071e27>] ? warn_slowpath_common+0x87/0xc0 <4> [<ffffffff81071f16>] ? warn_slowpath_fmt+0x46/0x50 <4> [<ffffffff81084220>] ? process_timeout+0x0/0x10 <4> [<ffffffff8129489e>] ? list_del+0x6e/0xa0 <4> [<ffffffff8109b681>] ? remove_wait_queue+0x31/0x50 <4> [<ffffffffa0e8ef6e>] ? mdt_coordinator+0xbce/0x16b0 [mdt] <4> [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 <4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 <4> [<ffffffff81528090>] ? thread_return+0x4e/0x76e <4> [<ffffffffa0e8e3a0>] ? mdt_coordinator+0x0/0x16b0 [mdt] <4> [<ffffffff8109aee6>] ? kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20 <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4>---[ end trace 4c18e762bb5c5a65 ]--- <4>------------[ cut here ]------------ <4>WARNING: at lib/list_debug.c:51 list_del+0x8d/0xa0() (Tainted: G W -- ------------- ) <4>Hardware name: X8DTT-H <4>list_del corruption. next->prev should be ffff88039810de18, but was 5a5a5a5a5a5a5a5a <4>Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables nfsd exportfs nfs lockd fscache auth_rpc gss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode iTCO_wdt iTCO_vendor_support serio_raw i2c_i801 i2c_core sg lpc_ich mfd_core mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core e1000e ptp pps_core ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_pi ix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_conntrack] <4>Pid: 17120, comm: hsm_cdtr Tainted: G W --------------- 2.6.32-431.5.1.el6_lustre.x86_64 #1 <4>Call Trace: <4> [<ffffffff81071e27>] ? warn_slowpath_common+0x87/0xc0 <4> [<ffffffff81071f16>] ? warn_slowpath_fmt+0x46/0x50 <4> [<ffffffff81084220>] ? process_timeout+0x0/0x10 <4> [<ffffffff812948bd>] ? list_del+0x8d/0xa0 <4> [<ffffffff8109b681>] ? remove_wait_queue+0x31/0x50 <4> [<ffffffffa0e8ef6e>] ? mdt_coordinator+0xbce/0x16b0 [mdt] <4> [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 <4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 <4> [<ffffffff81528090>] ? thread_return+0x4e/0x76e <4> [<ffffffffa0e8e3a0>] ? mdt_coordinator+0x0/0x16b0 [mdt] <4> [<ffffffff8109aee6>] ? kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20 <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4>---[ end trace 4c18e762bb5c5a66 ]--- <4>Lustre: 17182:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394641933/real 1394641933] req@ffff880392bf6c00 x1462381936924904/t0(0) o251->MGC192.168.2.116@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1394641939 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 <4>general protection fault: 0000 [#1] SMP <4>last sysfs file: /sys/devices/system/cpu/online <4>CPU 7 <4>Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables nfsd exportfs nfs lockd fscache auth_rpc gss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode iTCO_wdt iTCO_vendor_support serio_raw i2c_i801 i2c_core sg lpc_ich mfd_core mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core e1000e ptp pps_core ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_pi ix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_conntrack] <4> <4>Pid: 17120, comm: hsm_cdtr Tainted: G W --------------- 2.6.32-431.5.1.el6_lustre.x86_64 #1 Supermicro X8DTT-H/X8DTT-H <4>RIP: 0010:[<ffffffff8128a039>] [<ffffffff8128a039>] strnlen+0x9/0x40 <4>RSP: 0018:ffff88039810dae0 EFLAGS: 00010286 <4>RAX: ffffffff817b5a4e RBX: ffff8803a9176000 RCX: 0000000000000002 <4>RDX: 5a5a5a5a5a5a5a66 RSI: ffffffffffffffff RDI: 5a5a5a5a5a5a5a66 <4>RBP: ffff88039810dae0 R08: 0000000000000073 R09: ffff88043839cd40 <4>R10: 0000000000000001 R11: 000000000000000f R12: ffff8803a9175134 <4>R13: 5a5a5a5a5a5a5a66 R14: 00000000ffffffff R15: 0000000000000000 <4>FS: 0000000000000000(0000) GS:ffff880045ce0000(0000) knlGS:0000000000000000 <4>CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b <4>CR2: 0000000000448000 CR3: 000000082d703000 CR4: 00000000000007e0 <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 <4>Process hsm_cdtr (pid: 17120, threadinfo ffff88039810c000, task ffff88082d41a080) <4>Stack: <4> ffff88039810db20 ffffffff8128b2f0 ffff880000033b48 ffff8803a9175134 <4><d> ffffffffa0ea175a ffffffffa0ea1758 ffff88039810dcb0 ffff8803a9176000 <4><d> ffff88039810dbc0 ffffffff8128c738 0000000000000004 0000000affffffff <4>Call Trace: <4> [<ffffffff8128b2f0>] string+0x40/0x100 <4> [<ffffffff8128c738>] vsnprintf+0x218/0x5e0 <4> [<ffffffffa052f27b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs] <4> [<ffffffffa053effa>] libcfs_debug_vmsg2+0x2ea/0xbc0 [libcfs] <4> [<ffffffff81071e34>] ? warn_slowpath_common+0x94/0xc0 <4> [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20 <4> [<ffffffffa053f911>] libcfs_debug_msg+0x41/0x50 [libcfs] <4> [<ffffffff81058d53>] ? __wake_up+0x53/0x70 <4> [<ffffffffa0e8f2af>] mdt_coordinator+0xf0f/0x16b0 [mdt] <4> [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 <4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 <4> [<ffffffff81528090>] ? thread_return+0x4e/0x76e <4> [<ffffffffa0e8e3a0>] ? mdt_coordinator+0x0/0x16b0 [mdt] <4> [<ffffffff8109aee6>] kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] child_rip+0xa/0x20 <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4>Code: 66 90 48 83 c2 01 80 3a 00 75 f7 48 89 d0 48 29 f8 c9 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 f6 48 89 e5 74 2e <80> 3f 00 74 29 48 83 ee 01 48 89 f8 eb 12 66 0f 1f 84 00 00 00 <1>RIP [<ffffffff8128a039>] strnlen+0x9/0x40 <4> RSP <ffff88039810dae0> |
| Comment by Jodi Levi (Inactive) [ 29/May/14 ] |
|
Patch landed to Master. Please reopen ticket if more work is needed. |
| Comment by James Nunez (Inactive) [ 04/Sep/14 ] |
|
I've run into this error again while testing HSM with the 2.5.3-RC1 version of Lustre on the OpenSFS cluster. The results are at https://testing.hpdd.intel.com/test_sessions/60308f12-3472-11e4-995a-5254006e85c2 |
| Comment by Bruno Faccini (Inactive) [ 10/Nov/14 ] |
|
James, according to this auto-test failure report it is unclear for me how this happened ... Name Status Run at Duration Return code Error Test log Bugs test_300 FAIL 2014-09-05 01:47:58 UTC 31 1 Restart of mds1 failed! Preview 10 test_300 FAIL 2014-09-02 13:14:59 UTC 80 1 hsm_control state is not 'stopped' on mds1 Preview 10 test_300 FAIL 2014-08-27 01:19:32 UTC 107 1 post-failover df: 1 Preview 10 test_300 FAIL 2014-08-21 19:55:39 UTC 714 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-07-01 22:43:31 UTC 599 1 hsm_control state is not 'stopped' on mds1 Preview 10 test_300 FAIL 2014-07-01 10:47:59 UTC 107 1 hsm_control state is not 'stopped' on mds1 Preview 10 test_300 FAIL 2014-06-21 14:32:23 UTC 723 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-06-02 17:16:48 UTC 331 1 hsm_control state is not 'stopped' on mds1 Preview 10 test_300 FAIL 2014-05-28 05:46:52 UTC 714 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-05-27 09:56:15 UTC 975 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-05-22 19:10:52 UTC 105 1 hsm_control state is not 'stopped' on mds1 Preview 10 LU-4065 test_300 FAIL 2014-05-08 18:55:20 UTC 716 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-05-07 22:34:02 UTC 746 1 import is not in FULL state Preview 10 LU-4018 test_300 FAIL 2014-03-26 22:02:07 UTC 77 1 hsm_control state is not 'stopped' on mds1 Preview 10 LU-4125, LU-4065 test_300 FAIL 2014-03-11 20:01:49 UTC 79 1 hsm_control state is not 'stopped' on mds1 Preview 10 LU-4065, LU-4065 test_300 FAIL 2013-12-08 19:39:12 UTC 755 1 import is not in FULL state LU-4018, LU-4361 |
| Comment by Sergey Cheremencev [ 19/Nov/14 ] |
|
Hello, we faced the same error in seagate and now using following solution http://review.whamcloud.com/#/c/12783/. Short explanation:
|
| Comment by Bruno Faccini (Inactive) [ 22/Jan/15 ] |
|
Sergei, thanks for this comment and patch, but I just wonder if you found the related issue/failure only when running sanity-hsm/test_300 sub-test or also with others usage of "cdt_set_mount_state()" within sanity-hsm ?? BTW, I don't understand why you refer to copytool_cleanup() in your description. Also, did you find the detailed behavior you have described by analyzing MGS/MDS nodes debug-log? And last, your patch has failed to pass auto-tests for unrelated failures, and I think you need to rebase it to avoid such issues. Last, auto-tests/Maloo query indicates that issue for this ticket has only been encountered 1 time (on 2014-11-19 15:49:17 UTC) during the past 4 months. And I am presently analyzing this failure's associated MDS/MGS debug-log, and in parallel I am trying to reproduce in-house. |
| Comment by Sergey Cheremencev [ 26/Jan/15 ] |
|
Yes we hit this bug in a lot of subtests in sanity-hsm: 402, 3, 106 ... About why test_300 may fail(my view):
yes About reproducing the problem. You may try to make custom build with MGC_TIMEOUT_MIN_SECONDS = 10 or 15. If it will not brake something else it may help. |
| Comment by Sergey Cheremencev [ 28/Sep/15 ] |
|
There is a +1 from Andreas in gerrit. Can somebody else do review to move forward ? |
| Comment by Gerrit Updater [ 02/Oct/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12783/ |
| Comment by Joseph Gmitter (Inactive) [ 02/Oct/15 ] |
|
Landed for 2.8.0 |