[LU-4065] sanity-hsm test_300 failure: 'cdt state is not stopped' Created: 04/Oct/13  Updated: 10/Apr/17  Resolved: 02/Oct/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.6.0, Lustre 2.5.2, Lustre 2.8.0

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: HSM
Environment:

Luster master build # 1715
OpenSFS cluster with combined MGS/MDS, single OSS with two OSTs, three clients; one agent + client, one with robinhood/db running + client and one just running as Lustre clients


Attachments: File lustre-debug.log.tgz     Text File lustre-dmesg.txt    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 10892

 Description   

The test results are at: https://maloo.whamcloud.com/test_sets/8e9cca2c-2c8b-11e3-85ee-52540035b04c

From the client test_log:

== sanity-hsm test 300: On disk coordinator state kept between MDT umount/mount == 14:22:47 (1380835367)
Stop coordinator and remove coordinator state at mount
mdt.scratch-MDT0000.hsm_control=shutdown
Changed after 0s: from '' to 'stopping'
Waiting 10 secs for update
Updated after 8s: wanted 'stopped' got 'stopped'
Failing mds1 on mds
Stopping /lustre/scratch/mdt0 (opts:) on mds
pdsh@c15: mds: ssh exited with exit code 1
reboot facets: mds1
Failover mds1 to mds
14:23:15 (1380835395) waiting for mds network 900 secs ...
14:23:15 (1380835395) network interface is UP
mount facets: mds1
Starting mds1:   /dev/sda3 /lustre/scratch/mdt0
Started scratch-MDT0000
c15: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 25 sec
Changed after 0s: from '' to 'enabled'
Waiting 20 secs for update
Waiting 10 secs for update
Update not seen after 20s: wanted 'stopped' got 'enabled'
 sanity-hsm test_300: @@@@@@ FAIL: cdt state is not stopped 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4264:error_noexit()
  = /usr/lib64/lustre/tests/test-framework.sh:4291:error()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:298:cdt_check_state()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:3063:test_300()


 Comments   
Comment by jacques-charles lafoucriere [ 06/Oct/13 ]

We recently have seen lctl set_param -P was not working as expected

Comment by Bruno Faccini (Inactive) [ 07/Oct/13 ]

Hello JC,

Do you mean cdt_state permanent/on-disk change may have failed ??

Comment by Jinshan Xiong (Inactive) [ 07/Oct/13 ]

Hi JC, if I remember it correctly, we had a conversation about this to make procfs exported all the time so that we can set the parameter from configuration. Can you refresh my memory on this?

Comment by jacques-charles lafoucriere [ 08/Oct/13 ]

During recent "by hand" tests, we have seen lctl set_param -P was not working like for test 300. But gerrit is usually ok. We also have seen with latest master that even sanity-hsm was no more starting when running manually at home. Thomas or Aurelien should open a ticket with the details.

Comment by John Hammond [ 08/Oct/13 ]

JC, when you test locally do you install Lustre on all nodes or do you run from a build directory? Using 'lctl set_param -P' requires that lctl be installed at /usr/sbin/lctl on all nodes. See LU-4041.

Comment by Aurelien Degremont (Inactive) [ 09/Oct/13 ]

I confirm. We are doing this in our test environment where sanity-hsm is run without Lustre utils being installed in standard paths. This is really LU-4041.

Comment by John Hammond [ 09/Oct/13 ]

Using set_param ...hsm_control=shutdown on a CDT that is already stopped will create a state where the CDT cannot be started:

# llmount.sh
# lctl get_param mdt.lustre-MDT0000.hsm_control
mdt.lustre-MDT0000.hsm_control=stopped
# lctl set_param mdt.lustre-MDT0000.hsm_control=shutdown 
mdt.lustre-MDT0000.hsm_control=shutdown
# lctl get_param mdt.lustre-MDT0000.hsm_control
mdt.lustre-MDT0000.hsm_control=stopping
# lctl set_param mdt.lustre-MDT0000.hsm_control=enabled
mdt.lustre-MDT0000.hsm_control=enabled
error: set_param: setting /proc/fs/lustre/mdt/lustre-MDT0000/hsm_control=enabled: Operation already in progress
# lctl get_param mdt.lustre-MDT0000.hsm_control
mdt.lustre-MDT0000.hsm_control=stopping
# lctl set_param mdt.lustre-MDT0000.hsm_control=disabled
mdt.lustre-MDT0000.hsm_control=disabled
# lctl get_param mdt.lustre-MDT0000.hsm_control
mdt.lustre-MDT0000.hsm_control=disabled
# lctl set_param mdt.lustre-MDT0000.hsm_control=enabled
mdt.lustre-MDT0000.hsm_control=enabled
# lctl get_param mdt.lustre-MDT0000.hsm_control
mdt.lustre-MDT0000.hsm_control=enabled
# pgrep hsm
#

Amusingly if you try to remount the MDT to resolve this state then you get the following.

Message from syslogd@t at Oct  9 13:00:45 ...
 kernel:LustreError: 7626:0:(mdt_coordinator.c:417:hsm_cdt_procfs_fini()) ASSERTION( cdt->cdt_state == CDT_STOPPED ) failed: 

Message from syslogd@t at Oct  9 13:00:45 ...
 kernel:LustreError: 7626:0:(mdt_coordinator.c:417:hsm_cdt_procfs_fini()) LBUG
Comment by John Hammond [ 09/Oct/13 ]

I think we need some more protective logic in lprocfs_wr_hsm_cdt_control() and its callees. Shouldn't every write to hsm_control force a call to mdt_hsm_cdt_wakeup()?

Comment by Bruno Faccini (Inactive) [ 25/Oct/13 ]

Assigned to me since I am working on a fix for LU-4093 where I tend to use cdt_shutdown/cdt_restart upon each copytool_cleanup in sanity-hsm test-suite, thus I am strongly concerned with accurate cdt_state actually and for sure I already experienced the saeme LBUG than John !!

Comment by Bruno Faccini (Inactive) [ 25/Oct/13 ]

Concerning the cdt_state that can become easily stale, I think I have a patch for it, the problem seems to be that both the shutdown/disabled cmds assume CDT is started and thus simply set cdt_state=STOPPING/DISABLED (expecting for CDT thread to check it and gracefully shutdown or go to sleep! ) and then if umount/stop MDT you simply trigger the ASSERTION( cdt->cdt_state == CDT_STOPPED ) in hsm_cdt_procfs_fini(). Better/enforced checks in lprocfs_wr_hsm_cdt_control() should fix this.

Even if I don't think it is fully safe (against concurrent hsm_control/hsm_state updates …) my patch at http://review.whamcloud.com/8074 should prevent such cdt_state inaccurate values.

Comment by Bruno Faccini (Inactive) [ 20/Nov/13 ]

Patch has land, but I wonder if we need to add extra check/test in sanity-hsm to verify that even with multiple/random/replicated hsm_control setting (like John's serie to trigger LBUG), cdt_state is kept accurate ??

Comment by James Nunez (Inactive) [ 05/Dec/13 ]

sanity-hsm test 300 is still failing on the OpenSFS cluster with master build #1790. I've uploaded results from two runs of sanity-hsm where test 300 fails at https://maloo.whamcloud.com/test_sessions/b223c634-5d51-11e3-ad71-52540035b04c and https://maloo.whamcloud.com/test_sessions/643460bc-5d4e-11e3-956b-52540035b04c .

Test 300 consistently fails on this cluster.

Comment by Bruno Faccini (Inactive) [ 09/Dec/13 ]

In the 2 tests sessions you just uploaded, sanity-hsm/test_300 seems to fail because the permanent/on-disk config llog setting ("lctl set_param -d -P mdt.<FSNAME>-MDT0000.hsm_control" in sanity-hsm/cdt_clear_mount_state() function), is mysteriously not re-read after umount/re-mount.
I st thought it could be the consequence of the follwing msgs found in MDS dmesg :

Lustre: DEBUG MARKER: == sanity-hsm test 300: On disk coordinator state kept between MDT umount/mount == 12:30:45 (1386189045)
Lustre: Disabling parameter general.mdt.scratch-MDT0000.hsm_control in log params  <<<<<--- config log set
Lustre: Failing over scratch-MDT0000
LustreError: 137-5: scratch-MDT0000_UUID: not available for connect from 192.168.2.112@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 1 previous similar message
LustreError: 137-5: scratch-MDT0000_UUID: not available for connect from 192.168.2.112@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
Lustre: 9522:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1386189056/real 1386189056]  req@ffff8803cc092000 x1453343831874984/t0(0) o251->MGC192.168.2.108@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1386189062 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: server umount scratch-MDT0000 complete
LDISKFS-fs (sda3): mounted filesystem with ordered data mode. quota=on. Opts: 
format at mdt_coordinator.c:977:mdt_hsm_cdt_start doesn't end in newline
LustreError: 9685:0:(mdt_coordinator.c:977:mdt_hsm_cdt_start()) scratch-MDT0000: cannot take the layout locks needed for registered restore: -2
Lustre: MGS: non-config logname received: params  <<<<----- PB ??
Lustre: Skipped 6 previous similar messages
LustreError: 11-0: scratch-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
Lustre: scratch-MDD0000: changelog on
Lustre: DEBUG MARKER: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: scratch-MDT0000: Will be in recovery for at least 5:00, or until 7 clients reconnect
LustreError: 9700:0:(ldlm_lib.c:1733:check_for_next_transno()) scratch-MDT0000: waking for gap in transno, VBR is OFF (skip: 77309411692, ql: 1, comp: 6, conn: 7, next: 77309411694, last_committed: 77309411691)
Lustre: scratch-MDT0000: Recovery over after 0:21, of 7 clients 7 recovered and 0 were evicted.
Lustre: DEBUG MARKER: sanity-hsm test_300: @@@@@@ FAIL: hsm_control state is not 'stopped' on mds1

But in fact I have the same on my test platform where I am unable to reproduce any sanity-hsm/test_300 sub-test failure actually …

James, I am not sure to remember, but did you not say that test_300 can even fail when run as the single sub-test of sanity-hsm ? If yes, can you run the following sequence manually with full debug-logs enabled on MDS/MGS :

lctl set_param debug=-1
lctl clear
lctl mark shutdown
lctl set_param mdt.<FSNAME>-MDT0000.hsm_control=shutdown
lctl mark set_param-d-P
lctl set_param -d -P mdt.<FSNAME>-MDT0000.hsm_control
lctl mark get_param
lctl get_param mdt.<FSNAME>-MDT0000.hsm_control
lctl mark umount
umount <MDT>
lctl mark mount
mount <MDT>
sleep 10
lctl mark get_param
lctl get_param mdt.<FSNAME>-MDT0000.hsm_control
Comment by James Nunez (Inactive) [ 10/Dec/13 ]

Yes, I can get test_300 to fail running just that test.

I ran sanity-hsm test 300 and it failed and then I ran the commands you asked for with the following output:

[root@c08 ~]# lctl set_param debug=-1
debug=-1
[root@c08 ~]# lctl clear
[root@c08 ~]# lctl mark shutdown
[root@c08 ~]# lctl set_param mdt.scratch-MDT0000.hsm_control=shutdown
mdt.scratch-MDT0000.hsm_control=shutdown
[root@c08 ~]# lctl mark set_param-d-P
[root@c08 ~]# lctl set_param -d -P mdt.scratch-MDT0000.hsm_control
error: executing set_param: Success
[root@c08 ~]# lctl mark get_param
 [root@c08 ~]# lctl get_param mdt.scratch-MDT0000.hsm_control
mdt.scratch-MDT0000.hsm_control=stopped
[root@c08 ~]# umount /lustre/scratch/mdt0
[root@c08 ~]# lctl mark mount
[root@c08 ~]# mount -t lustre /dev/sda3 /lustre/scratch/mdt0
[root@c08 ~]# sleep 10
[root@c08 ~]# lctl mark get_param
[root@c08 ~]# lctl get_param mdt.scratch-MDT0000.hsm_control
mdt.scratch-MDT0000.hsm_control=enabled

dmesg on the MDS/MGS:

Lustre: DEBUG MARKER: shutdown
Lustre: MGS: non-config logname received: params
Lustre: Skipped 8 previous similar messages
Lustre: DEBUG MARKER: set_param-d-P
Lustre: Disabling parameter general.mdt.scratch-MDT0000.hsm_control in log params
LustreError: 7226:0:(mgs_handler.c:744:mgs_iocontrol()) MGS: setparam err: rc = 1
Lustre: DEBUG MARKER: get_param
Lustre: Failing over scratch-MDT0000
LustreError: 137-5: scratch-MDT0000_UUID: not available for connect from 192.168.2.113@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 1 previous similar message
LustreError: 137-5: scratch-MDT0000_UUID: not available for connect from 192.168.2.112@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: 137-5: scratch-MDT0000_UUID: not available for connect from 192.168.2.115@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 2 previous similar messages
Lustre: 7239:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1386634800/real 1386634800]  req@ffff880409bd9800 x1453343832768544/t0(0) o251->MGC192.168.2.108@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1386634806 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: server umount scratch-MDT0000 complete
Lustre: DEBUG MARKER: mount
Lustre: 7244:0:(obd_mount.c:1246:lustre_fill_super()) VFS Op: sb ffff8803e594b400
LDISKFS-fs (sda3): mounted filesystem with ordered data mode. quota=on. Opts: 
format at mdt_coordinator.c:977:mdt_hsm_cdt_start doesn't end in newline
LustreError: 7308:0:(mdt_coordinator.c:977:mdt_hsm_cdt_start()) scratch-MDT0000: cannot take the layout locks needed for registered restore: -2
LustreError: 11-0: scratch-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
Lustre: scratch-MDD0000: changelog on
Lustre: scratch-MDT0000: Will be in recovery for at least 5:00, or until 4 clients reconnect
Lustre: 7284:0:(mdt_handler.c:449:mdt_pack_attr2body()) [0x200000002:0x1:0x0]: returning size 4096
Lustre: 7284:0:(mdt_handler.c:449:mdt_pack_attr2body()) [0x200000002:0x2:0x0]: returning size 4096
Lustre: 7286:0:(mdt_handler.c:449:mdt_pack_attr2body()) [0x200000002:0x1:0x0]: returning size 4096
Lustre: 7286:0:(mdt_handler.c:449:mdt_pack_attr2body()) [0x200000002:0x2:0x0]: returning size 4096
Lustre: 7324:0:(mdt_handler.c:449:mdt_pack_attr2body()) [0x200000002:0x1:0x0]: returning size 4096
Lustre: 7324:0:(mdt_handler.c:449:mdt_pack_attr2body()) [0x200000002:0x2:0x0]: returning size 4096
Lustre: MGS: non-config logname received: params
Lustre: Skipped 12 previous similar messages
Lustre: scratch-MDT0000: Recovery over after 0:15, of 4 clients 4 recovered and 0 were evicted.
Lustre: DEBUG MARKER: get_param
Comment by Bruno Faccini (Inactive) [ 10/Dec/13 ]

Well, seems that I missed to ask you to collect the MDS Lustre debug-log at the end !! Sorry …
Can you re-run the same sequence of commands and add an "lctl dk /tmp/lustre-debug.log" at the end ? Then I would like you to attach both the corresponding lustre-debug.log file and dmesg output.
Thanks again for your help.

BTW, the new+strange "error: executing set_param: Success" and "LustreError: 7226:0:(mgs_handler.c:744:mgs_iocontrol()) MGS: setparam err: rc = 1" msgs upon "lctl set_param -d -P mdt.<FSNAME>-MDT0000.hsm_control", after I checked in the source code they only/finally indicate that the parameter's value was already the default/disabled. I may push a cosmetic change to avoid this.

Comment by James Nunez (Inactive) [ 10/Dec/13 ]

Here are the logs you requested.

Comment by Bruno Faccini (Inactive) [ 11/Dec/13 ]

James, I am afraid I forgot to ask you to also grow the debug log buffer size with "lctl set_param debug_mb=2048", because the debug-log misses the traces for the first commands.

Can you also "mark" the umount, I forgot too.

I also would like you to dump the config llog content after the "lctl set_param -d -P mdt.scratch-MDT0000.hsm_control", to do so you need to mount the MDT as "-t ldiskfs" separately and the go/cd to CONFIGS sub-dir to run the command "llog_reader params".

Last, I created a separate ticket LU-4374 for the false error/msgs "error: executing set_param: Success" and "LustreError: 7226:0:(mgs_handler.c:744:mgs_iocontrol()) MGS: setparam err: rc = 1" upon "lctl set_param -d -P mdt.<FSNAME>-MDT0000.hsm_control" and when default is already set.

Comment by Bruno Faccini (Inactive) [ 12/Dec/13 ]

I requested access to OpenSFS cluster, waiting for it now and in order to work on the issue on the platform where you reproduct!

Comment by James Nunez (Inactive) [ 11/Mar/14 ]

Another instance of this failure at https://maloo.whamcloud.com/test_sets/7f2aafdc-a959-11e3-95fe-52540035b04c

Comment by Bruno Faccini (Inactive) [ 12/Mar/14 ]

Humm, I thought (and hoped!) that this problem have disappeared because it is the first failure occurrence of sanity-hsm/test_300 subtest since mid-december, according to maloo/auto-tests stats.

I will try to find a way to reproduce it, but still I still strongly suspect it to occur with specific environment/platform, it may not be so easy.

BTW James, the new instance you pointed indicates details that I am not familiar with : Test group = "acc-sm-c20", Arch/Lustre version = "2.5.1-RC3--PRISTINE-2.6.32-431.5.1.el6[_lustre].x86_64". Can you better describe me the configuration/platform/... you were using ? Is it again on OpenSFS cluster?

Also, the test log shows the following error :

format at mdt_coordinator.c:977:mdt_hsm_cdt_start doesn't end in newline

which is due to a missing new line at the end of the msg passed to CERROR, with the additional consequence of 2 merged lines/traces in Lustre debug-log.
This is still present in master and I pushed a patch to fix it at http://review.whamcloud.com/9597.

Comment by James Nunez (Inactive) [ 12/Mar/14 ]

Bruno,

Yes, this was run on the OpenSFS platform with RHEL 6.5 Lustre 2.5.1 RC3, single MGS/MDS, single OSS with two OSTs, one node running the latest Robinhood, one agent/client and one client. I don't know why the test group is "acc-sm-c20"; I was just running sanity-hsm on its own with no other tests.

Comment by James Nunez (Inactive) [ 13/Mar/14 ]

Since I'm testing HSM on 2.5.1, I thought I would collect the information Bruno asked for on 09/Oct/13. I ran sanity-hsm test 300 alone and got the "cdt state not stopped on mds1" On the MDT, the first few commands succeeded, but the unmount hung and later crashed.

Here's what I see in dmesg captured by crash:

<4>Lustre: DEBUG MARKER: == sanity-hsm test 300: On disk coordinator state kept between MDT umount/mount == 09:30:44 (1394641844)
<6>Lustre: Disabling parameter general.mdt.lscratch-MDT0000.hsm_control in log params
<4>Lustre: Failing over lscratch-MDT0000
<3>LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.118@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
<4>Lustre: 16111:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394641848/real 1394641848]  req@ffff88039ae00800 x1462381936924596/t0(0) o251->MGC192.168.2.116@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1394641854 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
<4>Lustre: server umount lscratch-MDT0000 complete
<6>LDISKFS-fs (sda3): mounted filesystem with ordered data mode. quota=on. Opts:
 
<3>LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.119@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
<4>Lustre: lscratch-MDT0000: used disk, loading
<6>format at mdt_coordinator.c:977:mdt_hsm_cdt_start doesn't end in newline
<3>LustreError: 16267:0:(mdt_coordinator.c:977:mdt_hsm_cdt_start()) lscratch-MDT0000: cannot take the layout locks needed for registered restore: -2
<4>Lustre: MGS: non-config logname received: params
<3>LustreError: 11-0: lscratch-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
<6>Lustre: lscratch-MDD0000: changelog on
<4>Lustre: lscratch-MDT0000: Will be in recovery for at least 5:00, or until 5 clients reconnect
<6>Lustre: lscratch-MDT0000: Recovery over after 0:29, of 5 clients 5 recovered and 0 were evicted.
<4>Lustre: DEBUG MARKER: mdc.lscratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 29 sec
<4>Lustre: MGS: non-config logname received: params
<4>Lustre: Skipped 4 previous similar messages
<4>Lustre: DEBUG MARKER: sanity-hsm test_300: @@@@@@ FAIL: hsm_control state is not 'stopped' on mds1
<4>Lustre: DEBUG MARKER: == sanity-hsm test complete, duration 92 sec == 09:32:06 (1394641926)
<4>Lustre: DEBUG MARKER: shutdown
<4>Lustre: DEBUG MARKER: set_param-d-P
<6>Lustre: Disabling parameter general.mdt.lscratch-MDT0000.hsm_control in log params
<3>LustreError: 17173:0:(mgs_handler.c:744:mgs_iocontrol()) MGS: setparam err: rc = 1
<4>Lustre: DEBUG MARKER: get_param
<4>Lustre: DEBUG MARKER: umount
<4>Lustre: Failing over lscratch-MDT0000
<3>LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.118@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
<4>------------[ cut here ]------------
<4>WARNING: at lib/list_debug.c:48 list_del+0x6e/0xa0() (Not tainted)
<4>Hardware name: X8DTT-H
<4>list_del corruption. prev->next should be ffff88039810de18, but was 5a5a5a5a5a5a5a5a
<4>Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables nfsd exportfs nfs lockd fscache auth_rpc
gss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode iTCO_wdt iTCO_vendor_support serio_raw i2c_i801 i2c_core sg lpc_ich mfd_core mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core e1000e ptp pps_core ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_pi
ix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_conntrack]
<4>Pid: 17120, comm: hsm_cdtr Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
<4>Call Trace:
<4> [<ffffffff81071e27>] ? warn_slowpath_common+0x87/0xc0
<4> [<ffffffff81071f16>] ? warn_slowpath_fmt+0x46/0x50
<4> [<ffffffff81084220>] ? process_timeout+0x0/0x10
<4> [<ffffffff8129489e>] ? list_del+0x6e/0xa0
<4> [<ffffffff8109b681>] ? remove_wait_queue+0x31/0x50
<4> [<ffffffffa0e8ef6e>] ? mdt_coordinator+0xbce/0x16b0 [mdt]
<4> [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
<4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
<4> [<ffffffff81528090>] ? thread_return+0x4e/0x76e
<4> [<ffffffffa0e8e3a0>] ? mdt_coordinator+0x0/0x16b0 [mdt]
<4> [<ffffffff8109aee6>] ? kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
<4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>---[ end trace 4c18e762bb5c5a65 ]---
<4>------------[ cut here ]------------
<4>WARNING: at lib/list_debug.c:51 list_del+0x8d/0xa0() (Tainted: G        W  --
-------------   )
<4>Hardware name: X8DTT-H
<4>list_del corruption. next->prev should be ffff88039810de18, but was 5a5a5a5a5a5a5a5a
<4>Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables nfsd exportfs nfs lockd fscache auth_rpc
gss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode iTCO_wdt iTCO_vendor_support serio_raw i2c_i801 i2c_core sg lpc_ich mfd_core mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core e1000e ptp pps_core ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_pi
ix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_conntrack]
<4>Pid: 17120, comm: hsm_cdtr Tainted: G        W  ---------------    2.6.32-431.5.1.el6_lustre.x86_64 #1
<4>Call Trace:
<4> [<ffffffff81071e27>] ? warn_slowpath_common+0x87/0xc0
<4> [<ffffffff81071f16>] ? warn_slowpath_fmt+0x46/0x50
<4> [<ffffffff81084220>] ? process_timeout+0x0/0x10
<4> [<ffffffff812948bd>] ? list_del+0x8d/0xa0
<4> [<ffffffff8109b681>] ? remove_wait_queue+0x31/0x50
<4> [<ffffffffa0e8ef6e>] ? mdt_coordinator+0xbce/0x16b0 [mdt]
<4> [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
<4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
<4> [<ffffffff81528090>] ? thread_return+0x4e/0x76e
<4> [<ffffffffa0e8e3a0>] ? mdt_coordinator+0x0/0x16b0 [mdt]
<4> [<ffffffff8109aee6>] ? kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
<4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>---[ end trace 4c18e762bb5c5a66 ]---
<4>Lustre: 17182:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394641933/real 1394641933]  req@ffff880392bf6c00 x1462381936924904/t0(0) o251->MGC192.168.2.116@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1394641939 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
<4>general protection fault: 0000 [#1] SMP 
<4>last sysfs file: /sys/devices/system/cpu/online
<4>CPU 7 
<4>Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables nfsd exportfs nfs lockd fscache auth_rpc
gss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode iTCO_wdt iTCO_vendor_support serio_raw i2c_i801 i2c_core sg lpc_ich mfd_core mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core e1000e ptp pps_core ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_pi
ix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_conntrack]
<4>
<4>Pid: 17120, comm: hsm_cdtr Tainted: G        W  ---------------    2.6.32-431.5.1.el6_lustre.x86_64 #1 Supermicro X8DTT-H/X8DTT-H
<4>RIP: 0010:[<ffffffff8128a039>]  [<ffffffff8128a039>] strnlen+0x9/0x40
<4>RSP: 0018:ffff88039810dae0  EFLAGS: 00010286
<4>RAX: ffffffff817b5a4e RBX: ffff8803a9176000 RCX: 0000000000000002
<4>RDX: 5a5a5a5a5a5a5a66 RSI: ffffffffffffffff RDI: 5a5a5a5a5a5a5a66
<4>RBP: ffff88039810dae0 R08: 0000000000000073 R09: ffff88043839cd40
<4>R10: 0000000000000001 R11: 000000000000000f R12: ffff8803a9175134
<4>R13: 5a5a5a5a5a5a5a66 R14: 00000000ffffffff R15: 0000000000000000
<4>FS:  0000000000000000(0000) GS:ffff880045ce0000(0000) knlGS:0000000000000000
<4>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>CR2: 0000000000448000 CR3: 000000082d703000 CR4: 00000000000007e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process hsm_cdtr (pid: 17120, threadinfo ffff88039810c000, task ffff88082d41a080)
<4>Stack:
<4> ffff88039810db20 ffffffff8128b2f0 ffff880000033b48 ffff8803a9175134
<4><d> ffffffffa0ea175a ffffffffa0ea1758 ffff88039810dcb0 ffff8803a9176000
<4><d> ffff88039810dbc0 ffffffff8128c738 0000000000000004 0000000affffffff
<4>Call Trace:
<4> [<ffffffff8128b2f0>] string+0x40/0x100
<4> [<ffffffff8128c738>] vsnprintf+0x218/0x5e0
<4> [<ffffffffa052f27b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs]
<4> [<ffffffffa053effa>] libcfs_debug_vmsg2+0x2ea/0xbc0 [libcfs]
<4> [<ffffffff81071e34>] ? warn_slowpath_common+0x94/0xc0
<4> [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
<4> [<ffffffffa053f911>] libcfs_debug_msg+0x41/0x50 [libcfs]
<4> [<ffffffff81058d53>] ? __wake_up+0x53/0x70
<4> [<ffffffffa0e8f2af>] mdt_coordinator+0xf0f/0x16b0 [mdt]
<4> [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
<4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
<4> [<ffffffff81528090>] ? thread_return+0x4e/0x76e
<4> [<ffffffffa0e8e3a0>] ? mdt_coordinator+0x0/0x16b0 [mdt]
<4> [<ffffffff8109aee6>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>Code: 66 90 48 83 c2 01 80 3a 00 75 f7 48 89 d0 48 29 f8 c9 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 f6 48 89 e5 74 2e <80> 3f 00 74 29 48 83 ee 01 48 89 f8 eb 12 66 0f 1f 84 00 00 00 
<1>RIP  [<ffffffff8128a039>] strnlen+0x9/0x40
<4> RSP <ffff88039810dae0>

Comment by Jodi Levi (Inactive) [ 29/May/14 ]

Patch landed to Master. Please reopen ticket if more work is needed.

Comment by James Nunez (Inactive) [ 04/Sep/14 ]

I've run into this error again while testing HSM with the 2.5.3-RC1 version of Lustre on the OpenSFS cluster. The results are at https://testing.hpdd.intel.com/test_sessions/60308f12-3472-11e4-995a-5254006e85c2

Comment by Bruno Faccini (Inactive) [ 10/Nov/14 ]

James, according to this auto-test failure report it is unclear for me how this happened ...
BTW, I ran a Maloo/auto-tests search for sanity-hsm/test_300 failures and seems that it runs pretty good and the failure you reported was the last/only one since 2 months now :

Name            Status          Run at                  Duration        Return code     Error                                           Test log        Bugs
test_300        FAIL    2014-09-05 01:47:58 UTC         31              1               Restart of mds1 failed!                         Preview 10
test_300        FAIL    2014-09-02 13:14:59 UTC         80              1               hsm_control state is not 'stopped' on mds1      Preview 10
test_300        FAIL    2014-08-27 01:19:32 UTC         107             1               post-failover df: 1                             Preview 10
test_300        FAIL    2014-08-21 19:55:39 UTC         714             1               import is not in FULL state                     Preview 10      LU-4018
test_300        FAIL    2014-07-01 22:43:31 UTC         599             1               hsm_control state is not 'stopped' on mds1      Preview 10
test_300        FAIL    2014-07-01 10:47:59 UTC         107             1               hsm_control state is not 'stopped' on mds1      Preview 10
test_300        FAIL    2014-06-21 14:32:23 UTC         723             1               import is not in FULL state                     Preview 10      LU-4018
test_300        FAIL    2014-06-02 17:16:48 UTC         331             1               hsm_control state is not 'stopped' on mds1      Preview 10
test_300        FAIL    2014-05-28 05:46:52 UTC         714             1               import is not in FULL state                     Preview 10      LU-4018
test_300        FAIL    2014-05-27 09:56:15 UTC         975             1               import is not in FULL state                     Preview 10      LU-4018
test_300        FAIL    2014-05-22 19:10:52 UTC         105             1               hsm_control state is not 'stopped' on mds1      Preview 10      LU-4065
test_300        FAIL    2014-05-08 18:55:20 UTC         716             1               import is not in FULL state                     Preview 10      LU-4018
test_300        FAIL    2014-05-07 22:34:02 UTC         746             1               import is not in FULL state                     Preview 10      LU-4018
test_300        FAIL    2014-03-26 22:02:07 UTC         77              1               hsm_control state is not 'stopped' on mds1      Preview 10      LU-4125, LU-4065
test_300        FAIL    2014-03-11 20:01:49 UTC         79              1               hsm_control state is not 'stopped' on mds1      Preview 10      LU-4065, LU-4065
test_300        FAIL    2013-12-08 19:39:12 UTC         755             1               import is not in FULL state                                     LU-4018, LU-4361
Comment by Sergey Cheremencev [ 19/Nov/14 ]

Hello, we faced the same error in seagate and now using following solution http://review.whamcloud.com/#/c/12783/.

Short explanation:

  1. "cdt_set_mount_state enabled" sets params on server. It uses lctl set_param -P.
  2. copytool_cleanup set hsm_control=shutdown and waiting when hsm_control becames "stopped"(after "shutdown").
  3. At this moment(copytool_cleanup waits for "stopped") mgc retrieves and applys configuration from server with hsm_control=enabled that was set in step 1.
  4. Mgc starts log processing with delay about 10 seconds (see mgc_requeue_thread).
    copytool_cleanup waits 20 seconds and get hsm_control==enabled because this parameter was modified at step 3.
Comment by Bruno Faccini (Inactive) [ 22/Jan/15 ]

Sergei, thanks for this comment and patch, but I just wonder if you found the related issue/failure only when running sanity-hsm/test_300 sub-test or also with others usage of "cdt_set_mount_state()" within sanity-hsm ?? BTW, I don't understand why you refer to copytool_cleanup() in your description.

Also, did you find the detailed behavior you have described by analyzing MGS/MDS nodes debug-log?

And last, your patch has failed to pass auto-tests for unrelated failures, and I think you need to rebase it to avoid such issues.

Last, auto-tests/Maloo query indicates that issue for this ticket has only been encountered 1 time (on 2014-11-19 15:49:17 UTC) during the past 4 months. And I am presently analyzing this failure's associated MDS/MGS debug-log, and in parallel I am trying to reproduce in-house.

Comment by Sergey Cheremencev [ 26/Jan/15 ]

Yes we hit this bug in a lot of subtests in sanity-hsm: 402, 3, 106 ...
In our case this was race between copytool_cleanup and cdt_mount_state. So usually copytool_cleanup failed.

About why test_300 may fail(my view):

  1. cdt_set_mount_state sets param using -P
  2. "cdt_check_state stopped" waiting when hsm_control becames "stopped"(after after cdt_shutdown and cdt_clear_mount_state)
  3. At this moment("cdt_check_state stopped" waits for "stopped") mgc retrieves and applys configuration from server with hsm_control=enabled that was set in step 1.

Also, did you find the detailed behavior you have described by analyzing MGS/MDS nodes debug-log?

yes

About reproducing the problem. You may try to make custom build with MGC_TIMEOUT_MIN_SECONDS = 10 or 15. If it will not brake something else it may help.

Comment by Sergey Cheremencev [ 28/Sep/15 ]

There is a +1 from Andreas in gerrit. Can somebody else do review to move forward ?

Comment by Gerrit Updater [ 02/Oct/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12783/
Subject: LU-4065 tests: hsm copytool_cleanup improvement
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 73bca6c1f4923cdf673fa11486aec04ec3576051

Comment by Joseph Gmitter (Inactive) [ 02/Oct/15 ]

Landed for 2.8.0

Generated at Sat Feb 10 01:39:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.