[LU-2059] mgc to backup configuration on osd-based llogs Created: 01/Oct/12  Updated: 29/May/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.4.1, Lustre 2.5.0, Lustre 2.6.0
Fix Version/s: Lustre 2.4.0, Lustre 2.5.0, Lustre 2.6.0, Lustre 2.5.4

Type: Bug Priority: Critical
Reporter: Alex Zhuravlev Assignee: John Fuchs-Chesney (Inactive)
Resolution: Fixed Votes: 0
Labels: yuc2

Issue Links:
Duplicate
duplicates LU-2959 ASSERTION( cli->cl_mgc_configs_dir ) ... Closed
is duplicated by LU-1931 Test failure on test suite conf-sanit... Resolved
is duplicated by LU-3562 Test failure on test suite insanity, ... Resolved
is duplicated by LU-2959 ASSERTION( cli->cl_mgc_configs_dir ) ... Closed
Related
is related to LU-4121 re-enable tests disabled due to LU-2059 Resolved
is related to LU-5984 server_mgc_set_fs()) can't set_fs -17 Resolved
is related to LU-3347 (local_storage.c:872:local_oid_storag... Resolved
is related to LU-4565 recovery-mds-scale test_failover_ost:... Resolved
is related to LU-6395 One byte size shorter for name alloca... Resolved
is related to LU-2778 Test failure on test suite conf-sanit... Resolved
is related to LU-4274 mount of zfs ost fails: Is the MGS ru... Resolved
is related to LU-5796 MGS: non-config logname received: params Closed
is related to LU-2158 remove lvfs and fsfilt code Resolved
Severity: 3
Rank (Obsolete): 4308

 Comments   
Comment by Alex Zhuravlev [ 01/Oct/12 ]

mgc_process_cfg_log() do not make a local copy of configuration if zfs is in use. this is because of lsi_srv_mnt == NULL in:

        /* Copy the setup log locally if we can. Don't mess around if we're
           running an MGS though (logs are already local). */
	if (lctxt && lsi && IS_SERVER(lsi) &&
            (lsi->lsi_srv_mnt == cli->cl_mgc_vfsmnt) &&
	    !IS_MGS(lsi) && lsi->lsi_srv_mnt) {

this is done as mgc_copy_llog() still rely on lvfs, see mgc_llog_init():

	rc = llog_setup(NULL, obd, olg, LLOG_CONFIG_ORIG_CTXT, tgt,
			&llog_lvfs_ops);

this should be changed to llog_osd_ops

Comment by Andreas Dilger [ 09/Oct/12 ]

http://review.whamcloud.com/4234 skips insanity.sh test_2 and test_4 for ZFS configs, and removes skip for conf-sanity.sh test_5d, test_19b, test_21b, test_27a for ldiskfs.

Comment by Andreas Dilger [ 09/Nov/12 ]

Moving over to OSD API for the MGC backup will also allow us to delete a few thousand lines of obsolete code in lustre/lvfs/fsfilt*.c and lustre/lvfs/lvfs*.c.

Comment by Mikhail Pershin [ 18/Jan/13 ]

http://review.whamcloud.com/5049

Comment by James A Simmons [ 11/Feb/13 ]

Patch needs to be rebased.

Comment by Mikhail Pershin [ 11/Feb/13 ]

Done

Comment by Nathaniel Clark [ 26/Feb/13 ]

Tests are listed for zfs as ALWAYS_EXCEPTED under bug LU-2778

Comment by Mikhail Pershin [ 27/Feb/13 ]

Maloo failed at replay-ost-single test_7. As I can see that happens because config logs are bigger than 40 blocks as test expects but 44. Need to find out is that just because logs become really so big and test must be fixed or this is patch issue

Comment by James A Simmons [ 04/Mar/13 ]

Mikhail does http://review.whamcloud.com/#change,5569 replace http://review.whamcloud.com/#change,5049 ?

Comment by Mikhail Pershin [ 04/Mar/13 ]

James, it is the same as 5049 but with debug in replay-ost-single to find out how big config logs can be. See my previous comment.

Comment by James A Simmons [ 22/Mar/13 ]

Any updates on this?

Comment by Mikhail Pershin [ 30/Mar/13 ]

James, I've found object leaks with the latest patch. I am figuring out what is wrong.

Comment by James A Simmons [ 01/Apr/13 ]

Excellent, looking forward to your patch.

Comment by Peter Jones [ 01/Apr/13 ]

Is this a must-have fix for 2.4.0 or can it be tidied up in 2.4.1?

Comment by Peter Jones [ 02/Apr/13 ]

dropping priority after chatting with Mike.

Comment by Christopher Morrone [ 02/Apr/13 ]

Could you elaborate? This doesn't seem to me to be the kind of thing we should be doing on maintenance branches.

Comment by Peter Jones [ 02/Apr/13 ]

Chris

Mike basically said that this ticket was created to track some technical debt type cleanup and that it would be ok to live with the present situation for 2.4. If, as you say, when the changes are too large when ready then they would end up in 2.5 rather than 2.4.1

Peter

Comment by James A Simmons [ 24/Apr/13 ]

With the latest test I'm getting this error. I do have a few other patches running with this one so tomorrow I will try a stock 2.3.64 with only this patch to see if the problem still exist.

Apr 24 17:05:13 barry-oss1 kernel: [ 481.702376] INFO: task mount.lustre:8891 blocked for more than 120 seconds.
Apr 24 17:05:13 barry-oss1 kernel: [ 481.720497] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 24 17:05:13 barry-oss1 kernel: [ 481.736916] mount.lustre D 0000000000000001 0 8891 8890 0x00000080
Apr 24 17:05:13 barry-oss1 kernel: [ 481.751549] ffff8803e085b798 0000000000000086 0000000000000000 ffff8803c8cee000
Apr 24 17:05:13 barry-oss1 kernel: [ 481.766831] ffff8803e085b748 ffffc9001ecf3030 0000000000000246 0000000000000246
Apr 24 17:05:13 barry-oss1 kernel: [ 481.782091] ffff8803e06a7af8 ffff8803e085bfd8 000000000000fb88 ffff8803e06a7af8
Apr 24 17:05:13 barry-oss1 kernel: [ 481.797351] Call Trace:
Apr 24 17:05:13 barry-oss1 kernel: [ 481.809126] [<ffffffffa04586fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
Apr 24 17:05:13 barry-oss1 kernel: [ 481.822999] [<ffffffffa05b5e43>] lu_object_find_at+0xb3/0x360 [obdclass]
Apr 24 17:05:13 barry-oss1 kernel: [ 481.837409] [<ffffffff81063310>] ? default_wake_function+0x0/0x20
Apr 24 17:05:13 barry-oss1 kernel: [ 481.851209] [<ffffffffa05b851a>] dt_locate_at+0x3a/0x140 [obdclass]
Apr 24 17:05:13 barry-oss1 kernel: [ 481.865294] [<ffffffffa0598333>] local_oid_storage_init+0x513/0x1200 [obdclass]
Apr 24 17:05:13 barry-oss1 kernel: [ 481.887428] [<ffffffffa0d390bc>] ? osd_index_ea_lookup+0x2fc/0x770 [osd_ldiskfs]
Apr 24 17:05:13 barry-oss1 kernel: [ 481.908544] [<ffffffffa058ff7a>] llog_osd_setup+0xfa/0x360 [obdclass]
Apr 24 17:05:13 barry-oss1 kernel: [ 481.922742] [<ffffffffa056a1d2>] llog_setup+0x352/0x920 [obdclass]
Apr 24 17:05:13 barry-oss1 kernel: [ 481.938678] [<ffffffffa0db2fe9>] mgc_set_info_async+0x1249/0x1960 [mgc]
Apr 24 17:05:13 barry-oss1 kernel: [ 481.953709] [<ffffffffa04682a1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Apr 24 17:05:13 barry-oss1 kernel: [ 481.967880] [<ffffffffa05d7ab0>] server_mgc_set_fs+0x120/0x520 [obdclass]
Apr 24 17:05:13 barry-oss1 kernel: [ 481.988772] [<ffffffffa05de4e5>] server_start_targets+0x85/0x19c0 [obdclass]
Apr 24 17:05:13 barry-oss1 kernel: [ 482.003376] [<ffffffffa0458b2e>] ? cfs_free+0xe/0x10 [libcfs]
Apr 24 17:05:13 barry-oss1 kernel: [ 482.016577] [<ffffffffa05af3c5>] ? lustre_start_mgc+0x4b5/0x1ee0 [obdclass]
Apr 24 17:05:13 barry-oss1 kernel: [ 482.031047] [<ffffffffa057caf8>] ? class_new_export+0x718/0x940 [obdclass]
Apr 24 17:05:13 barry-oss1 kernel: [ 482.048057] [<ffffffffa05e09cc>] server_fill_super+0xbac/0x1660 [obdclass]
Apr 24 17:05:13 barry-oss1 kernel: [ 482.062188] [<ffffffffa05b0fc8>] lustre_fill_super+0x1d8/0x530 [obdclass]
Apr 24 17:05:13 barry-oss1 kernel: [ 482.076384] [<ffffffffa05b0df0>] ? lustre_fill_super+0x0/0x530 [obdclass]
Apr 24 17:05:13 barry-oss1 kernel: [ 482.090313] [<ffffffff811842bf>] get_sb_nodev+0x5f/0xa0
Apr 24 17:05:13 barry-oss1 kernel: [ 482.102667] [<ffffffffa05a8965>] lustre_get_sb+0x25/0x30 [obdclass]
Apr 24 17:05:13 barry-oss1 kernel: [ 482.123386] [<ffffffff811838fb>] vfs_kern_mount+0x7b/0x1b0
Apr 24 17:05:13 barry-oss1 kernel: [ 482.138730] [<ffffffff81183aa2>] do_kern_mount+0x52/0x130
Apr 24 17:05:13 barry-oss1 kernel: [ 482.152137] [<ffffffff811a3d52>] do_mount+0x2d2/0x8d0
Apr 24 17:05:13 barry-oss1 kernel: [ 482.166646] [<ffffffff811a43e0>] sys_mount+0x90/0xe0
Apr 24 17:05:13 barry-oss1 kernel: [ 482.178209] [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Apr 24 17:05:15 barry-oss1 kernel: [ 483.945364] LustreError: 137-5: UUID 'lustre-OST0014_UUID' is not available for connect (no target)
Apr 24 17:05:15 barry-oss1 kernel: [ 483.962848] LustreError: Skipped 261 previous similar messages

Comment by James A Simmons [ 25/Apr/13 ]

With run plain 2.3.64 with this patch alone still produces this error.

Comment by Mikhail Pershin [ 25/Apr/13 ]

James, how to reproduce this error? Some specific tests or sequence of actions? Is this first mount?

Comment by James A Simmons [ 25/Apr/13 ]

This is the first mount.

Comment by Mikhail Pershin [ 25/Apr/13 ]

James, is that with zfs or doesn't matter?

Comment by James A Simmons [ 25/Apr/13 ]

ldiskfs. No zfs setup here.

Comment by Mikhail Pershin [ 25/Apr/13 ]

OK, I can reproduce that after patch re-base over the latest master.

Comment by James A Simmons [ 25/Apr/13 ]

Time for git bisect

Comment by Mikhail Pershin [ 27/Apr/13 ]

OK, I found the conflict and solve it, please try again, patch is re-pushed

Comment by James A Simmons [ 29/Apr/13 ]

Yes the patch working great now. One comment to point out to testers of this patch. You need to umount the file system and reformat the MGT. After you remount the file system should be good to go.

Comment by Christopher Morrone [ 29/Apr/13 ]

That last comment about reformatting is disturbing. Hopefully that was only if you were running previous version of change 5049?

Comment by James A Simmons [ 01/May/13 ]

The format for the changelog is altered with this patch. Because of this you need to delete the MGT change logs. Yes reformatting my MGT was a bit heavy handed. You still need to unmount your file system and then use tunefs.lustre --writeconf ... to erase your old data. Then remount and the file system should be good to go.

Comment by Christopher Morrone [ 01/May/13 ]

We aren't going to land something that requires all existing users of Lustre to writeconf their filesystems, right?

Comment by James A Simmons [ 20/May/13 ]

Looks like this is a 2.5 thing. Peter can you change the Affected version to 2.5

Comment by Peter Jones [ 20/May/13 ]

ok done

Comment by Mikhail Pershin [ 20/May/13 ]

Christopher, the format of changelog was changed not in this patch, it was just rebased over the latest master.

Comment by James A Simmons [ 05/Jul/13 ]

Testing with the latest patch I'm getting the following error when attempting to bring up a file system.

Jul 5 11:36:30 barry-oss1 kernel: [ 530.526152] LDISKFS-fs (dm-9): recovery complete
Jul 5 11:36:30 barry-oss1 kernel: [ 530.531930] LDISKFS-fs (dm-9): mounted filesystem with ordered data mode. quota=on. Opts:
Jul 5 11:36:30 barry-oss1 kernel: [ 530.714659] LustreError: 3465:0:(fld_handler.c:123:fld_server_lookup()) ASSERTION( info != ((void *)0) )
failed:
Jul 5 11:36:30 barry-oss1 kernel: [ 530.725093] LustreError: 3465:0:(fld_handler.c:123:fld_server_lookup()) LBUG
Jul 5 11:36:30 barry-oss1 kernel: [ 530.732252] Pid: 3465, comm: mount.lustre
Jul 5 11:36:30 barry-oss1 kernel: [ 530.736331]
Jul 5 11:36:30 barry-oss1 kernel: [ 530.736332] Call Trace:
Jul 5 11:36:30 barry-oss1 kernel: [ 530.740414] [<ffffffffa0472895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Jul 5 11:36:30 barry-oss1 kernel: [ 530.747488] [<ffffffffa0472e97>] lbug_with_loc+0x47/0xb0 [libcfs]
Jul 5 11:36:30 barry-oss1 kernel: [ 530.753779] [<ffffffffa095dddd>] fld_server_lookup+0x2bd/0x3d0 [fld]
Jul 5 11:36:30 barry-oss1 kernel: [ 530.760312] [<ffffffff8119c87e>] ? generic_detach_inode+0x18e/0x1f0
Jul 5 11:36:30 barry-oss1 kernel: [ 530.766784] [<ffffffffa0d77d91>] osd_fld_lookup+0x71/0x1d0 [osd_ldiskfs]
Jul 5 11:36:30 barry-oss1 kernel: [ 530.773667] [<ffffffff8119c6e2>] ? iput+0x62/0x70
Jul 5 11:36:30 barry-oss1 kernel: [ 530.778559] [<ffffffffa0d77f8a>] osd_remote_fid+0x9a/0x280 [osd_ldiskfs]
Jul 5 11:36:30 barry-oss1 kernel: [ 530.785456] [<ffffffffa0d84ca1>] osd_index_ea_lookup+0x521/0x850 [osd_ldiskfs]
Jul 5 11:36:30 barry-oss1 kernel: [ 530.792949] [<ffffffffa05ed34f>] dt_lookup_dir+0x6f/0x130 [obdclass]
Jul 5 11:36:30 barry-oss1 kernel: [ 530.799515] [<ffffffffa05cbd25>] llog_osd_open+0x475/0xbb0 [obdclass]
Jul 5 11:36:30 barry-oss1 kernel: [ 530.806156] [<ffffffffa059832a>] llog_open+0xba/0x2c0 [obdclass]
Jul 5 11:36:30 barry-oss1 kernel: [ 530.812362] [<ffffffffa059bf81>] llog_backup+0x61/0x500 [obdclass]
Jul 5 11:36:30 barry-oss1 kernel: [ 530.818741] [<ffffffff81281d00>] ? sprintf+0x40/0x50
Jul 5 11:36:30 barry-oss1 kernel: [ 530.829312] [<ffffffffa0e046e2>] mgc_process_log+0x1192/0x18e0 [mgc]
Jul 5 11:36:30 barry-oss1 kernel: [ 530.841291] [<ffffffffa0dfe370>] ? mgc_blocking_ast+0x0/0x7e0 [mgc]
Jul 5 11:36:30 barry-oss1 kernel: [ 530.853282] [<ffffffffa074ccf0>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jul 5 11:36:30 barry-oss1 kernel: [ 530.865730] [<ffffffffa0e062c4>] mgc_process_config+0x594/0xed0 [mgc]
Jul 5 11:36:30 barry-oss1 kernel: [ 530.877804] [<ffffffffa05e14a6>] lustre_process_log+0x256/0xaa0 [obdclass]
Jul 5 11:36:30 barry-oss1 kernel: [ 530.890296] [<ffffffffa05b0922>] ? class_name2dev+0x42/0xe0 [obdclass]
Jul 5 11:36:30 barry-oss1 kernel: [ 530.902398] [<ffffffff81167d73>] ? kmem_cache_alloc_trace+0x1a3/0x1b0
Jul 5 11:36:30 barry-oss1 kernel: [ 530.914407] [<ffffffffa05b09ce>] ? class_name2obd+0xe/0x30 [obdclass]
Jul 5 11:36:30 barry-oss1 kernel: [ 530.926353] [<ffffffffa0615121>] server_start_targets+0x1821/0x1a40 [obdclass]
Jul 5 11:36:30 barry-oss1 kernel: [ 530.939063] [<ffffffffa05e4ac3>] ? lustre_start_mgc+0x493/0x1e90 [obdclass]
Jul 5 11:36:30 barry-oss1 kernel: [ 530.951484] [<ffffffffa05b4638>] ? class_new_export+0x708/0x930 [obdclass]
Jul 5 11:36:30 barry-oss1 kernel: [ 530.963790] [<ffffffffa05dca10>] ? class_config_llog_handler+0x0/0x1820 [obdclass]
Jul 5 11:36:30 barry-oss1 kernel: [ 530.976755] [<ffffffffa0618ccc>] server_fill_super+0xbac/0x1a04 [obdclass]
Jul 5 11:36:30 barry-oss1 kernel: [ 530.989081] [<ffffffffa05e6698>] lustre_fill_super+0x1d8/0x530 [obdclass]
Jul 5 11:36:30 barry-oss1 kernel: [ 531.001275] [<ffffffffa05e64c0>] ? lustre_fill_super+0x0/0x530 [obdclass]
Jul 5 11:36:30 barry-oss1 kernel: [ 531.013381] [<ffffffff8118430f>] get_sb_nodev+0x5f/0xa0
Jul 5 11:36:30 barry-oss1 kernel: [ 531.023912] [<ffffffffa05de335>] lustre_get_sb+0x25/0x30 [obdclass]
Jul 5 11:36:30 barry-oss1 kernel: [ 531.035448] [<ffffffff8118394b>] vfs_kern_mount+0x7b/0x1b0
Jul 5 11:36:30 barry-oss1 kernel: [ 531.046172] [<ffffffff81183af2>] do_kern_mount+0x52/0x130
Jul 5 11:36:30 barry-oss1 kernel: [ 531.056713] [<ffffffff811a3d22>] do_mount+0x2d2/0x8d0
Jul 5 11:36:30 barry-oss1 kernel: [ 531.066857] [<ffffffff811a43b0>] sys_mount+0x90/0xe0
Jul 5 11:36:30 barry-oss1 kernel: [ 531.076861] [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Jul 5 11:36:30 barry-oss1 kernel: [ 531.087802]

Comment by Nathaniel Clark [ 25/Jul/13 ]

Re-enable all tests EXCEPTed or skipped:
http://review.whamcloud.com/7113

Comment by Jian Yu [ 06/Sep/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/44/ (2.4.1 RC1)
Distro/Arch: RHEL6.4/x86_64
FSTYPE=zfs
MDSCOUNT=4

insanity test 1 failed as follows:

Starting mds2: -o user_xattr,acl  lustre-mdt2/mdt2 /mnt/mds2
CMD: wtm-29vm7 mkdir -p /mnt/mds2; mount -t lustre -o user_xattr,acl  		                   lustre-mdt2/mdt2 /mnt/mds2
wtm-29vm7: mount.lustre: mount lustre-mdt2/mdt2 at /mnt/mds2 failed: Input/output error
wtm-29vm7: Is the MGS running?
Start of lustre-mdt2/mdt2 on mds2 failed 5
 insanity test_1: @@@@@@ FAIL: test_1 failed with 2 

Console log on wtm-29vm7:

12:50:34:Lustre: DEBUG MARKER: mkdir -p /mnt/mds2; mount -t lustre -o user_xattr,acl  		                   lustre-mdt2/mdt2 /mnt/mds2
12:50:34:LustreError: 16886:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff88007073e000 x1445162105843372/t0(0) o253->MGC10.10.17.33@tcp@10.10.17.33@tcp:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
12:50:34:LustreError: 16886:0:(obd_mount_server.c:1124:server_register_target()) lustre-MDT0001: error registering with the MGS: rc = -5 (not fatal)
12:50:34:LustreError: 16886:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff8800740bb400 x1445162105843380/t0(0) o101->MGC10.10.17.33@tcp@10.10.17.33@tcp:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
12:50:34:LustreError: 16886:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff8800740bb400 x1445162105843384/t0(0) o101->MGC10.10.17.33@tcp@10.10.17.33@tcp:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
12:50:34:LustreError: 15c-8: MGC10.10.17.33@tcp: The configuration from log 'lustre-MDT0001' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
12:50:34:LustreError: 16886:0:(obd_mount_server.c:1258:server_start_targets()) failed to start server lustre-MDT0001: -5
12:50:34:LustreError: 16886:0:(obd_mount_server.c:1700:server_fill_super()) Unable to start targets: -5
12:50:34:LustreError: 16886:0:(obd_mount_server.c:849:lustre_disconnect_lwp()) lustre-MDT0000-lwp-MDT0001: Can't end config log lustre-client.

Maloo report: https://maloo.whamcloud.com/test_sets/059ee4a8-164c-11e3-aa2a-52540035b04c

Comment by Jodi Levi (Inactive) [ 18/Oct/13 ]

Re-enabling the tests is being tracked in LU-4121

Comment by Mikhail Pershin [ 18/Oct/13 ]

patch was merged, are there more work to do in context of this bug?

Comment by Nathaniel Clark [ 18/Nov/13 ]

Skipping insanity/10 because it fails in the same way as the other always_excepted bugs fail

http://review.whamcloud.com/8316

Comment by Andreas Dilger [ 18/Nov/13 ]

Reopening this bug, since it doesn't seem that the landed patch actually allows the OST to start when the MGS is down. I guess the config log might be replicated, but the MGC is somehow blocked waiting for the MGS?

Comment by Jodi Levi (Inactive) [ 18/Mar/14 ]

Mike,
Can you please comment on the sanity test 10 failures?
Thank you!

Comment by Mikhail Pershin [ 11/Apr/14 ]

yes, I have to investigate this first

Comment by Andreas Dilger [ 30/Apr/14 ]

I think the first step is to run a local test to verify manually that the OST is actually creating a local copy of the config llog on the first mount, and then check that the OST can start from the local llog if the MGS is not running.

Comment by Mikhail Pershin [ 05/May/14 ]

Well, I did that for sure when did the initial patch for this ticket which makes local config llogs. Nevertheless I agree this is first step to do now, just to make sure it still works. Meanwhile I tend to think you are right that problem might be in locking or something else what cause MGC to wait for MGS

Comment by Alex Zhuravlev [ 05/May/14 ]

hmm, there were tests in conf-sanity to verify OST can start with inactive MGS.

Comment by Andreas Dilger [ 07/May/14 ]

I noticed that several conf-sanity.sh tests are being skipped for ZFS because of LU-2059 "no local config for ZFS", but I think that is no longer true?

conf-sanity.sh: 5d, 19b, 21b, 27a
insanity.sh: 1, 2, 4, 10

There is also conf-sanity.sh test_50h that doesn't give a reason, and sanity-quota.sh test_7d.

It would be good to submit a patch or two to enable these tests along with a Test-parameters: line that runs the relevant test ten times to see what the success rate is.

Comment by Mikhail Pershin [ 13/May/14 ]

the local config logs are created as expected, meanwhile there was problem with 'params' log, it can be empty and that case shouldn't be critical. I've enabled all excepted tests and they passed with both ldiskfs and zfs except insanity test_1. It failed only with zfs as following:

May 13 14:56:02 nodez kernel: LustreError: 13691:0:(mgc_request.c:516:do_requeue()) failed processing log: -5
May 13 14:56:02 nodez kernel: LustreError: 13691:0:(mgc_request.c:516:do_requeue()) Skipped 2 previous similar messages
May 13 14:56:15 nodez kernel: INFO: task mount.lustre:15680 blocked for more than 120 seconds.
May 13 14:56:15 nodez kernel:      Tainted: P           ---------------    2.6.32 #11
May 13 14:56:15 nodez kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 13 14:56:15 nodez kernel: mount.lustre  D ffff880001e0f7e0  3592 15680  15679 0x00000080
May 13 14:56:15 nodez kernel: ffff8800241167f0 0000000000000082 0000000000000000 ffffffff814ffbf6
May 13 14:56:15 nodez kernel: ffff88002d7a3b38 ffffffff81a2d020 ffff880024116ac0 000000000000f7e0
May 13 14:56:15 nodez kernel: ffff88001eaabfd8 ffff880024116ac0 ffff8800241167f0 ffff88001eaab6c8
May 13 14:56:15 nodez kernel: Call Trace:
May 13 14:56:15 nodez kernel: [<ffffffff814ffbf6>] ? __mutex_lock_slowpath+0x1f6/0x2a0
May 13 14:56:15 nodez kernel: [<ffffffff814ff325>] ? schedule_timeout+0x1c5/0x280
May 13 14:56:15 nodez kernel: [<ffffffff81048804>] ? task_rq_lock+0x54/0xa0
May 13 14:56:15 nodez kernel: [<ffffffff810512cd>] ? set_cpus_allowed_ptr+0xad/0x1b0
May 13 14:56:15 nodez kernel: [<ffffffff814ff325>] ? schedule_timeout+0x1c5/0x280
May 13 14:56:15 nodez kernel: [<ffffffff814fe319>] ? wait_for_common+0x139/0x1a0
May 13 14:56:15 nodez kernel: [<ffffffff81050f80>] ? default_wake_function+0x0/0x20
May 13 14:56:15 nodez kernel: [<ffffffffa0eed770>] ? client_lwp_config_process+0x0/0x1540 [obdclass]
May 13 14:56:15 nodez kernel: [<ffffffffa0e72d9c>] ? llog_process_or_fork+0x2bc/0x4c0 [obdclass]
May 13 14:56:15 nodez kernel: [<ffffffffa0e72faf>] ? llog_process+0xf/0x20 [obdclass]
May 13 14:56:15 nodez kernel: [<ffffffffa0ea9f79>] ? class_config_parse_llog+0x1e9/0x340 [obdclass]
May 13 14:56:15 nodez kernel: [<ffffffffa0e78b02>] ? __llog_ctxt_put+0xf2/0x150 [obdclass]
May 13 14:56:15 nodez kernel: [<ffffffffa027948c>] ? mgc_process_cfg_log+0x56c/0xe90 [mgc]
May 13 14:56:15 nodez kernel: [<ffffffffa027bdfa>] ? mgc_process_log+0x59a/0x1390 [mgc]
May 13 14:56:15 nodez kernel: [<ffffffffa0dc7f6f>] ? libcfs_debug_msg+0x3f/0x50 [libcfs]
May 13 14:56:15 nodez kernel: [<ffffffffa027ccc5>] ? do_config_log_add+0xd5/0x5b0 [mgc]
May 13 14:56:15 nodez kernel: [<ffffffffa0276dd0>] ? mgc_blocking_ast+0x0/0x800 [mgc]
May 13 14:56:15 nodez kernel: [<ffffffffa0300240>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc]
May 13 14:56:15 nodez kernel: [<ffffffffa027db8f>] ? mgc_process_config+0x69f/0x1290 [mgc]
May 13 14:56:15 nodez kernel: [<ffffffffa0ec1dc9>] ? lustre_process_log+0x219/0x1000 [obdclass]
May 13 14:56:15 nodez kernel: [<ffffffffa0ec2106>] ? lustre_process_log+0x556/0x1000 [obdclass]
May 13 14:56:15 nodez kernel: [<ffffffff811283e2>] ? kmem_cache_alloc_trace+0x1c2/0x1d0
May 13 14:56:15 nodez kernel: [<ffffffffa0ee89c6>] ? lustre_start_lwp+0x146/0x700 [obdclass]
May 13 14:56:15 nodez kernel: [<ffffffffa0ee8a1f>] ? lustre_start_lwp+0x19f/0x700 [obdclass]
May 13 14:56:15 nodez kernel: [<ffffffffa0eead66>] ? server_start_targets+0xeb6/0x1d30 [obdclass]
May 13 14:56:15 nodez kernel: [<ffffffffa0ebe4c6>] ? lustre_start_mgc+0x456/0x29b0 [obdclass]
May 13 14:56:15 nodez kernel: [<ffffffffa0eb6af0>] ? class_config_llog_handler+0x0/0x1e50 [obdclass]
May 13 14:56:15 nodez kernel: [<ffffffffa0ef1a3f>] ? server_fill_super+0xbff/0x15c0 [obdclass]
May 13 14:56:15 nodez kernel: [<ffffffffa0ec0a20>] ? lustre_fill_super+0x0/0x520 [obdclass]
May 13 14:56:15 nodez kernel: [<ffffffffa0ec0bf8>] ? lustre_fill_super+0x1d8/0x520 [obdclass]
May 13 14:56:15 nodez kernel: [<ffffffffa0ec0a20>] ? lustre_fill_super+0x0/0x520 [obdclass]
May 13 14:56:15 nodez kernel: [<ffffffffa0ec0a20>] ? lustre_fill_super+0x0/0x520 [obdclass]
May 13 14:56:15 nodez kernel: [<ffffffff8113c9a9>] ? get_sb_nodev+0x59/0xb0
May 13 14:56:15 nodez kernel: [<ffffffffa0eb8b5c>] ? lustre_get_sb+0x1c/0x30 [obdclass]
May 13 14:56:15 nodez kernel: [<ffffffff8113bf16>] ? vfs_kern_mount+0x76/0x1b0
May 13 14:56:15 nodez kernel: [<ffffffff8113c0c3>] ? do_kern_mount+0x53/0x130
May 13 14:56:15 nodez kernel: [<ffffffff81157d32>] ? do_mount+0x302/0x8f0
May 13 14:56:15 nodez kernel: [<ffffffff811583ba>] ? sys_mount+0x9a/0xf0
May 13 14:56:15 nodez kernel: [<ffffffff81009f42>] ? system_call_fastpath+0x16/0x1b
May 13 14:57:14 nodez kernel: LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
May 13 14:57:14 nodez kernel: LustreError: Skipped 129 previous similar messages
May 13 14:57:24 nodez kernel: LustreError: 11-0: lustre-MDT0000-osp-MDT0001: Communicating with 0@lo, operation mds_connect failed with -19.
May 13 14:57:24 nodez kernel: LustreError: Skipped 51 previous similar messages
May 13 14:58:15 nodez kernel: INFO: task ll_cfg_requeue:13691 blocked for more than 120 seconds.
May 13 14:58:15 nodez kernel:      Tainted: P           ---------------    2.6.32 #11
May 13 14:58:15 nodez kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 13 14:58:15 nodez kernel: ll_cfg_requeu D ffff880001e0f7e0  4760 13691      2 0x00000080
May 13 14:58:15 nodez kernel: ffff88001f96c040 0000000000000046 0000000000000000 ffff880001e0f920
May 13 14:58:15 nodez kernel: 0000000000000018 ffffffff81a2d020 ffff88001f96c310 000000000000f7e0
May 13 14:58:15 nodez kernel: ffff880028f99fd8 ffff88001f96c310 ffff88001f96c040 ffff880001e0f920
May 13 14:58:15 nodez kernel: Call Trace:
May 13 14:58:15 nodez kernel: [<ffffffff812bc54a>] ? string+0x3a/0xf0
May 13 14:58:15 nodez kernel: [<ffffffff812bdbf7>] ? vsnprintf+0x357/0x600
May 13 14:58:15 nodez kernel: [<ffffffff814ffb17>] ? __mutex_lock_slowpath+0x117/0x2a0
May 13 14:58:15 nodez kernel: [<ffffffff8150031f>] ? mutex_lock+0x1f/0x40
May 13 14:58:15 nodez kernel: [<ffffffffa027b8c3>] ? mgc_process_log+0x63/0x1390 [mgc]
May 13 14:58:15 nodez kernel: [<ffffffff81501377>] ? _spin_lock_irqsave+0x17/0x20
May 13 14:58:15 nodez kernel: [<ffffffff81068353>] ? lock_timer_base+0x33/0x70
May 13 14:58:15 nodez kernel: [<ffffffff8106899e>] ? try_to_del_timer_sync+0x6e/0xd0
May 13 14:58:15 nodez kernel: [<ffffffffa0dc7f6f>] ? libcfs_debug_msg+0x3f/0x50 [libcfs]
May 13 14:58:15 nodez kernel: [<ffffffffa027ea8d>] ? mgc_requeue_thread+0x30d/0x7a0 [mgc]
May 13 14:58:15 nodez kernel: [<ffffffff81050f80>] ? default_wake_function+0x0/0x20
May 13 14:58:15 nodez kernel: [<ffffffffa027e780>] ? mgc_requeue_thread+0x0/0x7a0 [mgc]
May 13 14:58:15 nodez kernel: [<ffffffff81077e9e>] ? kthread+0x8e/0xa0
May 13 14:58:15 nodez kernel: [<ffffffff8100b0ea>] ? child_rip+0xa/0x20
May 13 14:58:15 nodez kernel: [<ffffffff81077e10>] ? kthread+0x0/0xa0
May 13 14:58:15 nodez kernel: [<ffffffff8100b0e0>] ? child_rip+0x0/0x20

That looks like deadlock, not sure why it happens with ZFS only though

Comment by Mikhail Pershin [ 13/May/14 ]

Patch in gerrit with last fixes related to 'params' log handling and enabled tests:
http://review.whamcloud.com/10311

The test_1 in insanity is still excluded, it failed due to deadlock of some kind and need more investigation. Maybe it worths to file another bug for that or reopen LU-3562

Comment by Jodi Levi (Inactive) [ 23/May/14 ]

Patch landed to Master. Please reopen ticket if more work is needed.

Comment by Nathaniel Clark [ 20/Jun/14 ]

Port to b2_4:
http://review.whamcloud.com/10771

Comment by Andreas Dilger [ 01/Oct/14 ]

sanity-quota test_7d is still being skipped for ZFS filesystems, reportedly because of this bug.

Comment by Andreas Dilger [ 01/Oct/14 ]

insanity.sh test_1 is also being skipped for ZFS filesystems due to this bug.

Comment by Jian Yu [ 25/Oct/14 ]

Here are the back-ported patches for Lustre b2_5 branch:
http://review.whamcloud.com/12427 (from http://review.whamcloud.com/10311)
http://review.whamcloud.com/12428 (from http://review.whamcloud.com/10589)

Comment by Jian Yu [ 11/Nov/14 ]

The above patches were merged into Lustre b2_5 branch for 2.5.4 release.

Comment by Andreas Dilger [ 28/Nov/14 ]

The conf-sanity test_57b is still being skipped and patch http://review.whamcloud.com/10357 which enables it is faint testing. This bug is definitely not fixed.

Comment by Mikhail Pershin [ 16/Dec/14 ]

Andreas, the patch you mentioned passed test_57b successfully. Did I miss something?

Comment by Mikhail Pershin [ 16/Dec/14 ]

as for insanity test_1, it is not LU-2059 issue and has different nature, more related to LU-3562. It worths to open separate ticket for this or reopen LU-3562.

Generated at Sat Feb 10 01:22:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.