[LU-65] Interop testing results for 1.8.5.54 clients with Lustre 2.0.59 servers Created: 08/Feb/11  Updated: 28/Jun/11  Resolved: 29/Mar/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: Lustre 2.1.0

Type: Bug Priority: Blocker
Reporter: James A Simmons Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

4 OSS, each with 7 OSTS with one MDS with a disk devoted to a MDT and another disk devoted to the MGS; all running lustre 2.0.59. I have 4 clients running lustre 1.8.5.54


Attachments: Text File client.txt     File lustre_conf-sanity_test_55.1297196706.gz     File lustre_obdfilter-survey_test_1b.1297256596.gz     File lustre_replay-single_test_65a.1297196325.gz     File lustre_sanity_test_64b.1297190999.gz    
Severity: 3
Bugzilla ID: 21,367
Epic: interop, results, test
Rank (Obsolete): 5209

 Description   

These are the results of running the b1_8 acc_sm test. I also listed the results at https://bugzilla.lustre.org/show_bug.cgi?id=21367.

sanity 64b - bug 22703
sanity 72b - bug 24226
replay-single 65a - bug 19960
sanity-quota - totally broken. Does not work at all. Locks up clients

obdfilter-survey - locks the client up. No debug output
config-sanity 55,56,57 - no bug report yet. Seeing the following error

LustreError: 13996:0:(mdt_handler.c:4521:mdt_init0()) CMD Operation not allowed in IOP mode
LustreError: 13996:0:(obd_config.c:495:class_setup()) setup lustre-MDT0001 failed (-22)
LustreError: 13996:0:(obd_config.c:1338:class_config_llog_handler()) Err -22 on cfg command:
Lustre: cmd=cf003 0:lustre-MDT0001 1:lustre-MDT0001_UUID 2:1 3:lustre-MDT0001-mdtlov 4:f
LustreError: 15b-f: MGC10.36.230.2@o2ib: The configuration from log 'lustre-MDT0001'failed from the
MGS (-22). Make sure this client and the MGS are running compatible versions of Lustre.
LustreError: 15c-8: MGC10.36.230.2@o2ib: The configuration from log 'lustre-MDT0001' failed (-22).
This may be the result of communication errors between this node and the MGS, a bad configuration,
or other errors. See the syslog for more information.

I will provide more info and logs as very soon.



 Comments   
Comment by James A Simmons [ 08/Feb/11 ]

Here are some lustre logs produced for the failed runs.

Comment by Peter Jones [ 08/Feb/11 ]

Nasf

Can you please look into this one?

Thanks

Peter

Comment by James A Simmons [ 09/Feb/11 ]

Debug log from OBDFilter test 1b from 1.8.X test suite on 1.8 client

Comment by James A Simmons [ 09/Feb/11 ]

No debug logs for Obdfilter-survey but I do have a dmesg that could be of some interest. For some reason it appears the client can't communicate with the OSS runing 2.X

Comment by James A Simmons [ 09/Feb/11 ]

Also for Obdfilter_survery test 2b on the OSS running lustre 2.X I'm seeing in dmesg

Lustre: DEBUG MARKER: == test 2b: Stripe F/S over the Network, async journal == 11:45:14 (1297269914)
LustreError: 137-5: UUID 'lustre-OST0008_UUIlustre-OST0008_UUID' is not available for connect (no target)
LustreError: Skipped 37 previous similar messages
LustreError: 14469:0:(ldlm_lib.c:2118:target_send_reply_msg()) @@@ processing error (19) req@ffff8104233e6400 x1355126983196980/t0(0) o-1><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1297270075 ref 1 fl Interpret:/ffffffff/ffffffff rc -19/-1
LustreError: 14469:0:(ldlm_lib.c:2118:target_send_reply_msg()) Skipped 42 previous similar messages

Their is no lustre-OST008_UUID for the OSS. Where is it getting this info?

Comment by nasf (Inactive) [ 10/Feb/11 ]

Patch for "sanity 72b - bug 24226" is in inspection:

http://review.whamcloud.com/#change,238

Comment by James A Simmons [ 10/Feb/11 ]

That patch falls short of behaving correctly. Try this.

touch $DIR/$tfile
chmod 777 $DIR/$tfile
chmod ug+s $DIR/$tfile
$RUNAS dd if=/dev/urandom of=$DIR/$tfile

While it runs look at the file's permission. You will notice the suid bits are still there. If you try this on a ext[3,4] file system you will notice the suid bits are gone soon as you start writing to the file. The current lustre code handles the fixup of attr when the file is closed. It needs to be managed when the file is opened. I have a patch but it needs to be worked on since their exist the case of a suid file copied from a non lustre file system to a lustre file system to preserve those suid bits. The case is very specific for the removal of the suid. As soon as I have it working I will post a patch.

Comment by nasf (Inactive) [ 11/Feb/11 ]

The SUID/SGID should be removed just when you start to write the file, which is the expected behavior. With the patch applied, Lustre-2.x behavior is the same as ext3/4 does, but not you mentioned, at least I can not reproduce it (means the SUID/SGID bits still exist when writes until file closed). Have you verified my patch? or your description is based on the test result against old Lustre code?

Comment by James A Simmons [ 11/Feb/11 ]

I will give your patches a try monday. Right now I'm running some test for oleg.

Comment by James A Simmons [ 14/Feb/11 ]

Okay I did test your patch and it appears to work.

Comment by James A Simmons [ 14/Feb/11 ]

I have a bunch of patches to fix various parts of the 2.X test suite. Shoudl I opena different bug for those patches?

Comment by nasf (Inactive) [ 14/Feb/11 ]

I thing you can create some sub-tasks under this one, then it is more easy to be tracked.

Comment by Peter Jones [ 21/Feb/11 ]

Jsmes

Personally I think that it is easier to track issues when there is a 1:1 relationship between tickets and issues\fixes

Peter

Comment by nasf (Inactive) [ 01/Mar/11 ]

>replay-single 65a - bug 19960

Sorry, I can not reproduce this failure. Can you show me an easy way to reproduce it? I think it is a duplicate of bug 22560, which has been fixed on master and lustre-1.8.5. Would you like to verify it again?

Comment by James A Simmons [ 01/Mar/11 ]

Just tried it. Also the test fails with 2.X clients with 2.X servers.

Comment by James A Simmons [ 01/Mar/11 ]

I believe I found the problem for replay-single 65a. Please look at patch http://review.whamcloud.com/#change,284

Comment by nasf (Inactive) [ 02/Mar/11 ]

> sanity 64b - bug 22703

I have made patch for it:
http://review.whamcloud.com/#change,286 (for master)
http://review.whamcloud.com/#change,287 (for b1_8)

Comment by nasf (Inactive) [ 02/Mar/11 ]

> sanity-quota - totally broken. Does not work at all. Locks up clients

Are there any logs related with sanity_quota interoperability test which caused client locked up? Because sanity-quota interoperability test passed in my local environment. I have checked bugzilla also, and found recent test result:

https://bugzilla.lustre.org/show_bug.cgi?id=24207#c4

That means sanity-quota interoperability works under TCP environment, but failed under IB case for bug 24055, and related patch for bug 24055 has been landed.

So would you like to check whether such patch applied in your test. On the other hand, I think bug 24055's patch is not enough, you need above patch for bug 22703 also.

Thanks!

Comment by nasf (Inactive) [ 03/Mar/11 ]

> config-sanity 55,56,57 - no bug report yet.

According to the MDS side log (lustre_conf-sanity_test_55.1297196706.gz), the system is not ready to accept client(10.36.230.36@o2ib) connection yet.

==========
00000004:00000001:2.0:1297196688.391043:0:19245:0:(mdt_handler.c:2528:mdt_req_handle()) Process entered
00000004:00000001:2.0:1297196688.391045:0:19245:0:(mdt_handler.c:2482:mdt_unpack_req_pack_rep()) Process entered
00000004:00000001:2.0:1297196688.391046:0:19245:0:(mdt_handler.c:2503:mdt_unpack_req_pack_rep()) Process leaving (rc=0 : 0 : 0)
00010000:00000001:2.0:1297196688.391049:0:19245:0:(ldlm_lib.c:667:target_handle_connect()) Process entered
00010000:02000400:2.0:1297196688.391055:0:19245:0:(ldlm_lib.c:694:target_handle_connect()) lustre-MDT0000: temporarily refusing client connection from 10.36.230.36@o2ib
00010000:00000001:2.0:1297196688.406588:0:19245:0:(ldlm_lib.c:695:target_handle_connect()) Process leaving via out (rc=18446744073709551605 : -11 : 0xfffffffffffffff5)
00010000:00000001:2.0:1297196688.406591:0:19245:0:(ldlm_lib.c:1082:target_handle_connect()) Process leaving (rc=18446744073709551605 : -11 : fffffffffffffff5)
==========

That means MDS returned "EAGAIN" to client to tell it retry later, which is normal case. But from the log, I can not find any other communication between client and MDS after that, until MDS reported test_55 failure.

00000001:00000001:5.0:1297196705.949788:0:19367:0:(debug.c:439:libcfs_debug_mark_buffer()) ***************************************************
00000001:02000400:5.0:1297196705.949789:0:19367:0:(debug.c:440:libcfs_debug_mark_buffer()) DEBUG MARKER: conf-sanity test_55: @@@@@@ FAIL: client start failed
00000001:00000001:5.0:1297196705.963380:0:19367:0:(debug.c:441:libcfs_debug_mark_buffer()) ***************************************************

I need client side log to investigate what happened on client after MDS returned "EAGAIN".
James, are there any logs for that? It seems not easy to reproduce conf_sanity test_55 failure.

Thanks!

Comment by James A Simmons [ 03/Mar/11 ]

For http://review.whamcloud.com/#change,286 the two sets of patches conflict. Is the second patch the only valid one?

Comment by nasf (Inactive) [ 03/Mar/11 ]

> For http://review.whamcloud.com/#change,286 the two sets of patches conflict. Is the second patch the only valid one?

Yes, set 2 is the right one.

Comment by James A Simmons [ 07/Mar/11 ]

Sorry I haven't been able to test. The build system is broken.

/data/buildsystem/jsimmons-head/rpmbuild/BUILD/lustre-2.0.59/lustre/lvfs/fsfilt-ldiskfs.c: In function 'fsfilt_ldiskfs_fid2dentry':
/data/buildsystem/jsimmons-head/rpmbuild/BUILD/lustre-2.0.59/lustre/lvfs/fsfilt-ldiskfs.c:2352: error: implicit declaration of function 'exportfs_decode_fh'
/data/buildsystem/jsimmons-head/rpmbuild/BUILD/lustre-2.0.59/lustre/lvfs/fsfilt-ldiskfs.c:2352: error: 'FILEID_INO32_GEN' undeclared (first use in this function)
/data/buildsystem/jsimmons-head/rpmbuild/BUILD/lustre-2.0.59/lustre/lvfs/fsfilt-ldiskfs.c:2352: error: (Each undeclared identifier is reported only once
/data/buildsystem/jsimmons-head/rpmbuild/BUILD/lustre-2.0.59/lustre/lvfs/fsfilt-ldiskfs.c:2352: error: for each function it appears in.)
cc1: warnings being treated as errors

Comment by nasf (Inactive) [ 22/Mar/11 ]

The patch of "http://review.whamcloud.com/#change,286" has been landed, I think you can test with the latest code.

On the other hand, would you like to update your patch of "http://review.whamcloud.com/#change,284" to make it more compatible?

Thanks

Comment by Build Master (Inactive) [ 23/Mar/11 ]

Integrated in reviews-centos5 #541
LU-65 ORNL Lustre 2.X testing

James Simmons : 43d727e089f1a1cf237da4251dc2aa661de05a0b
Files :

  • lustre/tests/conf-sanity.sh
  • libcfs/libcfs/darwin/darwin-proc.c
  • lustre/obdclass/class_obd.c
  • lustre/include/obd_support.h
  • libcfs/include/libcfs/Makefile.am
  • lustre/obdclass/darwin/darwin-sysctl.c
  • lustre/lvfs/lvfs_lib.c
  • lustre/obdfilter/filter.c
  • lustre/obdclass/linux/linux-sysctl.c
  • libcfs/libcfs/module.c
  • lustre/mdt/mdt_internal.h
  • libcfs/libcfs/Makefile.in
  • lustre/liblustre/tests/recovery_small.c
  • libcfs/libcfs/linux/linux-proc.c
  • libcfs/include/libcfs/libcfs.h
  • lustre/include/darwin/obd_support.h
  • lustre/include/linux/obd_support.h
  • lustre/tests/sanity-gss.sh
  • libcfs/libcfs/autoMakefile.am
  • libcfs/libcfs/fail.c
  • libcfs/include/libcfs/libcfs_fail.h
Comment by Brian Murrell (Inactive) [ 23/Mar/11 ]

FWIW, your ubuntu reviews builds are failing due to LU-92. I've just submitted http://review.whamcloud.com/#change,356 to test a patch to fix this. If it passes it's review testing, you could try to import that patch into your branch, putting it patch before your patch to see if it resolves your ubuntu build issue.

Comment by Peter Jones [ 29/Mar/11 ]

Believed resolved. ORNL will reopen or open a new ticket if their reproducer still has issues

Generated at Sat Feb 10 01:03:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.