[LU-4713] Failure on test suite sanity test_237: check_fhandle_syscalls failed Created: 05/Mar/14  Updated: 14/May/14  Resolved: 14/May/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Bob Glossman (Inactive)
Resolution: Fixed Votes: 0
Labels: patch
Environment:

server: lustre-master build # 1911 RHEL6 ldiskfs
client is SLES11 SP3


Severity: 3
Rank (Obsolete): 12955

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/a67ae282-a01c-11e3-ab91-52540035b04c.

The sub-test test_237 failed with the following error:

check_fhandle_syscalls failed

test log

== sanity test 237: Verify name_to_handle_at/open_by_handle_at syscalls ============================== 17:55:21 (1393466121)
name_by_handle_at(f237.sanity) error: Bad file descriptor
 sanity test_237: @@@@@@ FAIL: check_fhandle_syscalls failed 


 Comments   
Comment by Andreas Dilger [ 13/Mar/14 ]

Swapnil, could you please take a look at this bug, since it relates to the open_by_handle() code/test that you added.

Comment by Swapnil Pimpale (Inactive) [ 14/Mar/14 ]

Is there a way to reproduce this?

Comment by Bob Glossman (Inactive) [ 15/Apr/14 ]

As far as I can tell this is easy to reproduce. Seen with any sles11sp3 client. In sles11sp3 with default kernel .config lustre is built with '#define HAVE_FHANDLE_SYSCALLS 1' and '#undef HAVE_FHANDLE_GLIBC_SUPPORT'. The test program check_fhandle_syscalls seems to build OK, but always returns an error when called.

Comment by Swapnil Pimpale (Inactive) [ 24/Apr/14 ]

The test failed with the following reason:
name_by_handle_at(f237.sanity) error: Bad file descriptor

I don't think it failed because HAVE_FHANDLE_SYSCALLS is defined and HAVE_FHANDLE_GLIBC_SUPPORT is not. Because in that case we explicitly define struct file_handle, __NR_name_to_handle_at and __NR_open_by_handle_at.

test_237 does the following:
echo "Test file_handle syscalls" > $DIR/$tfile
without checking whether the write succeeded. I think we should add a check here.

Comment by Swapnil Pimpale (Inactive) [ 24/Apr/14 ]

Patch with the above check: http://review.whamcloud.com/#/c/10088/

Comment by Bob Glossman (Inactive) [ 25/Apr/14 ]

test run with your added check still fails and doesn't report failure in the write. it still reports what looks like a failure in check_fhandle_syscalls: "name_by_handle_at(f237.sanity) error: Bad file descriptor"

I suspect the local definitions in the test program that are used in the case of !defined(HAVE_FHANDLE_GLIBC_SUPPORT) && defined(HAVE_FHANDLE_SYSCALLS) don't exactly match the definitions in the underlying kernel.

Comment by Bob Glossman (Inactive) [ 25/Apr/14 ]

Since in Centos where the feature doesn't exist the test program always just reports success without really doing anything and in SLES where the feature does exist the test always fails, I'm going to just disable the test for now. In its current state this test isn't adding any value and it's just making SLES test runs fail.

Comment by Bob Glossman (Inactive) [ 25/Apr/14 ]

patch to disable the test
http://review.whamcloud.com/10105

Comment by James A Simmons [ 25/Apr/14 ]

That is strange. The test should succeed on systems with kernel fhandle and lacking glibc support. That is the test setup I worked with. What does the config logs show for your SP3 system? Is HAVE_FHANDLE_GLIBC_SUPPORT set when it shouldn't be?

Comment by Bob Glossman (Inactive) [ 25/Apr/14 ]

no, I'm seeing HAVE_FHANDLE_GLIBC_SUPPORT #undef'ed. I can manually run check_fhandle_syscalls and it fails every time. As I said I suspect the internal definitions in the test program don't match the real definitions in the underlying kernel. That's the only think I can think of that might be a cause.

Comment by Bob Glossman (Inactive) [ 01/May/14 ]

Just as speculation I tried replacing the local definitions of syscall numbers and struct file_handle with ones from SLES #include files. I'm guessing the ones that are in there now are from RHEL instances. If I do that then the test passes. I think this confirms my theory about the local definitions in the test program not matching the underlying kernel.

However there's still a problem. I can't figure out how to define the correct syscall numbers for all builds. It seems to vary based on the arch type.

In arch/x86/include/asm/unistd_64.h it's:

#define __NR_name_to_handle_at 303
#define __NR_open_by_handle_at 304

while in arch/x86/include/asm/unistd_32.h it's:

#define __NR_name_to_handle_at 341
#define __NR_open_by_handle_at 342

and in include/asm-generic/unistd.h it's:

#define __NR_name_to_handle_at 264
#define __NR_open_by_handle_at 265

Since I'm on x86_64 I hard coded the values from unistd_64.h and it worked for me, but that's hardly a production worthy general solution.

Comment by James A Simmons [ 01/May/14 ]

I wonder if their is a way to do header magic to make this work.

Comment by Bob Glossman (Inactive) [ 01/May/14 ]

It seems likely there is a way to do header magic to get what is needed, but I can't figure out how. The only method I can come up with so far is a horrible, ad-hoc series of #if tests hard coded in. Pretty sure that's not the way to go.

Comment by Bob Glossman (Inactive) [ 01/May/14 ]

fix for test program:
http://review.whamcloud.com/10189

I did the horrible sequence of #if's as I couldn't come up with a better scheme.
Please add review comments if you have a better idea.

Comment by James A Simmons [ 14/May/14 ]

Patch landed to master. Ticket can be closed and be reopened if needed.

Comment by Peter Jones [ 14/May/14 ]

Landed for 2.6

Generated at Sat Feb 10 01:45:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.