[LU-4713] Failure on test suite sanity test_237: check_fhandle_syscalls failed Created: 05/Mar/14 Updated: 14/May/14 Resolved: 14/May/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Bob Glossman (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Environment: |
server: lustre-master build # 1911 RHEL6 ldiskfs |
||
| Severity: | 3 |
| Rank (Obsolete): | 12955 |
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/a67ae282-a01c-11e3-ab91-52540035b04c. The sub-test test_237 failed with the following error:
test log == sanity test 237: Verify name_to_handle_at/open_by_handle_at syscalls ============================== 17:55:21 (1393466121) name_by_handle_at(f237.sanity) error: Bad file descriptor sanity test_237: @@@@@@ FAIL: check_fhandle_syscalls failed |
| Comments |
| Comment by Andreas Dilger [ 13/Mar/14 ] |
|
Swapnil, could you please take a look at this bug, since it relates to the open_by_handle() code/test that you added. |
| Comment by Swapnil Pimpale (Inactive) [ 14/Mar/14 ] |
|
Is there a way to reproduce this? |
| Comment by Bob Glossman (Inactive) [ 15/Apr/14 ] |
|
As far as I can tell this is easy to reproduce. Seen with any sles11sp3 client. In sles11sp3 with default kernel .config lustre is built with '#define HAVE_FHANDLE_SYSCALLS 1' and '#undef HAVE_FHANDLE_GLIBC_SUPPORT'. The test program check_fhandle_syscalls seems to build OK, but always returns an error when called. |
| Comment by Swapnil Pimpale (Inactive) [ 24/Apr/14 ] |
|
The test failed with the following reason: I don't think it failed because HAVE_FHANDLE_SYSCALLS is defined and HAVE_FHANDLE_GLIBC_SUPPORT is not. Because in that case we explicitly define struct file_handle, __NR_name_to_handle_at and __NR_open_by_handle_at. test_237 does the following: |
| Comment by Swapnil Pimpale (Inactive) [ 24/Apr/14 ] |
|
Patch with the above check: http://review.whamcloud.com/#/c/10088/ |
| Comment by Bob Glossman (Inactive) [ 25/Apr/14 ] |
|
test run with your added check still fails and doesn't report failure in the write. it still reports what looks like a failure in check_fhandle_syscalls: "name_by_handle_at(f237.sanity) error: Bad file descriptor" I suspect the local definitions in the test program that are used in the case of !defined(HAVE_FHANDLE_GLIBC_SUPPORT) && defined(HAVE_FHANDLE_SYSCALLS) don't exactly match the definitions in the underlying kernel. |
| Comment by Bob Glossman (Inactive) [ 25/Apr/14 ] |
|
Since in Centos where the feature doesn't exist the test program always just reports success without really doing anything and in SLES where the feature does exist the test always fails, I'm going to just disable the test for now. In its current state this test isn't adding any value and it's just making SLES test runs fail. |
| Comment by Bob Glossman (Inactive) [ 25/Apr/14 ] |
|
patch to disable the test |
| Comment by James A Simmons [ 25/Apr/14 ] |
|
That is strange. The test should succeed on systems with kernel fhandle and lacking glibc support. That is the test setup I worked with. What does the config logs show for your SP3 system? Is HAVE_FHANDLE_GLIBC_SUPPORT set when it shouldn't be? |
| Comment by Bob Glossman (Inactive) [ 25/Apr/14 ] |
|
no, I'm seeing HAVE_FHANDLE_GLIBC_SUPPORT #undef'ed. I can manually run check_fhandle_syscalls and it fails every time. As I said I suspect the internal definitions in the test program don't match the real definitions in the underlying kernel. That's the only think I can think of that might be a cause. |
| Comment by Bob Glossman (Inactive) [ 01/May/14 ] |
|
Just as speculation I tried replacing the local definitions of syscall numbers and struct file_handle with ones from SLES #include files. I'm guessing the ones that are in there now are from RHEL instances. If I do that then the test passes. I think this confirms my theory about the local definitions in the test program not matching the underlying kernel. However there's still a problem. I can't figure out how to define the correct syscall numbers for all builds. It seems to vary based on the arch type. In arch/x86/include/asm/unistd_64.h it's: #define __NR_name_to_handle_at 303 while in arch/x86/include/asm/unistd_32.h it's: #define __NR_name_to_handle_at 341 and in include/asm-generic/unistd.h it's: #define __NR_name_to_handle_at 264 Since I'm on x86_64 I hard coded the values from unistd_64.h and it worked for me, but that's hardly a production worthy general solution. |
| Comment by James A Simmons [ 01/May/14 ] |
|
I wonder if their is a way to do header magic to make this work. |
| Comment by Bob Glossman (Inactive) [ 01/May/14 ] |
|
It seems likely there is a way to do header magic to get what is needed, but I can't figure out how. The only method I can come up with so far is a horrible, ad-hoc series of #if tests hard coded in. Pretty sure that's not the way to go. |
| Comment by Bob Glossman (Inactive) [ 01/May/14 ] |
|
fix for test program: I did the horrible sequence of #if's as I couldn't come up with a better scheme. |
| Comment by James A Simmons [ 14/May/14 ] |
|
Patch landed to master. Ticket can be closed and be reopened if needed. |
| Comment by Peter Jones [ 14/May/14 ] |
|
Landed for 2.6 |