[LU-1957] Test failure on test suite sanity, subtest test_180b Created: 16/Sep/12 Updated: 03/Jan/20 Resolved: 18/Sep/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0, Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.4 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | ldiskfs | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 4069 | ||||||||
| Description |
|
This issue was created by maloo for Li Wei <liwei@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/f1cbcf24-fe85-11e1-b4cd-52540035b04c. The sub-test test_180b failed with the following error:
From the test output: == sanity test 180b: test obdecho directly on obdfilter == 03:16:00 (1347617760)
CMD: client-26vm4 lsmod | grep -q obdecho || { insmod /usr/lib64/lustre/obdecho/obdecho.ko || modprobe obdecho; }
CMD: client-26vm4 /usr/sbin/lctl dl
CMD: client-26vm4 /usr/sbin/lctl attach echo_client ec ec_uuid
CMD: client-26vm4 /usr/sbin/lctl --device ec setup lustre-OST0000
CMD: client-26vm4 /usr/sbin/lctl --device ec create 1
client-26vm4: error: create: #1 - Operation not supported
New object id is
CMD: client-26vm4 /usr/sbin/lctl --device ec cleanup
CMD: client-26vm4 /usr/sbin/lctl --device ec detach
obecho_create_test failed: 3
CMD: client-26vm4 rmmod obdecho
sanity test_180b: @@@@@@ FAIL: test_180b failed with 1
This was b2_3 with OFD and ZFS OSTs. Info required for matching: sanity 180b |
| Comments |
| Comment by Li Wei (Inactive) [ 16/Sep/12 ] |
|
https://maloo.whamcloud.com/test_sets/29df0cfe-fedc-11e1-b4cd-52540035b04c This was b2_3 with OFD and ZFS OSTs. |
| Comment by Jian Yu [ 19/Sep/12 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/19 |
| Comment by Li Wei (Inactive) [ 23/Sep/12 ] |
|
https://maloo.whamcloud.com/test_sets/cbb8e36a-0490-11e2-bfd4-52540035b04c This was master with OFD and LDiskFS. |
| Comment by Nathaniel Clark [ 21/Feb/13 ] |
|
The zfs portion of this bug is possibly handled by |
| Comment by Keith Mannthey (Inactive) [ 26/Feb/13 ] |
|
So from looking at maloo test 180b is still failing. An ldiskfs can be seen as https://maloo.whamcloud.com/sub_tests/33b45e16-7a59-11e2-b916-52540035b04c (it failed 4 times in the last 4 weeked on the ldiskfs patch review jobs) For both zfs and ldiskfs it seems there is some fun error like CMD: wtm-19vm4 lsmod | grep -q obdecho || { insmod /usr/lib64/lustre/obdecho/obdecho.ko || modprobe obdecho; }
wtm-19vm4: insmod: can't read '/usr/lib64/lustre/obdecho/obdecho.ko': No such file or directory
There is no module? |
| Comment by Keith Mannthey (Inactive) [ 26/Feb/13 ] |
|
Nope there is a module is just seems to not want to load on the OST 12:11:05:Lustre: DEBUG MARKER: /usr/sbin/lctl --device ec create 1 12:11:05:LustreError: 21642:0:(ofd_obd.c:1191:ofd_create()) lustre-OST0000: Can't find FID Sequence 0x2: rc = -22 12:11:05:LustreError: 21642:0:(echo_client.c:2306:echo_create_object()) Cannot create objects: rc = -22 12:11:06:LustreError: 21642:0:(echo_client.c:2330:echo_create_object()) create object failed with: rc = -22 |
| Comment by Keith Mannthey (Inactive) [ 04/Mar/13 ] |
|
Ok current update. The "insmod: can't read " thing is not part of the issue. It is just the way the test is written. It seem obdecho can be in /usr/lib64/lustre/obdecho/ sometimes and sometimes it is with the kernel. It is not quite clear to me yet of the module from /usr/lib64/lustre/obdecho/ would be different then the one that modprobe returns. In any case in the main error path is the one seen above where "/usr/sbin/lctl --device ec create 1 " fails to find the FID Sequence. This is the real error. I am running endless testing to recreate the FID sequence error. I will submit an autotest job tomorrow I am unable to re-pro overnight. I would say it is a rare error at this point in time. |
| Comment by Keith Mannthey (Inactive) [ 05/Mar/13 ] |
|
So far 11k Iterations of the test without a repro. It seems http://review.whamcloud.com/5307 was landed on Feb 14th. The last know ldiskfs error was Feb 18th so I doubt the error has been encountered since the patch landed. Patch 5307 is " I am inclined to say this issue has been fixed. I have emailed Wang Di. |
| Comment by Di Wang [ 05/Mar/13 ] |
|
Yes, with patch 5307, Normal FID will be only used when DNE is enabled. But I do not understand why this can fix this problem, echo client should always use seq 2, no matter OST FID is enabled or not. Probably I miss sth here. The debug log from those failure links are so less. Keith, Do you have debug log you can post here? Thanks. |
| Comment by Keith Mannthey (Inactive) [ 05/Mar/13 ] |
|
It seems the error has not occurred for over two weeks the last know error was Feb18th. I do not have any local debug logs as I have not been only the autotests logs. Below is the OST debug log from the Feb 18th run. In general I don't see anytime more interesting than 00000100:00100000:0.0:1360080302.580440:0:13196:0:(service.c:1976:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost00_002:lustre-MDT0000-mdtlov_UUID+5:3022:x1426144806071293:12345-10.10.16.242@tcp:400 00000100:00100000:0.0:1360080302.580447:0:13196:0:(service.c:2020:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost00_002:lustre-MDT0000-mdtlov_UUID+5:3022:x1426144806071293:12345-10.10.16.242@tcp:400 Request procesed in 7us (156us total) trans 0 rc 0/0 00000100:00100000:0.0:1360080302.580448:0:13196:0:(nrs_fifo.c:245:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.10.16.242@tcp, seq: 251 00000100:00100000:0.0:1360080302.580450:0:13196:0:(nrs_fifo.c:223:nrs_fifo_req_start()) NRS start fifo request from 12345-10.10.16.242@tcp, seq: 252 00000100:00100000:0.0:1360080302.580451:0:13196:0:(service.c:1976:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost00_002:lustre-MDT0000-mdtlov_UUID+5:3022:x1426144806071295:12345-10.10.16.242@tcp:400 00000100:00100000:0.0:1360080302.580456:0:13196:0:(service.c:2020:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost00_002:lustre-MDT0000-mdtlov_UUID+5:3022:x1426144806071295:12345-10.10.16.242@tcp:400 Request procesed in 5us (161us total) trans 0 rc 0/0 00000100:00100000:0.0:1360080302.580458:0:13196:0:(nrs_fifo.c:245:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.10.16.242@tcp, seq: 252 00002000:00100000:0.0:1360080302.620394:0:21137:0:(ofd_obd.c:135:ofd_parse_connect_data()) lustre-OST0000: cli ECHO_UUID/ffff88006041b800 ocd_connect_flags: 0x405000000068 ocd_version: 2033c00 ocd_grant: 0 ocd_index: 0 ocd_group 2 00002000:00100000:0.0:1360080302.620403:0:21137:0:(ofd_obd.c:234:ofd_parse_connect_data()) lustre-OST0000: cli (no nid) does not support OBD_CONNECT_CKSUM, CRC32 will be used 00002000:00080000:0.0:1360080302.620456:0:21137:0:(ofd_obd.c:317:ofd_obd_connect()) lustre-OST0000: get connection from MDS 2 00000001:02000400:0.0:1360080302.715088:0:21160:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl --device ec create 1 00002000:00020000:0.0:1360080302.803633:0:21184:0:(ofd_obd.c:1178:ofd_create()) lustre-OST0000: Can't find oseq 0x2: -22 00008000:00020000:0.0:1360080302.803636:0:21184:0:(echo_client.c:2300:echo_create_object()) Cannot create objects: rc = -22 00008000:00020000:0.0:1360080302.804450:0:21184:0:(echo_client.c:2324:echo_create_object()) create object failed with: rc = -22 00000001:02000400:0.0:1360080302.902789:0:21207:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl --device ec cleanup 00000100:00100000:0.0:1360080302.936335:0:5561:0:(client.c:1418:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_0:29659cc7-7815-f2e4-6cf2-2103848e55b6:5561:1426144820745791:10.10.16.242@tcp:400 in the logs. |
| Comment by Andreas Dilger [ 06/Mar/13 ] |
|
In my maloo search, it does appear that sanity.sh test_180b is failing several times a day: The most recent failures are at: Looks like all of the failures on ZFS. |
| Comment by Keith Mannthey (Inactive) [ 06/Mar/13 ] |
|
Correct fails ZFS alot. There has not been an Ldiskfs Failure since Feb 18th.
|
| Comment by Keith Mannthey (Inactive) [ 06/Mar/13 ] |
|
Are we all ok to bring this down out of blocker state? Perhaps a close as unreproducible for ldiskfs? |
| Comment by Andreas Dilger [ 07/Mar/13 ] |
|
Fixed for ldiskfs, use |
| Comment by Andreas Dilger [ 01/Oct/14 ] |
|
sanity.sh test_180 is still being skipped on ZFS filesystems due to this issue. If it was fixed by |
| Comment by James A Simmons [ 14/Aug/16 ] |
|
Really old blocker for unsupported version |
| Comment by Andreas Dilger [ 29/May/17 ] |
|
Reopen to clear flags. |
| Comment by Andreas Dilger [ 26/Aug/19 ] |
|
We still always skip test_180 for ZFS targets due to ALWAYS_EXCEPT. A patch should be submitted to remove the subtests from ALWAYS_EXCEPT on the assumption that |
| Comment by Gerrit Updater [ 26/Aug/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35930 |
| Comment by Gerrit Updater [ 16/Sep/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35930/ |
| Comment by Andreas Dilger [ 18/Sep/19 ] |
|
Bug was fixed in 2.4.0, test enabled in 2.13.0. |
| Comment by Gerrit Updater [ 05/Dec/19 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36930 |
| Comment by Gerrit Updater [ 03/Jan/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36930/ |