[LU-1957] Test failure on test suite sanity, subtest test_180b Created: 16/Sep/12  Updated: 03/Jan/20  Resolved: 18/Sep/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.4.0
Fix Version/s: Lustre 2.13.0, Lustre 2.12.4

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: ldiskfs

Issue Links:
Related
is related to LU-2803 sanity.sh test_180 fails with zfs Resolved
Severity: 3
Rank (Obsolete): 4069

 Description   

This issue was created by maloo for Li Wei <liwei@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/f1cbcf24-fe85-11e1-b4cd-52540035b04c.

The sub-test test_180b failed with the following error:

test_180b failed with 1

From the test output:

== sanity test 180b: test obdecho directly on obdfilter == 03:16:00 (1347617760)
CMD: client-26vm4 lsmod | grep -q obdecho ||  { insmod /usr/lib64/lustre/obdecho/obdecho.ko ||  modprobe obdecho; }
CMD: client-26vm4 /usr/sbin/lctl dl
CMD: client-26vm4 /usr/sbin/lctl attach echo_client ec ec_uuid
CMD: client-26vm4 /usr/sbin/lctl --device ec setup lustre-OST0000
CMD: client-26vm4 /usr/sbin/lctl --device ec create 1
client-26vm4: error: create: #1 - Operation not supported
New object id is 
CMD: client-26vm4 /usr/sbin/lctl --device ec  cleanup
CMD: client-26vm4 /usr/sbin/lctl --device ec  detach
obecho_create_test failed: 3
CMD: client-26vm4 rmmod obdecho
 sanity test_180b: @@@@@@ FAIL: test_180b failed with 1 

This was b2_3 with OFD and ZFS OSTs.

Info required for matching: sanity 180b



 Comments   
Comment by Li Wei (Inactive) [ 16/Sep/12 ]

https://maloo.whamcloud.com/test_sets/29df0cfe-fedc-11e1-b4cd-52540035b04c

This was b2_3 with OFD and ZFS OSTs.

Comment by Jian Yu [ 19/Sep/12 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/19
USE_OFD=yes
OSTFSTYPE=zfs
LOAD_MODULES_REMOTE=true
https://maloo.whamcloud.com/test_sets/ff130926-0241-11e2-ab94-52540035b04c

Comment by Li Wei (Inactive) [ 23/Sep/12 ]

https://maloo.whamcloud.com/test_sets/cbb8e36a-0490-11e2-bfd4-52540035b04c

This was master with OFD and LDiskFS.

Comment by Nathaniel Clark [ 21/Feb/13 ]

The zfs portion of this bug is possibly handled by LU-2803

Comment by Keith Mannthey (Inactive) [ 26/Feb/13 ]

So from looking at maloo test 180b is still failing. An ldiskfs can be seen as https://maloo.whamcloud.com/sub_tests/33b45e16-7a59-11e2-b916-52540035b04c (it failed 4 times in the last 4 weeked on the ldiskfs patch review jobs)

For both zfs and ldiskfs it seems there is some fun error like

CMD: wtm-19vm4 lsmod | grep -q obdecho ||  { insmod /usr/lib64/lustre/obdecho/obdecho.ko ||  modprobe obdecho; }
wtm-19vm4: insmod: can't read '/usr/lib64/lustre/obdecho/obdecho.ko': No such file or directory

There is no module?

Comment by Keith Mannthey (Inactive) [ 26/Feb/13 ]

Nope there is a module is just seems to not want to load on the OST

12:11:05:Lustre: DEBUG MARKER: /usr/sbin/lctl --device ec create 1
12:11:05:LustreError: 21642:0:(ofd_obd.c:1191:ofd_create()) lustre-OST0000: Can't find FID Sequence 0x2: rc = -22
12:11:05:LustreError: 21642:0:(echo_client.c:2306:echo_create_object()) Cannot create objects: rc = -22
12:11:06:LustreError: 21642:0:(echo_client.c:2330:echo_create_object()) create object failed with: rc = -22
Comment by Keith Mannthey (Inactive) [ 04/Mar/13 ]

Ok current update. The "insmod: can't read " thing is not part of the issue. It is just the way the test is written. It seem obdecho can be in /usr/lib64/lustre/obdecho/ sometimes and sometimes it is with the kernel. It is not quite clear to me yet of the module from /usr/lib64/lustre/obdecho/ would be different then the one that modprobe returns.

In any case in the main error path is the one seen above where "/usr/sbin/lctl --device ec create 1 " fails to find the FID Sequence. This is the real error.

I am running endless testing to recreate the FID sequence error. I will submit an autotest job tomorrow I am unable to re-pro overnight. I would say it is a rare error at this point in time.

Comment by Keith Mannthey (Inactive) [ 05/Mar/13 ]

So far 11k Iterations of the test without a repro.

It seems http://review.whamcloud.com/5307 was landed on Feb 14th. The last know ldiskfs error was Feb 18th so I doubt the error has been encountered since the patch landed.

Patch 5307 is "LU-2775 osp: enable fid-on-OST only for DNE." Basically now we only use use FID for DNE, and autotest runs should not be using FID sequence on the OST.

I am inclined to say this issue has been fixed. I have emailed Wang Di.

Comment by Di Wang [ 05/Mar/13 ]

Yes, with patch 5307, Normal FID will be only used when DNE is enabled. But I do not understand why this can fix this problem, echo client should always use seq 2, no matter OST FID is enabled or not. Probably I miss sth here. The debug log from those failure links are so less. Keith, Do you have debug log you can post here? Thanks.

Comment by Keith Mannthey (Inactive) [ 05/Mar/13 ]

It seems the error has not occurred for over two weeks the last know error was Feb18th. I do not have any local debug logs as I have not been only the autotests logs.

Below is the OST debug log from the Feb 18th run.
https://maloo.whamcloud.com/test_logs/1dcc313a-7a5b-11e2-b916-52540035b04c

In general I don't see anytime more interesting than

00000100:00100000:0.0:1360080302.580440:0:13196:0:(service.c:1976:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost00_002:lustre-MDT0000-mdtlov_UUID+5:3022:x1426144806071293:12345-10.10.16.242@tcp:400
00000100:00100000:0.0:1360080302.580447:0:13196:0:(service.c:2020:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost00_002:lustre-MDT0000-mdtlov_UUID+5:3022:x1426144806071293:12345-10.10.16.242@tcp:400 Request procesed in 7us (156us total) trans 0 rc 0/0
00000100:00100000:0.0:1360080302.580448:0:13196:0:(nrs_fifo.c:245:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.10.16.242@tcp, seq: 251
00000100:00100000:0.0:1360080302.580450:0:13196:0:(nrs_fifo.c:223:nrs_fifo_req_start()) NRS start fifo request from 12345-10.10.16.242@tcp, seq: 252
00000100:00100000:0.0:1360080302.580451:0:13196:0:(service.c:1976:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost00_002:lustre-MDT0000-mdtlov_UUID+5:3022:x1426144806071295:12345-10.10.16.242@tcp:400
00000100:00100000:0.0:1360080302.580456:0:13196:0:(service.c:2020:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost00_002:lustre-MDT0000-mdtlov_UUID+5:3022:x1426144806071295:12345-10.10.16.242@tcp:400 Request procesed in 5us (161us total) trans 0 rc 0/0
00000100:00100000:0.0:1360080302.580458:0:13196:0:(nrs_fifo.c:245:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.10.16.242@tcp, seq: 252
00002000:00100000:0.0:1360080302.620394:0:21137:0:(ofd_obd.c:135:ofd_parse_connect_data()) lustre-OST0000: cli ECHO_UUID/ffff88006041b800 ocd_connect_flags: 0x405000000068 ocd_version: 2033c00 ocd_grant: 0 ocd_index: 0 ocd_group 2
00002000:00100000:0.0:1360080302.620403:0:21137:0:(ofd_obd.c:234:ofd_parse_connect_data()) lustre-OST0000: cli (no nid) does not support OBD_CONNECT_CKSUM, CRC32 will be used
00002000:00080000:0.0:1360080302.620456:0:21137:0:(ofd_obd.c:317:ofd_obd_connect()) lustre-OST0000: get connection from MDS 2
00000001:02000400:0.0:1360080302.715088:0:21160:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl --device ec create 1
00002000:00020000:0.0:1360080302.803633:0:21184:0:(ofd_obd.c:1178:ofd_create()) lustre-OST0000: Can't find oseq 0x2: -22
00008000:00020000:0.0:1360080302.803636:0:21184:0:(echo_client.c:2300:echo_create_object()) Cannot create objects: rc = -22
00008000:00020000:0.0:1360080302.804450:0:21184:0:(echo_client.c:2324:echo_create_object()) create object failed with: rc = -22
00000001:02000400:0.0:1360080302.902789:0:21207:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl --device ec  cleanup
00000100:00100000:0.0:1360080302.936335:0:5561:0:(client.c:1418:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_0:29659cc7-7815-f2e4-6cf2-2103848e55b6:5561:1426144820745791:10.10.16.242@tcp:400

in the logs.

Comment by Andreas Dilger [ 06/Mar/13 ]

In my maloo search, it does appear that sanity.sh test_180b is failing several times a day:

https://maloo.whamcloud.com/sub_tests/query?utf8=%E2%9C%93&test_set[test_set_script_id]=f9516376-32bc-11e0-aaee-52540025f9ae&sub_test[sub_test_script_id]=14b0513a-32be-11e0-b685-52540025f9ae&sub_test[status]=FAIL&sub_test[query_bugs]=&test_session[test_host]=&test_session[test_group]=&test_session[user_id]=&test_session[query_date]=&test_session[query_recent_period]=2419200&test_node[os_type_id]=&test_node[distribution_type_id]=&test_node[architecture_type_id]=&test_node[file_system_type_id]=&test_node[lustre_branch_id]=&test_node_network[network_type_id]=&commit=Update+results

The most recent failures are at:
https://maloo.whamcloud.com/sub_tests/0b93da12-833a-11e2-85c9-52540035b04c
https://maloo.whamcloud.com/sub_tests/e26a1e3c-8242-11e2-ba47-52540035b04c

Looks like all of the failures on ZFS.

Comment by Keith Mannthey (Inactive) [ 06/Mar/13 ]

Correct fails ZFS alot. There has not been an Ldiskfs Failure since Feb 18th.

LU-2803 sanity/180 fail with zfs : is a separate LU that tracks the zfs issue. Alex has a Patch out for the issue. I have taken this LU to mean Ldiskfs.

Comment by Keith Mannthey (Inactive) [ 06/Mar/13 ]

Are we all ok to bring this down out of blocker state?

Perhaps a close as unreproducible for ldiskfs?

Comment by Andreas Dilger [ 07/Mar/13 ]

Fixed for ldiskfs, use LU-2803 for the current ZFS failures.

Comment by Andreas Dilger [ 01/Oct/14 ]

sanity.sh test_180 is still being skipped on ZFS filesystems due to this issue. If it was fixed by LU-2803 then a patch should be submitted to re-enable the test.

Comment by James A Simmons [ 14/Aug/16 ]

Really old blocker for unsupported version

Comment by Andreas Dilger [ 29/May/17 ]

Reopen to clear flags.

Comment by Andreas Dilger [ 26/Aug/19 ]

We still always skip test_180 for ZFS targets due to ALWAYS_EXCEPT. A patch should be submitted to remove the subtests from ALWAYS_EXCEPT on the assumption that LU-2803 fixed that problem.

Comment by Gerrit Updater [ 26/Aug/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35930
Subject: LU-1957 tests: remove sanity test 180 from ALWAYS_EXCEPT
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2b0119932f5abfd85f01ffac5614f41d5b9fe559

Comment by Gerrit Updater [ 16/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35930/
Subject: LU-1957 tests: remove sanity test 180 from ALWAYS_EXCEPT
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 72b59b85a253e508ec1b192fbf8cad840ca6ff2c

Comment by Andreas Dilger [ 18/Sep/19 ]

Bug was fixed in 2.4.0, test enabled in 2.13.0.

Comment by Gerrit Updater [ 05/Dec/19 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36930
Subject: LU-1957 tests: remove sanity test 180 from ALWAYS_EXCEPT
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 3e47ed64ed481d59140fd74ff92d9f774d0e39da

Comment by Gerrit Updater [ 03/Jan/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36930/
Subject: LU-1957 tests: remove sanity test 180 from ALWAYS_EXCEPT
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: c3d53269c5133e938b90f0f0488cddf29c35701b

Generated at Sat Feb 10 01:21:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.