[LU-4812] Interop 2.5.1<->2.6 failure on test suite sanity-hsm test_12c: request on 0x200000bd0:0x1c:0x0 is not SUCCEED on mds1 Created: 25/Mar/14  Updated: 14/Dec/21  Resolved: 14/Dec/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Bruno Faccini (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

server: 2.5.1
client: lustre-master build # 1945


Severity: 3
Rank (Obsolete): 13239

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/bce807b0-b242-11e3-a93f-52540035b04c.

The sub-test test_12c failed with the following error:

request on 0x200000bd0:0x1c:0x0 is not SUCCEED on mds1

Update not seen after 100s: wanted 'SUCCEED' got ''
 sanity-hsm test_12c: @@@@@@ FAIL: request on 0x200000bd0:0x1c:0x0 is not SUCCEED on mds1 


 Comments   
Comment by Jodi Levi (Inactive) [ 25/Mar/14 ]

Bruno,
Can you have a look at this one please and comment?
Thank you!

Comment by Bruno Faccini (Inactive) [ 27/Mar/14 ]

The subtest log indicates that the setstripe fails with ENODATA and later stat() fails with ERANGE :

........
error on ioctl 0x4008669a for '/mnt/lustre/d12c.sanity-hsm/f12c.sanity-hsm' (3): No data available
error: setstripe: create stripe file '/mnt/lustre/d12c.sanity-hsm/f12c.sanity-hsm' failed
5+0 records in
5+0 records out
5242880 bytes (5.2 MB) copied, 1.11764 s, 4.7 MB/s
Cannot stat /mnt/lustre/d12c.sanity-hsm/f12c.sanity-hsm: Numerical result out of range
CMD: client-13vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd0:0x1c:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
........

Having a look into the Lustre debug logs for both the MDS and the Client, it seems that the MDS returns ERANGE from mdt_getxattr, but we miss a lot of debug/traces there :

00000100:00100000:0.0:1395452120.445468:0:14249:0:(nrs_fifo.c:182:nrs_fifo_req_get()) NRS start fifo request from 12345-10.10.4.213@tcp, seq: 516
00000100:00100000:0.0:1395452120.445527:0:14249:0:(service.c:2011:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_000:4a4a5079-10da-02d9-9c8b-596c927d64af+70:26950:x1463237424976160:12345-10.10.4.213@tcp:49
00000100:00100000:0.0:1395452120.445594:0:14249:0:(service.c:2055:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_000:4a4a5079-10da-02d9-9c8b-596c927d64af+70:26950:x1463237424976160:12345-10.10.4.213@tcp:49 Request procesed in 121us (294us total) trans 0 rc -34/-34
00000100:00100000:0.0:1395452120.445600:0:14249:0:(nrs_fifo.c:244:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.10.4.213@tcp, seq: 516

Sarah, do you have more detail about the 2.5.1 build I can use in order to reproduce ?

Comment by Bruno Faccini (Inactive) [ 31/Mar/14 ]

Sarah, I added you as a watcher for this ticket since I need your help to better qualify the platform/problem ...
Can you help ?

Comment by Sarah Liu [ 31/Mar/14 ]

Hello Bruno,

Here is the link I used for 2.5.1, I used the RHEL6 x86_64 server build:

http://build.whamcloud.com/job/lustre-b2_5/42/

Generated at Sat Feb 10 01:46:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.