[LU-2867] 2.1.4<->2.4.0 interop: parallel-scale test_compilebench: IOError: [Errno 71] Protocol error Created: 26/Feb/13  Updated: 07/Mar/13  Resolved: 07/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.1.4
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: nasf (Inactive)
Resolution: Duplicate Votes: 0
Labels: HB
Environment:

Lustre b2_1 client build: http://build.whamcloud.com/job/lustre-b2_1/176
Lustre master server build: http://build.whamcloud.com/job/lustre-master/1269
Distro/Arch: RHEL6.3/x86_64


Issue Links:
Duplicate
duplicates LU-2790 Failure to allocated osd keys leads t... Resolved
Severity: 3
Rank (Obsolete): 6933

 Description   

The parallel-scale test compilebench failed as follows:

IOError: [Errno 71] Protocol error
 parallel-scale test_compilebench: @@@@@@ FAIL: compilebench failed: 1 

Console log on the client node client-19vm1 showed that:

16:48:59:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
16:54:22:LustreError: 23937:0:(pack_generic.c:413:lustre_msg_buf_v2()) msg ffff880042c05cc0 buffer[2] size 0 too small (required 40, opc=101)
16:54:22:LustreError: 23937:0:(layout.c:1659:__req_capsule_get()) @@@ Wrong buffer for field `dlm_lvb' (2 of 3) in format `LDLM_ENQUEUE_LVB': 0 vs. 40 (server)
16:54:22:  req@ffff880042c05800 x1427708753202028/t0(0) o101->lustre-OST0001-osc-ffff880075eeb000@10.10.2.223@tcp:28/4 lens 296/312 e 0 to 0 dl 1361581007 ref 1 fl Interpret:R/0/0 rc 0/0
16:54:22:LustreError: 23937:0:(pack_generic.c:413:lustre_msg_buf_v2()) msg ffff880042c05cc0 buffer[2] size 0 too small (required 40, opc=101)
16:54:22:LustreError: 23937:0:(pack_generic.c:413:lustre_msg_buf_v2()) Skipped 1 previous similar message
16:54:22:LustreError: 23937:0:(layout.c:1659:__req_capsule_get()) @@@ Wrong buffer for field `dlm_lvb' (2 of 3) in format `LDLM_ENQUEUE_LVB': 0 vs. 40 (server)
16:54:22:  req@ffff880042c05800 x1427708753202030/t0(0) o101->lustre-OST0001-osc-ffff880075eeb000@10.10.2.223@tcp:28/4 lens 296/312 e 0 to 0 dl 1361581023 ref 1 fl Interpret:R/0/0 rc 0/0
16:54:22:LustreError: 23937:0:(layout.c:1659:__req_capsule_get()) Skipped 1 previous similar message
16:54:22:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  parallel-scale test_compilebench: @@@@@@ FAIL: compilebench failed: 1

Maloo report: https://maloo.whamcloud.com/test_sets/2bef1a90-7d79-11e2-85d0-52540035b04c



 Comments   
Comment by Jian Yu [ 27/Feb/13 ]

Hi Nasf,

Is this a duplicate of LU-2645?

Comment by nasf (Inactive) [ 27/Feb/13 ]

I do not think so. LU-2645 is the convert case, server give larger size than required. I need more investigation for this one.

Comment by Zhenyu Xu [ 28/Feb/13 ]

I think it must has something to do with variable sized LVB support (http://review.whamcloud.com/3965) which on the server side mdt_intent_policy() set the DLM_LVB field of server buf with 0 size

mdt_intent_policy()
        } else {
                /* No intent was provided */
                LASSERT(pill->rc_fmt == &RQF_LDLM_ENQUEUE);
                req_capsule_set_size(pill, &RMF_DLM_LVB, RCL_SERVER, 0);
                rc = req_capsule_server_pack(pill);
Comment by nasf (Inactive) [ 06/Mar/13 ]

Bobijam, I do not think it is the code section you mentioned above caused the failure. Because the failed lock was a EXT lock, not an IBITS lock. In fact, the directly reason for the failure was the OST out of memory.

The log on client:
====================
00000020:00010000:0.0:1361580860.953069:0:15378:0:(cl_lock.c:143:cl_lock_trace0()) enqueue lock: ffff880022365778@(2 ffff880078e1d500 1 0 0 1 1 0)(ffff8800457fa148/1/0) at cl_enqueue_try():1193
00010000:00010000:0.0:1361580860.953083:0:15378:0:(ldlm_lock.c:687:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: lustre-OST0001-osc-ffff880075eeb000 lock: ffff88007a0ced80/0xde77b732306ea406 lrc: 3/1,0 mode: -/PR res: 679272/0 rrc: 1 type: EXT [0->0] (req 0>0) flags: 0x0 remote: 0x0 expref: -99 pid: 15378 timeout 0
00010000:00010000:0.0:1361580860.953087:0:15378:0:(ldlm_request.c:835:ldlm_cli_enqueue()) ### client-side enqueue START, flags 1000
ns: lustre-OST0001-osc-ffff880075eeb000 lock: ffff88007a0ced80/0xde77b732306ea406 lrc: 3/1,0 mode: -/PR res: 679272/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x0 remote: 0x0 expref: -99 pid: 15378 timeout 0
00000020:00010000:0.0:1361580860.953095:0:15378:0:(cl_lock.c:143:cl_lock_trace0()) wait lock try: ffff880022365778@(3 ffff880078e1d500 1 2 0 1 1 0)(ffff8800457fa148/1/0) at cl_wait_try():1453
00000020:00010000:0.0:1361580860.953098:0:15378:0:(cl_lock.c:143:cl_lock_trace0()) disclosure lock: ffff880022365938@(3 ffff880078e1d500 2 1 0 1 1 0)(ffff8800457fa878/0/3) at cl_lock_disclosure():1718
00000020:00010000:0.0:1361580860.953101:0:15378:0:(cl_lock.c:143:cl_lock_trace0()) state wait lock: ffff880022365938@(2 ffff880078e1d500 1 1 0 1 1 0)(ffff8800457fa878/0/1) at cl_lock_state_wait():961
00000100:00100000:0.0:1361580860.953109:0:23937:0:(client.c:1434:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:899b23b2-359c-4238-1ba8-ff7de7e0c945:23937:1427708753202028:10.10.2.223@tcp:101
00010000:00010000:0.0:1361580861.005537:0:23937:0:(ldlm_lock.c:687:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: lustre-OST0001-osc-ffff880075eeb000 lock: ffff88007a0ced80/0xde77b732306ea406 lrc: 6/2,0 mode: -/PR res: 679272/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x0 remote: 0x0 expref: -99 pid: 15378 timeout 0
00010000:00010000:0.0:1361580861.005549:0:23937:0:(ldlm_request.c:599:ldlm_cli_enqueue_fini()) ### client-side enqueue, new policy data ns: lustre-OST0001-osc-ffff880075eeb000 lock: ffff88007a0ced80/0xde77b732306ea406 lrc: 6/2,0 mode: -/PR res: 679272/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x0 remote: 0xaa3ee0fd27648cf2 expref: -99 pid: 15378 timeout 0
00000100:00020000:0.0:1361580861.005553:0:23937:0:(pack_generic.c:413:lustre_msg_buf_v2()) msg ffff880042c05cc0 buffer[2] size 0 too small (required 40, opc=101)
00000100:00020000:0.0:1361580861.006865:0:23937:0:(layout.c:1659:__req_capsule_get()) @@@ Wrong buffer for field `dlm_lvb' (2 of 3) in format `LDLM_ENQUEUE_LVB': 0 vs. 40 (server)
====================

Searching the failed lock "0xde77b732306ea406" in OST log, we can find that:
====================
==> 00010000:00020000:0.0:1361580860.999665:0:1514:0:(ldlm_resource.c:1161:ldlm_resource_get()) lvbo_init failed for resource 679272: rc -12
00010000:00010000:0.0:1361580861.001913:0:1514:0:(ldlm_lockd.c:1282:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: filter-ffff88001721d000 lock: ffff880060a66000/0xaa3ee0fd27648cf2 lrc: 2/0,0 mode: -/PR res: 679272/0 rrc: 1 type: EXT [0->0] (req 0>0) flags: 0x0 nid: local remote: 0xde77b732306ea406 expref: -99 pid: 1514 timeout: 0 lvb_type: 0
00010000:00010000:0.0:1361580861.001930:0:1514:0:(ldlm_extent.c:314:ldlm_check_contention()) contended locks = 0
00010000:00010000:0.0:1361580861.001931:0:1514:0:(ldlm_extent.c:314:ldlm_check_contention()) contended locks = 0
00010000:00010000:0.0:1361580861.001936:0:1514:0:(ldlm_resource.c:1270:ldlm_resource_add_lock()) ### About to add this lock:
ns: filter-ffff88001721d000 lock: ffff880060a66000/0xaa3ee0fd27648cf2 lrc: 3/0,0 mode: PR/PR res: 679272/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 nid: 10.10.4.220@tcp remote: 0xde77b732306ea406 expref: 4042 pid: 1514 timeout: 0 lvb_type: 1
00010000:00010000:0.0:1361580861.001945:0:1514:0:(ldlm_lockd.c:1414:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: filter-ffff88001721d000 lock: ffff880060a66000/0xaa3ee0fd27648cf2 lrc: 3/0,0 mode: PR/PR res: 679272/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 nid: 10.10.4.220@tcp remote: 0xde77b732306ea406 expref: 4042 pid: 1514 timeout: 0 lvb_type: 1
00010000:00010000:0.0:1361580861.001952:0:1514:0:(ldlm_lockd.c:1447:ldlm_handle_enqueue0()) ### server-side enqueue handler END (lock ffff880060a66000, rc 0)
====================

As you can see, "ldlm_resource_get()" failed for "-ENOMEM": ldlm_resource_get() ==> ofd_lvbo_init() ==> OBD_ALLOC_PTR(lvb). That means the "lvb" on the lock resource was NULL because of not enough memory at that moment. But the ldlm_handle_enqueue0() ignored such "lvb" error and went ahead, then later, it packed zero-sized "lvb" into the reply. So it was NOT variable-sized LVB issue, but some memory issue.

Comment by nasf (Inactive) [ 07/Mar/13 ]

Cannot allocate "lvb" for ext lock

Comment by nasf (Inactive) [ 07/Mar/13 ]

It is a duplication of LU-2790 for failed to allocate "lvb" for ext lock.

Generated at Sat Feb 10 01:28:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.