[LU-5496] fix for LU-5266 Created: 15/Aug/14  Updated: 28/Apr/15  Resolved: 02/Oct/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.7.0
Fix Version/s: Lustre 2.7.0, Lustre 2.5.4

Type: Bug Priority: Critical
Reporter: Vitaly Fertman Assignee: Li Wei (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-5266 LBUG on Failover -ldlm_process_extent... Resolved
Severity: 3
Rank (Obsolete): 15337

 Description   

the fix in LU-5266 was not exactly correct, check for resend before removing
from resource, otherwise conflicts can be granted in parallel.



 Comments   
Comment by Vitaly Fertman [ 15/Aug/14 ]

code : http://review.whamcloud.com/#/c/11469/

Comment by Peter Jones [ 15/Aug/14 ]

Thanks Vitaly!

Comment by Li Wei (Inactive) [ 20/Aug/14 ]

For the record, one way this may manifest is like:

Aug 19 19:13:34 lola-24 kernel: Lustre: 3728:0:(client.c:1926:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1408500806/real 1408500806]  req@ffff880ff523c800 x1476487427817788/t0(0) o101->soaked-OST0000-osc-ffff8810329a9800@192.168.1.102@o2ib:28/4 lens 328/400 e 0 to 1 dl 1408500814 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Aug 19 19:13:34 lola-24 kernel: Lustre: soaked-OST0000-osc-ffff8810329a9800: Connection to soaked-OST0000 (at 192.168.1.102@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Aug 19 19:13:34 lola-24 kernel: Lustre: soaked-OST0000-osc-ffff8810329a9800: Connection restored to soaked-OST0000 (at 192.168.1.102@o2ib)
Aug 19 19:13:34 lola-24 kernel: LustreError: 11-0: soaked-OST0000-osc-ffff8810329a9800: Communicating with 192.168.1.102@o2ib, operation ldlm_enqueue failed with -12.

On the OSS:

00010000:00010000:28.0:1408500814.882478:0:5651:0:(ldlm_lockd.c:1268:ldlm_handle_enqueue0()) @@@ found existing lock cookie 0x840d55bbc87132f5  req@ffff88082f115050 x1476487427817788/t0(0) o101->c1d7cd54-55f6-0482-0887-cf6de8216f19@192.168.1.124@o2ib1:0/0 lens 328/0 e 0 to 0 dl 1408500821 ref 1 fl Interpret:/2/ffffffff rc 0/-1
[...]
00010000:00000001:28.0:1408500814.882522:0:5651:0:(ldlm_lock.c:441:ldlm_lock_destroy_nolock()) Process leaving
00010000:00000001:28.0:1408500814.882523:0:5651:0:(ldlm_lock.c:1685:ldlm_lock_enqueue()) Process leaving via out (rc=4294967284 : 4294967284 : 0xfffffff4)
00010000:00000001:28.0:1408500814.882525:0:5651:0:(ldlm_lockd.c:1338:ldlm_handle_enqueue0()) Process leaving via out (rc=4294967284 : 4294967284 : 0xfffffff4)
00010000:00010000:28.0:1408500814.882529:0:5651:0:(ldlm_lockd.c:1422:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=-12, rc=-12) ns: filter-soaked-OST0000_UUID lock: ffff880823f1bbc0/0x840d55bbc87132f5 lrc: 1/0,0 mode: PW/PW res: [0x236111:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x44000000000000 nid: 192.168.1.124@o2ib1 remote: 0xcc587d962f96a2f5 expref: 1019 pid: 5651 timeout: 0 lvb_type: 0

The ENOMEM comes from here in ldlm_lock_enqueue():

        ldlm_resource_unlink_lock(lock);
        if (res->lr_type == LDLM_EXTENT && lock->l_tree_node == NULL) {
                if (node == NULL) {
                        ldlm_lock_destroy_nolock(lock);
                        GOTO(out, rc = -ENOMEM);
                }

                CFS_INIT_LIST_HEAD(&node->li_group);
                ldlm_interval_attach(node, lock);
                node = NULL;
        }
Comment by Peter Jones [ 28/Aug/14 ]

Landed for 2.7

Comment by Vitaly Fertman [ 28/Aug/14 ]

heh, have not succeeded to submit 2nd version before the land, so a separate patch:
remote: http://review.whamcloud.com/11644

Comment by Peter Jones [ 28/Aug/14 ]

Heh. ok.

Comment by Jodi Levi (Inactive) [ 02/Oct/14 ]

Patches have landed to Master

Generated at Sat Feb 10 01:51:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.