Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0, Lustre 2.5.4
    • Lustre 2.6.0, Lustre 2.7.0
    • 3
    • 15337

    Description

      the fix in LU-5266 was not exactly correct, check for resend before removing
      from resource, otherwise conflicts can be granted in parallel.

      Attachments

        Issue Links

          Activity

            [LU-5496] fix for LU-5266

            Patches have landed to Master

            jlevi Jodi Levi (Inactive) added a comment - Patches have landed to Master
            pjones Peter Jones added a comment -

            Heh. ok.

            pjones Peter Jones added a comment - Heh. ok.

            heh, have not succeeded to submit 2nd version before the land, so a separate patch:
            remote: http://review.whamcloud.com/11644

            vitaly_fertman Vitaly Fertman added a comment - heh, have not succeeded to submit 2nd version before the land, so a separate patch: remote: http://review.whamcloud.com/11644
            pjones Peter Jones added a comment -

            Landed for 2.7

            pjones Peter Jones added a comment - Landed for 2.7

            For the record, one way this may manifest is like:

            Aug 19 19:13:34 lola-24 kernel: Lustre: 3728:0:(client.c:1926:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1408500806/real 1408500806]  req@ffff880ff523c800 x1476487427817788/t0(0) o101->soaked-OST0000-osc-ffff8810329a9800@192.168.1.102@o2ib:28/4 lens 328/400 e 0 to 1 dl 1408500814 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
            Aug 19 19:13:34 lola-24 kernel: Lustre: soaked-OST0000-osc-ffff8810329a9800: Connection to soaked-OST0000 (at 192.168.1.102@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            Aug 19 19:13:34 lola-24 kernel: Lustre: soaked-OST0000-osc-ffff8810329a9800: Connection restored to soaked-OST0000 (at 192.168.1.102@o2ib)
            Aug 19 19:13:34 lola-24 kernel: LustreError: 11-0: soaked-OST0000-osc-ffff8810329a9800: Communicating with 192.168.1.102@o2ib, operation ldlm_enqueue failed with -12.
            

            On the OSS:

            00010000:00010000:28.0:1408500814.882478:0:5651:0:(ldlm_lockd.c:1268:ldlm_handle_enqueue0()) @@@ found existing lock cookie 0x840d55bbc87132f5  req@ffff88082f115050 x1476487427817788/t0(0) o101->c1d7cd54-55f6-0482-0887-cf6de8216f19@192.168.1.124@o2ib1:0/0 lens 328/0 e 0 to 0 dl 1408500821 ref 1 fl Interpret:/2/ffffffff rc 0/-1
            [...]
            00010000:00000001:28.0:1408500814.882522:0:5651:0:(ldlm_lock.c:441:ldlm_lock_destroy_nolock()) Process leaving
            00010000:00000001:28.0:1408500814.882523:0:5651:0:(ldlm_lock.c:1685:ldlm_lock_enqueue()) Process leaving via out (rc=4294967284 : 4294967284 : 0xfffffff4)
            00010000:00000001:28.0:1408500814.882525:0:5651:0:(ldlm_lockd.c:1338:ldlm_handle_enqueue0()) Process leaving via out (rc=4294967284 : 4294967284 : 0xfffffff4)
            00010000:00010000:28.0:1408500814.882529:0:5651:0:(ldlm_lockd.c:1422:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=-12, rc=-12) ns: filter-soaked-OST0000_UUID lock: ffff880823f1bbc0/0x840d55bbc87132f5 lrc: 1/0,0 mode: PW/PW res: [0x236111:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x44000000000000 nid: 192.168.1.124@o2ib1 remote: 0xcc587d962f96a2f5 expref: 1019 pid: 5651 timeout: 0 lvb_type: 0
            

            The ENOMEM comes from here in ldlm_lock_enqueue():

                    ldlm_resource_unlink_lock(lock);
                    if (res->lr_type == LDLM_EXTENT && lock->l_tree_node == NULL) {
                            if (node == NULL) {
                                    ldlm_lock_destroy_nolock(lock);
                                    GOTO(out, rc = -ENOMEM);
                            }
            
                            CFS_INIT_LIST_HEAD(&node->li_group);
                            ldlm_interval_attach(node, lock);
                            node = NULL;
                    }
            
            liwei Li Wei (Inactive) added a comment - For the record, one way this may manifest is like: Aug 19 19:13:34 lola-24 kernel: Lustre: 3728:0:(client.c:1926:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1408500806/real 1408500806] req@ffff880ff523c800 x1476487427817788/t0(0) o101->soaked-OST0000-osc-ffff8810329a9800@192.168.1.102@o2ib:28/4 lens 328/400 e 0 to 1 dl 1408500814 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Aug 19 19:13:34 lola-24 kernel: Lustre: soaked-OST0000-osc-ffff8810329a9800: Connection to soaked-OST0000 (at 192.168.1.102@o2ib) was lost; in progress operations using this service will wait for recovery to complete Aug 19 19:13:34 lola-24 kernel: Lustre: soaked-OST0000-osc-ffff8810329a9800: Connection restored to soaked-OST0000 (at 192.168.1.102@o2ib) Aug 19 19:13:34 lola-24 kernel: LustreError: 11-0: soaked-OST0000-osc-ffff8810329a9800: Communicating with 192.168.1.102@o2ib, operation ldlm_enqueue failed with -12. On the OSS: 00010000:00010000:28.0:1408500814.882478:0:5651:0:(ldlm_lockd.c:1268:ldlm_handle_enqueue0()) @@@ found existing lock cookie 0x840d55bbc87132f5 req@ffff88082f115050 x1476487427817788/t0(0) o101->c1d7cd54-55f6-0482-0887-cf6de8216f19@192.168.1.124@o2ib1:0/0 lens 328/0 e 0 to 0 dl 1408500821 ref 1 fl Interpret:/2/ffffffff rc 0/-1 [...] 00010000:00000001:28.0:1408500814.882522:0:5651:0:(ldlm_lock.c:441:ldlm_lock_destroy_nolock()) Process leaving 00010000:00000001:28.0:1408500814.882523:0:5651:0:(ldlm_lock.c:1685:ldlm_lock_enqueue()) Process leaving via out (rc=4294967284 : 4294967284 : 0xfffffff4) 00010000:00000001:28.0:1408500814.882525:0:5651:0:(ldlm_lockd.c:1338:ldlm_handle_enqueue0()) Process leaving via out (rc=4294967284 : 4294967284 : 0xfffffff4) 00010000:00010000:28.0:1408500814.882529:0:5651:0:(ldlm_lockd.c:1422:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=-12, rc=-12) ns: filter-soaked-OST0000_UUID lock: ffff880823f1bbc0/0x840d55bbc87132f5 lrc: 1/0,0 mode: PW/PW res: [0x236111:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x44000000000000 nid: 192.168.1.124@o2ib1 remote: 0xcc587d962f96a2f5 expref: 1019 pid: 5651 timeout: 0 lvb_type: 0 The ENOMEM comes from here in ldlm_lock_enqueue(): ldlm_resource_unlink_lock(lock); if (res->lr_type == LDLM_EXTENT && lock->l_tree_node == NULL) { if (node == NULL) { ldlm_lock_destroy_nolock(lock); GOTO(out, rc = -ENOMEM); } CFS_INIT_LIST_HEAD(&node->li_group); ldlm_interval_attach(node, lock); node = NULL; }
            pjones Peter Jones added a comment -

            Thanks Vitaly!

            pjones Peter Jones added a comment - Thanks Vitaly!
            vitaly_fertman Vitaly Fertman added a comment - code : http://review.whamcloud.com/#/c/11469/

            People

              liwei Li Wei (Inactive)
              vitaly_fertman Vitaly Fertman
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: