[LU-11951] sanity: test_231a failure, idle disconnect & import generation disagreement - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.13.0, Lustre 2.12.1
Affects Version/s: Lustre 2.13.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

These observations are drawn from this test failure/logs (master + a few unrelated patches):

https://testing.whamcloud.com/test_sets/7ed37e66-2b3a-11e9-90fb-52540065bddc

sanity 231a.

dd fails for no particular reason in the middle of a pretty innocuous test.

A disconnect/idle request is sent:

00000100:00100000:0.0:1549584426.013412:0:3658:0:(client.c:1631:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:6a3d5a57-20e1-3c31-d0be-69db5e11ff93:3658:1624855787560768:10.9.5.221@tcp:9

It has not completed yet.

A lock request is made:

 00010000:00010000:1.0:1549584426.079672:0:25756:0:(ldlm_request.c:1019:ldlm_cli_enqueue()) ### sending request ns: lustre-OST0006-osc-ffff8853e612d000 lock: ffff8853c1acd200/0x600e7a448b87755c lrc: 3/0,1 mode: -/PW res: [0x75c0:0x0:0x0].0x0 rrc: 2 type: EXT [0->4194303] (req 0>4194303) flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 25756 timeout: 0 lvb_type: 1
00000100:00080000:1.0:1549584426.079679:0:25756:0:(client.c:1575:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != CONNECTING)  req@ffff8853e63e3a80

The disconnect comes back:

 00000100:00080000:0.0:1549584426.491344:0:3658:0:(import.c:1677:ptlrpc_disconnect_idle_interpret()) @@@ inflight=2, refcount=6: rc = 0
  req@ffff8853e63e2400 x1624855787560768/t0(0) o9->lustre-OST0006-osc-ffff8853e612d000@10.9.5.221@tcp:28/4 lens 224/192 e 0 to 0 dl 1549584432 ref 1 fl Interpret:RN/0/0 rc 0/0

It notices a request has happened and reconnects immediately:

 00000100:00080000:0.0:1549584426.491350:0:3658:0:(import.c:1693:ptlrpc_disconnect_idle_interpret()) ffff8853fbff1800 lustre-OST0006_UUID: changing import state from IDLE to NEW
00000100:00080000:0.0:1549584426.491361:0:3658:0:(import.c:683:ptlrpc_connect_import()) ffff8853fbff1800 lustre-OST0006_UUID: changing import state from NEW to CONNECTING

The delayed lock request is rejected for import generation difference:

 00000100:00080000:0.0:1549584426.491919:0:3658:0:(recover.c:223:ptlrpc_wake_delayed()) @@@ waking (set ffff88538df3fc80):  req@ffff8853e63e3a80 x1624855787560912/t0(0) o101->lustre-OST0006-osc-ffff8853e612d000@10.9.5.221@tcp:28/4 lens 328/400 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
00000100:00100000:0.0:1549584426.491929:0:3658:0:(client.c:2061:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:6a3d5a57-20e1-3c31-d0be-69db5e11ff93:3658:1624855787560928:10.9.5.221@tcp:8
00000100:00020000:1.0:1549584426.491944:0:25756:0:(client.c:1193:ptlrpc_import_delay_req()) @@@ req wrong generation:  req@ffff8853e63e3a80 x1624855787560912/t0(0) o101->lustre-OST0006-osc-ffff8853e612d000@10.9.5.221@tcp:28/4 lens 328/400 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
00000100:00100000:1.0:1549584426.496060:0:25756:0:(client.c:2061:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc dd:6a3d5a57-20e1-3c31-d0be-69db5e11ff93:25756:1624855787560912:10.9.5.221@tcp:101
00010000:00010000:1.0:1549584426.496066:0:25756:0:(ldlm_request.c:592:ldlm_cli_enqueue_fini()) ### client-side enqueue END (FAILED) ns: lustre-OST0006-osc-ffff8853e612d000 lock: ffff8853c1acd200/0x600e7a448b87755c lrc: 4/0,1 mode: -/PW res: [0x75c0:0x0:0x0].0x0 rrc: 2 type: EXT [0->4194303] (req 0>4194303) flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 25756 timeout: 0 lvb_type: 1

Here's the problem.

ptlrpc_request_alloc_internal uses IMP_IDLE as a way to tell if it should start a new connection:

      if (unlikely(imp->imp_state == LUSTRE_IMP_IDLE)) {

And increment the generation number as part of that:

                 if (imp->imp_state == LUSTRE_IMP_IDLE) {
                        imp->imp_generation++;
                        imp->imp_initiated_at = imp->imp_generation;
                        imp->imp_state =  LUSTRE_IMP_NEW;
                        connect = 1;

After this, the request generation in the request is set to match that of the import. So if this doesn't happen, the existing import generation is stored in the request.

However, the idle code (ptlrpc_disconnect_prep_req) sets the import state to connecting:

IMPORT_SET_STATE(imp, LUSTRE_IMP_CONNECTING);

Then sends the request. And only sets IMP_IDLE only once the import is fully idle, in

ptlrpc_disconnect_idle_interpret:

 /* DISCONNECT reply can be late and another connection can just
         * be initiated. so we have to abort disconnection. */
        if (req->rq_import_generation == imp->imp_generation &&
            imp->imp_state != LUSTRE_IMP_CLOSED) {
                LASSERTF(imp->imp_state == LUSTRE_IMP_CONNECTING,
                         "%s\n", ptlrpc_import_state_name(imp->imp_state));
                imp->imp_state = LUSTRE_IMP_IDLE;

So the request_alloc_internal doesn't notice the import is in the process of disconnecting, so it does not start the reconnect early, as intended.

But we increment the generation in disconnect_idle_interpret:

                             if (atomic_read(&imp->imp_inflight) > 1) {
                                                imp->imp_generation++;
                                                imp->imp_initiated_at = imp->imp_generation;
                                                IMPORT_SET_STATE_NOLOCK(imp, LUSTRE_IMP_NEW);
                                                connect = 1;"

I am not really sure of the correct handling of import generation overall. My first instinct is that we should not increment it in this case, as we connect immediately...

But perhaps that's incorrect as well. I'm hoping that Alex will take a look at this.

Attachments

Issue Links

is duplicated by

LU-11976 req wrong generation leading to I/O errors on 2.12 clients

Resolved

LU-11958 sanity test_244: FAIL: sendfile+grouplock failed: sendfile_copy: assertion 'sret > 0' failed: sendfile failed: Input/output error

Closed

is related to

LU-11964 Heavy load and soft lockups on MDS with DOM

Resolved

LU-11559 sanity-sec test_21: req wrong generation:

Resolved

is related to

LU-7236 OST connect and disconnect on demand

Resolved

Activity

People

Assignee:: Alex Zhuravlev

Reporter:: Patrick Farrell (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 08/Feb/19 11:40 PM

Updated:: 10/Jul/24 1:04 AM

Resolved:: 27/Feb/19 5:41 AM