Details

    • Question/Request
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.12.0, Lustre 2.10.8
    • None
    • Client 2.12.0 + patches, Server: 2.10.8 + patches, CentOS 7.6
    • 9223372036854775807

    Description

      Hello! I'm tracking hung file access on Sherlock/Oak, perhaps someone can help.

      Clients are on Sherlock and running 2.12.0 + patches (https://github.com/stanford-rc/lustre/commits/4f7519966aebb21589e145b010dcdfea6ced6670). The problem seems to be only happening for files on Oak servers, which is running 2.10.8 + patches (https://github.com/stanford-rc/lustre/commits/73a88a805990aed3c35d5247cc886a7cfc1c527f).

      The problem originate from the following kind of errors on the OSS (here 10.0.2.110@o2ib5 running 2.10.8+):

      Jul 11 12:19:58 oak-io3-s2 kernel: LustreError: 177680:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff883e01b05450 x1631536567940480/t0(0) o4->83eaf162-e2fc-0143-0a04-bb96aa68816a@10.9.105.47@o2ib4:647/0 lens 4008/448 e 0 to 0 dl 1562873297 ref 1 fl Interpret:/2/0 rc 0/0
      

      And on the corresponding client, here (sh-105-47) 10.9.105.47@o2ib4, running 2.12.0+:

      Jul 11 12:18:17 sh-105-47.int kernel: Lustre: 97415:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1562872096/real 1562872096]  req@ffff8880aa5acb00 x1631536567940480/t0(0) o4->oak-OST00
      Jul 11 12:18:17 sh-105-47.int kernel: Lustre: oak-OST0075-osc-ffff8864d1205000: Connection to oak-OST0075 (at 10.0.2.110@o2ib5) was lost; in progress operations using this service will wait for recovery to complete
      Jul 11 12:18:17 sh-105-47.int kernel: Lustre: oak-OST0075-osc-ffff8864d1205000: Connection restored to 10.0.2.110@o2ib5 (at 10.0.2.110@o2ib5)
      

      Another example: a file is blocked on sh-24-05 (10.8.24.5@o2ib6). I cannot access it even locally (without routers) on an Oak client:

      [root@oak-gw02 ~]# file /oak/stanford/groups/tpd/cpeng/Kitaev_v2/t1_3.0_t2_0.1_k_1.0_J_0.0/delta_0.0833/Lx_48_Ly_3/m_6000/status.45458106.out
      ...hanging...
      

      But stat does work:

      [root@oak-gw02 ~]# stat /oak/stanford/groups/tpd/cpeng/Kitaev_v2/t1_3.0_t2_0.1_k_1.0_J_0.0/delta_0.0833/Lx_48_Ly_3/m_6000/status.45458106.out
        File: ‘/oak/stanford/groups/tpd/cpeng/Kitaev_v2/t1_3.0_t2_0.1_k_1.0_J_0.0/delta_0.0833/Lx_48_Ly_3/m_6000/status.45458106.out’
        Size: 68141     	Blocks: 144        IO Block: 4194304 regular file
      Device: d8214508h/3626059016d	Inode: 1044835495315506970  Links: 1
      Access: (0660/-rw-rw----)  Uid: (346234/ cpeng18)   Gid: ( 6267/ oak_tpd)
      Access: 2019-07-12 10:08:55.000000000 -0700
      Modify: 2019-07-10 06:40:11.000000000 -0700
      Change: 2019-07-10 06:40:11.000000000 -0700
       Birth: -
      

      The client logs show many disconnect from OST003b:

      Jul 12 09:27:38 sh-24-05.int kernel: Lustre: oak-OST003b-osc-ffff9f3b06215000: Connection to oak-OST003b (at 10.0.2.106@o2ib5) was lost; in progress operations using this service will wait for recovery to complete
      Jul 12 09:27:38 sh-24-05.int kernel: Lustre: oak-OST003b-osc-ffff9f3b06215000: Connection restored to 10.0.2.106@o2ib5 (at 10.0.2.106@o2ib5)
      Jul 12 09:37:40 sh-24-05.int kernel: Lustre: 91043:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1562948858/real 1562948858]  req@ffff9f4315b34800 x1635098786669312/t0(0) o4->oak-OST003b-osc-ffff9f3b06215000@10.0.2.106@o2ib5:6/4 lens 568/448 e 4 to 1 dl 1562949459 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
      Jul 12 09:37:40 sh-24-05.int kernel: Lustre: oak-OST003b-osc-ffff9f3b06215000: Connection to oak-OST003b (at 10.0.2.106@o2ib5) was lost; in progress operations using this service will wait for recovery to complete
      Jul 12 09:37:40 sh-24-05.int kernel: Lustre: oak-OST003b-osc-ffff9f3b06215000: Connection restored to 10.0.2.106@o2ib5 (at 10.0.2.106@o2ib5)
      Jul 12 09:47:41 sh-24-05.int kernel: Lustre: 91043:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1562949460/real 1562949460]  req@ffff9f4315b34800 x1635098786669312/t0(0) o4->oak-OST003b-osc-ffff9f3b06215000@10.0.2.106@o2ib5:6/4 lens 568/448 e 4 to 1 dl 1562950061 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
      Jul 12 09:47:41 sh-24-05.int kernel: Lustre: oak-OST003b-osc-ffff9f3b06215000: Connection to oak-OST003b (at 10.0.2.106@o2ib5) was lost; in progress operations using this service will wait for recovery to complete
      Jul 12 09:47:42 sh-24-05.int kernel: Lustre: oak-OST003b-osc-ffff9f3b06215000: Connection restored to 10.0.2.106@o2ib5 (at 10.0.2.106@o2ib5)
      Jul 12 09:57:42 sh-24-05.int kernel: Lustre: 91043:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1562950061/real 1562950061]  req@ffff9f4315b34800 x1635098786669312/t0(0) o4->oak-OST003b-osc-ffff9f3b06215000@10.0.2.106@o2ib5:6/4 lens 568/448 e 4 to 1 dl 1562950662 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
      Jul 12 09:57:42 sh-24-05.int kernel: Lustre: oak-OST003b-osc-ffff9f3b06215000: Connection to oak-OST003b (at 10.0.2.106@o2ib5) was lost; in progress operations using this service will wait for recovery to complete
      Jul 12 09:57:42 sh-24-05.int kernel: Lustre: oak-OST003b-osc-ffff9f3b06215000: Connection restored to 10.0.2.106@o2ib5 (at 10.0.2.106@o2ib5)
      Jul 12 10:07:43 sh-24-05.int kernel: Lustre: 91043:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1562950662/real 1562950662]  req@ffff9f4315b34800 x1635098786669312/t0(0) o4->oak-OST003b-osc-ffff9f3b06215000@10.0.2.106@o2ib5:6/4 lens 568/448 e 4 to 1 dl 1562951263 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
      Jul 12 10:07:43 sh-24-05.int kernel: Lustre: oak-OST003b-osc-ffff9f3b06215000: Connection to oak-OST003b (at 10.0.2.106@o2ib5) was lost; in progress operations using this service will wait for recovery to complete
      Jul 12 10:07:43 sh-24-05.int kernel: Lustre: oak-OST003b-osc-ffff9f3b06215000: Connection restored to 10.0.2.106@o2ib5 (at 10.0.2.106@o2ib5)
      

      This file is indeed stripped on OST index 59 (003b):

      [root@oak-gw02 ~]# lfs getstripe /oak/stanford/groups/tpd/cpeng/Kitaev_v2/t1_3.0_t2_0.1_k_1.0_J_0.0/delta_0.0833/Lx_48_Ly_3/m_6000/status.45458106.out
      /oak/stanford/groups/tpd/cpeng/Kitaev_v2/t1_3.0_t2_0.1_k_1.0_J_0.0/delta_0.0833/Lx_48_Ly_3/m_6000/status.45458106.out
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 59
      	obdidx		 objid		 objid		 group
      	    59	        330567	      0x50b47	  0x1ac0000400
      

      OST003b is on oak-io2-s2 (10.0.2.106@o2ib5) and the matching logs are:

      Jul 12 09:49:22 oak-io2-s2 kernel: LustreError: 337216:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff883e20b73c50 x1635098786669312/t0(0) o4->6a1f7b6d-2e7f-b8ba-0d05-a94a9c66ed24@10.8.24.5@o2ib6:246/
      Jul 12 09:49:22 oak-io2-s2 kernel: LustreError: 337216:0:(ldlm_lib.c:3239:target_bulk_io()) Skipped 33 previous similar messages
      Jul 12 09:51:11 oak-io2-s2 kernel: Lustre: oak-OST0059: Bulk IO read error with 35757990-c9b9-0d25-75ab-51daa53c34a7 (at 10.9.101.8@o2ib4), client will retry: rc -110
      Jul 12 09:51:11 oak-io2-s2 kernel: Lustre: Skipped 10 previous similar messages
      Jul 12 09:51:48 oak-io2-s2 kernel: Lustre: oak-OST004b: Connection restored to 7d38b821-565e-3c2d-7913-4b3451580d80 (at 10.9.108.19@o2ib4)
      Jul 12 09:51:48 oak-io2-s2 kernel: Lustre: Skipped 53 previous similar messages
      Jul 12 09:55:54 oak-io2-s2 kernel: Lustre: oak-OST0043: Client ed05c118-550c-6d50-a22a-90d69ede3f9d (at 10.8.25.4@o2ib6) reconnecting
      Jul 12 09:55:54 oak-io2-s2 kernel: Lustre: Skipped 30 previous similar messages
      Jul 12 09:57:51 oak-io2-s2 kernel: Lustre: oak-OST0053: Bulk IO write error with 3a089bf9-b1d9-abd2-566d-ba0d59f75130 (at 10.8.12.2@o2ib6), client will retry: rc = -110
      Jul 12 09:57:51 oak-io2-s2 kernel: Lustre: Skipped 22 previous similar messages
      Jul 12 09:58:18 oak-io2-s2 kernel: Lustre: oak-OST003b: haven't heard from client aca6646c-58d6-0250-4505-9a1b573dc90f (at 10.9.112.3@o2ib4) in 227 seconds. I think it's dead, and I am evicting it. exp ffff883e5ef6d800, cur 1562950698 exp
      Jul 12 09:58:18 oak-io2-s2 kernel: Lustre: Skipped 23 previous similar messages
      Jul 12 09:58:29 oak-io2-s2 kernel: Lustre: oak-OST0053: haven't heard from client aca6646c-58d6-0250-4505-9a1b573dc90f (at 10.9.112.3@o2ib4) in 227 seconds. I think it's dead, and I am evicting it. exp ffff883bc3f1b000, cur 1562950709 exp
      Jul 12 09:59:23 oak-io2-s2 kernel: LustreError: 337286:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff883bb150d850 x1635098786669312/t0(0) o4->6a1f7b6d-2e7f-b8ba-0d05-a94a9c66ed24@10.8.24.5@o2ib6:92/0
      Jul 12 09:59:23 oak-io2-s2 kernel: LustreError: 337286:0:(ldlm_lib.c:3239:target_bulk_io()) Skipped 33 previous similar messages
      Jul 12 10:01:12 oak-io2-s2 kernel: Lustre: oak-OST0059: Bulk IO read error with 35757990-c9b9-0d25-75ab-51daa53c34a7 (at 10.9.101.8@o2ib4), client will retry: rc -110
      Jul 12 10:01:12 oak-io2-s2 kernel: Lustre: Skipped 10 previous similar messages
      Jul 12 10:01:49 oak-io2-s2 kernel: Lustre: oak-OST004b: Connection restored to 7d38b821-565e-3c2d-7913-4b3451580d80 (at 10.9.108.19@o2ib4)
      Jul 12 10:01:49 oak-io2-s2 kernel: Lustre: Skipped 29 previous similar messages
      Jul 12 10:05:55 oak-io2-s2 kernel: Lustre: oak-OST0043: Client ed05c118-550c-6d50-a22a-90d69ede3f9d (at 10.8.25.4@o2ib6) reconnecting
      Jul 12 10:05:55 oak-io2-s2 kernel: Lustre: Skipped 28 previous similar messages
      Jul 12 10:07:52 oak-io2-s2 kernel: Lustre: oak-OST0053: Bulk IO write error with 3a089bf9-b1d9-abd2-566d-ba0d59f75130 (at 10.8.12.2@o2ib6), client will retry: rc = -110
      Jul 12 10:07:52 oak-io2-s2 kernel: Lustre: Skipped 20 previous similar messages
      Jul 12 10:09:23 oak-io2-s2 kernel: LustreError: 337506:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff881d0aba5450 x1635098786669312/t0(0) o4->6a1f7b6d-2e7f-b8ba-0d05-a94a9c66ed24@10.8.24.5@o2ib6:693/
      Jul 12 10:09:23 oak-io2-s2 kernel: LustreError: 337506:0:(ldlm_lib.c:3239:target_bulk_io()) Skipped 31 previous similar messages
      Jul 12 10:11:13 oak-io2-s2 kernel: Lustre: oak-OST0059: Bulk IO read error with 35757990-c9b9-0d25-75ab-51daa53c34a7 (at 10.9.101.8@o2ib4), client will retry: rc -110
      Jul 12 10:11:13 oak-io2-s2 kernel: Lustre: Skipped 10 previous similar messages
      Jul 12 10:11:50 oak-io2-s2 kernel: Lustre: oak-OST004b: Connection restored to 7d38b821-565e-3c2d-7913-4b3451580d80 (at 10.9.108.19@o2ib4)
      Jul 12 10:11:50 oak-io2-s2 kernel: Lustre: Skipped 74 previous similar messages
      Jul 12 10:15:56 oak-io2-s2 kernel: Lustre: oak-OST0043: Client ed05c118-550c-6d50-a22a-90d69ede3f9d (at 10.8.25.4@o2ib6) reconnecting
      Jul 12 10:15:56 oak-io2-s2 kernel: Lustre: Skipped 28 previous similar messages
      Jul 12 10:17:53 oak-io2-s2 kernel: Lustre: oak-OST0053: Bulk IO write error with 3a089bf9-b1d9-abd2-566d-ba0d59f75130 (at 10.8.12.2@o2ib6), client will retry: rc = -110
      Jul 12 10:17:53 oak-io2-s2 kernel: Lustre: Skipped 20 previous similar messages
      Jul 12 10:19:24 oak-io2-s2 kernel: LustreError: 337169:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff883bcdb45450 x1635098786669312/t0(0) o4->6a1f7b6d-2e7f-b8ba-0d05-a94a9c66ed24@10.8.24.5@o2ib6:539/
      Jul 12 10:19:24 oak-io2-s2 kernel: LustreError: 337169:0:(ldlm_lib.c:3239:target_bulk_io()) Skipped 31 previous similar messages
      

      The workaround we have found is to reboot clients that are doing that. Otherwise, access to (some) files that are open on these clients is just hanging from anywhere, leading to failed jobs.

      The majority of the timeout messages are timeout on bulk WRITE, but I also see a few timeout on bulk READ. You can see below that they come from any OSS/OSTs and several clients.

      oak-io2-s2: Jul 11 20:47:58 oak-io2-s2 kernel: LustreError: 337506:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff880064249450 x1635098786669312/t0(0) o4->6a1f7b6d-2e7f-b8ba-0d05-a94a9c66ed24@10.8.24.5@o2ib6:173/0 lens 568/448 e 0 to 0 dl 1562903778 ref 1 fl Interpret:/2/0 rc 0/0
      oak-io2-s1: Jul 11 20:50:48 oak-io2-s1 kernel: LustreError: 269496:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff881fdd355850 x1631642338211952/t0(0) o4->2849578e-5060-1d9c-ed8e-2c6ecf8dbfdc@10.9.103.21@o2ib4:618/0 lens 520/448 e 0 to 0 dl 1562903468 ref 1 fl Interpret:/2/0 rc 0/0
      oak-io1-s1: Jul 11 20:47:28 oak-io1-s1 kernel: LustreError: 346440:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff883e8e740450 x1634178743345136/t0(0) o4->2e50e863-cfe9-6b65-cc33-1b949391a2a5@10.9.109.69@o2ib4:418/0 lens 9248/448 e 0 to 0 dl 1562903268 ref 1 fl Interpret:/2/0 rc 0/0
      oak-io3-s2: Jul 11 20:52:03 oak-io3-s2 kernel: LustreError: 160887:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff8835760a5850 x1631812409859184/t0(0) o4->7b520521-bd66-0427-aeea-fed858d66fc7@10.8.30.13@o2ib6:418/0 lens 544/448 e 0 to 0 dl 1562904023 ref 1 fl Interpret:/2/0 rc 0/0
      oak-io1-s2: Jul 11 20:54:08 oak-io1-s2 kernel: LustreError: 263511:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff881b99d74850 x1631536876509328/t0(0) o4->88f0afe1-d8a4-2cd9-5b8e-a652402618f0@10.9.103.16@o2ib4:125/0 lens 4584/448 e 0 to 0 dl 1562903730 ref 1 fl Interpret:/2/0 rc 0/0
      oak-io1-s1: Jul 11 20:57:33 oak-io1-s1 kernel: LustreError: 333909:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff883e47c9dc50 x1634178743345136/t0(0) o4->2e50e863-cfe9-6b65-cc33-1b949391a2a5@10.9.109.69@o2ib4:268/0 lens 9248/448 e 0 to 0 dl 1562903873 ref 1 fl Interpret:/2/0 rc 0/0
      oak-io2-s2: Jul 11 20:57:59 oak-io2-s2 kernel: LustreError: 337756:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff88003532c450 x1635098786669312/t0(0) o4->6a1f7b6d-2e7f-b8ba-0d05-a94a9c66ed24@10.8.24.5@o2ib6:19/0 lens 568/448 e 0 to 0 dl 1562904379 ref 1 fl Interpret:/2/0 rc 0/0
      oak-io3-s1: Jul 11 20:58:00 oak-io3-s1 kernel: LustreError: 166558:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff88005bd8d050 x1631571779573440/t0(0) o4->65acd50a-6ff2-3a7f-4743-551235e323f1@10.8.30.22@o2ib6:293/0 lens 10728/448 e 0 to 0 dl 1562903898 ref 1 fl Interpret:/2/0 rc 0/0
      oak-io2-s1: Jul 11 21:00:53 oak-io2-s1 kernel: LustreError: 269486:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff881c9332cc50 x1631642338211952/t0(0) o4->2849578e-5060-1d9c-ed8e-2c6ecf8dbfdc@10.9.103.21@o2ib4:468/0 lens 520/448 e 0 to 0 dl 1562904073 ref 1 fl Interpret:/2/0 rc 0/0
      oak-io3-s2: Jul 11 21:02:04 oak-io3-s2 kernel: LustreError: 160141:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff8838d7b30450 x1631812409859184/t0(0) o4->7b520521-bd66-0427-aeea-fed858d66fc7@10.8.30.13@o2ib6:264/0 lens 544/448 e 0 to 0 dl 1562904624 ref 1 fl Interpret:/2/0 rc 0/0
      

      We have noticed yesterday that it's like some client is holding a write lock (the one in the error message) and when we reboot the client, it is finally evicted and access to the file restored. Do you think there are any client patch in 2.12.1 or 2.12.2 which could help with this? Any other suggestions? Thanks!

      Stephane

      Attachments

        Activity

          [LU-12543] timeout on bulk READ/WRITE
          pjones Peter Jones added a comment -

          ok - thanks

          pjones Peter Jones added a comment - ok - thanks

          To us, it is resolved now that we got rid of all 2.12.0-based clients. No sign of the issue with newer "vanilla" 2.12.3 clients nor when using Amir's patches from LU-12906/LU-12907 on top of 2.12.3 that we are rolling out on all clients at the moment. Logs on servers (Oak, 2.10.8_3) are much cleaner now. Note that this issue has never occurred with the direct-attached 2.10.8 clients that we have on Oak neither, so it must have been a regression in 2.12.0. Thanks!

          sthiell Stephane Thiell added a comment - To us, it is resolved now that we got rid of all 2.12.0-based clients. No sign of the issue with newer "vanilla" 2.12.3 clients nor when using Amir's patches from  LU-12906 / LU-12907 on top of 2.12.3 that we are rolling out on all clients at the moment. Logs on servers (Oak, 2.10.8_3) are much cleaner now. Note that this issue has never occurred with the direct-attached 2.10.8 clients that we have on Oak neither, so it must have been a regression in 2.12.0. Thanks!
          pjones Peter Jones added a comment -

          ok, so can we consider this ticket resolved?

          pjones Peter Jones added a comment - ok, so can we consider this ticket resolved?

          It looks like it was a 2.12.0 clients issue, which is resolved with Lustre 2.12.3. Indeed, while we have progressively upgraded our clients on Sherlock to 2.12.3, we've seen the 'timeout on bulk READ/WRITE' errors disappearing on Oak (the Lustre servers running 2.10.8).

          sthiell Stephane Thiell added a comment - It looks like it was a 2.12.0 clients issue, which is resolved with Lustre 2.12.3. Indeed, while we have progressively upgraded our clients on Sherlock to 2.12.3, we've seen the 'timeout on bulk READ/WRITE' errors disappearing on Oak (the Lustre servers running 2.10.8).

          We continue to drain and reboot the 2.12 clients that are hitting this. We think it is helping. At this time, one OSS (`oak-io3-s1`) stopped doing that on Jul 14 18:25:47, the others are still showing issues.

          [root@oak-hn01 sthiell.root]# clush -w@oss journalctl -n 100000 -k \| grep timeout \| tail -1
          oak-io1-s2: Jul 15 15:46:14 oak-io1-s2 kernel: LustreError: 263524:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff883d2b628450 x1631773946505376/t0(0) o4->ca61ebcc-8fe9-6078-6900-091596d38ce5@10.8.31.1@o2ib6:324/0 lens 544/448 e 0 to 0 dl 1563230844 ref 1 fl Interpret:/2/0 rc 0/0
          oak-io1-s1: Jul 15 15:47:28 oak-io1-s1 kernel: LustreError: 333887:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s  req@ffff881ff9aa7050 x1631702845192384/t0(0) o3->e4074bcb-d3a9-e7ce-f0ff-c9299e12af69@10.9.101.39@o2ib4:346/0 lens 488/432 e 0 to 0 dl 1563230866 ref 1 fl Interpret:H/2/0 rc 0/0
          oak-io2-s1: Jul 15 15:48:05 oak-io2-s1 kernel: LustreError: 269437:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff881c5deb6850 x1631596458047008/t0(0) o4->a28ed951-f21d-a025-1515-b542dcf02373@10.9.103.20@o2ib4:382/0 lens 5080/448 e 0 to 0 dl 1563230902 ref 1 fl Interpret:/2/0 rc 0/0
          oak-io3-s1: Jul 14 18:25:47 oak-io3-s1 kernel: LustreError: 13530:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff882e3b8fa050 x1631682526197744/t0(0) o4->b0caabfb-5248-65f6-c4d2-6b3aa02959f9@10.8.25.10@o2ib6:444/0 lens 8352/448 e 0 to 0 dl 1563153954 ref 1 fl Interpret:/2/0 rc 0/0
          oak-io3-s2: Jul 15 15:50:29 oak-io3-s2 kernel: LustreError: 230686:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff880b8e9f3850 x1631448489493376/t0(0) o4->d42a1e0f-df41-3c8d-23b2-9af81cad77b3@10.8.20.35@o2ib6:254/0 lens 544/448 e 0 to 0 dl 1563231529 ref 1 fl Interpret:/2/0 rc 0/0
          oak-io2-s2: Jul 15 15:48:20 oak-io2-s2 kernel: LustreError: 337169:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff883bcc6d1050 x1631774515146240/t0(0) o4->ca61ebcc-8fe9-6078-6900-091596d38ce5@10.8.31.1@o2ib6:124/0 lens 544/448 e 0 to 0 dl 1563231399 ref 1 fl Interpret:/2/0 rc 0/0
          
          [root@oak-hn01 sthiell.root]# clush -w@oss journalctl -n 100000 -k \| grep Bulk \| tail -1
          oak-io1-s2: Jul 15 15:54:48 oak-io1-s2 kernel: Lustre: oak-OST0019: Bulk IO write error with ca61ebcc-8fe9-6078-6900-091596d38ce5 (at 10.8.31.1@o2ib6), client will retry: rc = -110
          oak-io2-s1: Jul 15 15:52:02 oak-io2-s1 kernel: Lustre: oak-OST0038: Bulk IO write error with a28ed951-f21d-a025-1515-b542dcf02373 (at 10.9.103.20@o2ib4), client will retry: rc = -110
          oak-io3-s2: Jul 15 15:50:29 oak-io3-s2 kernel: Lustre: oak-OST006b: Bulk IO write error with d42a1e0f-df41-3c8d-23b2-9af81cad77b3 (at 10.8.20.35@o2ib6), client will retry: rc = -110
          oak-io1-s1: Jul 15 15:55:36 oak-io1-s1 kernel: Lustre: oak-OST0010: Bulk IO write error with a28ed951-f21d-a025-1515-b542dcf02373 (at 10.9.103.20@o2ib4), client will retry: rc = -110
          oak-io3-s1: Jul 14 18:27:35 oak-io3-s1 kernel: Lustre: oak-OST0072: Bulk IO write error with b0caabfb-5248-65f6-c4d2-6b3aa02959f9 (at 10.8.25.10@o2ib6), client will retry: rc = -110
          oak-io2-s2: Jul 15 15:55:06 oak-io2-s2 kernel: Lustre: oak-OST0053: Bulk IO read error with 9bb2d237-f2ac-86e5-17d4-840dd14480cb (at 10.8.16.4@o2ib6), client will retry: rc -110
          
          sthiell Stephane Thiell added a comment - We continue to drain and reboot the 2.12 clients that are hitting this. We think it is helping. At this time, one OSS (`oak-io3-s1`) stopped doing that on Jul 14 18:25:47, the others are still showing issues. [root@oak-hn01 sthiell.root]# clush -w@oss journalctl -n 100000 -k \| grep timeout \| tail -1 oak-io1-s2: Jul 15 15:46:14 oak-io1-s2 kernel: LustreError: 263524:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff883d2b628450 x1631773946505376/t0(0) o4->ca61ebcc-8fe9-6078-6900-091596d38ce5@10.8.31.1@o2ib6:324/0 lens 544/448 e 0 to 0 dl 1563230844 ref 1 fl Interpret:/2/0 rc 0/0 oak-io1-s1: Jul 15 15:47:28 oak-io1-s1 kernel: LustreError: 333887:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s req@ffff881ff9aa7050 x1631702845192384/t0(0) o3->e4074bcb-d3a9-e7ce-f0ff-c9299e12af69@10.9.101.39@o2ib4:346/0 lens 488/432 e 0 to 0 dl 1563230866 ref 1 fl Interpret:H/2/0 rc 0/0 oak-io2-s1: Jul 15 15:48:05 oak-io2-s1 kernel: LustreError: 269437:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff881c5deb6850 x1631596458047008/t0(0) o4->a28ed951-f21d-a025-1515-b542dcf02373@10.9.103.20@o2ib4:382/0 lens 5080/448 e 0 to 0 dl 1563230902 ref 1 fl Interpret:/2/0 rc 0/0 oak-io3-s1: Jul 14 18:25:47 oak-io3-s1 kernel: LustreError: 13530:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff882e3b8fa050 x1631682526197744/t0(0) o4->b0caabfb-5248-65f6-c4d2-6b3aa02959f9@10.8.25.10@o2ib6:444/0 lens 8352/448 e 0 to 0 dl 1563153954 ref 1 fl Interpret:/2/0 rc 0/0 oak-io3-s2: Jul 15 15:50:29 oak-io3-s2 kernel: LustreError: 230686:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff880b8e9f3850 x1631448489493376/t0(0) o4->d42a1e0f-df41-3c8d-23b2-9af81cad77b3@10.8.20.35@o2ib6:254/0 lens 544/448 e 0 to 0 dl 1563231529 ref 1 fl Interpret:/2/0 rc 0/0 oak-io2-s2: Jul 15 15:48:20 oak-io2-s2 kernel: LustreError: 337169:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff883bcc6d1050 x1631774515146240/t0(0) o4->ca61ebcc-8fe9-6078-6900-091596d38ce5@10.8.31.1@o2ib6:124/0 lens 544/448 e 0 to 0 dl 1563231399 ref 1 fl Interpret:/2/0 rc 0/0 [root@oak-hn01 sthiell.root]# clush -w@oss journalctl -n 100000 -k \| grep Bulk \| tail -1 oak-io1-s2: Jul 15 15:54:48 oak-io1-s2 kernel: Lustre: oak-OST0019: Bulk IO write error with ca61ebcc-8fe9-6078-6900-091596d38ce5 (at 10.8.31.1@o2ib6), client will retry: rc = -110 oak-io2-s1: Jul 15 15:52:02 oak-io2-s1 kernel: Lustre: oak-OST0038: Bulk IO write error with a28ed951-f21d-a025-1515-b542dcf02373 (at 10.9.103.20@o2ib4), client will retry: rc = -110 oak-io3-s2: Jul 15 15:50:29 oak-io3-s2 kernel: Lustre: oak-OST006b: Bulk IO write error with d42a1e0f-df41-3c8d-23b2-9af81cad77b3 (at 10.8.20.35@o2ib6), client will retry: rc = -110 oak-io1-s1: Jul 15 15:55:36 oak-io1-s1 kernel: Lustre: oak-OST0010: Bulk IO write error with a28ed951-f21d-a025-1515-b542dcf02373 (at 10.9.103.20@o2ib4), client will retry: rc = -110 oak-io3-s1: Jul 14 18:27:35 oak-io3-s1 kernel: Lustre: oak-OST0072: Bulk IO write error with b0caabfb-5248-65f6-c4d2-6b3aa02959f9 (at 10.8.25.10@o2ib6), client will retry: rc = -110 oak-io2-s2: Jul 15 15:55:06 oak-io2-s2 kernel: Lustre: oak-OST0053: Bulk IO read error with 9bb2d237-f2ac-86e5-17d4-840dd14480cb (at 10.8.16.4@o2ib6), client will retry: rc -110
          sthiell Stephane Thiell added a comment - - edited

          The only think I can see, which is strange, is on the Oak/Sherlock routers (FDR/FDR and FDR/EDR): there is a lot of PUT_NACK with the clients. I don't see that on Fir routers (EDR/FDR and EDR/EDR). Routers are all the same, running 2.12.0 + the patch for LU-12065 (lnd: increase CQ entries):

          Jul 12 15:05:34 sh-rtr-oak-1-2.int kernel: LNet: 9516:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.8.16.4@o2ib6
          Jul 12 15:25:36 sh-rtr-oak-1-2.int kernel: LNet: 9517:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.8.16.4@o2ib6
          Jul 12 15:35:37 sh-rtr-oak-1-2.int kernel: LNet: 9516:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.8.16.4@o2ib6
          Jul 12 15:45:38 sh-rtr-oak-1-2.int kernel: LNet: 9517:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.8.16.4@o2ib6
          Jul 12 15:55:39 sh-rtr-oak-1-2.int kernel: LNet: 9516:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.8.16.4@o2ib6
          Jul 12 15:55:39 sh-rtr-oak-1-2.int kernel: LNet: 9516:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) Skipped 1 previous similar message
          

          Client:

          Jul 12 15:35:37 sh-16-04.int kernel: Lustre: 28485:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1562970336/real 1562970336]  req@ffff8d383429bc00 x1631585825749152/t0(0) o3->oak-OST005
          Jul 12 15:35:37 sh-16-04.int kernel: Lustre: oak-OST0053-osc-ffff8d40987c0800: Connection to oak-OST0053 (at 10.0.2.106@o2ib5) was lost; in progress operations using this service will wait for recovery to complete
          Jul 12 15:35:37 sh-16-04.int kernel: Lustre: oak-OST0053-osc-ffff8d40987c0800: Connection restored to 10.0.2.106@o2ib5 (at 10.0.2.106@o2ib5)
          Jul 12 15:45:38 sh-16-04.int kernel: Lustre: 28485:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1562970937/real 1562970937]  req@ffff8d383429bc00 x1631585825749152/t0(0) o3->oak-OST005
          Jul 12 15:45:38 sh-16-04.int kernel: Lustre: oak-OST0053-osc-ffff8d40987c0800: Connection to oak-OST0053 (at 10.0.2.106@o2ib5) was lost; in progress operations using this service will wait for recovery to complete
          Jul 12 15:45:38 sh-16-04.int kernel: Lustre: oak-OST0053-osc-ffff8d40987c0800: Connection restored to 10.0.2.106@o2ib5 (at 10.0.2.106@o2ib5)
          Jul 12 15:55:39 sh-16-04.int kernel: Lustre: 28485:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1562971538/real 1562971538]  req@ffff8d383429bc00 x1631585825749152/t0(0) o3->oak-OST005
          Jul 12 15:55:39 sh-16-04.int kernel: Lustre: oak-OST0053-osc-ffff8d40987c0800: Connection to oak-OST0053 (at 10.0.2.106@o2ib5) was lost; in progress operations using this service will wait for recovery to complete
          Jul 12 15:55:39 sh-16-04.int kernel: Lustre: oak-OST0053-osc-ffff8d40987c0800: Connection restored to 10.0.2.106@o2ib5 (at 10.0.2.106@o2ib5)
          

          Server (2.10), lctl dk:

          00000020:02000400:7.0:1562971037.407655:0:336668:0:(tgt_handler.c:2046:tgt_brw_read()) oak-OST0053: Bulk IO read error with 9bb2d237-f2ac-86e5-17d4-840dd14480cb (at 10.8.16.4@o2ib6), client will retry: rc -110
          00010000:02000400:24.0:1562971538.368057:0:337070:0:(ldlm_lib.c:779:target_handle_reconnect()) oak-OST0053: Client 9bb2d237-f2ac-86e5-17d4-840dd14480cb (at 10.8.16.4@o2ib6) reconnecting
          00000100:02000000:24.0:1562971538.368078:0:337070:0:(import.c:1541:ptlrpc_import_recovery_state_machine()) oak-OST0053: Connection restored to 9bb2d237-f2ac-86e5-17d4-840dd14480cb (at 10.8.16.4@o2ib6)
          00010000:00020000:9.0:1562971638.368632:0:336721:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s  req@ffff883b9b290850 x1631585825749152/t0(0) o3->9bb2d237-f2ac-86e5-17d4-840dd14480cb@10.8.16.4@o2ib6:583/0 lens 488/432 e 0 to 0 dl 1562972138 ref 1 fl Interpret:/2/0 rc 0/0
          00000020:02000400:9.0:1562971638.368665:0:336721:0:(tgt_handler.c:2046:tgt_brw_read()) oak-OST0053: Bulk IO read error with 9bb2d237-f2ac-86e5-17d4-840dd14480cb (at 10.8.16.4@o2ib6), client will retry: rc -110
          00010000:02000400:43.0:1562972139.327922:0:318983:0:(ldlm_lib.c:779:target_handle_reconnect()) oak-OST0053: Client 9bb2d237-f2ac-86e5-17d4-840dd14480cb (at 10.8.16.4@o2ib6) reconnecting
          00000100:02000000:43.0:1562972139.327938:0:318983:0:(import.c:1541:ptlrpc_import_recovery_state_machine()) oak-OST0053: Connection restored to 9bb2d237-f2ac-86e5-17d4-840dd14480cb (at 10.8.16.4@o2ib6)
          00010000:00020000:12.0:1562972239.327640:0:337289:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s  req@ffff881ccbed2c50 x1631585825749152/t0(0) o3->9bb2d237-f2ac-86e5-17d4-840dd14480cb@10.8.16.4@o2ib6:429/0 lens 488/432 e 0 to 0 dl 1562972739 ref 1 fl Interpret:/2/0 rc 0/0
          00000020:02000400:12.0:1562972239.327693:0:337289:0:(tgt_handler.c:2046:tgt_brw_read()) oak-OST0053: Bulk IO read error with 9bb2d237-f2ac-86e5-17d4-840dd14480cb (at 10.8.16.4@o2ib6), client will retry: rc -110
          

          1562971037 = Friday, July 12, 2019 3:37:17 PM

          sthiell Stephane Thiell added a comment - - edited The only think I can see, which is strange, is on the Oak/Sherlock routers (FDR/FDR and FDR/EDR): there is a lot of PUT_NACK with the clients. I don't see that on Fir routers (EDR/FDR and EDR/EDR). Routers are all the same, running 2.12.0 + the patch for  LU-12065 (lnd: increase CQ entries): Jul 12 15:05:34 sh-rtr-oak-1-2.int kernel: LNet: 9516:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.8.16.4@o2ib6 Jul 12 15:25:36 sh-rtr-oak-1-2.int kernel: LNet: 9517:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.8.16.4@o2ib6 Jul 12 15:35:37 sh-rtr-oak-1-2.int kernel: LNet: 9516:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.8.16.4@o2ib6 Jul 12 15:45:38 sh-rtr-oak-1-2.int kernel: LNet: 9517:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.8.16.4@o2ib6 Jul 12 15:55:39 sh-rtr-oak-1-2.int kernel: LNet: 9516:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) PUT_NACK from 10.8.16.4@o2ib6 Jul 12 15:55:39 sh-rtr-oak-1-2.int kernel: LNet: 9516:0:(o2iblnd_cb.c:408:kiblnd_handle_rx()) Skipped 1 previous similar message Client: Jul 12 15:35:37 sh-16-04.int kernel: Lustre: 28485:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1562970336/real 1562970336] req@ffff8d383429bc00 x1631585825749152/t0(0) o3->oak-OST005 Jul 12 15:35:37 sh-16-04.int kernel: Lustre: oak-OST0053-osc-ffff8d40987c0800: Connection to oak-OST0053 (at 10.0.2.106@o2ib5) was lost; in progress operations using this service will wait for recovery to complete Jul 12 15:35:37 sh-16-04.int kernel: Lustre: oak-OST0053-osc-ffff8d40987c0800: Connection restored to 10.0.2.106@o2ib5 (at 10.0.2.106@o2ib5) Jul 12 15:45:38 sh-16-04.int kernel: Lustre: 28485:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1562970937/real 1562970937] req@ffff8d383429bc00 x1631585825749152/t0(0) o3->oak-OST005 Jul 12 15:45:38 sh-16-04.int kernel: Lustre: oak-OST0053-osc-ffff8d40987c0800: Connection to oak-OST0053 (at 10.0.2.106@o2ib5) was lost; in progress operations using this service will wait for recovery to complete Jul 12 15:45:38 sh-16-04.int kernel: Lustre: oak-OST0053-osc-ffff8d40987c0800: Connection restored to 10.0.2.106@o2ib5 (at 10.0.2.106@o2ib5) Jul 12 15:55:39 sh-16-04.int kernel: Lustre: 28485:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1562971538/real 1562971538] req@ffff8d383429bc00 x1631585825749152/t0(0) o3->oak-OST005 Jul 12 15:55:39 sh-16-04.int kernel: Lustre: oak-OST0053-osc-ffff8d40987c0800: Connection to oak-OST0053 (at 10.0.2.106@o2ib5) was lost; in progress operations using this service will wait for recovery to complete Jul 12 15:55:39 sh-16-04.int kernel: Lustre: oak-OST0053-osc-ffff8d40987c0800: Connection restored to 10.0.2.106@o2ib5 (at 10.0.2.106@o2ib5) Server (2.10), lctl dk: 00000020:02000400:7.0:1562971037.407655:0:336668:0:(tgt_handler.c:2046:tgt_brw_read()) oak-OST0053: Bulk IO read error with 9bb2d237-f2ac-86e5-17d4-840dd14480cb (at 10.8.16.4@o2ib6), client will retry: rc -110 00010000:02000400:24.0:1562971538.368057:0:337070:0:(ldlm_lib.c:779:target_handle_reconnect()) oak-OST0053: Client 9bb2d237-f2ac-86e5-17d4-840dd14480cb (at 10.8.16.4@o2ib6) reconnecting 00000100:02000000:24.0:1562971538.368078:0:337070:0:(import.c:1541:ptlrpc_import_recovery_state_machine()) oak-OST0053: Connection restored to 9bb2d237-f2ac-86e5-17d4-840dd14480cb (at 10.8.16.4@o2ib6) 00010000:00020000:9.0:1562971638.368632:0:336721:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s req@ffff883b9b290850 x1631585825749152/t0(0) o3->9bb2d237-f2ac-86e5-17d4-840dd14480cb@10.8.16.4@o2ib6:583/0 lens 488/432 e 0 to 0 dl 1562972138 ref 1 fl Interpret:/2/0 rc 0/0 00000020:02000400:9.0:1562971638.368665:0:336721:0:(tgt_handler.c:2046:tgt_brw_read()) oak-OST0053: Bulk IO read error with 9bb2d237-f2ac-86e5-17d4-840dd14480cb (at 10.8.16.4@o2ib6), client will retry: rc -110 00010000:02000400:43.0:1562972139.327922:0:318983:0:(ldlm_lib.c:779:target_handle_reconnect()) oak-OST0053: Client 9bb2d237-f2ac-86e5-17d4-840dd14480cb (at 10.8.16.4@o2ib6) reconnecting 00000100:02000000:43.0:1562972139.327938:0:318983:0:(import.c:1541:ptlrpc_import_recovery_state_machine()) oak-OST0053: Connection restored to 9bb2d237-f2ac-86e5-17d4-840dd14480cb (at 10.8.16.4@o2ib6) 00010000:00020000:12.0:1562972239.327640:0:337289:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s req@ffff881ccbed2c50 x1631585825749152/t0(0) o3->9bb2d237-f2ac-86e5-17d4-840dd14480cb@10.8.16.4@o2ib6:429/0 lens 488/432 e 0 to 0 dl 1562972739 ref 1 fl Interpret:/2/0 rc 0/0 00000020:02000400:12.0:1562972239.327693:0:337289:0:(tgt_handler.c:2046:tgt_brw_read()) oak-OST0053: Bulk IO read error with 9bb2d237-f2ac-86e5-17d4-840dd14480cb (at 10.8.16.4@o2ib6), client will retry: rc -110 1562971037 = Friday, July 12, 2019 3:37:17 PM

          I looked at the stack traces. All LNet/LND threads seem idle.

          Are there any LNetErrors indicating message transmit failures? The timeouts indicated in the logs above don't necessarily indicate a network problem. The FS might not be responding to RPCs for some reason.

          Do you see any LND/LNet timeouts?

          ashehata Amir Shehata (Inactive) added a comment - I looked at the stack traces. All LNet/LND threads seem idle. Are there any LNetErrors indicating message transmit failures? The timeouts indicated in the logs above don't necessarily indicate a network problem. The FS might not be responding to RPCs for some reason. Do you see any LND/LNet timeouts?

          We've dumped a client that was stuck but likely this won't help much. It is in sh-24-05_foreach.bt.log. I only see one Lustre thread waiting for a write, but that's it:

          PID: 170838  TASK: ffff9f3919301040  CPU: 11  COMMAND: "DMRG"
           #0 [ffff9f348006b988] __schedule at ffffffffa4b68972
           #1 [ffff9f348006ba10] schedule at ffffffffa4b68e19
           #2 [ffff9f348006ba20] cl_sync_io_wait at ffffffffc0d405dd [obdclass]
           #3 [ffff9f348006bab0] cl_io_submit_sync at ffffffffc0d40828 [obdclass]
           #4 [ffff9f348006baf8] vvp_io_write_commit at ffffffffc109c440 [lustre]
           #5 [ffff9f348006bb58] vvp_io_write_start at ffffffffc109cbe6 [lustre]
           #6 [ffff9f348006bbc0] cl_io_start at ffffffffc0d3faf8 [obdclass]
           #7 [ffff9f348006bbe8] cl_io_loop at ffffffffc0d41ee1 [obdclass]
           #8 [ffff9f348006bc58] ll_file_io_generic at ffffffffc1052742 [lustre]
           #9 [ffff9f348006bd70] ll_file_aio_write at ffffffffc10534c2 [lustre]
          #10 [ffff9f348006bde0] ll_file_write at ffffffffc1053734 [lustre]
          #11 [ffff9f348006bec8] vfs_write at ffffffffa4641810
          #12 [ffff9f348006bf08] sys_write at ffffffffa464262f
          #13 [ffff9f348006bf50] system_call_fastpath at ffffffffa4b75ddb
          

          But maybe that can help determine whether it is a client issue or not.

          BTW, I'm not sure it is a lnet issue neither.

          sthiell Stephane Thiell added a comment - We've dumped a client that was stuck but likely this won't help much. It is in sh-24-05_foreach.bt.log . I only see one Lustre thread waiting for a write, but that's it: PID: 170838 TASK: ffff9f3919301040 CPU: 11 COMMAND: "DMRG" #0 [ffff9f348006b988] __schedule at ffffffffa4b68972 #1 [ffff9f348006ba10] schedule at ffffffffa4b68e19 #2 [ffff9f348006ba20] cl_sync_io_wait at ffffffffc0d405dd [obdclass] #3 [ffff9f348006bab0] cl_io_submit_sync at ffffffffc0d40828 [obdclass] #4 [ffff9f348006baf8] vvp_io_write_commit at ffffffffc109c440 [lustre] #5 [ffff9f348006bb58] vvp_io_write_start at ffffffffc109cbe6 [lustre] #6 [ffff9f348006bbc0] cl_io_start at ffffffffc0d3faf8 [obdclass] #7 [ffff9f348006bbe8] cl_io_loop at ffffffffc0d41ee1 [obdclass] #8 [ffff9f348006bc58] ll_file_io_generic at ffffffffc1052742 [lustre] #9 [ffff9f348006bd70] ll_file_aio_write at ffffffffc10534c2 [lustre] #10 [ffff9f348006bde0] ll_file_write at ffffffffc1053734 [lustre] #11 [ffff9f348006bec8] vfs_write at ffffffffa4641810 #12 [ffff9f348006bf08] sys_write at ffffffffa464262f #13 [ffff9f348006bf50] system_call_fastpath at ffffffffa4b75ddb But maybe that can help determine whether it is a client issue or not. BTW, I'm not sure it is a lnet issue neither.
          pjones Peter Jones added a comment -

          Amir

          Can you please advise?

          Thanks

          Peter

          pjones Peter Jones added a comment - Amir Can you please advise? Thanks Peter

          People

            ashehata Amir Shehata (Inactive)
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: