Details
-
Question/Request
-
Resolution: Fixed
-
Critical
-
None
-
Lustre 2.12.0, Lustre 2.10.8
-
None
-
Client 2.12.0 + patches, Server: 2.10.8 + patches, CentOS 7.6
-
9223372036854775807
Description
Hello! I'm tracking hung file access on Sherlock/Oak, perhaps someone can help.
Clients are on Sherlock and running 2.12.0 + patches (https://github.com/stanford-rc/lustre/commits/4f7519966aebb21589e145b010dcdfea6ced6670). The problem seems to be only happening for files on Oak servers, which is running 2.10.8 + patches (https://github.com/stanford-rc/lustre/commits/73a88a805990aed3c35d5247cc886a7cfc1c527f).
The problem originate from the following kind of errors on the OSS (here 10.0.2.110@o2ib5 running 2.10.8+):
Jul 11 12:19:58 oak-io3-s2 kernel: LustreError: 177680:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff883e01b05450 x1631536567940480/t0(0) o4->83eaf162-e2fc-0143-0a04-bb96aa68816a@10.9.105.47@o2ib4:647/0 lens 4008/448 e 0 to 0 dl 1562873297 ref 1 fl Interpret:/2/0 rc 0/0
And on the corresponding client, here (sh-105-47) 10.9.105.47@o2ib4, running 2.12.0+:
Jul 11 12:18:17 sh-105-47.int kernel: Lustre: 97415:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1562872096/real 1562872096] req@ffff8880aa5acb00 x1631536567940480/t0(0) o4->oak-OST00 Jul 11 12:18:17 sh-105-47.int kernel: Lustre: oak-OST0075-osc-ffff8864d1205000: Connection to oak-OST0075 (at 10.0.2.110@o2ib5) was lost; in progress operations using this service will wait for recovery to complete Jul 11 12:18:17 sh-105-47.int kernel: Lustre: oak-OST0075-osc-ffff8864d1205000: Connection restored to 10.0.2.110@o2ib5 (at 10.0.2.110@o2ib5)
Another example: a file is blocked on sh-24-05 (10.8.24.5@o2ib6). I cannot access it even locally (without routers) on an Oak client:
[root@oak-gw02 ~]# file /oak/stanford/groups/tpd/cpeng/Kitaev_v2/t1_3.0_t2_0.1_k_1.0_J_0.0/delta_0.0833/Lx_48_Ly_3/m_6000/status.45458106.out ...hanging...
But stat does work:
[root@oak-gw02 ~]# stat /oak/stanford/groups/tpd/cpeng/Kitaev_v2/t1_3.0_t2_0.1_k_1.0_J_0.0/delta_0.0833/Lx_48_Ly_3/m_6000/status.45458106.out File: ‘/oak/stanford/groups/tpd/cpeng/Kitaev_v2/t1_3.0_t2_0.1_k_1.0_J_0.0/delta_0.0833/Lx_48_Ly_3/m_6000/status.45458106.out’ Size: 68141 Blocks: 144 IO Block: 4194304 regular file Device: d8214508h/3626059016d Inode: 1044835495315506970 Links: 1 Access: (0660/-rw-rw----) Uid: (346234/ cpeng18) Gid: ( 6267/ oak_tpd) Access: 2019-07-12 10:08:55.000000000 -0700 Modify: 2019-07-10 06:40:11.000000000 -0700 Change: 2019-07-10 06:40:11.000000000 -0700 Birth: -
The client logs show many disconnect from OST003b:
Jul 12 09:27:38 sh-24-05.int kernel: Lustre: oak-OST003b-osc-ffff9f3b06215000: Connection to oak-OST003b (at 10.0.2.106@o2ib5) was lost; in progress operations using this service will wait for recovery to complete Jul 12 09:27:38 sh-24-05.int kernel: Lustre: oak-OST003b-osc-ffff9f3b06215000: Connection restored to 10.0.2.106@o2ib5 (at 10.0.2.106@o2ib5) Jul 12 09:37:40 sh-24-05.int kernel: Lustre: 91043:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1562948858/real 1562948858] req@ffff9f4315b34800 x1635098786669312/t0(0) o4->oak-OST003b-osc-ffff9f3b06215000@10.0.2.106@o2ib5:6/4 lens 568/448 e 4 to 1 dl 1562949459 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Jul 12 09:37:40 sh-24-05.int kernel: Lustre: oak-OST003b-osc-ffff9f3b06215000: Connection to oak-OST003b (at 10.0.2.106@o2ib5) was lost; in progress operations using this service will wait for recovery to complete Jul 12 09:37:40 sh-24-05.int kernel: Lustre: oak-OST003b-osc-ffff9f3b06215000: Connection restored to 10.0.2.106@o2ib5 (at 10.0.2.106@o2ib5) Jul 12 09:47:41 sh-24-05.int kernel: Lustre: 91043:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1562949460/real 1562949460] req@ffff9f4315b34800 x1635098786669312/t0(0) o4->oak-OST003b-osc-ffff9f3b06215000@10.0.2.106@o2ib5:6/4 lens 568/448 e 4 to 1 dl 1562950061 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Jul 12 09:47:41 sh-24-05.int kernel: Lustre: oak-OST003b-osc-ffff9f3b06215000: Connection to oak-OST003b (at 10.0.2.106@o2ib5) was lost; in progress operations using this service will wait for recovery to complete Jul 12 09:47:42 sh-24-05.int kernel: Lustre: oak-OST003b-osc-ffff9f3b06215000: Connection restored to 10.0.2.106@o2ib5 (at 10.0.2.106@o2ib5) Jul 12 09:57:42 sh-24-05.int kernel: Lustre: 91043:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1562950061/real 1562950061] req@ffff9f4315b34800 x1635098786669312/t0(0) o4->oak-OST003b-osc-ffff9f3b06215000@10.0.2.106@o2ib5:6/4 lens 568/448 e 4 to 1 dl 1562950662 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Jul 12 09:57:42 sh-24-05.int kernel: Lustre: oak-OST003b-osc-ffff9f3b06215000: Connection to oak-OST003b (at 10.0.2.106@o2ib5) was lost; in progress operations using this service will wait for recovery to complete Jul 12 09:57:42 sh-24-05.int kernel: Lustre: oak-OST003b-osc-ffff9f3b06215000: Connection restored to 10.0.2.106@o2ib5 (at 10.0.2.106@o2ib5) Jul 12 10:07:43 sh-24-05.int kernel: Lustre: 91043:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1562950662/real 1562950662] req@ffff9f4315b34800 x1635098786669312/t0(0) o4->oak-OST003b-osc-ffff9f3b06215000@10.0.2.106@o2ib5:6/4 lens 568/448 e 4 to 1 dl 1562951263 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Jul 12 10:07:43 sh-24-05.int kernel: Lustre: oak-OST003b-osc-ffff9f3b06215000: Connection to oak-OST003b (at 10.0.2.106@o2ib5) was lost; in progress operations using this service will wait for recovery to complete Jul 12 10:07:43 sh-24-05.int kernel: Lustre: oak-OST003b-osc-ffff9f3b06215000: Connection restored to 10.0.2.106@o2ib5 (at 10.0.2.106@o2ib5)
This file is indeed stripped on OST index 59 (003b):
[root@oak-gw02 ~]# lfs getstripe /oak/stanford/groups/tpd/cpeng/Kitaev_v2/t1_3.0_t2_0.1_k_1.0_J_0.0/delta_0.0833/Lx_48_Ly_3/m_6000/status.45458106.out /oak/stanford/groups/tpd/cpeng/Kitaev_v2/t1_3.0_t2_0.1_k_1.0_J_0.0/delta_0.0833/Lx_48_Ly_3/m_6000/status.45458106.out lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 59 obdidx objid objid group 59 330567 0x50b47 0x1ac0000400
OST003b is on oak-io2-s2 (10.0.2.106@o2ib5) and the matching logs are:
Jul 12 09:49:22 oak-io2-s2 kernel: LustreError: 337216:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff883e20b73c50 x1635098786669312/t0(0) o4->6a1f7b6d-2e7f-b8ba-0d05-a94a9c66ed24@10.8.24.5@o2ib6:246/ Jul 12 09:49:22 oak-io2-s2 kernel: LustreError: 337216:0:(ldlm_lib.c:3239:target_bulk_io()) Skipped 33 previous similar messages Jul 12 09:51:11 oak-io2-s2 kernel: Lustre: oak-OST0059: Bulk IO read error with 35757990-c9b9-0d25-75ab-51daa53c34a7 (at 10.9.101.8@o2ib4), client will retry: rc -110 Jul 12 09:51:11 oak-io2-s2 kernel: Lustre: Skipped 10 previous similar messages Jul 12 09:51:48 oak-io2-s2 kernel: Lustre: oak-OST004b: Connection restored to 7d38b821-565e-3c2d-7913-4b3451580d80 (at 10.9.108.19@o2ib4) Jul 12 09:51:48 oak-io2-s2 kernel: Lustre: Skipped 53 previous similar messages Jul 12 09:55:54 oak-io2-s2 kernel: Lustre: oak-OST0043: Client ed05c118-550c-6d50-a22a-90d69ede3f9d (at 10.8.25.4@o2ib6) reconnecting Jul 12 09:55:54 oak-io2-s2 kernel: Lustre: Skipped 30 previous similar messages Jul 12 09:57:51 oak-io2-s2 kernel: Lustre: oak-OST0053: Bulk IO write error with 3a089bf9-b1d9-abd2-566d-ba0d59f75130 (at 10.8.12.2@o2ib6), client will retry: rc = -110 Jul 12 09:57:51 oak-io2-s2 kernel: Lustre: Skipped 22 previous similar messages Jul 12 09:58:18 oak-io2-s2 kernel: Lustre: oak-OST003b: haven't heard from client aca6646c-58d6-0250-4505-9a1b573dc90f (at 10.9.112.3@o2ib4) in 227 seconds. I think it's dead, and I am evicting it. exp ffff883e5ef6d800, cur 1562950698 exp Jul 12 09:58:18 oak-io2-s2 kernel: Lustre: Skipped 23 previous similar messages Jul 12 09:58:29 oak-io2-s2 kernel: Lustre: oak-OST0053: haven't heard from client aca6646c-58d6-0250-4505-9a1b573dc90f (at 10.9.112.3@o2ib4) in 227 seconds. I think it's dead, and I am evicting it. exp ffff883bc3f1b000, cur 1562950709 exp Jul 12 09:59:23 oak-io2-s2 kernel: LustreError: 337286:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff883bb150d850 x1635098786669312/t0(0) o4->6a1f7b6d-2e7f-b8ba-0d05-a94a9c66ed24@10.8.24.5@o2ib6:92/0 Jul 12 09:59:23 oak-io2-s2 kernel: LustreError: 337286:0:(ldlm_lib.c:3239:target_bulk_io()) Skipped 33 previous similar messages Jul 12 10:01:12 oak-io2-s2 kernel: Lustre: oak-OST0059: Bulk IO read error with 35757990-c9b9-0d25-75ab-51daa53c34a7 (at 10.9.101.8@o2ib4), client will retry: rc -110 Jul 12 10:01:12 oak-io2-s2 kernel: Lustre: Skipped 10 previous similar messages Jul 12 10:01:49 oak-io2-s2 kernel: Lustre: oak-OST004b: Connection restored to 7d38b821-565e-3c2d-7913-4b3451580d80 (at 10.9.108.19@o2ib4) Jul 12 10:01:49 oak-io2-s2 kernel: Lustre: Skipped 29 previous similar messages Jul 12 10:05:55 oak-io2-s2 kernel: Lustre: oak-OST0043: Client ed05c118-550c-6d50-a22a-90d69ede3f9d (at 10.8.25.4@o2ib6) reconnecting Jul 12 10:05:55 oak-io2-s2 kernel: Lustre: Skipped 28 previous similar messages Jul 12 10:07:52 oak-io2-s2 kernel: Lustre: oak-OST0053: Bulk IO write error with 3a089bf9-b1d9-abd2-566d-ba0d59f75130 (at 10.8.12.2@o2ib6), client will retry: rc = -110 Jul 12 10:07:52 oak-io2-s2 kernel: Lustre: Skipped 20 previous similar messages Jul 12 10:09:23 oak-io2-s2 kernel: LustreError: 337506:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff881d0aba5450 x1635098786669312/t0(0) o4->6a1f7b6d-2e7f-b8ba-0d05-a94a9c66ed24@10.8.24.5@o2ib6:693/ Jul 12 10:09:23 oak-io2-s2 kernel: LustreError: 337506:0:(ldlm_lib.c:3239:target_bulk_io()) Skipped 31 previous similar messages Jul 12 10:11:13 oak-io2-s2 kernel: Lustre: oak-OST0059: Bulk IO read error with 35757990-c9b9-0d25-75ab-51daa53c34a7 (at 10.9.101.8@o2ib4), client will retry: rc -110 Jul 12 10:11:13 oak-io2-s2 kernel: Lustre: Skipped 10 previous similar messages Jul 12 10:11:50 oak-io2-s2 kernel: Lustre: oak-OST004b: Connection restored to 7d38b821-565e-3c2d-7913-4b3451580d80 (at 10.9.108.19@o2ib4) Jul 12 10:11:50 oak-io2-s2 kernel: Lustre: Skipped 74 previous similar messages Jul 12 10:15:56 oak-io2-s2 kernel: Lustre: oak-OST0043: Client ed05c118-550c-6d50-a22a-90d69ede3f9d (at 10.8.25.4@o2ib6) reconnecting Jul 12 10:15:56 oak-io2-s2 kernel: Lustre: Skipped 28 previous similar messages Jul 12 10:17:53 oak-io2-s2 kernel: Lustre: oak-OST0053: Bulk IO write error with 3a089bf9-b1d9-abd2-566d-ba0d59f75130 (at 10.8.12.2@o2ib6), client will retry: rc = -110 Jul 12 10:17:53 oak-io2-s2 kernel: Lustre: Skipped 20 previous similar messages Jul 12 10:19:24 oak-io2-s2 kernel: LustreError: 337169:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff883bcdb45450 x1635098786669312/t0(0) o4->6a1f7b6d-2e7f-b8ba-0d05-a94a9c66ed24@10.8.24.5@o2ib6:539/ Jul 12 10:19:24 oak-io2-s2 kernel: LustreError: 337169:0:(ldlm_lib.c:3239:target_bulk_io()) Skipped 31 previous similar messages
The workaround we have found is to reboot clients that are doing that. Otherwise, access to (some) files that are open on these clients is just hanging from anywhere, leading to failed jobs.
The majority of the timeout messages are timeout on bulk WRITE, but I also see a few timeout on bulk READ. You can see below that they come from any OSS/OSTs and several clients.
oak-io2-s2: Jul 11 20:47:58 oak-io2-s2 kernel: LustreError: 337506:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff880064249450 x1635098786669312/t0(0) o4->6a1f7b6d-2e7f-b8ba-0d05-a94a9c66ed24@10.8.24.5@o2ib6:173/0 lens 568/448 e 0 to 0 dl 1562903778 ref 1 fl Interpret:/2/0 rc 0/0 oak-io2-s1: Jul 11 20:50:48 oak-io2-s1 kernel: LustreError: 269496:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff881fdd355850 x1631642338211952/t0(0) o4->2849578e-5060-1d9c-ed8e-2c6ecf8dbfdc@10.9.103.21@o2ib4:618/0 lens 520/448 e 0 to 0 dl 1562903468 ref 1 fl Interpret:/2/0 rc 0/0 oak-io1-s1: Jul 11 20:47:28 oak-io1-s1 kernel: LustreError: 346440:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff883e8e740450 x1634178743345136/t0(0) o4->2e50e863-cfe9-6b65-cc33-1b949391a2a5@10.9.109.69@o2ib4:418/0 lens 9248/448 e 0 to 0 dl 1562903268 ref 1 fl Interpret:/2/0 rc 0/0 oak-io3-s2: Jul 11 20:52:03 oak-io3-s2 kernel: LustreError: 160887:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff8835760a5850 x1631812409859184/t0(0) o4->7b520521-bd66-0427-aeea-fed858d66fc7@10.8.30.13@o2ib6:418/0 lens 544/448 e 0 to 0 dl 1562904023 ref 1 fl Interpret:/2/0 rc 0/0 oak-io1-s2: Jul 11 20:54:08 oak-io1-s2 kernel: LustreError: 263511:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff881b99d74850 x1631536876509328/t0(0) o4->88f0afe1-d8a4-2cd9-5b8e-a652402618f0@10.9.103.16@o2ib4:125/0 lens 4584/448 e 0 to 0 dl 1562903730 ref 1 fl Interpret:/2/0 rc 0/0 oak-io1-s1: Jul 11 20:57:33 oak-io1-s1 kernel: LustreError: 333909:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff883e47c9dc50 x1634178743345136/t0(0) o4->2e50e863-cfe9-6b65-cc33-1b949391a2a5@10.9.109.69@o2ib4:268/0 lens 9248/448 e 0 to 0 dl 1562903873 ref 1 fl Interpret:/2/0 rc 0/0 oak-io2-s2: Jul 11 20:57:59 oak-io2-s2 kernel: LustreError: 337756:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff88003532c450 x1635098786669312/t0(0) o4->6a1f7b6d-2e7f-b8ba-0d05-a94a9c66ed24@10.8.24.5@o2ib6:19/0 lens 568/448 e 0 to 0 dl 1562904379 ref 1 fl Interpret:/2/0 rc 0/0 oak-io3-s1: Jul 11 20:58:00 oak-io3-s1 kernel: LustreError: 166558:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff88005bd8d050 x1631571779573440/t0(0) o4->65acd50a-6ff2-3a7f-4743-551235e323f1@10.8.30.22@o2ib6:293/0 lens 10728/448 e 0 to 0 dl 1562903898 ref 1 fl Interpret:/2/0 rc 0/0 oak-io2-s1: Jul 11 21:00:53 oak-io2-s1 kernel: LustreError: 269486:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff881c9332cc50 x1631642338211952/t0(0) o4->2849578e-5060-1d9c-ed8e-2c6ecf8dbfdc@10.9.103.21@o2ib4:468/0 lens 520/448 e 0 to 0 dl 1562904073 ref 1 fl Interpret:/2/0 rc 0/0 oak-io3-s2: Jul 11 21:02:04 oak-io3-s2 kernel: LustreError: 160141:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff8838d7b30450 x1631812409859184/t0(0) o4->7b520521-bd66-0427-aeea-fed858d66fc7@10.8.30.13@o2ib6:264/0 lens 544/448 e 0 to 0 dl 1562904624 ref 1 fl Interpret:/2/0 rc 0/0
We have noticed yesterday that it's like some client is holding a write lock (the one in the error message) and when we reboot the client, it is finally evicted and access to the file restored. Do you think there are any client patch in 2.12.1 or 2.12.2 which could help with this? Any other suggestions? Thanks!
Stephane
ok - thanks