[LU-12502] application failed as Segmentation fault Created: 01/Jul/19 Updated: 29/Nov/19 |
|
| Status: | Reopened |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Sarah Liu | Assignee: | Hongchao Zhang |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
lustre-master-next-ib build #121 |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Many applications failed as segmentation fault, for example 290230-iorfpp.out Summary:
api = POSIX
test filename = /mnt/soaked/soaktest/test/iorfpp/290230/iorfpp_file
access = file-per-process
pattern = segmented (1 segment)
ordering in a file = sequential offsets
ordering inter file=random task offsets >= 1, seed=0
clients = 16 (2 per node)
repetitions = 1
xfersize = 27.36 MiB
blocksize = 25.38 GiB
aggregate filesize = 406.14 GiB
[soak-18:106287] *** Process received signal ***
[soak-18:106288] *** Process received signal ***
[soak-18:106287] Signal: Segmentation fault (11)
[soak-18:106287] Signal code: Address not mapped (1)
[soak-18:106287] Failing at address: (nil)
[soak-18:106288] Signal: Segmentation fault (11)
[soak-18:106288] Signal code: Address not mapped (1)
[soak-18:106288] Failing at address: (nil)
[soak-18:106287] [ 0] /usr/lib64/libpthread.so.0(+0xf5d0)[0x7f124cb7d5d0]
[soak-18:106287] *** End of error message ***
[soak-18:106288] [ 0] /usr/lib64/libpthread.so.0(+0xf5d0)[0x7efcf51c75d0]
[soak-18:106288] *** End of error message ***
srun: error: soak-18: tasks 2-3: Segmentation fault
srun: Terminating job step 290230.0
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 290230.0 ON soak-17 CANCELLED AT 2019-06-30T09:33:24 ***
|
| Comments |
| Comment by Patrick Farrell (Inactive) [ 01/Jul/19 ] |
|
Thanks very much, Sarah. Can you provide dmesg from this node covering this time? |
| Comment by Sarah Liu [ 01/Jul/19 ] |
|
soak-18 dmesg around that time [52761.184570] Lustre: 11041:0:(client.c:2215:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1561870097/real 1561870097] req@ffff 89465e654800 x1637687364931840/t0(0) o400->soaked-OST000f-osc-ffff894d2c2d6800@192.168.1.107@o2ib:28/4 lens 224/224 e 0 to 1 dl 1561870141 ref 1 fl Rpc:XNQr/0/ffffff ff rc 0/-1 [52761.217758] Lustre: soaked-OST000f-osc-ffff894d2c2d6800: Connection to soaked-OST000f (at 192.168.1.107@o2ib) was lost; in progress operations using this service will wait for recovery to complete [52786.127967] Lustre: 11037:0:(client.c:2215:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1561870122/real 1561870122] req@ffff 89465e655580 x1637687365398656/t0(0) o400->soaked-OST000f-osc-ffff894d2c2d6800@192.168.1.107@o2ib:28/4 lens 224/224 e 0 to 1 dl 1561870166 ref 1 fl Rpc:XNQr/0/ffffff ff rc 0/-1 [53113.093634] Lustre: soaked-OST000f-osc-ffff894d2c2d6800: Connection restored to 192.168.1.106@o2ib (at 192.168.1.106@o2ib) [53121.992315] LustreError: 11-0: soaked-OST000b-osc-ffff894d2c2d6800: operation ost_write to node 192.168.1.106@o2ib failed: rc = -19 [53122.003204] Lustre: soaked-OST000b-osc-ffff894d2c2d6800: Connection to soaked-OST000b (at 192.168.1.106@o2ib) was lost; in progress operations using this service will wait for recovery to complete [53122.025032] LustreError: Skipped 2 previous similar messages [53157.928339] Lustre: soaked-OST000b-osc-ffff894d2c2d6800: Connection restored to 192.168.1.107@o2ib (at 192.168.1.107@o2ib) [53162.673455] Lustre: soaked-OST000f-osc-ffff894d2c2d6800: Connection to soaked-OST000f (at 192.168.1.106@o2ib) was lost; in progress operations using this service will wait for recovery to complete [53167.218496] LustreError: 11-0: soaked-OST0007-osc-ffff894d2c2d6800: operation ost_write to node 192.168.1.106@o2ib failed: rc = -19 [53167.231708] LustreError: Skipped 6 previous similar messages [53167.234629] Lustre: soaked-OST0007-osc-ffff894d2c2d6800: Connection to soaked-OST0007 (at 192.168.1.106@o2ib) was lost; in progress operations using this service will wait for recovery to complete [53176.123410] Lustre: soaked-OST000f-osc-ffff894d2c2d6800: Connection restored to 192.168.1.107@o2ib (at 192.168.1.107@o2ib) [53184.179153] Lustre: soaked-OST0007-osc-ffff894d2c2d6800: Connection restored to 192.168.1.107@o2ib (at 192.168.1.107@o2ib) [55528.861770] Lustre: 11038:0:(client.c:2215:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1561872901/real 1561872901] req@ffff 894b45d88900 x1637687455318016/t0(0) o400->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1561872908 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 [55528.893711] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail [55531.665930] Lustre: 11020:0:(client.c:2215:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1561872900/real 1561872900] req@ffff 89492bfcc800 x1637687455310208/t0(0) o103->soaked-MDT0000-mdc-ffff894d2c2d6800@192.168.1.108@o2ib:17/18 lens 472/224 e 0 to 1 dl 1561872911 ref 1 fl Rpc:XQr/0/ffffff ff rc 0/-1 [55531.699162] Lustre: soaked-MDT0000-mdc-ffff894d2c2d6800: Connection to soaked-MDT0000 (at 192.168.1.108@o2ib) was lost; in progress operations using this service will wait for recovery to complete [55531.699162] Lustre: soaked-MDT0000-mdc-ffff894d2c2d6800: Connection to soaked-MDT0000 (at 192.168.1.108@o2ib) was lost; in progress operations using this service will wait for recovery to complete [55544.229633] Lustre: 98548:0:(client.c:2215:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1561872880/real 1561872880] req@ffff 8949c96aa400 x1637687455075456/t0(0) o101->soaked-MDT0000-mdc-ffff894d2c2d6800@192.168.1.108@o2ib:12/10 lens 584/1664 e 0 to 1 dl 1561872924 ref 2 fl Rpc:XQr/0/fffff fff rc 0/-1 [55544.262921] Lustre: 98548:0:(client.c:2215:ptlrpc_expire_one_request()) Skipped 1 previous similar message [55933.134307] Lustre: Evicted from MGS (at MGC192.168.1.108@o2ib_0) after server handle changed from 0x8f4f40a74d9d44ad to 0x436eb31673ba0526 [55933.148584] Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib) [55939.214312] LustreError: 11014:0:(client.c:3113:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@ffff8948ff40e300 x1637687434482304/t2641409129512(2641409129512) o101->soaked-MDT0000-mdc-ffff894d2c2d6800@192.168.1.108@o2ib:12/10 lens 576/568 e 0 to 0 dl 1561873362 ref 2 fl Interpret:RPQU/4/0 rc 301/301 [55975.798452] Lustre: soaked-MDT0000-mdc-ffff894d2c2d6800: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib) [56148.102577] Lustre: 11025:0:(client.c:2215:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1561873521/real 1561873521] req@ffff894d2d365a00 x1637687463644480/t0(0) o103->soaked-MDT0001-mdc-ffff894d2c2d6800@192.168.1.109@o2ib:17/18 lens 1512/224 e 0 to 1 dl 1561873528 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 [56148.135899] Lustre: soaked-MDT0001-mdc-ffff894d2c2d6800: Connection to soaked-MDT0001 (at 192.168.1.109@o2ib) was lost; in progress operations using this service will wait for recovery to complete [56398.951209] LustreError: 167-0: soaked-MDT0001-mdc-ffff894d2c2d6800: This client was evicted by soaked-MDT0001; in progress operations using this service will fail. [56398.967987] LustreError: 99173:0:(file.c:233:ll_close_inode_openhandle()) soaked-clilmv-ffff894d2c2d6800: inode [0x2401437fc:0x1f0ae:0x0] mdc close failed: rc = -5 [56398.992491] Lustre: soaked-MDT0001-mdc-ffff894d2c2d6800: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib) [56421.437040] Lustre: soaked-MDT0001-mdc-ffff894d2c2d6800: Connection to soaked-MDT0001 (at 192.168.1.108@o2ib) was lost; in progress operations using this service will wait for recovery to complete [56421.474309] LustreError: 167-0: soaked-MDT0001-mdc-ffff894d2c2d6800: This client was evicted by soaked-MDT0001; in progress operations using this service will fail. [56421.502558] Lustre: soaked-MDT0001-mdc-ffff894d2c2d6800: Connection restored to 192.168.1.109@o2ib (at 192.168.1.109@o2ib) [57263.952065] Lustre: 11039:0:(client.c:2215:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1561874636/real 1561874636] req@ffff894c9536bf00 x1637687484812160/t0(0) o103->soaked-MDT0001-mdc-ffff894d2c2d6800@192.168.1.109@o2ib:17/18 lens 328/224 e 0 to 1 dl 1561874644 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 [57273.851618] Lustre: 99598:0:(client.c:2215:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1561874636/real 1561874636] req@ffff894535841200 x1637687484811264/t0(0) o36->soaked-MDT0001-mdc-ffff894d2c2d6800@192.168.1.109@o2ib:12/10 lens 496/912 e 0 to 1 dl 1561874654 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 [57597.077220] LustreError: 167-0: soaked-MDT0001-mdc-ffff894d2c2d6800: This client was evicted by soaked-MDT0001; in progress operations using this service will fail. [57597.140513] Lustre: soaked-MDT0001-mdc-ffff894d2c2d6800: Connection restored to 192.168.1.109@o2ib (at 192.168.1.109@o2ib) [60446.844551] Lustre: 11018:0:(client.c:2215:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1561877812/real 1561877812] req@ffff8945b6c74800 x1637687545868544/t0(0) o400->soaked-MDT0001-mdc-ffff894d2c2d6800@192.168.1.109@o2ib:12/10 lens 224/224 e 0 to 1 dl 1561877827 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 [60446.877832] Lustre: 11018:0:(client.c:2215:ptlrpc_expire_one_request()) Skipped 1 previous similar message [60446.888649] Lustre: soaked-MDT0001-mdc-ffff894d2c2d6800: Connection to soaked-MDT0001 (at 192.168.1.109@o2ib) was lost; in progress operations using this service will wait for recovery to complete [60753.796836] LustreError: 167-0: soaked-MDT0001-mdc-ffff894d2c2d6800: This client was evicted by soaked-MDT0001; in progress operations using this service will fail. [60753.826032] Lustre: soaked-MDT0001-mdc-ffff894d2c2d6800: Connection restored to 192.168.1.109@o2ib (at 192.168.1.109@o2ib) [63893.594833] Lustre: 11016:0:(client.c:2215:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1561881263/real 1561881263] req@ffff8947b96bda00 x1637687586473344/t0(0) o103->soaked-MDT0000-mdc-ffff894d2c2d6800@192.168.1.108@o2ib:17/18 lens 328/224 e 0 to 1 dl 1561881273 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 [63893.628018] Lustre: soaked-MDT0000-mdc-ffff894d2c2d6800: Connection to soaked-MDT0000 (at 192.168.1.108@o2ib) was lost; in progress operations using this service will wait for recovery to complete [63896.931016] Lustre: 11018:0:(client.c:2215:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1561881270/real 1561881270] req@ffff894535eaa880 x1637687586473472/t0(0) o400->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1561881277 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 [63896.962938] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service will fail [64198.045899] Lustre: Evicted from MGS (at MGC192.168.1.108@o2ib_0) after server handle changed from 0x436eb31673ba0526 to 0x64a87ae2ca1b751d [64198.060003] Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib) [64205.060484] LustreError: 11014:0:(client.c:3113:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@ffff894822167500 x1637687549774400/t2645704454707(2645704454707) o101->soaked-MDT0000-mdc-ffff894d2c2d6800@192.168.1.108@o2ib:12/10 lens 576/568 e 0 to 0 dl 1561881606 ref 2 fl Interpret:RPQU/4/0 rc 301/301 [64205.091242] LustreError: 11014:0:(client.c:3113:ptlrpc_replay_interpret()) Skipped 1 previous similar message [64252.095152] Lustre: soaked-MDT0000-mdc-ffff894d2c2d6800: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib) [65590.873997] Lustre: 11014:0:(client.c:2215:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1561882964/real 1561882964] req@ffff89453b240d80 x1637687644371584/t0(0) o9->soaked-OST000c-osc-ffff894d2c2d6800@192.168.1.104@o2ib:28/4 lens 224/224 e 0 to 1 dl 1561882970 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 [65968.654567] Lustre: soaked-OST0008-osc-ffff894d2c2d6800: Connection restored to 192.168.1.105@o2ib (at 192.168.1.105@o2ib) [65993.848227] Lustre: soaked-OST0000-osc-ffff894d2c2d6800: Connection restored to 192.168.1.105@o2ib (at 192.168.1.105@o2ib) [66019.989947] Lustre: soaked-OST0004-osc-ffff894d2c2d6800: Connection restored to 192.168.1.105@o2ib (at 192.168.1.105@o2ib) [66049.558720] Lustre: soaked-OST0008-osc-ffff894d2c2d6800: Connection to soaked-OST0008 (at 192.168.1.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete [66049.578231] Lustre: Skipped 1 previous similar message [66050.764854] Lustre: soaked-OST0008-osc-ffff894d2c2d6800: Connection restored to 192.168.1.104@o2ib (at 192.168.1.104@o2ib) [66050.777189] Lustre: Skipped 1 previous similar message [66051.488551] LustreError: 11-0: soaked-OST0004-osc-ffff894d2c2d6800: operation ldlm_enqueue to node 192.168.1.105@o2ib failed: rc = -107 [66051.502145] Lustre: soaked-OST0004-osc-ffff894d2c2d6800: Connection to soaked-OST0004 (at 192.168.1.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete [66059.049901] Lustre: soaked-OST0000-osc-ffff894d2c2d6800: Connection restored to 192.168.1.104@o2ib (at 192.168.1.104@o2ib) [66067.926132] LustreError: 11-0: soaked-OST000c-osc-ffff894d2c2d6800: operation ldlm_enqueue to node 192.168.1.105@o2ib failed: rc = -107 [66067.939745] Lustre: soaked-OST000c-osc-ffff894d2c2d6800: Connection to soaked-OST000c (at 192.168.1.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete [66084.677172] Lustre: soaked-OST000c-osc-ffff894d2c2d6800: Connection restored to 192.168.1.104@o2ib (at 192.168.1.104@o2ib) [66084.689514] Lustre: Skipped 1 previous similar message [67142.355034] Lustre: 11031:0:(client.c:2215:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1561884476/real 1561884476] req@ffff894595c01200 x1637687680489408/t0(0) o400->soaked-OST000f-osc-ffff894d2c2d6800@192.168.1.107@o2ib:28/4 lens 224/224 e 0 to 1 dl 1561884522 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 [67142.388222] Lustre: 11031:0:(client.c:2215:ptlrpc_expire_one_request()) Skipped 3 previous similar messages |
| Comment by Patrick Farrell (Inactive) [ 12/Jul/19 ] |
|
Sarah, Is there a place I can see a timeline of our failover events? Like, node X failed over at time Y type information. I'm not having much luck with the segfaults yet - They are happening, a lot, and there are no obvious errors associated with them in any of the logs on the clients... They just occur and the jobs fail. Can you help me understand the precise version of master (or master-next?) we are running, and the last version (of master/master-next) that did not have this problem? |
| Comment by Sarah Liu [ 18/Jul/19 ] |
|
Hi Patrick, The failover triggered for each day are recorded in /scratch/results/soak/daily/date/soak-trig.xx Current soak is running with lustre-master-next-ib build version=2.12.55_58_g2c7b19e |
| Comment by Patrick Farrell (Inactive) [ 16/Aug/19 ] |
|
Sarah, Do we know if this is still happening? James mentioned failure rates were back down, but I'm not sure if this went away or not. |
| Comment by Sarah Liu [ 19/Aug/19 ] |
|
Hi Patrick, The last time saw this problem was on 7/18/19, after that this problem is not showing anymore. |
| Comment by Patrick Farrell (Inactive) [ 19/Aug/19 ] |
|
Hm, OK, cool. That's a bit of a mystery! Then let's say we'll leave this open and if it happens again on this or another branch, we'll investigate. And if nothing else happens for a while, we'll close it. |
| Comment by Patrick Farrell (Inactive) [ 04/Sep/19 ] |
|
Not clear why, but the problem went away. |
| Comment by Sarah Liu [ 26/Nov/19 ] |
|
saw this kind of error again in 2.13.0 testing build: b2_13-ib #4 less 354725-iorssf.out [soak-24:04336] *** Process received signal *** [soak-24:04336] Signal: Segmentation fault (11) [soak-24:04336] Signal code: Address not mapped (1) [soak-24:04336] Failing at address: (nil) [soak-24:04336] [ 0] /lib64/libpthread.so.0(+0xf5f0)[0x7f809d09b5f0] [soak-24:04336] *** End of error message *** [1] At file byte offset 45464659968, comparing 2787532srun: error: soak-24: task 1: Segmentation fault srun: Terminating job step 354725.0 slurmstepd: error: *** STEP 354725.0 ON soak-23 CANCELLED AT 2019-11-24T09:58:33 *** [3] At file byte offset 97619398656, comparing 27875328-[4] At file byte offset 121759432704, comparin[0] At file byte offset 22941394944, comparing 27875328-byte trans[9] At file byte offset 243964870656,[7] At file byte offset 195071545344,[6] At file byte offset 170624882688,[8] At file byte offset 219518208000,srun: error: soak-27: task 4: Terminated srun: error: soak-23: task 0: Terminated [5] At file byte offset 146178220032,[2] At file byte offset 73172736000, comparing 27875328-srun: error: soak-26: task 3: Terminated srun: error: soak-30: task 7: Terminated around the failure occured(106200), soak-24 console log [106090.686894] LNetError: 7387:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 6 previous similar messages [106090.698463] Lustre: 3694:0:(client.c:2219:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1574589214/real 157458922 0] req@ffff8b3452054380 x1650971283944320/t0(0) o400->MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1574589221 ref 1 fl Rpc:eXNQr /0/ffffffff rc 0/-1 job:'' [106090.731460] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using this service wil l fail [106090.747966] Lustre: soaked-MDT0000-mdc-ffff8b3dc0326000: Connection to soaked-MDT0000 (at 192.168.1.108@o2ib) was lost; in progress operations using thi s service will wait for recovery to complete [106294.636791] LNet: 3673:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 192.168.1.108@o2ib: 1 seconds [106294.648267] LNet: 3673:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 1 previous similar message [106366.737896] LustreError: 3684:0:(client.c:3117:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@ffff8b2e585a9200 x1650970765880960/t10308034373 2(103080343732) o101->soaked-MDT0000-mdc-ffff8b3dc0326000@192.168.1.109@o2ib:12/10 lens 576/600 e 0 to 0 dl 1574589678 ref 2 fl Interpret:RPQU/4/0 rc 301/30 1 job:'' [106369.637915] LNet: 3673:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 192.168.1.108@o2ib: 18 seconds [106369.649474] LNet: 3673:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 19 previous similar messages [106384.092083] Lustre: soaked-MDT0000-mdc-ffff8b3dc0326000: Connection restored to 192.168.1.109@o2ib (at 192.168.1.109@o2ib) [106384.104524] Lustre: Skipped 1 previous similar message [106391.825769] Lustre: Evicted from MGS (at 192.168.1.109@o2ib) after server handle changed from 0x36b0a40c5ba0a734 to 0xdd60b175004c3d06 [106413.652576] LustreError: 11-0: soaked-MDT0000-mdc-ffff8b3dc0326000: operation mds_close to node 192.168.1.109@o2ib failed: rc = -107 [106413.666012] Lustre: soaked-MDT0000-mdc-ffff8b3dc0326000: Connection to soaked-MDT0000 (at 192.168.1.109@o2ib) was lost; in progress operations using thi s service will wait for recovery to complete [106442.078742] LustreError: 167-0: soaked-MDT0000-mdc-ffff8b3dc0326000: This client was evicted by soaked-MDT0000; in progress operations using this servic e will fail. [106442.098484] LustreError: 7621:0:(file.c:234:ll_close_inode_openhandle()) soaked-clilmv-ffff8b3dc0326000: inode [0x20000bf8d:0xd4b:0x0] mdc close failed: rc = -5 [106442.122890] Lustre: soaked-MDT0000-mdc-ffff8b3dc0326000: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib) [106442.135320] Lustre: Skipped 1 previous similar message [106449.002262] Lustre: 3691:0:(client.c:2219:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1574589571/real 1574589571] req@ffff8b3093e4b180 x1650971377795584/t0(0) o400->MGC192.168.1.108@o2ib@192.168.1.109@o2ib:26/25 lens 224/224 e 0 to 1 dl 1574589578 ref 1 fl Rpc:XNQr/0/f fffffff rc 0/-1 job:'' [106449.034917] Lustre: 3691:0:(client.c:2219:ptlrpc_expire_one_request()) Skipped 1 previous similar message [106449.045737] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.109@o2ib) was lost; in progress operations using this service wil l fail |
| Comment by Peter Jones [ 26/Nov/19 ] |
|
Hongchao Could you please advise? Thanks Peter |
| Comment by Hongchao Zhang [ 29/Nov/19 ] |
|
It could be caused by the eviction of the client. Are there debug logs of the client and the MDT? |