|
from the logs, it seems the bad network is the reason to cause eviction, take OST0016 as an example,
1, the server side:
output of ' grep "OST0016" oss6/server-log.133910* |grep "evict" '
oss6/server-log.1339107443.28502.txt:00000100:00080000:9.0:1339107438.377395:0:28621:0:(service.c:829:ptlrpc_update_export_timer()) lustre-OST0016: Think about evicting 192.168.115.106@o2ib1 from 1339107065
oss6/server-log.1339107515.28820.txt:00000100:00080000:5.0:1339107514.029089:0:28820:0:(pinger.c:642:ping_evictor_main()) evicting all exports of obd lustre-OST0016 older than 1339107364
oss6/server-log.1339107515.28820.txt:00000100:02000400:5.0:1339107514.029092:0:28820:0:(pinger.c:668:ping_evictor_main()) lustre-OST0016: haven't heard from client b437de75-d7b5-c8e1-cda3-0fc115566ca1 (at 192.168.113.136@o2ib1) in 430 seconds. I think it's dead, and I am evicting it. exp ffff8802eceaac00, cur 1339107514 expire 1339107364 last 1339107084
oss6/server-log.1339107515.28820.txt:00000100:02000400:5.0:1339107514.029197:0:28820:0:(pinger.c:668:ping_evictor_main()) lustre-OST0016: haven't heard from client 502ca4cf-aa86-f823-123d-b12e548e49ba (at 192.168.114.16@o2ib1) in 425 seconds. I think it's dead, and I am evicting it. exp ffff880489661000, cur 1339107514 expire 1339107364 last 1339107089
oss6/server-log.1339107515.28820.txt:00000100:02000400:5.0:1339107514.029308:0:28820:0:(pinger.c:668:ping_evictor_main()) lustre-OST0016: haven't heard from client ad17b65f-2e05-a801-a642-28187f3ab454 (at 192.168.114.12@o2ib1) in 425 seconds. I think it's dead, and I am evicting it. exp ffff8801ec20f400, cur 1339107514 expire 1339107364 last 1339107089
oss6/server-log.1339107515.28820.txt:00000100:02000400:5.0:1339107514.029382:0:28820:0:(pinger.c:668:ping_evictor_main()) lustre-OST0016: haven't heard from client 5d184491-373a-66ab-0e22-d1fb789b66ee (at 192.168.114.13@o2ib1) in 425 seconds. I think it's dead, and I am evicting it. exp ffff88053cc38400, cur 1339107514 expire 1339107364 last 1339107089
oss6/server-log.1339108693.6659.txt:00000100:00080000:4.0:1339108687.119017:0:28626:0:(service.c:829:ptlrpc_update_export_timer()) lustre-OST0016: Think about evicting 192.168.113.111@o2ib1 from 1339107938
the OST0016 evicts these clients for it don't send ping requests to keep alive
2, the client side
the excerpt of output ' grep "OST0016" hy260.debug.log hy281/* '
...
hy260.debug.log:00000100:00000400:1.0:1339107138.280452:0:19715:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1339107114/real 1339107138] req@ffff8801d4d6e000 x1404138588850748/t0(0) o400->lustre-OST0016-osc-ffff880226efb000@192.168.127.65@o2ib1:28/4 lens 192/192 e 0 to 1 dl 1339107220 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
hy260.debug.log:00000100:00100000:1.0:1339107138.280460:0:19715:0:(client.c:1823:ptlrpc_expire_one_request()) @@@ TIMEOUT-NORESEND: req@ffff8801d4d6e000 x1404138588850748/t0(0) o400->lustre-OST0016-osc-ffff880226efb000@192.168.127.65@o2ib1:28/4 lens 192/192 e 0 to 1 dl 1339107220 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
hy260.debug.log:00000100:02000400:1.0:1339107138.280465:0:19715:0:(import.c:170:ptlrpc_set_import_discon()) lustre-OST0016-osc-ffff880226efb000: Connection to lustre-OST0016 (at 192.168.127.65@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
hy260.debug.log:00000100:00080000:1.0:1339107138.280466:0:19715:0:(import.c:180:ptlrpc_set_import_discon()) ffff880110a9c000 lustre-OST0016_UUID: changing import state from FULL to DISCONN
hy260.debug.log:00000100:00080000:1.0:1339107138.280467:0:19715:0:(import.c:415:ptlrpc_fail_import()) lustre-OST0016_UUID: waking up pinger
hy260.debug.log:00000100:00080000:4.0:1339107138.280493:0:19717:0:(recover.c:71:ptlrpc_initiate_recovery()) lustre-OST0016_UUID: starting recovery
hy260.debug.log:00000100:00080000:4.0:1339107138.280494:0:19717:0:(import.c:626:ptlrpc_connect_import()) ffff880110a9c000 lustre-OST0016_UUID: changing import state from DISCONN to CONNECTING
hy260.debug.log:00000100:00080000:4.0:1339107138.280496:0:19717:0:(import.c:482:import_select_connection()) lustre-OST0016-osc-ffff880226efb000: connect to NID 192.168.127.65@o2ib1 last attempt 4309519608
hy260.debug.log:00000100:00080000:4.0:1339107138.280499:0:19717:0:(import.c:561:import_select_connection()) lustre-OST0016-osc-ffff880226efb000: import ffff880110a9c000 using connection 192.168.127.65@o2ib1/192.168.127.65@o2ib1
hy260.debug.log:00000100:00100000:4.0:1339107138.280505:0:19717:0:(import.c:725:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5) req@ffff8801dee51c00 x1404138588850806/t0(0) o8->lustre-OST0016-osc-ffff880226efb000@192.168.127.65@o2ib1:28/4 lens 368/512 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1
hy260.debug.log:00000100:00080000:4.0:1339107138.280707:0:19717:0:(pinger.c:258:ptlrpc_pinger_process_import()) not pinging lustre-OST0016_UUID (in recovery or recovery disabled: CONNECTING)
hy260.debug.log:00000100:00080000:4.0:1339107138.281232:0:19717:0:(pinger.c:258:ptlrpc_pinger_process_import()) not pinging lustre-OST0016_UUID (in recovery or recovery disabled: CONNECTING)
hy260.debug.log:00000100:00000400:2.0:1339107138.283718:0:19716:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1339107138/real 1339107138] req@ffff8801dee51c00 x1404138588850806/t0(0) o8->lustre-OST0016-osc-ffff880226efb000@192.168.127.65@o2ib1:28/4 lens 368/512 e 0 to 1 dl 1339107188 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
hy260.debug.log:00000100:00100000:2.0:1339107138.283723:0:19716:0:(client.c:1812:ptlrpc_expire_one_request()) @@@ err 110, sent_state=CONNECTING (now=CONNECTING) req@ffff8801dee51c00 x1404138588850806/t0(0) o8>lustre-OST0016-osc-ffff880226efb000@192.168.127.65@o2ib1:28/4 lens 368/512 e 0 to 1 dl 1339107188 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
hy260.debug.log:00000100:00080000:2.0:1339107138.283728:0:19716:0:(import.c:1111:ptlrpc_connect_interpret()) ffff880110a9c000 lustre-OST0016_UUID: changing import state from CONNECTING to DISCONN
hy260.debug.log:00000100:00080000:2.0:1339107138.283729:0:19716:0:(import.c:1157:ptlrpc_connect_interpret()) recovery of lustre-OST0016_UUID on 192.168.127.65@o2ib1 failed (-110)
...
the client try several times before it can reach the server!
|