[LU-7820] jobs crash with llite_lib.c:2309:ll_prep_inode()) new_inode -fatal: rc -5 Created: 26/Feb/16  Updated: 24/Jan/17  Resolved: 24/Jan/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Frank Heckes (Inactive) Assignee: WC Triage
Resolution: Won't Fix Votes: 0
Labels: soak
Environment:

lola
build: https://build.hpdd.intel.com/job/lustre-b2_8/8/


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Error happens during soak testing of build '20160224' (b2_8 RC2) (see:
https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola& spaceKey=Releases#SoakTestingonLola-20150224). DNE is enabled.
MDSes had been formatted using ldiskfs, OSTs using zfs. MDSes are configured in active-active HA failover configuration.

Applicaton {mdtest (1file per process) jobs crash with the following errors:

  JOBID          ERROR MESSAGE
-- 445604 :  201602 25 15:08:35 : Process 1(lola-31.lola.whamcloud.com): FAILED in main, Unable to change to test directory: Input/output error
-- 445605 :  201602 25 15:07:42 : Process 3(lola-32.lola.whamcloud.com): FAILED in main, Unable to change to test directory: Input/output error
-- 445415 :  201602 25 11:27:11 : Process 3(lola-34.lola.whamcloud.com): FAILED in main, Unable to change to test directory: Input/output error
-- 445416 :  201602 25 11:28:45 : Process 3(lola-32.lola.whamcloud.com): FAILED in main, Unable to change to test directory: Input/output error
-- 445270 :  201602 25 08:05:01 : Process 4(lola-31.lola.whamcloud.com): FAILED in main, Unable to change to test directory: Input/output error
-- 445271 :  201602 25 08:04:34 : Process 1(lola-29.lola.whamcloud.com): FAILED in main, Unable to change to test directory: Input/output error

On MDS and client nodes the following Lustre errors can be correlated:

---- Incident 25 15:08:35 ----
lola-11.log:Feb 25 15:08:35 lola-11 kernel: Lustre: soaked-MDT0006: Connection restored to 300cd577-7ec5-3892-b093-9d631f897cda (at 192.168.1.131@o2ib100)
lola-11.log:Feb 25 15:08:35 lola-11 kernel: Lustre: Skipped 254 previous similar messages
lola-31.log:Feb 25 15:08:35 lola-31 kernel: LustreError: 167-0: soaked-MDT0006-mdc-ffff88086597e800: This client was evicted by soaked-MDT0006; in progress operations using this service will fail.
lola-31.log:Feb 25 15:08:35 lola-31 kernel: LustreError: 120434:0:(llite_lib.c:2309:ll_prep_inode()) new_inode -fatal: rc -5
lola-31.log:Feb 25 15:08:35 lola-31 kernel: Lustre: soaked-MDT0006-mdc-ffff88086597e800: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)
---- Incident 25 15:07:42 ----
lola-32.log:Feb 25 15:07:42 lola-32 kernel: LustreError: 167-0: soaked-MDT0006-mdc-ffff88082f4c4000: This client was evicted by soaked-MDT0006; in progress operations using this service will fail.
lola-32.log:Feb 25 15:07:42 lola-32 kernel: LustreError: 133347:0:(llite_lib.c:2309:ll_prep_inode()) new_inode -fatal: rc -5
lola-32.log:Feb 25 15:07:42 lola-32 kernel: LustreError: 133347:0:(llite_lib.c:2309:ll_prep_inode()) Skipped 2 previous similar messages
lola-32.log:Feb 25 15:07:42 lola-32 kernel: Lustre: soaked-MDT0006-mdc-ffff88082f4c4000: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)
---- Incident 25 11:27:11 ----
lola-31.log:Feb 25 11:27:11 lola-31 kernel: LustreError: 105033:0:(llite_lib.c:2309:ll_prep_inode()) new_inode -fatal: rc -4
lola-34.log:Feb 25 11:27:11 lola-34 kernel: LustreError: 167-0: soaked-MDT0002-mdc-ffff88102fa38000: This client was evicted by soaked-MDT0002; in progress operations using this service will fail.
lola-34.log:Feb 25 11:27:11 lola-34 kernel: LustreError: 105947:0:(llite_lib.c:2309:ll_prep_inode()) new_inode -fatal: rc -5
lola-34.log:Feb 25 11:27:11 lola-34 kernel: Lustre: soaked-MDT0002-mdc-ffff88102fa38000: Connection restored to 192.168.1.109@o2ib10 (at 192.168.1.109@o2ib10)
---- Incident 25 11:28:45 ----
lola-32.log:Feb 25 11:28:45 lola-32 kernel: LustreError: 167-0: soaked-MDT0002-mdc-ffff88082f4c4000: This client was evicted by soaked-MDT0002; in progress operations using this service will fail.
lola-32.log:Feb 25 11:28:45 lola-32 kernel: LustreError: 117554:0:(llite_lib.c:2309:ll_prep_inode()) new_inode -fatal: rc -5
lola-32.log:Feb 25 11:28:45 lola-32 kernel: Lustre: soaked-MDT0002-mdc-ffff88082f4c4000: Connection restored to 192.168.1.109@o2ib10 (at 192.168.1.109@o2ib10)
lola-32.log:Feb 25 11:28:45 lola-32 kernel: LustreError: 117554:0:(llite_lib.c:2309:ll_prep_inode()) Skipped 2 previous similar messages
---- Incident 25 08:05:01 ----
lola-31.log:Feb 25 08:05:01 lola-31 kernel: LustreError: 167-0: soaked-MDT0002-mdc-ffff88086597e800: This client was evicted by soaked-MDT0002; in progress operations using this service will fail.
lola-31.log:Feb 25 08:05:01 lola-31 kernel: LustreError: 89849:0:(file.c:180:ll_close_inode_openhandle()) soaked-clilmv-ffff88086597e800: inode [0x28000bf82:0x69f4:0x0] mdc close failed: rc = -5
lola-31.log:Feb 25 08:05:01 lola-31 kernel: LustreError: 91182:0:(llite_lib.c:2309:ll_prep_inode()) new_inode -fatal: rc -5
lola-31.log:Feb 25 08:05:01 lola-31 kernel: Lustre: soaked-MDT0002-mdc-ffff88086597e800: Connection restored to 192.168.1.109@o2ib10 (at 192.168.1.109@o2ib10)
---- Incident 25 08:04:34 ----
lola-29.log:Feb 25 08:04:34 lola-29 kernel: LustreError: 167-0: soaked-MDT0002-mdc-ffff880871eec800: This client was evicted by soaked-MDT0002; in progress operations using this service will fail.
lola-29.log:Feb 25 08:04:34 lola-29 kernel: LustreError: 1037:0:(file.c:180:ll_close_inode_openhandle()) soaked-clilmv-ffff880871eec800: inode [0x28000bf82:0x66f3:0x0] mdc close failed: rc = -5
lola-29.log:Feb 25 08:04:34 lola-29 kernel: LustreError: 1043:0:(vvp_io.c:1519:vvp_io_init()) soaked: refresh file layout [0x28000a816:0x1c0e2:0x0] error -5.
lola-29.log:Feb 25 08:04:34 lola-29 kernel: Lustre: soaked-MDT0002-mdc-ffff880871eec800: Connection restored to 192.168.1.109@o2ib10 (at 192.168.1.109@o2ib10)
lola-29.log:Feb 25 08:04:34 lola-29 kernel: LustreError: 1037:0:(file.c:180:ll_close_inode_openhandle()) Skipped 3 previous similar messages

The errors happened after

mds_failover     : 2016-02-25 14:52:36,099 - 2016-02-25 14:59:44,541     lola-11
mds_failover     : 2016-02-25 11:06:59,431 - 2016-02-25 11:16:18,956     lola-9
mds_failover     : 2016-02-25 07:45:03,939 - 2016-02-25 07:54:18,970     lola-9

Does the eviction is an expected part of the workflow?



 Comments   
Comment by Cliff White (Inactive) [ 24/Jan/17 ]

Old issue from 2.8

Generated at Sat Feb 10 02:12:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.