[LU-518] replay-single test_45: Can't lstat /mnt/lustre/f45: Cannot send after transport endpoint shutdown Created: 20/Jul/11  Updated: 16/Aug/16  Resolved: 16/Aug/16

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.0.0
Fix Version/s: Lustre 2.1.0

Type: Bug Priority: Minor
Reporter: Jian Yu Assignee: WC Triage
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

Lustre Tag: v2_0_65_0
Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/204/
e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/
Distro/Arch: RHEL6/x86_64(in-kernel OFED, kernel version: 2.6.32-131.2.1.el6)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

MGS/MDS Nodes: client-2-ib(active), client-4-ib(passive)
\ /
1 combined MGS/MDT

OSS Nodes: fat-amd-1-ib(active), fat-amd-2-ib(active)
\ /
OST1 (active in fat-amd-1-ib)
OST2 (active in fat-amd-2-ib)
OST3 (active in fat-amd-1-ib)
OST4 (active in fat-amd-2-ib)
OST5 (active in fat-amd-1-ib)
OST6 (active in fat-amd-2-ib)

Client Nodes: fat-amd-3-ib, client-6-ib


Severity: 3
Bugzilla ID: 22,981
Rank (Obsolete): 5988

 Description   

replay-single test 45 failed as follows:

== replay-single test 45: Handle failed close == 07:35:08 (1311086108)
multiop /mnt/lustre/f45 vO_c
TMPPIPE=/tmp/multiop_open_wait_pipe.17755
Can't lstat /mnt/lustre/f45: Cannot send after transport endpoint shutdown
 replay-single test_45: @@@@@@ FAIL: test_45 failed with 2 
Dumping lctl log to /home/yujian/test_logs/2011-07-19/054342/replay-single.test_45.*.1311086110.log

Dmesg on the client node fat-amd-3-ib:

Lustre: DEBUG MARKER: == replay-single test 45: Handle failed close == 07:35:08 (1311086108)
Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request
LustreError: 18821:0:(file.c:155:ll_close_inode_openhandle()) inode 144115440136749057 mdc close failed: rc = -108
LustreError: 18826:0:(client.c:1057:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff8804079b3400 x1374766420142719/t0(0) o-1->lustre-MDT0000_UUID@192.168.4.2@o2ib:12/10 lens 544/880 e 0 to 0 dl 0 ref 2 fl Rpc:/ffffffff/ffffffff rc 0/-1
LustreError: 18826:0:(mdc_locks.c:722:mdc_enqueue()) ldlm_cli_enqueue: -108
LustreError: 18826:0:(file.c:2165:ll_inode_revalidate_fini()) failure -108 inode 29
Lustre: DEBUG MARKER: replay-single test_45: @@@@@@ FAIL: test_45 failed with 2

Maloo report: https://maloo.whamcloud.com/test_sets/7f371672-b281-11e0-b33f-52540025f9af

This is an known issue on Lustre master branch: bug 22981. Some instances were reported in bug 20997.



 Comments   
Comment by Zhenyu Xu [ 21/Jul/11 ]

for unknown reason, the recovery reconnect request just failed, and made the MDC import to MDS invalid henceforward.

client2 (fat-amd-3-ib) debug log

1311086109.335453:0:18824:0:(recover.c:276:ptlrpc_set_import_active()) setting import lustre-MDT0000_UUID VALID
1311086109.335467:0:18824:0:(import.c:167:ptlrpc_set_import_discon()) lustre-MDT0000-mdc-ffff880117b9d400: Connection to service lustre-MDT0000 via nid 192.168.4.4@o2ib was lost; in progress operations using this service will wait for recovery to complete.
1311086109.335474:0:18824:0:(import.c:177:ptlrpc_set_import_discon()) ffff880218423000 lustre-MDT0000_UUID: changing import state from FULL to DISCONN
1311086109.335482:0:18824:0:(import.c:621:ptlrpc_connect_import()) ffff880218423000 lustre-MDT0000_UUID: changing import state from DISCONN to CONNECTING
1311086109.335489:0:18824:0:(import.c:478:import_select_connection()) lustre-MDT0000-mdc-ffff880117b9d400: connect to NID 192.168.4.4@o2ib last attempt 4305440055
1311086109.335495:0:18824:0:(import.c:478:import_select_connection()) lustre-MDT0000-mdc-ffff880117b9d400: connect to NID 192.168.4.2@o2ib last attempt 4305416002
1311086109.335505:0:18824:0:(import.c:550:import_select_connection()) Changing connection for lustre-MDT0000-mdc-ffff880117b9d400 to 192.168.4.2@o2ib/192.168.4.2@o2ib
1311086109.335509:0:18824:0:(import.c:556:import_select_connection()) lustre-MDT0000-mdc-ffff880117b9d400: import ffff880218423000 using connection 192.168.4.2@o2ib/192.168.4.2@o2ib
1311086109.335543:0:18824:0:(import.c:720:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5) req@ffff880319ce9400 x1374766420142717/t0(0) o-1->lustre-MDT0000_UUID@192.168.4.2@o2ib:12/10 lens 368/392 e 0 to 0 dl 0 ref 1 fl New:N/ffffffff/ffffffff rc 0/-1
1311086109.335575:0:18824:0:(recover.c:344:ptlrpc_recover_import_no_retry()) lustre-MDT0000_UUID: recovery started, waiting
1311086109.335587:0:15748:0:(client.c:1392:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-rcv:93a2f5a2-2e73-de1c-53c5-2e51c11c95b2:15748:1374766420142717:192.168.4.2@o2ib:38
1311086109.336967:0:15748:0:(client.c:1775:ptlrpc_expire_one_request()) @@@ Request x1374766420142717 sent from lustre-MDT0000-mdc-ffff880117b9d400 to NID 192.168.4.2@o2ib has failed due to network error: [sent 1311086109] [real_sent 1311086109] [current 1311086109] [deadline 26s] [delay -26s] req@ffff880319ce9400 x1374766420142717/t0(0) o-1->lustre-MDT0000_UUID@192.168.4.2@o2ib:12/10 lens 368/392 e 0 to 1 dl 1311086135 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1
1311086109.336994:0:15748:0:(client.c:1807:ptlrpc_expire_one_request()) @@@ err 110, sent_state=CONNECTING (now=CONNECTING) req@ffff880319ce9400 x1374766420142717/t0(0) o-1>lustre-MDT0000_UUID@192.168.4.2@o2ib:12/10 lens 368/392 e 0 to 1 dl 1311086135 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1
1311086109.337009:0:15748:0:(import.c:1120:ptlrpc_connect_interpret()) ffff880218423000 lustre-MDT0000_UUID: changing import state from CONNECTING to DISCONN
1311086109.337015:0:15748:0:(import.c:1166:ptlrpc_connect_interpret()) recovery of lustre-MDT0000_UUID on 192.168.4.2@o2ib failed (-110)

Comment by James A Simmons [ 16/Aug/16 ]

Old ticket for unsupported version

Generated at Sat Feb 10 01:07:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.