[LU-2354] replay-single test_89 failed to remount the OST Created: 19/Jan/12  Updated: 09/Jan/20  Resolved: 09/Jan/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Li Wei (Inactive) Assignee: Li Wei (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 2880

 Description   

https://maloo.whamcloud.com/test_sets/5c8be67a-4223-11e1-9650-5254004bbbd3

== replay-single test 89: no disk space leak on late ost connection ================================== 06:57:48 (1326898668)
Waiting for orphan cleanup...
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.432742 seconds, 24.2 MB/s
Stopping /mnt/ost1 (opts:)
Failing mds1 on node client-28vm7
Stopping /mnt/mds1 (opts:)
affected facets: mds1
Failover mds1 to client-28vm7
06:58:10 (1326898690) waiting for client-28vm7 network 900 secs ...
06:58:10 (1326898690) network interface is UP
Starting mds1: -o user_xattr,acl  /dev/lvm-MDS/P0 /mnt/mds1
client-28vm7: debug=0x33f0404
client-28vm7: subsystem_debug=0xffb7e3ff
client-28vm7: debug_mb=32
Started lustre-MDT0000
Starting ost1:   /dev/lvm-OSS/P0 /mnt/ost1
client-28vm8: mount.lustre: mount /dev/mapper/lvm--OSS-P0 at /mnt/ost1 failed: Transport endpoint is not connected
mount -t lustre  /dev/lvm-OSS/P0 /mnt/ost1
Start of /dev/lvm-OSS/P0 on ost1 failed 107
Starting client: client-28vm5.lab.whamcloud.com: -o user_xattr,acl,flock client-28vm7@tcp:/lustre /mnt/lustre
debug=0x33f0404
subsystem_debug=0xffb7e3ff
debug_mb=32


 Comments   
Comment by Li Wei (Inactive) [ 01/Feb/12 ]

From the OSS debug log:

00000020:01000004:0.0:1326898692.027873:0:29847:0:(obd_mount.c:374:lustre_start_mgc()) MGC10.10.4.170@tcp: Set MGC reconnect 1
10000000:01000000:0.0:1326898692.027875:0:29847:0:(mgc_request.c:873:mgc_set_info_async()) InitRecov MGC10.10.4.170@tcp 1/d0:i0:r0:or0:FULL

An existing MGC was found during the OST mount process. Its status was FULL.

00000020:01000004:0.0:1326898692.027877:0:29847:0:(obd_mount.c:837:server_start_targets()) starting target lustre-OST0000
00000020:01000004:0.0:1326898692.027881:0:29847:0:(obd_mount.c:814:server_register_target()) Registration , fs=lustre, 10.10.4.171@tcp, index=0000, flags=0x2
10000000:01000000:0.0:1326898692.027883:0:29847:0:(mgc_request.c:887:mgc_set_info_async()) register_target  0x2
10000000:01000000:0.0:1326898692.027900:0:29847:0:(mgc_request.c:838:mgc_target_register()) register
00000100:00100000:0.0:1326898692.027907:0:29847:0:(client.c:1332:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mount.lustre:f41320d4-71d5-d208-e36e-4a91bdddc25a:29847:1391341666777465:10.10.4.170@tcp:253

The OST was trying to register with the MGS.

00000100:02020000:0.0:1326898692.029258:0:29847:0:(client.c:1062:ptlrpc_check_status()) 11-0: an error occurred while communicating with 10.10.4.170@tcp. The mgs_target_reg operation failed with -107
00000100:00080000:0.0:1326898692.029261:0:29847:0:(recover.c:214:ptlrpc_request_handle_notconn()) import MGC10.10.4.170@tcp of MGS@MGC10.10.4.170@tcp_0 abruptly disconnected: reconnecting
00000100:02020000:0.0:1326898692.029264:0:29847:0:(import.c:177:ptlrpc_set_import_discon()) 166-1: MGC10.10.4.170@tcp: Connection to service MGS via nid 10.10.4.170@tcp was lost; in progress operations using this service will fail.
00000100:00080000:0.0:1326898692.029266:0:29847:0:(import.c:180:ptlrpc_set_import_discon()) ffff88006b5ff800 MGS: changing import state from FULL to DISCONN
10000000:01000000:0.0:1326898692.029268:0:29847:0:(mgc_request.c:952:mgc_import_event()) import event 0x808001
00000100:00080000:0.0:1326898692.029269:0:29847:0:(recover.c:223:ptlrpc_request_handle_notconn()) import MGS@MGC10.10.4.170@tcp_0 for MGC10.10.4.170@tcp not replayable, auto-deactivating
00000100:00080000:0.0:1326898692.029270:0:29847:0:(import.c:207:ptlrpc_deactivate_and_unlock_import()) setting import MGS INVALID
[...]
00000100:00100000:0.0:1326898692.029286:0:29847:0:(client.c:2591:ptlrpc_abort_inflight()) @@@ inflight  req@ffff88003cdacc00 x1391341666777465/t0(0) o-1->MGS@MGC10.10.4.170@tcp_0:26/25 lens 4736/192 e 0 to 0 dl 1326898709 ref 2 fl Rpc:R/ffffffff/ffffffff rc 0/-1

Because the MDT/MGS had restarted, the OST's registration request was replied with ENOTCONN. The MGC became DISCONN, was marked INVALID, and was deactivated. The registration RPC was failed as well, without being re-sent.

00000100:00100000:0.0:1326898692.029336:0:29847:0:(client.c:1666:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc mount.lustre:f41320d4-71d5-d208-e36e-4a91bdddc25a:29847:1391341666777465:10.10.4.170@tcp:253
00000020:02020000:0.0:1326898692.029341:0:29847:0:(obd_mount.c:846:server_start_targets()) 0-0: lustre-OST0000: Required registration failed: -107
00000020:00020000:0.0:1326898692.029863:0:29847:0:(obd_mount.c:1171:lustre_server_mount()) Unable to start targets: -107

The registration failure was passed up the stack to the OST mount process.

Comment by Andreas Dilger [ 09/Jan/20 ]

Close old ticket.

Generated at Sat Feb 10 01:24:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.