[LU-5658] sanity test_17n: destroy remote dir error 0 Created: 24/Sep/14  Updated: 16/Jul/15  Resolved: 15/Oct/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Di Wang
Resolution: Duplicate Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 15857

 Description   

This issue was created by maloo for Amir Shehata <amir.shehata@intel.com>

Seeing the following errors:

LustreError: 11-0: MGC10.1.4.222@tcp: Communicating with 10.1.4.222@tcp, operation obd_ping failed with -107.
LustreError: 166-1: MGC10.1.4.222@tcp: Connection to MGS (at 10.1.4.222@tcp) was lost; in progress operations using this service will fail
LustreError: 8827:0:(mgc_request.c:517:do_requeue()) failed processing log: -5

LustreError: 11-0: lustre-MDT0001-mdc-ffff88007981bc00: Communicating with 10.1.4.218@tcp, operation mds_statfs failed with -107.
LustreError: Skipped 1 previous similar message
Lustre: lustre-MDT0001-mdc-ffff88007981bc00: Connection to lustre-MDT0001 (at 10.1.4.218@tcp) was lost; in progress operations using this service will wait for recovery to complete
LustreError: 4138:0:(client.c:2802:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff880079cf9000 x1479964115341320/t4294967394(4294967394) o101->lustre-MDT0001-mdc-ffff88007981bc00@10.1.4.218@tcp:12/10 lens 592/544 e 0 to 0 dl 1411404116 ref 2 fl Interpret:RP/4/0 rc 301/301

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/c65731d8-42b4-11e4-b87c-5254006e85c2.



 Comments   
Comment by Jodi Levi (Inactive) [ 26/Sep/14 ]

Di,
Can you please have a look at this one and comment?
Thank you!

Comment by Andreas Dilger [ 26/Sep/14 ]

It looks like this is a recent regression.

Comment by Di Wang [ 06/Oct/14 ]

Hmm, I saw this on mds console message

Lustre: Skipped 3 previous similar messages
Lustre: lustre-MDT0001: Recovery over after 0:04, of 5 clients 5 recovered and 0 were evicted.
LustreError: 4554:0:(lod_lov.c:698:validate_lod_and_idx()) lustre-MDT0001-mdtlov: bad idx: 4 of 32
Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_17n: @@@@@@ FAIL: destroy remote dir error 0 
Comment by Di Wang [ 07/Oct/14 ]

http://review.whamcloud.com/#/c/12202

Comment by Di Wang [ 15/Oct/14 ]

Just checked the debug log, it seems the process is this

1. MDT0/MGS restarts.

2. MDT1 restarts and try to reuse MGC (note: the mgc is shared by MDT1/MDT2/MDT3), which is evicted by MGS(because of 1). So MDT1 can not get the config log with the mgc, so it will use the local config log.

00000100:02000400:0.0:1411404071.644549:0:3397:0:(import.c:950:ptlrpc_connect_interpret()) Evicted from MGS (at 10.1.4.222@tcp) after server handle changed from 0x82e1444051c34df2 to 0x82e1444051c3568f
......
10000000:01000000:1.0:1411404071.645599:0:11810:0:(mgc_request.c:1866:mgc_process_log()) Can't get cfg lock: -108
10000000:01000000:1.0:1411404071.645900:0:11810:0:(mgc_request.c:1777:mgc_process_cfg_log()) Failed to get MGS log lustre-MDT0001, using local copy for now, will try to update later.

3. Unfortunately, the config log is stale, and OSTs are not included in the config log yet, which cause the problem.

Comment by Di Wang [ 15/Oct/14 ]

Hmm, we probably should retry for mgc, instead of failing mgc in step 2. So the fix for LU-5420( http://review.whamcloud.com/#/c/11258/) should fix the issue.

Comment by Di Wang [ 15/Oct/14 ]

Duplicate with LU-5420.

Comment by Andreas Dilger [ 15/Oct/14 ]

Do, can you please look into why this patch is failing conf-sanity? That is preventing it from landing:
https://maloo.whamcloud.com/test_sessions/d2739d04-50c8-11e4-aa89-5254006e85c2
https://maloo.whamcloud.com/test_sessions/18ef3062-50de-11e4-ac0f-5254006e85c2

Comment by Di Wang [ 15/Oct/14 ]

Andreas, I just update the patch, please have a look.

Comment by Jinshan Xiong (Inactive) [ 16/Oct/14 ]

new occurrence: https://testing.hpdd.intel.com/test_sets/21a318ac-54f0-11e4-92b6-5254006e85c2

Comment by nasf (Inactive) [ 29/Oct/14 ]

Another failure instance:
https://testing.hpdd.intel.com/test_sets/83f29682-5ec7-11e4-a2a3-5254006e85c2

Generated at Sat Feb 10 01:53:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.