[LU-5658] sanity test_17n: destroy remote dir error 0 Created: 24/Sep/14 Updated: 16/Jul/15 Resolved: 15/Oct/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Di Wang |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 15857 |
| Description |
|
This issue was created by maloo for Amir Shehata <amir.shehata@intel.com> Seeing the following errors: LustreError: 11-0: MGC10.1.4.222@tcp: Communicating with 10.1.4.222@tcp, operation obd_ping failed with -107. LustreError: 166-1: MGC10.1.4.222@tcp: Connection to MGS (at 10.1.4.222@tcp) was lost; in progress operations using this service will fail LustreError: 8827:0:(mgc_request.c:517:do_requeue()) failed processing log: -5 LustreError: 11-0: lustre-MDT0001-mdc-ffff88007981bc00: Communicating with 10.1.4.218@tcp, operation mds_statfs failed with -107. LustreError: Skipped 1 previous similar message Lustre: lustre-MDT0001-mdc-ffff88007981bc00: Connection to lustre-MDT0001 (at 10.1.4.218@tcp) was lost; in progress operations using this service will wait for recovery to complete LustreError: 4138:0:(client.c:2802:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@ffff880079cf9000 x1479964115341320/t4294967394(4294967394) o101->lustre-MDT0001-mdc-ffff88007981bc00@10.1.4.218@tcp:12/10 lens 592/544 e 0 to 0 dl 1411404116 ref 2 fl Interpret:RP/4/0 rc 301/301 This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/c65731d8-42b4-11e4-b87c-5254006e85c2. |
| Comments |
| Comment by Jodi Levi (Inactive) [ 26/Sep/14 ] |
|
Di, |
| Comment by Andreas Dilger [ 26/Sep/14 ] |
|
It looks like this is a recent regression. |
| Comment by Di Wang [ 06/Oct/14 ] |
|
Hmm, I saw this on mds console message Lustre: Skipped 3 previous similar messages Lustre: lustre-MDT0001: Recovery over after 0:04, of 5 clients 5 recovered and 0 were evicted. LustreError: 4554:0:(lod_lov.c:698:validate_lod_and_idx()) lustre-MDT0001-mdtlov: bad idx: 4 of 32 Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity test_17n: @@@@@@ FAIL: destroy remote dir error 0 |
| Comment by Di Wang [ 07/Oct/14 ] |
| Comment by Di Wang [ 15/Oct/14 ] |
|
Just checked the debug log, it seems the process is this 1. MDT0/MGS restarts. 2. MDT1 restarts and try to reuse MGC (note: the mgc is shared by MDT1/MDT2/MDT3), which is evicted by MGS(because of 1). So MDT1 can not get the config log with the mgc, so it will use the local config log. 00000100:02000400:0.0:1411404071.644549:0:3397:0:(import.c:950:ptlrpc_connect_interpret()) Evicted from MGS (at 10.1.4.222@tcp) after server handle changed from 0x82e1444051c34df2 to 0x82e1444051c3568f ...... 10000000:01000000:1.0:1411404071.645599:0:11810:0:(mgc_request.c:1866:mgc_process_log()) Can't get cfg lock: -108 10000000:01000000:1.0:1411404071.645900:0:11810:0:(mgc_request.c:1777:mgc_process_cfg_log()) Failed to get MGS log lustre-MDT0001, using local copy for now, will try to update later. 3. Unfortunately, the config log is stale, and OSTs are not included in the config log yet, which cause the problem. |
| Comment by Di Wang [ 15/Oct/14 ] |
|
Hmm, we probably should retry for mgc, instead of failing mgc in step 2. So the fix for |
| Comment by Di Wang [ 15/Oct/14 ] |
|
Duplicate with |
| Comment by Andreas Dilger [ 15/Oct/14 ] |
|
Do, can you please look into why this patch is failing conf-sanity? That is preventing it from landing: |
| Comment by Di Wang [ 15/Oct/14 ] |
|
Andreas, I just update the patch, please have a look. |
| Comment by Jinshan Xiong (Inactive) [ 16/Oct/14 ] |
|
new occurrence: https://testing.hpdd.intel.com/test_sets/21a318ac-54f0-11e4-92b6-5254006e85c2 |
| Comment by nasf (Inactive) [ 29/Oct/14 ] |
|
Another failure instance: |