[LU-5298] The lwp device cannot be started when we migrate from Lustre 2.1 to Lustre 2.4 Created: 07/Jul/14 Updated: 01/Jul/16 Resolved: 01/Jul/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Bruno Travouillon (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL6 w/ patched kernel for Lustre server |
||
| Issue Links: |
|
||||||||
| Epic/Theme: | Quota | ||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 14784 | ||||||||
| Description |
|
We have an issue with the quotas on our filesystems after the upgrade from The quotas have been successfully enabled on all target devices using Check of the quota_slave.info on the MDT: # lctl get_param osd-*.*.quota_slave.info osd-ldiskfs.scratch-MDT0000.quota_slave.info= target name: scratch-MDT0000 pool ID: 0 type: md quota enabled: ug conn to master: not setup yet space acct: ug user uptodate: glb[0],slv[0],reint[1] group uptodate: glb[0],slv[0],reint[1] We can see that the connection to the QMT is not setup yet. I also noticed that By looking at the code, it seems that the lwp device cannot be started when we In lustre/obdclass/obd_mount_server.c: 717 /** 718 * Retrieve MDT nids from the client log, then start the lwp device. 719 * there are only two scenarios which would include mdt nid. 720 * 1. 721 * marker 5 (flags=0x01, v2.1.54.0) lustre-MDT0000 'add mdc' xxx- 722 * add_uuid nid=192.168.122.162@tcp(0x20000c0a87aa2) 0: 1:192.168.122.162@tcp 723 * attach 0:lustre-MDT0000-mdc 1:mdc 2:lustre-clilmv_UUID 724 * setup 0:lustre-MDT0000-mdc 1:lustre-MDT0000_UUID 2:192.168.122.162@tcp 725 * add_uuid nid=192.168.172.1@tcp(0x20000c0a8ac01) 0: 1:192.168.172.1@tcp 726 * add_conn 0:lustre-MDT0000-mdc 1:192.168.172.1@tcp 727 * modify_mdc_tgts add 0:lustre-clilmv 1:lustre-MDT0000_UUID xxxx 728 * marker 5 (flags=0x02, v2.1.54.0) lustre-MDT0000 'add mdc' xxxx- 729 * 2. 730 * marker 7 (flags=0x01, v2.1.54.0) lustre-MDT0000 'add failnid' xxxx- 731 * add_uuid nid=192.168.122.2@tcp(0x20000c0a87a02) 0: 1:192.168.122.2@tcp 732 * add_conn 0:lustre-MDT0000-mdc 1:192.168.122.2@tcp 733 * marker 7 (flags=0x02, v2.1.54.0) lustre-MDT0000 'add failnid' xxxx- 734 **/ 735 static int client_lwp_config_process(const struct lu_env *env, 736 struct llog_handle *handle, 737 struct llog_rec_hdr *rec, void *data) 738 { [...] 779 /* Don't try to connect old MDT server without LWP support, 780 * otherwise, the old MDT could regard this LWP client as 781 * a normal client and save the export on disk for recovery. 782 * 783 * This usually happen when rolling upgrade. LU-3929 */ 784 if (marker->cm_vers < OBD_OCD_VERSION(2, 3, 60, 0)) 785 GOTO(out, rc = 0); The function checks the MDT server version in the llog. I checked on the MGS of #09 (224)marker 5 (flags=0x01, v2.1.6.0) scratch-MDT0000 'add mdc' Sat Jul 5 14:40:44 2014- #10 (088)add_uuid nid=192.168.122.41@tcp(0x20000c0a87a29) 0: 1:192.168.122.41@tcp #11 (128)attach 0:scratch-MDT0000-mdc 1:mdc 2:scratch-clilmv_UUID #12 (144)setup 0:scratch-MDT0000-mdc 1:scratch-MDT0000_UUID 2:192.168.122.41@tcp #13 (168)modify_mdc_tgts add 0:scratch-clilmv 1:scratch-MDT0000_UUID 2:0 3:1 4:scratch-MDT0000-mdc_UUID #14 (224)marker 5 (flags=0x02, v2.1.6.0) scratch-MDT0000 'add mdc' Sat Jul 5 14:40:44 2014- After a writeconf on the filesystem, the llog has been updated and the device scratch-MDT0000-mdc is now registered with version 2.4.3.0. #09 (224)marker 6 (flags=0x01, v2.4.3.0) scratch-MDT0000 'add mdc' Sat Jul 5 15:19:27 2014- #10 (088)add_uuid nid=192.168.122.41@tcp(0x20000c0a87a29) 0: 1:192.168.122.41@tcp #11 (128)attach 0:scratch-MDT0000-mdc 1:mdc 2:scratch-clilmv_UUID #12 (144)setup 0:scratch-MDT0000-mdc 1:scratch-MDT0000_UUID 2:192.168.122.41@tcp #13 (168)modify_mdc_tgts add 0:scratch-clilmv 1:scratch-MDT0000_UUID 2:0 3:1 4:scratch-MDT0000-mdc_UUID #14 (224)marker 6 (flags=0x02, v2.4.3.0) scratch-MDT0000 'add mdc' Sat Jul 5 15:19:27 2014- Check of the quota_slave.info: # lctl get_param osd-*.*.quota_slave.info osd-ldiskfs.scratch-MDT0000.quota_slave.info= target name: scratch-MDT0000 pool ID: 0 type: md quota enabled: none conn to master: setup space acct: ug user uptodate: glb[0],slv[0],reint[0] group uptodate: glb[0],slv[0],reint[0] # lctl conf_param scratch.quota.mdt=ug # lctl conf_param scratch.quota.ost=ug # lctl get_param osd-*.*.quota_slave.info osd-ldiskfs.scratch-MDT0000.quota_slave.info= target name: scratch-MDT0000 pool ID: 0 type: md quota enabled: ug conn to master: setup space acct: ug user uptodate: glb[1],slv[1],reint[0] group uptodate: glb[1],slv[1],reint[0] Same behavior is observed on OSTs. It would be better to:
I think this issue can be related to |
| Comments |
| Comment by Johann Lombardi (Inactive) [ 07/Jul/14 ] |
|
For reference, the issue was introduced by |
| Comment by John Fuchs-Chesney (Inactive) [ 07/Jul/14 ] |
|
Niu, |
| Comment by Niu Yawei (Inactive) [ 08/Jul/14 ] |
|
We did some upgrade test and shows that config log won't be converted automatically, which means the fix of |
| Comment by Jian Yu [ 08/Jul/14 ] |
Hi Johann, we did not perform writeconf in rolling upgrade tests. Unfortunately, quotas was not tested in rolling upgrade tests, which is the reason that rolling upgrade from 2.1.6 to 2.4.2 passed: https://testing.hpdd.intel.com/test_sets/12d033c6-6bd7-11e3-a73e-52540035b04c (the test output was lost due to Maloo cutover). We tested quotas in clean upgrade tests (conf-sanity test 32*). However, writeconf was performed in those tests. So, while fixing this bug, we also need improve our tests. |
| Comment by Johann Lombardi (Inactive) [ 08/Jul/14 ] |
|
Niu, yes, i think we should revert the patch. A workaround for the initial problem would be to check whether OBD_CONNECT_LIGHTWEIGHT is set back in the connect reply and disconnect if it is not. Yujian, thanks for your reply. |
| Comment by Niu Yawei (Inactive) [ 08/Jul/14 ] |
conf-sanity 32 requires writeconf to generate correct nids for devices, we probably need to verify quota in manual test cases?
So there will be a window that could leaving a stub in the last_rcvd. Actually, I think deny the LWP connection on old server is a simple and robust solution (we have the patch in http://review.whamcloud.com/#/c/8086/), the drawback is that customer has to upgrade the MDS to a newer 2.1 before rolling upgrade to 2.4. Is that acceptable? |
| Comment by Johann Lombardi (Inactive) [ 08/Jul/14 ] |
Right, but the window should be pretty small and the side effect isn't that serious (i.e. have to wait for recovery timer to expire). The benefit is that it works with any versions < 2.4.0.
I think it is difficult to impose this constraint now that 2.4.0 was released a while ago. Actually, the two "solutions" are not incompatible and we can do both. |
| Comment by Bruno Travouillon (Inactive) [ 24/Oct/14 ] |
|
Revert " Thanks. |
| Comment by Niu Yawei (Inactive) [ 01/Jul/16 ] |
|
The patch led to this problem has been reverted. |