Data Migration and Import Examples

May 21, 2008

Tutorials

This article is intended as an accompaniment to the podcast of the same name, it is a simple walkthrough of some examples of the techniques mentioned in the podcast. It is much easier to visualise queries and data structures if written down, so here we look at an example of each of the points mentioned. I’ve taken a classic example of an employees table which needs to be populated with the clients existing data, which arrives in a different format. Its a common scenario and the ideas illustrated here are also applicable to many other situations and types of data.

(Data generated by http://www.generatedata.com)

Here’s our incoming data structure:

+-----------+-----------------------+------+-----+---------+----------------+
| Field     | Type                  | Null | Key | Default | Extra          |
+-----------+-----------------------+------+-----+---------+----------------+
| id        | mediumint(8) unsigned | NO   | PRI | NULL    | auto_increment |
| Name      | varchar(255)          | YES  |     | NULL    |                |
| Dept      | varchar(255)          | YES  |     | NULL    |                |
| Join_Date | varchar(50)           | YES  |     | NULL    |                |
| Email     | varchar(255)          | YES  |     | NULL    |                |
| Comments  | text                  | YES  |     | NULL    |                |
+-----------+-----------------------+------+-----+---------+----------------+

The target system looks like this:

+---------------+--------------+------+-----+---------+----------------+
| Field         | Type         | Null | Key | Default | Extra          |
+---------------+--------------+------+-----+---------+----------------+
| employee_id   | int(11)      | NO   | PRI | NULL    | auto_increment |
| first_name    | varchar(50)  | YES  |     | NULL    |                |
| last_name     | varchar(50)  | YES  |     | NULL    |                |
| department_id | int(11)      | YES  |     | NULL    |                |
| join_date     | datetime     | YES  |     | NULL    |                |
| email         | varchar(255) | YES  |     | NULL    |                |
| comments      | text         | YES  |     | NULL    |                |
+---------------+--------------+------+-----+---------+----------------+

+---------------+--------------+------+-----+---------+----------------+
| Field         | Type         | Null | Key | Default | Extra          |
+---------------+--------------+------+-----+---------+----------------+
| department_id | int(11)      | NO   | PRI | NULL    | auto_increment |
| name          | varchar(255) | YES  |     | NULL    |                |
+---------------+--------------+------+-----+---------+----------------+

Taking each of the incoming fields in turn, we need to:

  • id becomes employee_id and the datatype changes slightly
  • Name gets split up into first_name and last_name
  • Dept needs to get pulled out into a lookup table and the the department_id pushed back into this table
  • Join_date needs its string format parsing and turning into a lookup
  • Email and comments columns don’t need changing

Converting Data Type – Employee ID

This is pretty simple, just add a new column onto the Employees table and then populate it.

alter table Employees add column employee_id int;
update Employees set employee_id = id;

Splitting Strings – Name Field

Happily the generated data that I used has only first name and then surname, reliably separated by a space. If your incoming data is in a clean format like this then you are lucky – it can be much more complicated especially if a full name contains more than two words, deciding which name the middle word belongs to can be a nightmare. To get the name separated is awkward as mysql doesn’t have split() but I started off with this select statement:

select Name ,
substring_index(Name,' ',1) as first_name,
substring_index(substring_index(Name,' ',2),' ',-1) as last_name
from Employees
limit 10;

+--------------------+------------+-------------+
| Name               | first_name | last_name   |
+--------------------+------------+-------------+
| Anika Lott         | Anika      | Lott        |
| Jenette Gibson     | Jenette    | Gibson      |
| Lacy Barton        | Lacy       | Barton      |
| Noelani Peck       | Noelani    | Peck        |
| Lacy Prince        | Lacy       | Prince      |
| Glenna Rowland     | Glenna     | Rowland     |
| Isabella Hall      | Isabella   | Hall        |
| Hadley Fitzpatrick | Hadley     | Fitzpatrick |
| Elaine Steele      | Elaine     | Steele      |
| Kathleen Davidson  | Kathleen   | Davidson    |
+--------------------+------------+-------------+

The abbreviated output saves me waiting for a lot of records to be processed when really I just need a small sample to check my formatting is correct. This looks spot on so the select statement can be adapted to make an update statement after we’ve created our new columns, as shown below:

alter table Employees add column first_name varchar(50);
alter table Employees add column last_name varchar(50);

update Employees set
first_name = substring_index(Name,' ',1),
last_name = substring_index(substring_index(Name,' ',2),' ',-1)

String Dates to Real Dates

We’ve already got a column called join_date in our Employees table so the new one needs a new name:

alter table Employees add column join_date_formatted datetime;

Again a select statement first, and then converting to an update statement:

select join_date ,
str_to_date(join_date, '%m-%d-%y') as machine_date
from Employees
limit 10;

update Employees set join_date_formatted = str_to_date(join_date, '%m-%d-%y');

MySQL will default the time element of the timestamp to midnight on the relevant day.

Enum (or almost-enum) data becomes Lookup Table

First of all, let’s get a grip on what values are actually in this table to start with.

select distinct Dept from Employees;

+--------------------------+
| Dept                     |
+--------------------------+
| Human Resources          |
| Research and Development |
| Accounts                 |
| Finances                 |
| Quality Assurance        |
| Accounting               |
| Tech Support             |
| LegalPayroll             |
| QA                       |
| Sales and Marketing      |
| Customer Service         |
| Advertising              |
+--------------------------+

Looking at this table we’ve got a few things gone awry – “LegalPayroll” should be “Legal and Payroll”, and we’ve got “Accounts” and “Accounting” plus “Quality Assurance” and “QA”, but these are actually the same department. With such a small dataset it would be as well to manually create the lookup table we need, and I’m going to use numeric indexes for mine.

+---------------+--------------------------+
| department_id | name                     |
+---------------+--------------------------+
|             1 | Human Resources          |
|             2 | Research and Development |
|             3 | Accounting               |
|             4 | Finances                 |
|             5 | Quality Assurance        |
|             6 | Tech Support             |
|             7 | Legal and Payroll        |
|             8 | Sales and Marketing      |
|             9 | Customer Service         |
|            10 | Advertising              |
+---------------+--------------------------+

Next we create the new column on the old Employees table and then populate it with the “easy” entries.

alter table Employees add column department_id int;
update Employees set department_id = (select department.department_id from department where Employees.Dept = department.name); 

Looking at the records that didn’t get updated, we still have a few things to fix:

select Name, Dept from Employees where department_id is null;
+--------------------+--------------+
| Name               | Dept         |
+--------------------+--------------+
| Lacy Barton        | Accounts     |
| Isabella Hall      | Accounts     |
| Kathleen Davidson  | Accounts     |
| Chiquita Rodriguez | LegalPayroll |
| Emi Ochoa          | LegalPayroll |
| Wanda Sullivan     | QA           |
| Yen Wilder         | QA           |
| Echo Hopkins       | LegalPayroll |
| Germane Stout      | Accounts     |
| Evangeline Oconnor | Accounts     |
| Hadley Ochoa       | QA           |
| Carly Riley        | LegalPayroll |
| Destiny Brock      | QA           |
| Mariam Baldwin     | Accounts     |
+--------------------+--------------+

update Employees set department_id = 3 where Dept = "Accounts";
update Employees set department_id = 5 where Dept = "QA";
update Employees set department_id = 7 where Dept = "LegalPayroll";

The Final Step

Finally, we populate the main employee table with all our newly-prepared and reformatted data. Here’s the query:

insert into employee select id, first_name, last_name, department_id, join_date_formatted, Email, Comments from Employees;

About Cal Evans

Many moons ago, at the tender age of 14, Cal touched his first computer. (We're using the term "computer" loosely here, it was a TRS-80 Model 1) Since then his life has never been the same. He graduated from TRS-80s to Commodores and eventually to IBM PC's. For the past 10 years Cal has worked with PHP and MySQL on Linux OSX, and when necessary, Windows. He has built on a variety of projects ranging in size from simple web pages to multi-million dollar web applications. When not banging his head on his monitor, attempting a blood sacrifice to get a particular piece of code working, he enjoys building and managing development teams using his widely imitated but never patented management style of "management by wandering around". Cal is currently based in Nashville, TN and is gainfully unemployed as the Chief Marketing Officer of Blue Parabola, LLC. Cal is happily married to wife 1.28, the lovely and talented Kathy. Together they have 2 kids who were both bright enough not to pursue a career in IT. Cal blogs at http://blog.calevans.com and is the founder and host of Day Camp 4 Developers

View all posts by Cal Evans

3 Responses to “Data Migration and Import Examples”

  1. inkyts Says:

    Thank you.

  2. matt_bk Says:

    good example – thank You.

  3. bugmenot5 Says:

    thanks for the great tutorial