Techniques for Data Cleaning and Integration in Excel

By: Daniel Harris on March 30, 2017

So you’ve recently discovered that your job title should actually be “data janitor.” Welcome to the exciting world of digital business!

I’ve spent years scrubbing various types of data in Excel, from market research surveys to exports from CRM and web analytics platforms.

In this article, I’ll share some tricks I’ve learned along the way, including some tips for manipulating text strings from the veteran Excel jockeys on our advertising team. We’ll also take a very brief look at how data discovery tools can speed up data cleaning.

Here’s what we’ll cover:


De-Duping Across Columns with EXACT

Integrating Spreadsheets and Coding Data with VLOOKUP

Trimming and Cleaning Text

The TRIM Function

PROPER/LOWER

Combining Text From Multiple Cells With the & Operator

Text to Columns for Trimming URLs

Extracting Text

Text to Columns for Extracting Part of a URL

‘Nth’ Word Extraction

Data Cleaning Tools That Are Quicker Than Excel

De-Duping Across Columns with EXACT

It seems obvious to use the remove duplicates feature in Excel to de-dupe your data:

microsoft-excel-de-duping

Used with permission from Microsoft

The problem is that duplicate values often occur in different columns.

For example, we record URLs for the pages visitors land on, as well as the pages they convert on. If the visitor converts on the same page as they land on, then this URL value is the same. In such a case, de-duping has to be done with formulas.

One of the most useful formulas for this purpose is EXACT. The syntax is incredibly simple:

=EXACT(cellref1,cellref2)

Example: =EXACT(A1,B1)

The catch is that EXACT only works with two cell references. To test a range of cells with EXACT, you’ll need to use an AND operator:

=IF(AND(EXACT(cellref1,cellref1:cellref2),”OK”,”ALERT!”)

Example: =IF(AND(EXACT(AI2,AI2:AK2)),”OK”,”ALERT!”)

Note: You’ll also need to enter the formula by holding down the CTRL and SHIFT keys while hitting the ENTER key. If you do this correctly, you’ll see brackets around the formula in the formula window (it doesn’t work if you simply type the brackets before typing the formula).

exact-formula-results

Results from using the EXACT formula

Integrating Spreadsheets and Coding Data with VLOOKUP

If you’ve worked in a marketing department long enough, someone’s probably cracked that lame joke about creating “the one spreadsheet to rule them all”:

3d-model-of-lotr-ring

3D Model of the One Ring by Xander

Typically, what this means is that you’re going to have to mush a lot of spreadsheets together into one. I like to call this “spreadsheet data integration”—that is, if data in spreadsheets can ever really be considered integrated.

This can be a tediously manual process if you’re copying and pasting values. Thankfully, lookup tables provide an alternative.

Typically, you won’t be using lookup tables to integrate a few small spreadsheets. Instead, they’re useful when you’re trying to integrate spreadsheets with tens of thousands of rows.

If you’re not familiar with lookup tables, they essentially allow you to look down an entire column to match the value in a given cell. For instance, you can look down column “A” to see if there’s any cell matching the text in cell B2 in another spreadsheet tab.

You can then retrieve values in cells adjacent to the matching cell in A. For example, say that you’re looking to match a URL in cell B2 with a URL in column A:

retrieve-values-from-matching-cells

Once the VLOOKUP formula has found the URL in column A, you can fill in values from columns B, C and D.

VLOOKUP syntax is a bit complex. A typical formula looks like this:

=(f2,’Lookup tab’!$A$2:$D$11910,4,FALSE)

In this example:

  • F2 is the cell being tested

  • $A$2 is the top left corner of the lookup table

  • $D$11910 is the bottom right corner of the table

  • 4 means that the formula will return values from the fourth column from the left, column D in this example

So, the formula looks in column A for the URL, and once it finds the URL, it returns the appropriate “market” classification.

VLOOKUP tables are a great way to quickly integrate new data into existing classification schemes.

Learn more about this time-saving Excel data cleaning technique with this tutorial:

Trimming and Cleaning Text

Mario Mendez is senior advertising programs specialist here at Software Advice. In analyzing our paid search efforts, he deals with quite a bit of text, and has some handy tricks to share for cleaning up text in Excel.

The TRIM Function

Mendez explains that “if for some reason you accidentally put an extra space in a cell, you can use the trim function to remove it”:

=TRIM(cellref1)

Example: =TRIM(A1)

trim-function

PROPER/LOWER

Another neat trick Mendez highlights is the use of functions to clean up capitalization.

The PROPER function applies title case capitalization to text strings in sentence case:

=PROPER(cellref1)

Example: =PROPER(A1)

proper-function

The LOWER function changes everything to lowercase:

=LOWER(cellref1)

Example: =LOWER(a1)

lower-function

Combining Text From Multiple Cells With the & Operator

You’re probably familiar with what it means to merge cells. Essentially, this function combines two adjacent cells into one cell. If you merge B1 with A1, then the text in B1 is added to cell A1, and B1 loses any text in it.

There are two problems with merging:

  1. You might not want to lose the text in B1

  2. Merging can make your columns messy, if you’re not careful

Mendez shares a workaround: the & operator.

=cellref1&cellref2

Example: =A1&” “B1

Note: The quotes with the space inside are there to add a space between the text values from the merged cells; otherwise, they’ll run together as you can see in cell “C1” below.

excel-combine-text-from-multiple-cells

Text to Columns for Trimming URLs

Another common data cleaning issue in digital businesses is trimming URLs or extracting information from them.

For instance, your site may serve different versions of the same page, which can lead to confusion when integrating data for web analytics.

Take the following example:

https://www.softwareadvice.com/crm/

https://www.softwareadvice.com/crm/?layout=var\_so0

https://www.softwareadvice.com/crm/?layout=var\_so1

These are all versions of the same URL. To make them identical, which is necessary for data integration, we need to remove the trailing “layout” part. This can be done with text to columns:

remove-trailing-url

Typically, you’ll choose the “delimited” option in order to separate text on a character:

separate-text-using-delimited-option

In this case, you’ll then choose to separate on “?” to remove the trailing part of the URL:

remove-trailing-url-to-standardize

The good stuff is now in one column and the garbage is in another:

separated-trailing-url-into-separate-column

Extracting Text

Text extraction is a bit more advanced than simply trimming and cleaning strings. In text extraction, you’re trying to separate specific parts of a string from the rest of the string. Here are two fairly different techniques for doing this, one of which leverages text to columns.

Text to Columns for Extracting Part of a URL

Sometimes, you need to extract part of a URL. One trick for doing this was developed by Jennifer Sakaida, our advertising programs manager, for a large-scale web scraping project. In it, we dumped tons of web page source code into Excel to extract significant information.

The technique involves replacing parts of text strings with special characters. The carrot (“^”) is a good one to use for this purpose, since it’s rarely found in HTML.

For instance, say we want to extract product names from a long list of URLs such as the following:

https://www.softwareadvice.com/crm/salesforce-profile/

https://www.softwareadvice.com/crm/sugarcrm-profile/

https://www.softwareadvice.com/crm/infusionsoft-crm-profile/

We need to look at what all these URLs have in common. It’s clear that the part we want, the brand names, comes right after “/CRM/” in each URL. Thus we can replace “/CRM/” with “^” using find/replace to get the following list:

http://www.softwareadvice.com^salesforce-profile/

http://www.softwareadvice.com^sugarcrm-profile/

http://www.softwareadvice.com^infusionsoft-crm-profile/

We can also see that the part after the brand name is the same in all URLs, so we use find/replace to substitute “^” for “-profile/” in order to get the following list:

http://www.softwareadvice.com^salesforce^

http://www.softwareadvice.com^sugarcrm^

http://www.softwareadvice.com^infusionsoft^

At this point, we can use text to columns on “^” to get just the brand names:

convert-text-to-columns

‘Nth’ Word Extraction

The fanciest trick Mendez shares is for extracting the Nth word from a text string.

This is a long formula:

=TRIM(MID(SUBSTITUTE(cellref1,” “,REPT(” “,LEN(cellref1))),(cellref2)*LEN(cellref1)+1,LEN(cellref1)))

Cellref2 is the part of the formula that controls which word in the string you’re extracting. You paste the text string in cellref1. Then, in cellref2, enter the number of the word you want to extract (0 for the first word, 1 for the second, 2 for the third etc.)

Example: =TRIM(MID(SUBSTITUTE(A1,” “,REPT(” “,LEN(A1))),(B1)*LEN(A1)+1,LEN(A1)))

In the following example, you can see how the numbers in column B control which word in the string gets extracted:

control-words-in-the-string

Mendez explains that he uses the formula “to see if there’s a common theme” in text strings he’s analyzing. This is a particularly useful formula for preparing text for analysis in digital marketing and advertising contexts, since keyword-driven approaches to search engine optimization (SEO) and search engine marketing (SEM) remain standard.

Data Cleaning Tools That Are Quicker Than Excel

If you’re spending a good chunk of your workday on data scrubbing tasks, it may be time to consider tools other than Excel. There’s a whole class of software, known as self-service data preparation tools, for speeding up the tedious work of data cleaning and integration.

Such tools offer advanced capabilities such as “fuzzy matching” (matching fields that are similar but not identical by leveraging machine-learning technology) and text extraction from difficult formats such as PDF. We survey their capabilities in our report on data blending.

Additionally, take a look at our self-service data preparation guide, which gives an overview of how these differ from traditional analytical tools.

If you have questions about self-service data preparation tools or if you’d like to share your own data cleaning techniques, you can reach me at danielharris@softwareadvice.com.