Web Data Mining: Not Just Data Mining on the Web

Some words and phrases have obvious meanings, but many others don’t. A “carpet” is not a pet that lives in a car. And cars, by the way, drive on parkways and park on driveways. Languages are messy (some more than others), but perhaps the messiest of them all is the language of IT.

The fuzziness of IT definitions creates, most notably, great confusion among software buyers. At Software Advice, we’re reminded of this fact many times a day, as we help buyers parse vendor definitions into concrete and accurate software selection options.

Software for business intelligence (BI) is relatively new to the small and midsize business (SMB) space, and, perhaps because of its newness, it relies on a fair amount of sometimes ambiguous terminology.

In this report, we clear up the confusion around two common BI terms: web data mining and data mining. Many readers look at these terms and assume they’re related, but they’re about as related as a car and a pet are to a carpet.

(Click on a link below to jump to that section.)

What Is Web Data Mining Software?
Common Functionality of Web Data Mining Software
So What Is Data Mining Software?

What Is Web Data Mining Software?

There’s a great deal of data online, but most of it is hidden from view. (If you haven’t done this before, you can right click on a webpage and select “View Page Source” to see some of the information at work behind the scenes.)

While most of the data you’ll find there doesn’t have any competitive value, there is some that does. The challenge is finding and collecting the valuable data. Web data mining software is one of many BI tools used to overcome this challenge. Web data mining tools go by a variety of names: spiders, scrapers, crawlers and data extraction tools are among the most common. But most importantly,

Web data mining software is used for the collection of data. It doesn’t determine which data should be collected, where to collect it from or what it all means.

Those more in-depth analytical functionalities are typically found in larger BI software platforms.

Common Functionality of Web Data Mining Software

Web data mining software automates the process of collecting online information. These tools vary by vendor and often go by different names, but they share basic underlying functionality. Common functions of web data mining software often include:

Scraping agents. Also known as “crawlers,” these instruction sets determine which websites to crawl, what information to extract, what to do with the information and how often it should be collected. Web data mining applications typically allow users to create and save many different crawlers, each tailored to a specific type of collection.

Scheduling. To get the most recent data, web scrapers must visit sites frequently. Scrapers can monitor and download information whenever they detect updates or new content. Alternatively, scrapers can mine data at set intervals, for example: once a day, once a month or at the start of every new quarter.

Data handling. Web data mining produces a lot of information, and companies need to consider how they’ll manage and store it. Some web data mining solutions contain data handling functions that automatically organize collected data and store it differently, for example, on a local server or in the cloud, depending on how it’s configured.

Screenshot scraping. Sometimes a single screenshot is worth a thousand lines of mined data. In these cases, web data mining software with screenshot functionality can save the day. It finds, creates and saves screenshots of select web pages. This is especially helpful when comparing design, layout and product placement on competing sites.

So What Is Data Mining Software?

Now that you have a better understanding of web data mining software, let’s talk about how it differs from data mining software. In our Buyer’s Guide, we define data mining software as follows:

“Data mining software allows users to apply semi-automated and predictive analyses to parse raw data and find new ways to look at information. For example, e-commerce companies use these applications to analyze visitor demographics and discover how to deliver a better customer experience.”

If you’re experienced with BI applications, then you’re probably already familiar with this definition of data mining. But what if, like many first time BI software buyers, you’re not? You might assume you know what it means because you know what data is and you know what mining is. Putting the two together should give you an accurate understanding of the combined term, right?

Not necessarily. It depends on your mining experience, or lack thereof.

Let’s consider an analogy…

Pictured below is a gold mine in Australia, which looks like a very large hole in the ground. Now here’s the crux of the issue: What was removed to make this giant hole? Many, many, many tons of dirt. That dirt was later processed and the gold separated out, but the mining activity refers to the collection of the dirt that contains the gold, not the collection of the gold itself.


This distinction is important. Refer again to the definition of data mining above. Data mining is the process of extracting the valuable information from your existing data. In other words, in the parlance of business intelligence, you’ve already mined the many tons of dirt, and you use data mining software to process it into a few small bits of gold.

On the surface, based on the language alone, most people expect web data mining software to function similarly to plain data mining software. But as we’ve shown above, these two tools serve very different purposes.


Web data mining and data mining are two BI applications that are much more different than their very similar names suggest. While “web data mining” refers to the collection of large amounts of data from the internet, “data mining” refers to the extraction of valuable insight from large datasets.

Note: Laws regarding the use of web data mining software vary by locality, intent and degree. Companies should seek legal counsel prior to engaging in web data mining to minimize the possibility of legal action.

Mine image by Brian Voon Yee Yap used under CC BY-SA 3.0.

