Photo by Gabriel Heinzer on Unsplash
Creating A Web Scraper with Node.js
Learn how to extract information from any website easily!
Introduction
The term "Web Scraping" refers to the extraction of data from a website quickly and accurately. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
Practical use case:
You're working at a company that asked you to make a list of companies offering a particular product or service, it could be things like:
Real Estate
Cryptocurrency
Latest News e.t.c
Throughout this article, we'll be exploring how web scrapers work as well as the node packages that makes it possible.
What we are building 🚀
Websites hold a lot of information, from the services the website offers, to the videos and images embedded in them, and sometimes we just want only a specific type of information from these websites. With this web scraper, we simply need to type in the website's URL and target a particular tag, class or id from the website developer tools section containing the elements, Extract the data and transfer it into a document in order to save the just-extracted-information.
For review or practice, the code to build this web-scraper is provided at the end of this article.😁
Prerequisites
The following assumptions are made about the reader using this tutorial:
Basic knowledge of Javascript and Node.js
Knowledge of the command line/terminal
Finally, an open mind
Getting Started
Before we get started extracting data from websites, we need to install some node packages that would help us achieve our goal. To do this, head over to the Node Package Manager website and note the following packages as we would be using them throughout this project. The four important packages for this article are highlighted below:
Axios
Cheerio
Express
Nodemon
Let's Begin 👨💻
We would start by going to our command line/terminal and type in code that would create a new directory and initialize a new node project inside that directory:
mkdir web-scraper
cd web-scraper
npm init -y
The command above creates a new directory called "web-scraper", switches to it, and initializes a new Node.js project by choosing its default settings with the -y flag.
Next, we want to install the node packages we would be using, we can do so by running:
npm install express axios cheerio nodemon
This command would install the latest version of the packages highlighted.
You can read more about these packages here
Scripts📝
Moving forward, We need to create two scripts, the first being our index.js
script containing our logic, and the other would be in our auto-generated package.json
file.
Let's create a script in the package.json
file in our directory, you can view this in your personal code environment. In the package.json
file, where you have scripts, we would create a script called start
by typing:
"start": "nodemon index.js"
This command tells the nodemon dependency we installed to listen to changes on our index.js
file.
Requiring Our Packages📦
In order for us to build this web scraper, we need to require the packages we installed in our directory, inside our index.js
file we would start off by typing the code below:
const axios = require("axios");
const cheerio = require("cheerio");
const express = require("express");
This command would help us have access to these packages and their constituents in our script.
Creating Our Server💻
Next, we need to call express, and save it's value to a variable, I'll save mine as "app" by typing:
const app = express();
Express contains the listen()
method that helps to listen to changes made to a particular port number, This port number has to be manually created, so our server can listen to it. We would listen to this port number, and log to the console, a confirmation message to show that the server is listening to our specified port.
We would update our previous index.js
file, with the new code sample below:
const PORT = 8000;
const axios = require("axios");
const cheerio = require("cheerio");
const express = require("express");
const app = express();
app.listen(PORT, () => console.log(`Server is currently running on port ${PORT}`))
With the code above, we have included our PORT number (8000) as well as the listen()
method to help us track changes to our port.
We just need to run our index.js
script in our command line/terminal by typing:
npm run start
This would run the start
script from the package.json
file in our directory, and if all things work accordingly, we should get the output below:
With this, we have set up a working server running on PORT 8000 successfully!
Scraping The Web 🧹
To start scraping websites we are going to be using the axios
package. Axios works by passing it a url, it then visits that url and gets a response data from it, In this case we would save this data as HTML we can work with.
We would do this by creating a 'url' variable containing the website we want to extract data from and parse it into axios
. As an example, We would be using the Punch Newspapers official website
Here's how:
const url = 'https://www.punchng.com'
axios(url)
Since the axios
package is promise based, we can do some chaining on it, by applying the then()
method to help us handle the response immediately after the promise is resolved. As well as some error handling using the catch()
method to log any error we might have to the console. We can do this by typing the code below:
const url = 'https://www.punchng.com'
axios(url)
.then(response => {
const html = response.data
console.log(html)
})
.catch(err => {
console.log(err)
})
The result of the code above would be the entire html code of Punch Newspapers home page. Here's a snapshot that of the HTML:
This is great! But how do we start picking out certain elements like buttons, classes e.t.c found inside our HTML? To achieve this, we need to use cheerio to load our HTML, We can do so by using the load()
method that comes with the cheerio package.
We would update our index.js
file with the code below:
const url = 'https://www.punchng.com'
axios(url)
.then(response => {
const html = response.data
const $ = cheerio.load(html);
})
.catch(err => {
console.log(err)
})
The update made above to our index.js
file would use cheerio to load our HTML data and store it's value inside the '$' variable. With this, we can start to target any type of content on any website.
Up next, We need to head over to our website, and inspect the page for the exact information we want. Here's what the website looks like:
All that's left is to inspect the page using the shortcut Ctrl + Shift + I
. This would open up the developer tools. Once you have it opened, it should look similar to this:
Now, all we need to do is to go through the website bearing in mind, the exact type of content we want to scrape off our website, In this article, we would be scraping the "Headline" of the news as well as the "URL" to read more about that particular news.
Here's what it looks like in my browser:
The image on the right, shows the underlined headline that we would be inspecting using our chrome dev tools on the left. As we can see from the code on the left, The headline "Suspended Accountant..." is wrapped inside an a
tag, which is a child element wrapped inside its h3
parent tag with a class value of entry-title
. If you go ahead and inspect the website, you would notice that all the headlines have that particular class embedded in them. This is what we would go and target in our code.
Since we are looking for the headline and link to all the news on our website, We simply need to target the entry-title
class and target each item containing that class. This can be done by updating our index.js
with the code below:
const url = 'https://www.punchng.com'
axios(url)
.then(response => {
const html = response.data
const $ = cheerio.load(html);
$(".entry-title", html).each(function() {
//Empty code block
})
})
.catch(err => {
console.log(err)
})
All this code does is to target the entry-title
class (Make sure you put the "." since we are targeting a class). As a second argument we are passing the html
variable containing our response data from axios. On it, we used the .each
method which takes in a callback function that's going to help us specify exactly what we want inside that function.
Now, all that's left for us is to specify exactly what we want inside our callback function. To do this, update your index.js
file with the code below:
const url = 'https://www.punchng.com'
axios(url)
.then(response => {
const html = response.data
const $ = cheerio.load(html);
$(".entry-title", html).each(function() {
const headline = $(this).text();
const url = $(this).find('a').attr('href')
})
})
.catch(err => {
console.log(err)
})
The changes we made simply created a variable called "headline" which contains the value of the text belonging to the entry-title
class.
Also, we created another variable to help us target the url for the headline, by using the find()
method to help us look out for the anchor tag (Note: The find method can take in any tag, not just the anchor tag). Next we would use the attr()
method to specify the attribute of our tag, in this case it's an href
attribute.
With this, we have stored the value of our headline, as well as the link to the news inside our "headline" and "url" variables respectively.
All that's left is to create an array that would store the result of this search. To do this, let's create an empty array called articles
, and then push the result of "headline" and "url" as an object inside our array. Finally, we would log the result to our console.
Let's update our index.js
with the code below to achieve this:
const url = "https://punchng.com/";
axios(url)
.then((response) => {
const html = response.data;
const $ = cheerio.load(html);
const articles = [];
$(".entry-title", html).each(function () {
const headline = $(this).text();
const url = $(this).find("a").attr("href");
articles.push({
headline,
url,
});
});
console.log(articles);
})
.catch((err) => {
console.log(err);
});
Our final index.js
file should look something like this:
//Requiring our packages
const axios = require("axios");
const express = require("express");
const cheerio = require("cheerio");
const PORT = 8000;
const app = express();
//Web scraping
const url = "https://punchng.com/";
axios(url)
.then((res) => {
const data = res.data;
const $ = cheerio.load(data);
const articles = [];
$(".entry-title", data).each(function () {
const headline = $(this).text();
const url = $(this).find("a").attr("href");
articles.push({
headline,
url,
});
});
console.log(articles);
})
.catch((err) => {
console.log(err);
});
app.listen(() => {
console.log(`Server is currently running on port ${PORT}`);
});
Now, all we need to do in order to get our array of information is go to our command line/terminal and type:
npm run start
If you have been following along, you should have a similar result to what i have below:
The image above, shows the result of our work which is essentially an array containing an object with a key value pair of "headline" and "url".
Great work so far! We have gotten what we are looking for, but how do we extract this information from our terminal and then save the result to a separate file that we can always have access to whenever we need it?
Well, all we need to do is to restart our terminal and then type in the code below:
npm run start > result.txt
This would store the result of our script inside an automatically generated text file called "result". Once node is done storing the results inside our text file, we should have something like this:
Conclusion
Throughout this article, we covered a bit about how web scraping works with Node.js and we did this by scraping a newspaper website for their headlines as well as the link to these headlines. You could do a lot more with the axios and cheerio package, and we've only covered the bare minimum in this article.
In case you might have missed a step or two, the full code for this tutorial is also hosted on GitHub.
Also, feel free to ask me questions on concepts that weren't so clear, I'd be more than happy to help.😊
Thanks for reading!