Creating A Web Scraper with Node.js

Learn how to extract information from any website easily!

Introduction

The term "Web Scraping" refers to the extraction of data from a website quickly and accurately. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Practical use case:

You're working at a company that asked you to make a list of companies offering a particular product or service, it could be things like:

  • Real Estate

  • Cryptocurrency

  • Latest News e.t.c

Throughout this article, we'll be exploring how web scrapers work as well as the node packages that makes it possible.

What we are building 🚀

Websites hold a lot of information, from the services the website offers, to the videos and images embedded in them, and sometimes we just want only a specific type of information from these websites. With this web scraper, we simply need to type in the website's URL and target a particular tag, class or id from the website developer tools section containing the elements, Extract the data and transfer it into a document in order to save the just-extracted-information.

For review or practice, the code to build this web-scraper is provided at the end of this article.😁

Prerequisites

The following assumptions are made about the reader using this tutorial:

  • Basic knowledge of Javascript and Node.js

  • Knowledge of the command line/terminal

  • Finally, an open mind

Getting Started

Before we get started extracting data from websites, we need to install some node packages that would help us achieve our goal. To do this, head over to the Node Package Manager website and note the following packages as we would be using them throughout this project. The four important packages for this article are highlighted below:

  • Axios

  • Cheerio

  • Express

  • Nodemon

Let's Begin 👨‍💻

We would start by going to our command line/terminal and type in code that would create a new directory and initialize a new node project inside that directory:

mkdir web-scraper
cd web-scraper
npm init -y

The command above creates a new directory called "web-scraper", switches to it, and initializes a new Node.js project by choosing its default settings with the -y flag.

Next, we want to install the node packages we would be using, we can do so by running:

npm install express axios cheerio nodemon

This command would install the latest version of the packages highlighted.

You can read more about these packages here

Scripts📝

Moving forward, We need to create two scripts, the first being our index.js script containing our logic, and the other would be in our auto-generated package.json file.

Let's create a script in the package.json file in our directory, you can view this in your personal code environment. In the package.json file, where you have scripts, we would create a script called start by typing:

"start": "nodemon index.js"

This command tells the nodemon dependency we installed to listen to changes on our index.js file.

Requiring Our Packages📦

In order for us to build this web scraper, we need to require the packages we installed in our directory, inside our index.js file we would start off by typing the code below:

const axios = require("axios");
const cheerio = require("cheerio");
const express = require("express");

This command would help us have access to these packages and their constituents in our script.

Creating Our Server💻

Next, we need to call express, and save it's value to a variable, I'll save mine as "app" by typing:

const app = express();

Express contains the listen() method that helps to listen to changes made to a particular port number, This port number has to be manually created, so our server can listen to it. We would listen to this port number, and log to the console, a confirmation message to show that the server is listening to our specified port.

We would update our previous index.js file, with the new code sample below:

const PORT = 8000;
const axios = require("axios");
const cheerio = require("cheerio");
const express = require("express");

const app = express();
app.listen(PORT, () => console.log(`Server is currently running on port ${PORT}`))

With the code above, we have included our PORT number (8000) as well as the listen() method to help us track changes to our port.

We just need to run our index.js script in our command line/terminal by typing:

npm run start

This would run the start script from the package.json file in our directory, and if all things work accordingly, we should get the output below:

server message.JPG

With this, we have set up a working server running on PORT 8000 successfully!

Scraping The Web 🧹

To start scraping websites we are going to be using the axios package. Axios works by passing it a url, it then visits that url and gets a response data from it, In this case we would save this data as HTML we can work with.

We would do this by creating a 'url' variable containing the website we want to extract data from and parse it into axios. As an example, We would be using the Punch Newspapers official website

Here's how:

const url = 'https://www.punchng.com'
axios(url)

Since the axios package is promise based, we can do some chaining on it, by applying the then() method to help us handle the response immediately after the promise is resolved. As well as some error handling using the catch() method to log any error we might have to the console. We can do this by typing the code below:

const url = 'https://www.punchng.com'
axios(url)
.then(response => {
const html = response.data
console.log(html)
})
.catch(err => {
console.log(err)
})

The result of the code above would be the entire html code of Punch Newspapers home page. Here's a snapshot that of the HTML:

OutputSnippet.JPG

This is great! But how do we start picking out certain elements like buttons, classes e.t.c found inside our HTML? To achieve this, we need to use cheerio to load our HTML, We can do so by using the load() method that comes with the cheerio package.

We would update our index.js file with the code below:

const url = 'https://www.punchng.com'
axios(url)
.then(response => {
const html = response.data
const $ = cheerio.load(html);
})
.catch(err => {
console.log(err)
})

The update made above to our index.js file would use cheerio to load our HTML data and store it's value inside the '$' variable. With this, we can start to target any type of content on any website.

Up next, We need to head over to our website, and inspect the page for the exact information we want. Here's what the website looks like:

Screenshot (802).png

All that's left is to inspect the page using the shortcut Ctrl + Shift + I. This would open up the developer tools. Once you have it opened, it should look similar to this:

Screenshot (804).png

Now, all we need to do is to go through the website bearing in mind, the exact type of content we want to scrape off our website, In this article, we would be scraping the "Headline" of the news as well as the "URL" to read more about that particular news.

Here's what it looks like in my browser:

Screenshot (807)_LI.jpg

The image on the right, shows the underlined headline that we would be inspecting using our chrome dev tools on the left. As we can see from the code on the left, The headline "Suspended Accountant..." is wrapped inside an a tag, which is a child element wrapped inside its h3 parent tag with a class value of entry-title. If you go ahead and inspect the website, you would notice that all the headlines have that particular class embedded in them. This is what we would go and target in our code.

Since we are looking for the headline and link to all the news on our website, We simply need to target the entry-title class and target each item containing that class. This can be done by updating our index.js with the code below:

const url = 'https://www.punchng.com'
axios(url)
.then(response => {
const html = response.data
const $ = cheerio.load(html);
$(".entry-title", html).each(function() {
//Empty code block
})
})
.catch(err => {
console.log(err)
})

All this code does is to target the entry-title class (Make sure you put the "." since we are targeting a class). As a second argument we are passing the html variable containing our response data from axios. On it, we used the .each method which takes in a callback function that's going to help us specify exactly what we want inside that function.

Now, all that's left for us is to specify exactly what we want inside our callback function. To do this, update your index.js file with the code below:

const url = 'https://www.punchng.com'

axios(url)
.then(response => {
const html = response.data
const $ = cheerio.load(html);

$(".entry-title", html).each(function() {
 const headline  = $(this).text();
 const url = $(this).find('a').attr('href')
})
})
.catch(err => {
console.log(err)
})

The changes we made simply created a variable called "headline" which contains the value of the text belonging to the entry-title class. Also, we created another variable to help us target the url for the headline, by using the find() method to help us look out for the anchor tag (Note: The find method can take in any tag, not just the anchor tag). Next we would use the attr() method to specify the attribute of our tag, in this case it's an href attribute.

With this, we have stored the value of our headline, as well as the link to the news inside our "headline" and "url" variables respectively.

All that's left is to create an array that would store the result of this search. To do this, let's create an empty array called articles, and then push the result of "headline" and "url" as an object inside our array. Finally, we would log the result to our console.

Let's update our index.js with the code below to achieve this:

const url = "https://punchng.com/";
axios(url)
  .then((response) => {
    const html = response.data;
    const $ = cheerio.load(html);
    const articles = [];
    $(".entry-title", html).each(function () {
      const headline = $(this).text();
      const url = $(this).find("a").attr("href");
      articles.push({
        headline,
        url,
      });
    });
    console.log(articles);
  })
  .catch((err) => {
    console.log(err);
  });

Our final index.js file should look something like this:

//Requiring our packages
const axios = require("axios");
const express = require("express");
const cheerio = require("cheerio");
const PORT = 8000;
const app = express();

//Web scraping
const url = "https://punchng.com/";
axios(url)
  .then((res) => {
    const data = res.data;
    const $ = cheerio.load(data);
    const articles = [];
    $(".entry-title", data).each(function () {
      const headline = $(this).text();
      const url = $(this).find("a").attr("href");
      articles.push({
        headline,
        url,
      });
    });
    console.log(articles);
  })
  .catch((err) => {
    console.log(err);
  });

app.listen(() => {
  console.log(`Server is currently running on port ${PORT}`);
});

Now, all we need to do in order to get our array of information is go to our command line/terminal and type:

npm run start

If you have been following along, you should have a similar result to what i have below:

result.JPG

The image above, shows the result of our work which is essentially an array containing an object with a key value pair of "headline" and "url".

Great work so far! We have gotten what we are looking for, but how do we extract this information from our terminal and then save the result to a separate file that we can always have access to whenever we need it?

Well, all we need to do is to restart our terminal and then type in the code below:

npm run start > result.txt

This would store the result of our script inside an automatically generated text file called "result". Once node is done storing the results inside our text file, we should have something like this:

finally.JPG

Conclusion

Throughout this article, we covered a bit about how web scraping works with Node.js and we did this by scraping a newspaper website for their headlines as well as the link to these headlines. You could do a lot more with the axios and cheerio package, and we've only covered the bare minimum in this article.

In case you might have missed a step or two, the full code for this tutorial is also hosted on GitHub.

Also, feel free to ask me questions on concepts that weren't so clear, I'd be more than happy to help.😊

Thanks for reading!