Amazon is the most popular e-commerce website for web scrapers, with billions of product pages scraped every month.
It also has a huge database of product reviews, which can be very useful for market research and competitor monitoring.
You can extract relevant data from Amazon websites and save it in spreadsheet or JSON format. You can also automate the process of regularly updating your data.
Scraping Amazon product reviews isn’t always easy, especially when a login is required. In this guide, you will learn how to scrape Amazon product reviews after logging in. Learn the process of logging in, parsing review data, and exporting reviews to CSV.
Let’s get started.
Prerequisites and project setup
Collect Amazon reviews using the Node.js Puppeteer library. Make sure Node.js is installed on your system. If not, visit the official Node.js website and install it.
After installing Node.js, install Puppeteer. Puppeteer is a Node.js library that provides a high-level, user-friendly API for automating tasks and interacting with dynamic web pages.
Next, let’s install and configure Puppeteer.
Open a terminal and create a new folder with any name. (In my case it is amazon_reviews).
mkdir amazon_reviews
Change the current directory to the folder created above.
cd amazon_reviews
You have successfully reached the correct directory. Initialize it by running the following command: package.json File:
npm init -y
Finally, install Puppeteer using the following command:
npm install puppeteer
The process looks like this:
Next, open the folder in your favorite code editor and create a new JavaScript file (index.js). Make sure your hierarchy looks like this:
Everything was set up successfully. Now you’re ready to code your scraper.
Note: Make sure you have an account with Amazon so you can proceed with the rest of this tutorial.
Step 1: Visit your public page
Scrape reviews for the products listed below. Extract the author’s name, review title, and date.
The product URL is: https://www.amazon.com/ENHANCE-Headphone-Customizable-Lighting-Flexible/dp/B07DR59JLP/
First, log in to Amazon and redirect to your product’s URL to collect reviews.
Step 2: Scrape behind the login
Amazon’s multi-step login process allows users to enter their username or email,[続行]You have to click the button, enter the password and finally submit the password. Typically, the username and password fields are both on separate pages.
Use the selector to enter your email ID input[name=email]
.
Then use the selector and click the Continue button. input[id=continue]
.
You should see the following page.Use selector to enter password input[name=password]
.
Finally, use the selector and click the “Sign in” button. input[id=signInSubmit]
.
Here is the code for the login process:
const selectors =
emailid: 'input[name=email]',
password: 'input[name=password]',
continue: 'input[id=continue]',
singin: 'input[id=signInSubmit]',
;
await page.goto(signinURL);
await page.waitForSelector(selectors.emailid);
await page.type(selectors.emailid, "satyam@gmail.com", delay: 100 );
await page.click(selectors.continue);
await page.waitForSelector(selectors.password);
await page.type(selectors.password, "mypassword", delay: 100 );
await page.click(selectors.singin);
await page.waitForNavigation();
Follow the same steps as described above. First, go to the sign-in URL, enter your email ID, and click on the Continue button. Then enter your password and click on the “Sign in” button and wait for a while until the sign in process is completed.
Once the sign-in process is complete, you will be redirected to the product page for collecting reviews.
Step 3: Analyze your review data
You have successfully logged in and are now viewing the product page you want to scrape. Next, let’s analyze your review data.
This page contains various reviews.These reviews are contained within the parent div
with ID cm-cr-dp-review-list
, all reviews for the current page are retained. If you want to access more reviews, you will need to use the pagination process to navigate to the next page.
This parent div has multiple child divs, and each child div holds one review.To extract reviews you can use selector #cm-cr-dp-review-list div.review
.
const selectors =
allReviews: '#cm-cr-dp-review-list div.review',
authorName: 'div[data-hook="genome-widget"] span.a-profile-name',
reviewTitle: '[data-hook=review-title]>span:not([class])',
reviewDate: 'span[data-hook=review-date]',
;
This selector indicates to go to the element with ID first. cm-cr-dp-review-list
search all div
Element with data hook review
.
The following code snippet first navigates to the product URL, waits for the selector to load, then retrieves all reviews, reviewElements
variable.
await page.goto(productURL);
await page.waitForSelector(selectors.allReviews);
const reviewElements = await page.$$(selectors.allReviews);
Next, let’s extract the author name, review title, and date.
To parse author names, you can use selectors. div[data-hook="genome-widget"] span.a-profile-name
.This selector first div
contains elements data-hook
Attributes set to genome-widget
the name is in this, so div
element. next, span
class name elements a-profile-name
. This is the element that contains the author’s name.
const author = await reviewElement.$eval(selectors.authorName, (element) => element.textContent);
To parse the title of a review, you can use CSS selectors. [data-hook="review-title"] > span:not([class])
. This selector is span
elements that are direct children of [data-hook="review-title"]
I have an element and it doesn’t have a class attribute.
const title = await reviewElement.$eval(selectors.reviewTitle, (element) => element.textContent);
To parse dates, you can use CSS selectors. span[data-hook="review-date"]
. This selector is data-hook
Attributes set to review-date
. This is the element that contains the review date.
const rawReviewDate = await reviewElement.$eval(selectors.reviewDate, (element) => element.textContent);
Note that the entire text including the location is retrieved, not just the full date. So I need to extract the date from the text using a regular expression pattern.
Then combine all the data and reviewData
and push it to the final list reviewsData
.
const datePattern = /(w+sd1,2,sdcontact us)/;
const match = rawReviewDate.match(datePattern);
const reviewDate = match ? match[0].replace(',', '') : "Date not found";
const reviewData =
author,
title,
reviewDate,
;
reviewsData.push(reviewData);
}
The above process will run until all reviews on the current page have been parsed. Here is the code snippet that parses the data:
for (const reviewElement of reviewElements)
const author = await reviewElement.$eval(selectors.authorName, (element) => element.textContent);
const title = await reviewElement.$eval(selectors.reviewTitle, (element) => element.textContent);
const rawReviewDate = await reviewElement.$eval(selectors.reviewDate, (element) => element.textContent);
const datePattern = /(w+sd1,2,sdcontact us)/;
const match = rawReviewDate.match(datePattern);
const reviewDate = match ? match[0].replace(',', '') : "Date not found";
const reviewData =
author,
title,
reviewDate,
;
reviewsData.push(reviewData);
wonderful! The relevant data has been successfully parsed and is now in JSON format as shown below.
Step 4: Export reviews to CSV
Reviews are parsed in JSON format, making them somewhat human readable. Converting this data to his CSV format makes it easier to read and use for other purposes.
There are many ways to convert JSON data to CSV, but we’ll use a simple and effective approach. Here is a simple code snippet to convert JSON to CSV.
let csvContent = "Author,Title,Daten
for (const review of reviewsData)
const author, title, reviewDate = review;
csvContent += `$author,"$title",$reviewDaten`;
const csvFileName = "amazon_reviews.csv";
await fs.writeFileSync(csvFileName, csvContent, "utf8");
The output of the CSV file is as follows:
And it was done!
The complete code uploaded to GitHub can be found here.
conclusion
In this guide, you learned how to use Puppeteer to scrape Amazon product reviews after login. You learned how to log in, parse relevant data, and save it to a CSV file.
For further practice, you can use pagination to extract all reviews on all pages.