
Amazon product pages contain public data that can be used for price tracking, market research, and competitor analysis. When you start web scraping Amazon product information or work through typical Amazon scraping workflows, this data is retrieved directly from the page HTML. This guide shows how to scrape data from Amazon using Python, focusing on publicly visible product information and the steps required to collect structured results.
Amazon product pages expose a limited set of publicly visible fields that can be collected directly from the page. When you scrape Amazon product data, the focus is on information that appears in the standard product detail view rather than content tied to user accounts or internal systems.
The following fields are commonly available on Amazon product pages:
Product title
Price
Rating
Review count
Images
Product description
All of these fields are part of the page content rendered for regular visitors and can be accessed directly from the product page. This article does not rely on official APIs, and it does not involve login sessions, account specific information, or any private data. The scope is limited to publicly visible Amazon product data that can be retrieved directly from product pages.
👀 Related Reading:
Amazon product pages present core product information directly in the HTML returned to the browser. An initial page request often returns a complete document structure that contains the main product details.
At the same time, product data is not exposed through a single consistent structure. Some values are stored in HTML attributes, while others appear inside embedded script blocks. Page structure can also vary based on region, product category, or page version. As a result, DOM selectors do not behave as stable interfaces.
For individual product pages, HTTP requests combined with HTML parsing provide a reasonable starting point. Most scraping failures are caused by how requests are made rather than by changes to the page structure itself. This guide focuses on using requests and BeautifulSoup to extract publicly visible data from Amazon product pages.
Note: For extended scraping workflows, maintaining consistent access often requires managing request distribution. Tools that support this are discussed in later sections.
Use Python 3.x for this guide. Create a virtual environment to keep dependencies isolated.
python3 -m venv venv
source venv/bin/activate
pip install requests beautifulsoup4Run a quick check:
import requests
from bs4 import BeautifulSoup
print("Environment ready")Start by fetching HTML from a public product page. Replace the placeholder ASIN with a real product ID from Amazon.
import requests
url = "https://www.amazon.com/dp/REPLACE_WITH_REAL_ASIN"
response = requests.get(url, timeout=10)
print("Status:", response.status_code)
print("Length:", len(response.text))
print("Preview:", response.text[:150])During early tests, you may encounter:
CAPTCHA pages
503 responses
Partial or empty content
These outcomes are common and do not mean your code is broken.
Some pages return more complete HTML when the request resembles a browser.
import requests
url = "https://www.amazon.com/dp/REPLACE_WITH_REAL_ASIN"
headers = {
"User-Agent": "Mozilla/5.0",
"Accept-Language": "en-US,en;q=0.9",
}
response = requests.get(url, headers=headers, timeout=10)
print("Status:", response.status_code)
print("Length:", len(response.text))
print("Preview:", response.text[:150])If the response length increases and the preview resembles a normal product page, the request is working as expected.
Convert the raw HTML into a queryable structure.
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
print("Page title:", soup.title.text if soup.title else "None")This parsing step turns raw HTML into an object you can search and extract values from. If you want a quick explanation of what this step means, see guide on data parsing.
Each field follows the same pattern: locate the selector, extract the value, and validate the output.
def clean_text(element):
return element.get_text(strip=True) if element else Nonetitle_tag = soup.select_one("#productTitle")
title = clean_text(title_tag)
print("Title:", title)rating_tag = soup.select_one("span.a-icon-alt")
rating = clean_text(rating_tag)
review_tag = soup.select_one("#acrCustomerReviewText")
review_count = clean_text(review_tag)
print("Rating:", rating)
print("Review count:", review_count)If the returned value does not resemble a rating, the HTML response may be incomplete.
price_tag = soup.select_one("span.a-price span.a-offscreen")
price = clean_text(price_tag)
print("Price:", price)Prices may be missing or structured differently. Always allow missing values so the script continues running. This pattern is useful for scraping amazon prices and for scrape amazon prices tasks that cover many products. For deeper handling of price variations, see article on how to scrape price.
image_urls = []
for img in soup.select("img"):
src = img.get("src")
if src and src.startswith("https://") and not src.endswith(".svg") and ".gif" not in src:
image_urls.append(src)
image_urls = list(dict.fromkeys(image_urls))
print("Images found:", len(image_urls))
print("Sample:", image_urls[:2])If you need deeper handling later, see scraping images from websites.
desc_tag = soup.select_one("#productDescription")
description = clean_text(desc_tag)
print("Description:", (description[:100] + "...") if description else "None")Store the extracted fields in a CSV file.
import csv
with open("product.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["URL", "Title", "Price", "Rating", "ReviewCount", "ImageCount"])
writer.writerow([url, title, price, rating, review_count, len(image_urls)])
print("Saved: product.csv")Start from a search results page and extract product links.
import re
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0",
"Accept-Language": "en-US,en;q=0.9",
}
search_url = "https://www.amazon.com/s?k=laptop"
search_response = requests.get(search_url, headers=headers, timeout=10)
print("Search status:", search_response.status_code)
print("Search length:", len(search_response.text))
search_soup = BeautifulSoup(search_response.text, "html.parser")
links = []
for a in search_soup.select('a[href*="/dp/"], a[href*="/gp/product/"]'):
href = a.get("href", "")
match = re.search(r"/(?:dp|gp/product)/([A-Z0-9]{10})", href)
if match:
links.append("https://www.amazon.com/dp/" + match.group(1))
links = list(dict.fromkeys(links))
print("Product links found:", len(links))
print("Sample:", links[:3])If no links are returned, the HTML response may be incomplete. You can reuse the same extraction logic from Step 5 for each product link.
Scraping a small number of Amazon product pages usually works without issues. Requests return complete HTML, selectors match as expected, and extracted fields look correct, which often creates the impression that the scraping logic is reliable.
As scraping continues and requests become repetitive or long running, a different set of problems begins to appear. These failures tend to follow consistent patterns as access accumulates.
Common issues include:
Rate limiting As request volume grows, responses may slow down or return incomplete HTML, even for pages that previously loaded correctly.
IP blocking Repeated access from the same network source can lead to rejected responses or redirects, interrupting data collection.
CAPTCHA challenges Product pages may be replaced by verification screens, leaving expected fields missing from the HTML.
Layout variation Page structure can shift across regions, categories, or versions, causing selectors to stop matching consistently.
Long-running instability Jobs that run for extended periods may degrade over time, producing partial results even when the scraping logic remains unchanged.
These issues rarely indicate problems with parsing logic. They emerge as access patterns accumulate and scraping moves beyond a small number of isolated requests.
Long-term scraping stability depends less on how data is extracted and more on how access behaves over time. When instability appears, the root cause is usually not selector logic, but how repeated requests are handled.
Several factors play a central role in maintaining stability:
Request pacing Conservative, predictable timing helps reduce response degradation during sustained scraping runs.
Session consistency Keeping related requests aligned improves continuity when repeatedly loading product pages or paginated results.
IP distribution and rotation Traffic from a single source tends to fail sooner. Distributing requests across multiple origins more closely resembles normal access patterns, which is why many teams rely on web scraping proxies at scale.
Access environment Stability depends on the overall access setup rather than individual script changes. Approaches that incorporate real user IP distribution, including residential proxies, are commonly used to maintain consistent results as volume increases.
For scraping workflows that require long running stability, IPcook offers high quality and affordable proxy access with consistent request behavior. By routing requests through a large pool of real residential IPs, IPcook helps reduce repeated access patterns while supporting rotation and session continuity during extended scraping runs.
Entry plans start at $3.2 for 1 GB, with lower per-GB rates available as traffic volume increases, down to $0.5 per GB
55M+ residential IPs across 185+ locations, helping requests blend into normal user traffic
Flexible IP rotation and sticky sessions (up to 24 hours) to keep related product page requests aligned
Pay as you go pricing with non expiring traffic, suitable for burst runs as well as long running scraping workloads
Right now start with 100MB of free residential proxy traffic to verify access consistency, response completeness, and request behavior under real Amazon scraping conditions before scaling further.
Scraping Amazon product data at scale is rarely limited by parsing logic. As request volume increases, reliability depends on how stable responses remain and how closely request behavior reflects normal user traffic. When those patterns repeat or drift, partial HTML and blocked access gradually appear, even when extraction code stays unchanged.
To maintain consistent results during extended runs, distributed residential IPs provide a stronger and more reliable access foundation. IPcook offers large-scale residential proxy coverage with rotation and session control that keeps Amazon scraping stable as workloads expand — remaining cost-efficient for sustained data collection.