Understanding the Limitations of Python Code for Web Scraping with JavaScript/AJAX Interactions

Understanding the Problem and the Solution

When web scraping, one of the common challenges is dealing with dynamic content that changes based on user input or selection. In this case, we’re trying to scrape a table from a website with historical data, but the link between the default date range and the selected date range is still the same.

The provided Python code attempts to solve this problem using BeautifulSoup for parsing HTML and requests for making HTTP requests. However, it seems that the approach taken by the original poster has some limitations, and we need to dive deeper into the technical aspects of web scraping and JavaScript/AJAX interactions to find a more reliable solution.

Understanding the Website’s Technology Stack

To start with, let’s take a closer look at the website’s technology stack. The URL https://id.investing.com/commodities/gold-historical-data points us towards a webpage that uses JavaScript and AJAX for data fetching. Upon further inspection, we can see that the webpage contains two main parts:

Static HTML Content: The static HTML content includes the table structure, CSS styles, and initial data.
JavaScript/AJAX Interactions: The JavaScript/AJAX interactions are responsible for dynamic data updates based on user input or selections.

Understanding the Requests Made to the Server

Upon further inspection of the provided code, we can see that two main requests are made:

Initial Request: A GET request is sent to https://id.investing.com/instruments/HistoricalDataAjax with a User-Agent header.
Subsequent Request: An additional POST request is sent to the same URL with various parameters, including st_date, end_date, and others.

Understanding How to Interact with the Server

To solve this problem, we need to understand how to interact with the server. In this case, we can see that the website uses an AJAX endpoint for data fetching. This means that our requests should also be in a format suitable for an AJAX request.

In addition, we need to parse the HTML response from the server correctly using BeautifulSoup. We will use the requests library to make HTTP requests and BeautifulSoup to parse the HTML content.

A Deeper Look at the JavaScript/AJAX Interactions

Let’s take a closer look at the JavaScript/AJAX interactions on the website:

Request Parameters: The POST request sent by our code contains various parameters, including st_date and end_date. However, we need to make sure that these parameters are in the correct format.
Header Values: We also observe that there are certain header values set for our requests, such as User-Agent and X-Requested-With.

Implementing a Solution

To implement a solution, we can start by making an initial GET request to fetch the HTML content of the webpage. Then, we need to parse this HTML content using BeautifulSoup to find the AJAX endpoint URL.

Next, we’ll create our POST requests with the required parameters and headers. We’ll make sure that these parameters are in the correct format for an AJAX request.

Here is a sample implementation of our solution:

import requests
from bs4 import BeautifulSoup

# Initial GET Request
url = 'https://id.investing.com/instruments/HistoricalDataAjax'
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0",
}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')

# Extracting the AJAX Endpoint URL
ajax_endpoint_url = soup.find('a', id='btnSearch')['href']

print(ajax_endpoint_url)

# POST Request with Parameters
payload = {
    "curr_id": "8830",
    "smlID": "300004",
    "header": "Data+Historis+Emas+Berjangka",
}

headers['User-Agent'] = 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'
headers['X-Requested-With'] = 'XMLHttpRequest'

for year in range(2010, 2021):
    payload["st_date"] = f"01/01/{year}"
    payload["end_date"] = f"12/31/{year}"

    response = requests.post(ajax_endpoint_url, data=payload, headers=headers)

    soup = BeautifulSoup(response.text, 'lxml')

    table = soup.find('table')
    for row in table.find_all('tr')[1:]:
        row_data = [item.text for item in row.find_all(['td', 'th'])]
        print(row_data)

Conclusion

In this article, we explored a web scraping problem involving a website with dynamic content. We delved into the technical aspects of the website’s technology stack and requests made to the server.

We implemented a solution using Python and BeautifulSoup for parsing HTML content. Our code fetches the AJAX endpoint URL, makes POST requests with parameters in the correct format, and prints out the scraped data.

With this knowledge, you can tackle similar web scraping problems involving JavaScript/AJAX interactions. Always remember to inspect your target website’s technology stack and requests to fully understand how to scrape its content correctly.

Last modified on 2024-12-09