Detecting when a target for scraping has been updated on GCP

January 15, 2024

One website I ping for a variety of data points doesn’t have a regular update schedule and because I’m only concerned about a specific data on the page, I can’t simply rely on detecting a change in the entire contents of the page. Where do I start?

Well, I know that I’ve been pinging the page every day at a specific time, saving the contents into data frame, and then writing that to a csv in GCS for the given date. Then when pulling the data, if two or more days have the same data, I’ll use SQL to only select the unique data.

Now, in the long term I’d like to be more efficient and not save repetitive data. Choosing only to save new and updated data, and possibly run it multiple times a day to see how the data changes over time. What I’ve come up with for my approach is to:

  1. Get the most recent file from a given GCS bucket
  2. Load the contents of the recent file into a data frame
  3. Scrape and load the contents of the current data into data frame
  4. Reduce columns to relevant data to compare and normalize the data casting all values as STR, then run Pandas equals() to detect if the data frames are equal
  5. If equal, do nothing. If different, save the new data with the runtime timestamp.

Getting the most recent file from GCS:

def get_most_recent_file(bucket_name, subfolder_name):
        storage_client = storage.Client()
        bucket = storage_client.get_bucket(bucket_name)

        most_recent_file = None
        most_recent_date = datetime.min.replace(tzinfo=timezone.utc)

        if not subfolder_name.endswith('/'):
            subfolder_name += '/'

        for blob in bucket.list_blobs(prefix=subfolder_name):
            if blob.updated > most_recent_date:
                most_recent_date = blob.updated
                most_recent_file = blob.name

        return most_recent_file

Next to load the file into a data frame:

def load_file_into_dataframe(bucket_name, file_name):
        storage_client = storage.Client()
        bucket = storage_client.get_bucket(bucket_name)
        blob = bucket.blob(file_name)

        byte_stream = io.BytesIO()
        blob.download_to_file(byte_stream)
        byte_stream.seek(0)

        return pd.read_csv(byte_stream)

The rest of the logic including reducing columns, casting data types, comparing the data frames, and uploading the new data to GCS is something like:

tempDf = df[['A','B','C','D']]
tempDf = tempDf.astype(str)

if mostRecent:
    mostRecentData = load_file_into_dataframe(bucket_name, mostRecent)
    mostRecentDataTemp = mostRecentData[['A','B','C','D']]
    mostRecentDataTemp = mostRecentDataTemp.astype(str)

    if not mostRecentDataTemp.equals(tempDf):
        datasource = currentTimestamp+'.csv'
        df.to_csv(datasource, index=False)
        blob = bucket.blob('data/'+datasource)
        blob.upload_from_filename(datasource)

Welcome!

To learn something new, I've put together this site using Gatsby with WordPress and GraphQL, along with a server-side GTM configuration for a first party analytics ecosystem.

Aris

Explore when and wherever you can!

Find me on LinkedIn