Site migrations are often a necessary but complex undertaking for websites. One of the most challenging aspects is ensuring that all important pages from the old site are properly redirected to relevant pages on the new site.
I’ve developed a Python script for Google Colab makes this process much easier. The Google Colab notebook uses crawl data from tools like Screaming Frog or Sitebulb, it then lets you upload the origin and destination crawl data and automates URL redirect mapping by finding the closest content matches between pages on the origin and destination sites. All of this is done based on the columns you select from your crawl data.
Here is the Automate URL Redirects For Site Migrations Google Colab notebook.
How the Python Script Automates URL Matching
Under the hood, the script leverages several powerful Python libraries:
- pandas: For data manipulation and analysis
- sentence-transformers: For generating embeddings (numerical representations) of the page text content
- faiss: For efficient similarity search between the page embeddings
Faiss (Facebook AI Similarity Search) is a library developed by Facebook Research for quickly searching and clustering dense vectors. It can find the nearest neighbours (most similar vectors) to a given query vector, even in datasets with millions of vectors. This makes it perfect for our use case of finding the closest content matches across large sites.
The script follows these key steps:
- Vectorizes the text content of each origin and destination page into a high-dimensional space using a pre-trained language model (sentence-transformers). This captures the semantic meaning of the content.
- Builds a faiss index from the destination page vectors to enable fast similarity searches.
- For each origin page vector, searches the faiss index to find the closest destination page vector.
- Outputs the mapping between origin and destination URLs and the similarity score.
How to Use The Script to Automate Redirect Matching
If you have no development experience, the Google Colab notebook may look complicated. But running the script and successfully matching URLs doesn’t require any development experience. If you run into any issues, send me a message on LinkedIn and I’ll be glad to help you out.
- Prepare your origin and destination URL and content data in two CSV files. Use Screaming Frog (as described later) or another method to gather this data. The CSVs should have columns for the URL and any content fields you want to include in the matching (e.g., title, H1, Address). Ensure both CSVs have the same columns.
- Ensure the following:
- The column containing the URLs is named “Address”.
- You do a find and replace to remove the root domain so that the script only matches the relative URL.
- The columns in the origin and destination files are the same and appear in the same order.
- Upload the origin and destination CSVs to your Google Drive.
- Open the Colab notebook containing the redirect matching script.
- Select “Run All” or run the cells one by one. The first cell will take a little while as the required packages are being installed.
- Upload your origin.csv and destination.csv files.
- Once they are processed you will be able to select which columns to include in the content matching by adjusting the ‘selected_columns’ list. Run this cell.
- Run the next code cell to perform the content matching. The script will:
- Combine the selected columns into a single ‘content’ column for each dataframe
- Generate embeddings for each page’s combined content using sentence-transformers
- Build a faiss index from the destination embeddings
- For each origin embedding, find the closest destination embedding using the faiss index
- Calculate the cosine similarity between the matched embeddings
- Output a dataframe with the origin URL, matched destination URL, and similarity score
- Review the output dataframe to see the automatically generated redirect mapping. Spot-check a sample of the mappings, especially for key pages.
- If needed, adjust the selected columns and re-run the script to refine the results.
- Export the final redirect mapping to a CSV for implementation.
Gathering Data with Screaming Frog and Sitebulb
While you can compile the input CSVs manually, I recommend using Screaming Frog to automate the crawling and data extraction process.
Using Screaming Frog Custom Extraction
With its Custom Extraction feature, you can easily pull in not just standard elements like titles and headings, but also any other text on the page using CSS selectors. For example, you could extract all text within the <body> tag, specific <p> tags, or text within elements with certain classes or IDs. This flexibility allows you to include product descriptions, SKUs, or any other content in the matching process.
Tips for using Screaming Frog and Sitebulb to gather crawl data:
- Use Regex or XPath in Custom Extraction to extract body classes/ids, elements like SKU numbers or product details.
- Filter out non-200 response URLs.
- Split your crawl data up into segments. For example, do the product pages only, then service pages, etc.
- Use the Ahrefs integration when crawling to extract the top keywords and include them in the matching.
With a bit of creativity, the possibilities are nearly limitless. You could even use Custom Extraction to pull page content from the HTML source of your old site (if it’s no longer live) by uploading the files to Screaming Frog via list mode.
Caveats When Automating Redirects
While this automated matching can be a huge time saver, it’s not a complete substitute for manual review. I always recommend spot-checking a sample of the mapped redirects, especially for the most important pages. The script also works best when the origin and destination sites have similar content and structure.
Your Feedback is Appreciated
If you have any other tips or ideas to share, let me know in the comments! And if you’re planning a site migration but unsure where to start, feel free to get in touch.
After taking into account the feedback from the script’s first iteration, I was very excited to see that it has been featured on the Search With Candour podcast, Search Engine Land and across LinkedIn. Thanks to everyone who provided feedback. Happy migrating!
https://colab.research.google.com/drive/1Y4msGtQf44IRzCotz8KMy0oawwZ2yIbT