Juggling between coding languages? Let our Code Converter help. Your one-stop solution for language conversion. Start now!
Say you're tasked to analyze some website to check for its performance and you need to extract total files required to download for the web page to properly load, in this tutorial, I will help you accomplish that by building a Python tool to extract all script and CSS file links that are linked to a specific website.
We will be using requests and BeautifulSoup as an HTML parser, if you don't have them installed on your Python, please do:
Let's start off by initializing the HTTP session and setting the User agent as a regular browser and not a Python bot:
Now to download all the HTML content of that web page, all we need to do is call session.get() method, which returns a response object, we are interested just in the HTML code, not the entire response:
Now we have our soup, let's extract all script and CSS files, we use soup.find_all() method that returns all the HTML soup objects filtered with the tag and attributes passed:
So, basically we are searching for script tags that have the src attribute, this usually links to Javascript files required for this website.
Similarly, we can use it for extract CSS files:
As you may know, CSS files are within href attributes in link tags. We are using urljoin() function to make sure the link is an absolute one (i.e with full path, not a relative path such as /js/script.js).
Finally, let's print the total script and CSS files and write the links into seperate files:
Once you execute it, 2 files will appear, one for Javascript links and the other for CSS files:
css_files.txt
javascript_files.txt
Alright, in the end, I encourage you to further extend this code to build a sophisticated audit tool that is able to identify different files, their sizes and maybe can make suggestions to optimize the website !
As a challenge, try to download all these files and store them in your local disk (this tutorial can help).
I have another tutorial to show you how you can extract all website links, check it out here.
Furthermore, if the website you're analyzing accidentally bans your IP address, you need to use a proxy server in that case.
Related: How to Automate Login using Selenium in Python.
Happy Scraping ♥
Take the stress out of learning Python. Meet our Python Code Assistant – your new coding buddy. Give it a whirl!
View Full Code Generate Python Code
Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!