For my first blog post, I wanted to talk about something I’m really excited about. Throughout my coursework in my IT degree program at GSC, I’ve picked up quite a bit of C++ and C# (though I need to shake the rust off my C#…). Before starting this degree, however, I had learned quite a bit of Python 3 on my own. Of course, when I started school again I was so busy I abandoned my Python practice and stuck to the languages that were part of the curriculum. Well, this past Spring I completed my final programming course, and I felt like I could finally get back into learning programming languages that weren’t directly tied to my degree program.
After almost a three year break, I decided to pick up Python again… and wow did I hate it. It is so different in structure from the C based languages I had been working with. The idea of writing functions and loops with nothing to separate them from the body of the code aside from a few indents — really, no braces?? The whole experience really threw me for a while loop (I’ll see myself out).
In all seriousness, a few hours and a very simple dice rolling RNG program were all I needed to feel back at home again writing Python. It’s amazing how foreign an old language can look at first, and with a little practice feel totally natural again. With the rust loose, I decided I wanted a real challenge, so I set out to write a security tool in Python 3.
Having already written a simple port scanner and a simple encryption tool in C++ (I’m sure I’ll blog about those at some point too), I decided to write a simple webpage scrubbing tool. My goal for this project was to write a program that would accept a URL and return that webpage’s source code, formatted in a way that was easy to read, with features that would allow the user to search through the source for key words and display the line of code that key word appeared in. I also wanted the program to save the source code automatically to a .html file to a specified location within the working directory.
The resulting program was WebScrubr v1. I spent two to three long afternoons getting this first version to where I wanted it. It didn’t take me more than a couple hours to get the program to pull and display source code, but like with every other program that I’ve written, it’s the refining and attention to detail that takes the longest. I had the basic functions working properly, but I wanted it to be user friendly and not just shut down after it had run its course. Plus, when I would display the source in the command-line… let’s just say it was less than readable. So that was my first task: make the html easy to read. I tried parsing through the file and setting newlines at various markers. I tried setting a character limit for each line. I tried combinations of those two ideas and then scrapped it. I couldn’t quite figure out how to make it look nice. The attempts weren’t without progress though; I had managed to get source to print out in separate lines (rather than one ridiculously long line). This made searching through and displaying lines much easier, which is the feature I added next. Still though, it didn’t look right. I wanted the source code to look like it does when you view a page’s source directly from the web. To finally solve this, I decided to implement @wention‘s BeautifulSoup4 repository. This is really an incredible tool, and it allowed me to used one function, prettify(), to get everything looking exactly the way I wanted.
So now I had a program that would take a URL, spit out a local .html file that looked pretty, and could search for key words and spit out corresponding lines from the source code. Not I just had to polish the rough edges. I wrote a menu system, and for the final touch I added @pwaller‘s Pyfiglet to spice up the opening banner. That evening (either the third or fourth night working on this), after running it a couple dozen more times to look for bugs, I finally posted it to GitHub.
Of course, other than the very simple dice rolling program, this was my way of getting back into writing Python. As much as I hated the structure at first, by the end of this project I was back to loving the simplicity of Python. As of the time of posting this, it was a little over a month ago that I originally published WebScrubr v1. Looking back at it, I would really love to make this more command friendly, allowing the URL/IP to be passed as an argument when calling the program in Power Shell/Terminal (otherwise that shebang really isn’t living up to its full potential). I’d like to work on my structure too, I think some classes are in order here, and I think the README could use a little work. All in all though, I’m really happy with how the first version of this program turned out. If you’re interested, please check out my GitHub profile, you’ll see WebScrubr and a few of the other programs I mentioned in this post.
Again, thanks for stopping by!
-Kyle