I needed to parse the html of each report into a format that is not crazy html.
Once parsed, the data needed to be entered into a database.
For the database, I selected the NoSql database mongoDB. A discourse on the differences between a NoSql and a SQL database is beyond the scope of this post, but this Stack Overflow answer not only provides useful links to compare the NoSql databases mongoDB and couchDB, but also provides a succinct summary of why I chose a NoSql db over a SQL db.
Namely, “for most things that you would do with MySQL or PostgreSQL, but having predefined columns really holds you back” choose a NoSql db.
That’s me to a “T.” I need basic SQL functionality but I can’t use predefined columns because each accident report is a distinct document with an unknown structure.
One document may have only a single vehicle (unit) involved in the crash whereas the next document may have ten units. One document could have a section for “Property Damage” or even multiple sections for damaged property, but, on the other hand, most vehicle crashes resulted in no property damage.
So, I needed a database that was flexible and could expand and contract with each report.
Having selected mongoDB, I next needed to install it.
This process was absolutely painless thanks to the Mac OS X package manager, Homebrew.
If you develop on Mac OS and don’t use Homebrew, you’re making your life needlessly difficult.
I thought these were dumb when I was growing up, but I thought of them again when I re-re-began to work on parsing the reports from http://accidentreports.iowa.gov.
I had never made much progress because I had never really wanted to make much progress, i.e. I had not decided, definitively, to “do or do not”.
So, I got up early last Saturday, brewed some coffee and started hacking.
Although the actual Iowa Accident Reports are difficult to parse, fetching the reports themselves is pretty straightforward.
I figured I’d start there. Fetch all the reports, store them locally and work from that base.
The URL for each report breaks down to a base URL and URL id number.
Getting all the report pages is simply a matter of cycling through all the id numbers and then downloading the corresponding report.
Easy right?
Not so fast.
For some reason, reports don’t actually start appearing until the id number 29734. Consequently, the reports only stretch back to July 12, 2005.
Now, this situation raises our first data journalism questions. Are there more reports? What happened to the reports prior to July 12, 2005? Can I get those reports? Why aren’t those reports online?
We might want to contact the government offices now and ask them these very questions.
But, having worked with this data previously, I know that I’ll have to contact the Iowa State Patrol at some point anyway because of a problem with the geographic data associated with each report. So, I’ve made note of these questions and put them in a safe, fire-proof place for a later date.
To deal with those empty reports, I decided to check for the presence of the Law Enforcement Case Number:
If this returns “false”, a Law Enforcement Case Number is present and I’ll download the file; if it’s empty (returns true), there’s no report and we move on to the next URL id number.
While I might be able to reasonably assume that all URL id numbers prior to 29734 are empty (I spot checked quite a few), it’s easy for me to have the script start at zero and run through all possible crash reports, so that’s what I’ll do.
Eventually, I’d like to automate this script to start and stop on its own and to run every two weeks to check for new reports, but that’s for later. For now, I hard coded the script to start at 0 and stop at 63304, which, as of this writing, is the most recent full accident report.
The script is straightforward. A fetch_page method grabs the page, a save_page method checks for the presence of the Law Enforcement Case Number, and, if present, calls write_to_raw_file and passes in the html of the fetched page and a loop_through_pages method cycles through the url_id numbers and calls save_page.
I ran the script and it took awhile, ~10 hours, but it worked like a charm.
Next on my list is to get a list of the Law Enforcement Case Numbers and associated X and Y coordinate values.
My previous experience with this data taught me that the geo information is cut off for each of the reports. I contacted the government officials about this problem in 2011 and they said they’d be happy to update the geo information if I provided them a list of case numbers and X and Y values.
That’s pretty helpful and I wish I had been in a position to jump on that opportunity when it was presented. (Alas…le Sigh…)
They might not be so helpful this time. We’ll see.
I had just started my first day as a data journalist at the venerable Des Moines Register and my new boss and I, the equally venerable @jameswilkerson, were chatting about web scraping and he mentioned the Iowa State Patrol’s Crash site. James said that he and my predecessor at the Register, the also venerable @mikejcorey, had kicked the tires on the idea of scraping the site but had deemed it impractical/undoable/just not worth it.
Being the cocky young buck that I was two years ago and wanting to make a good impression, I said “It shouldn’t be too hard to scrape that.” James looked at me quizzically (or like I was nuts) and showed me the HTML.
Gross for sure. But not impossible. But gross. And hard.
But, worth it.
So, I took a spin at processing the html, but I didn’t get far/never really tried and was soon distracted by my new job, vacations, college football…..etc.
Then, layoffs struck the Des Moines Register and folks were scattered to the winds.
I never forgot about those Iowa State Patrol Crash reports, though, and even mentioned them again to @mikejcorey at NICAR 2012. He gave me @jameswilkerson’s “You’re nuts” look, but agreed that that data would be great to get and mentioned some questions/analysis he’d like to run on the data.
So, I dusted off that Ruby script and set to work processing that cringe-worthy HTML. I made good progress but was distracted by my lack of a certain goal. I wasn’t sure what I wanted to do with the accident report data once it was processed, so I again set the script aside.
This time that Ruby script might have remained forever dusty and forgotten if not for a course of events set in motion by this talk by Ben Welsh (@palewire), a database producer at the LA Times, at NICAR 2013.
Ben’s talk was great and made an impression that I mentioned to Ben’s LA Times colleague, Ken Schwencke (@schwanksta), in the lobby of our hotel on the final day of the conference.
Ken told me that Ben had done the analysis for that project in Django. Ken didn’t expound on that point, as it didn’t really need any explanation. I knew exactly what he was talking about.
I had never thought about doing the analysis for a story or project within a Web framework, be it Django or Rails, but the obviousness of this idea and the fact that I didn’t do it made me feel like I’d been riding a bicycle with triangle wheels for the last few years.
A few weeks after NICAR 2013, this article on How The Data Sausage Gets Made by Jacob Harris (@harrisj) of the New York Times and a subsequent conversation with Troy Thibodeaux (@tthibo) of the Associated Press reinforced the idea of doing all the data work for a project within one’s framework of choice.
At this point, I was obsessed and walking around like Howard Hughes muttering “The way of the future…”. And, I knew my Iowa Crash Site reports project was a perfect test case with which to experiment. This project would have it all: scraping, a FOIA (or at least some back and forth with government officials), mapping, graphs and even a NoSQL database.
Perfect. (Plus, the college football season was far enough off that I could finish before the first kickoff….)
I deployed my first official Rails application this weekend. (By official, I mean actually serves a purpose in it’s existence.)
ReVeraFilms.com started out as a static site for a production company based in Los Angeles, and I had no real desire to take it any further once I had finished it.
After I had finished Michael Hartl’s Rails Tutorial book, I began thinking about a good first project. I’ve worked extensively with data and databases before and even wrote some models and a Rake import for an aborted application in 2010. So, I felt confident with my backend skills in Rails, confident enough, at least, to get started without too much fear.
Where I knew I needed more practice in Rails was in the front-end matter, i.e. the asset pipeline, partials, layouts, views and the relationship of all and sundry. I was tired of nibbling around the edges of Rails, though, so I was reluctant to start another “tutorial/exercise” type project.
Enter ReVeraFilms.com.
I figured I’d just convert a static html site to Rails and that would give me some practice on the front-end portion of Rails.
It turned out to be a great idea and went pretty smoothly. (I even wrote a few tests.)
The only real problem was a niggling error related to the Asset Pipeline. The application.js file already requires JQuery and I mistakenly called the library again along with the js files for the Anything Slider. So, I kept getting a mysterious error and the Anything Slider didn’t work at all.
These sorts of errors are exactly what makes learning a framework so difficult and exactly why I wanted to (and am glad I did) start small and manageable.
What wasn’t a breeze was getting the domain name of reverafilms.com to point to the Heroku app. Bluehost doesn’t allow configuration of cname records unless the person also hosts his or her site on Bluehost. So, first I had to transfer the domain name registration to GoDaddy.
Then, I had to muck around in the guts of my Rackspace account to get the proper settings, change them in the GoDaddy DNS zone editor, then wait an hour or so for the changes to take effect to see if I had it right. If not, I had to return to the GoDaddy zone editor and try it again. (I now have no problem admitting that I have a lot to learn when it comes to DNS blah, blah, blah.) Late Sunday afternoon, my apparent triumph was cut short when an email I had sent to the owner of the site (personx@reverafilms.com) was returned as undeliverable. I thought, hoped, and prayed that maybe I just needed to point the right record in GoDaddy to the mx server on Rackspace. Thankfully, I was right and there was only a little downtime for the email addresses associated with the site.
All in all, it was a great exercise. The lesson: never underestimate the value of fully implementing a seemingly trivial version of a project. You can always add refinements, refactor, build out, etc. But, taking a trivial version of a project from 0 to fully implemented forces you to deal with that last, most difficult 10 percent of a project, but on a more digestible scale.
I’ve been messing around with d3 as much as possible lately and I’m impressed. I’ve a backlog of posts about this great library but I thought I’d start off small and give a shout out to Scott Murray, @alignedleft and his great tutorials.
Scott also wrote a book and I’m working my way through the electronic version. But, if you’re interested in d3, just start with Scott’s tutorials. d3 is not necessarily easy, especially if you’re pretty new to Javascript, but it allows you to set clearly defined goals for projects to move your skills forward. Everyone can find some real numbers to make a simple graph.
My first real project was to graph the Nebraska football wins over the history of the program. (I got this data from Wikipedia.)
Scott’s tutorials gave me enough knowledge to work through this example project and helped clear up some confusion I had regarding scales.
I plan on expanding this example Husker graphic to include all Big Ten teams and to add some interactivity as well.
Also, if you’re just starting out, don’t worry so much about “data” just yet. After working through most of another d3 book, Mike Dewar’s Getting Started With d3, I came to realize that the proverbial “rest of the iceberg” with d3 is the data parsing. You can make spectacular visuals with d3, as long as you can parse your data into the format that you want. Enter the scripting language of your choice….
Nevertheless, for now, just check out Scott’s tutorials. They really are fantastic. As for data, follow Scott’s lead and hard code your values in an array. (Always, start small, I say.)
Regarding Scott’s book, I can attest that it’s on the same level as the tutorials. So, if you dig the tutorials, you’ll dig the book too.
Also, if you have any questions on the Husker graphic above, hit me up in the comments.
Late last year, I wrote a script for a reporter using Facebook’s Graph api. The reporter was working on a much larger health-care related project and he was wondering whether he could use the now defunct site Openbook to find folks who had had adverse experiences with certain surgical procedures and/or medical conditions.
I said sure, he could use Openbook, but because that site is built on top of Facebook’s own platform, he’d be better off just using the Facebook data directly.
Could I write this script for him, he asked.
Sure, I said.
It didn’t take long to write the script, and the only real difficulty came with what to use as a delimiter. Unfiltered Facebook status updates are a burbling cesspool of every sort of character imaginable, and the first versions of the script yielded garbled Facebook statuses. Sanitizing the raw statuses solved this problem and I set the script to run daily and let it run a few weeks. After I had ~ 1000 lines of data, I converted the output to an Excel spreadsheet and sent the data to the reporter. We looked at the spreadsheet together, talked about removing duplicates and filtering the data to get what he wanted, then went back to our lives.
A couple weeks later I found out he had reached out to a number of individuals via Facebook, heard back from probably half and ended up interviewing a few of them.
I forgot about this script until yesterday when I was cleaning out my server. Those ~1000 lines of data had grown into 209520, an impressive number of Facebook statuses, even if only 25 percent are distinct statuses.
The effort I put into gathering these 200,000+ sources was minimal, probably an hour or so, at most, and the effort the reporter put into getting in touch with each individual was minimal as well. He said he focused on those individuals who were angry in their statuses and seemed like they wanted to talk or had issues they felt needed to be told.
An afternoon of work and 200,000+ possible sources got me wondering how else we could be using Facebook in reporting. Any ideas?
ReVeraFilms.com is finished, at least to the extent that I’m okay with where it’s at. The company’s owner can use it without being ashamed and I can move on to other projects. The goal of the site was simply to make an attractive online “business card” for an independent film and television production company based in Los Angeles.
My personal goal was to complete a more complex design for a site as well as build a server from scratch and then host the site on said server. This task took far longer than I want to admit (for a lot of reasons), but I was greatly aided by the Slicehost articles. (The need for custom email addresses for the domain was accomplished through Slicehost/Rackspace’s email hosting service. A service I can’t speak highly enough about. If you need email, look into it. It’s cheap too.) Once completed, I needed to settled on a design. I took the approach of “I’m not sure where this is going, but I’ll know it when I get there.”
I wanted the site to be somewhat unique from a visual standpoint. All the site needed to do was look good as a backdrop to a display of video clips and still images. I knew we’d use YouTube to host the videos since that’s the easiest way to insure folks who are sent links to videos can view the videos without running into browser/plugin type problems. That settled, when the company’s owner sent me the logo, I knew I wanted to somehow base the site around the logo. I imagined that logo on a blank, black TV screen late at night. Or, a dark movie theater, just before the show begins…. I wanted the site to capture that feeling of expectant promise somehow. I browsed the Web and stormed my brain and eventually settled on using 3D Parralax background effect from css-tricks.com.
Having jiggered that into submission, I browsed the Web and stormed my brain some more for a way to display the video clips. Enter the AnythingSlider jQuery Plugin. (Again from CSS-Tricks. Thanks twice gents!)
All in all, I can say that 1. I’m no designer, but, 2. I am proud of how the site turned out, and the process has definitely spurred me on to take learning design more seriously. To that end, I’ve already begun exploring Twitter’s Bootstrap. It’s an intriguing prospect to be able to design attractive, professional looking sites at a rapid pace. Another lesson I learned from this process was just letting go and moving on from a project. There are tweaks I’d like to make, rejiggerings, additions, deletions, etc. I’ve held off though. I’ve held off even making a few very basic adjustments, like shifting everything down 15 pixels or so from the top. Enough is enough. I’ll come back to this without doubt, but I figure if I move on, I’ll learn more and be better for it for ReVeraFilms.com v.2.