Ordering in the input is not defined. Ordering in the result is not important. Do not rely on the names of the hash keys other than :time. The field below executes your code in a Sandbox with $SAFE=4, so you can’t define new classes, use global variables, etc. The field will turn green if your solution returns a correct object.
I’ve long had a fascination with TwitterBots. I mean…they’re robots, after all. So, I decided to hack together a Bot to auto-tweet jobs from this Google spreadsheet.
It proved to be my most satisfying coding exercise, yet.
The attributes of the Class were straightforward: some jobs to tweet (:jobs), a configuration file (:twitter_config) to access the Twitter account of the bot (setting this up is a whole other post…), a Google drive session to access the spreadsheet (:gdrive_session), the jobs spreadsheet itself (:job_worksheet), and an instance of bitly (:bitly) to shorten the addresses that link back to the full job descriptions.
The class’s attributes are initialized when Twitterbot.new is called and an instance of Twitter is configured (Twitter.configure) to tweet the tweet after it is written.
The jobs worksheet is parsed by the make_jobs method. This method accesses the rows of the spreadsheet and, using a regular expression, plucks out all the jobs for 2013.
The only real tricky part of the script was when I used the bitly gem to shorten the provided link associated with each job.
Sometimes folks had pasted an already shortened link into the Google spreadsheet. Calling bitly’s shorten method on an already shortened link returns an error instead of the unchanged, already shortened bitly link (that was the behavior I had anticipated.)
The shorten_link method returns a shortened link if the link hasn’t already been shortened. If the link has already been shortened, the shorten_link method returns an error message. The unless method checks whether shorten_link returns ‘BitlyError’ or not. If not, I go ahead and shorten the link, if so, I do nothing and simply insert the already shortened link. (N.B. I’m not sure what happens if the link is already shortened but by another service. Also, submitting a patch to the bitly gem repo to get around the begin/end block here is a great idea…)
Constructing the actual tweet, then, is simply a matter of picking out the necessary array indices, arranging them in the correct order and adding some text for hashtags. I added a timestamp because I was originally tweeting every couple hours and Twitter won’t let identical tweets occur within a certain timeframe. That amount of tweeting got a little ridiculous, though. So, I changed the cronjob to once a day. I just left the timestamp in because I like it.
Overall, after a couple weeks to ponder it, the class is bit messier than I would like. Ideally, I’d just initialize the bot and it would take care of everything. Also, it doesn’t feel right to configure an instance of Twitter when the bot is initialized but then call Twitter.update outside the class.
Parsing the accident reports proved easier than I had anticipated.
There are eight sections of each report in which I’m interested: location (LOCATION), vehicles involved (UNIT), persons injured (PERSON_INJURED), the environment of the accident (ACCIDENT_ENVIR), the provided diagram of the accident (DIAGRAM) (which I’m not fetching quite yet), the narrative description of the accident (NARRATIVE), the details of the responding officer (OFFICER) and any properties damaged (PROPERTY_DAMAGE) . These sections are represented by constants at the top of the script and an html table or tables within the accident report.
The main method within the parsing script is the ‘dispatcher’ method at the bottom of the script. This method loads all the downloaded files and opens each with the HTML parsing library Nokogiri. Each HTML table is then analyzed to discover what type of table it is. If it’s a table with LOCATION data, the td elements of the table are passed to the parse_location method. It it’s a table with UNIT information, the tds are passed to the parse_unit method.
The main object of the script is the @rpt_hash instance variable. Each individual parsing method cycles through the td elements of the table for which it was designed and drills down into each td element to extract the relevant text. This text is then assigned as a value to a key within the @rpt_hash.
The script utilizes the powerful HTML parsing powers of the Nokogiri library. For an excellent tutorial of this library, see The Bastard’s Book of Ruby.
In essence, this code ‘tds.css(‘td’).children.text.strip.chomp’ first tells Nokogiri where to look: the second td element (css(‘td’)), then look at the second ‘thing’ within that element .children, next get the text (.text) then throw away any white space at the end of this text (.strip), finally throw away any newlines (.chomp).
The whole script is really just a bunch of assigning text as values to hash keys: @rpt_hash[:time_of_accident] = tds.css(‘td’).children.text.strip.chomp. Because we’re going to use this data to do some some analysis on the Iowa State Patrol’s response to accidents on Iowa highways, I wanted to take a look at all the data pieces as I went. It was more time consuming this way, but I was able to learn some things about the data. One, for example, was that we are going to have to do some serious work on the dates in each report, but more on that later….
Some tables, UNITS, for example, occurred in multiples, i.e while a crash could only involve one LOCATION or one NARRATIVE, any crash could involve multiple vehicles and/or multiple persons injured.
For these tables, for every accident report, instead of assigning each individual data point to a key within the main @rpt_hash object, I assigned each data point to a key within a temporary hash: tmp_hsh[:unit] = tds.css(‘td’).text.rm_space_tab_nline. When all the unit tables or persons injured tables are parsed, I pushed all the assigned values into the main @rpt_hash object, @units << tmp_hsh. In this way, I preserved the nested structure that makes json so useful.
Finally, I had the initial idea to accrue all the data from the reports into one giant @rpt_hash, then dump this data into Mongo all at once with the mongoimport command. That approach didn’t work very well. Mongo literally choked.
So, instead, I saved every @rpt_hash for each report separately (File.open….), then called mongoimport on each .json file I just saved (system…). This approach will work much better in the long run because, eventually, I won’t run the parsing script on every accident report, but, instead, just the new ones I haven’t previously saved.
Finally, I reset all the little holder objects, @units, @injuries, @properties and @rpt_hash before beginning the next accident report.
resets objects for next report
Next on my list is to clean the dates. The dates are somewhat irregular within each accident report and their format needs to be standardized or else any analysis involving Dates and Time will be difficult.
I needed to parse the html of each report into a format that is not crazy html.
Once parsed, the data needed to be entered into a database.
For the database, I selected the NoSql database mongoDB. A discourse on the differences between a NoSql and a SQL database is beyond the scope of this post, but this Stack Overflow answer not only provides useful links to compare the NoSql databases mongoDB and couchDB, but also provides a succinct summary of why I chose a NoSql db over a SQL db.
Namely, “for most things that you would do with MySQL or PostgreSQL, but having predefined columns really holds you back” choose a NoSql db.
That’s me to a “T.” I need basic SQL functionality but I can’t use predefined columns because each accident report is a distinct document with an unknown structure.
One document may have only a single vehicle (unit) involved in the crash whereas the next document may have ten units. One document could have a section for “Property Damage” or even multiple sections for damaged property, but, on the other hand, most vehicle crashes resulted in no property damage.
So, I needed a database that was flexible and could expand and contract with each report.
Having selected mongoDB, I next needed to install it.
This process was absolutely painless thanks to the Mac OS X package manager, Homebrew.
If you develop on Mac OS and don’t use Homebrew, you’re making your life needlessly difficult.
I thought these were dumb when I was growing up, but I thought of them again when I re-re-began to work on parsing the reports from http://accidentreports.iowa.gov.
I had never made much progress because I had never really wanted to make much progress, i.e. I had not decided, definitively, to “do or do not”.
So, I got up early last Saturday, brewed some coffee and started hacking.
Although the actual Iowa Accident Reports are difficult to parse, fetching the reports themselves is pretty straightforward.
I figured I’d start there. Fetch all the reports, store them locally and work from that base.
The URL for each report breaks down to a base URL and URL id number.
Getting all the report pages is simply a matter of cycling through all the id numbers and then downloading the corresponding report.
Not so fast.
For some reason, reports don’t actually start appearing until the id number 29734. Consequently, the reports only stretch back to July 12, 2005.
Now, this situation raises our first data journalism questions. Are there more reports? What happened to the reports prior to July 12, 2005? Can I get those reports? Why aren’t those reports online?
We might want to contact the government offices now and ask them these very questions.
But, having worked with this data previously, I know that I’ll have to contact the Iowa State Patrol at some point anyway because of a problem with the geographic data associated with each report. So, I’ve made note of these questions and put them in a safe, fire-proof place for a later date.
To deal with those empty reports, I decided to check for the presence of the Law Enforcement Case Number:
If this returns “false”, a Law Enforcement Case Number is present and I’ll download the file; if it’s empty (returns true), there’s no report and we move on to the next URL id number.
While I might be able to reasonably assume that all URL id numbers prior to 29734 are empty (I spot checked quite a few), it’s easy for me to have the script start at zero and run through all possible crash reports, so that’s what I’ll do.
Eventually, I’d like to automate this script to start and stop on its own and to run every two weeks to check for new reports, but that’s for later. For now, I hard coded the script to start at 0 and stop at 63304, which, as of this writing, is the most recent full accident report.
The script is straightforward. A fetch_page method grabs the page, a save_page method checks for the presence of the Law Enforcement Case Number, and, if present, calls write_to_raw_file and passes in the html of the fetched page and a loop_through_pages method cycles through the url_id numbers and calls save_page.
I ran the script and it took awhile, ~10 hours, but it worked like a charm.
Next on my list is to get a list of the Law Enforcement Case Numbers and associated X and Y coordinate values.
My previous experience with this data taught me that the geo information is cut off for each of the reports. I contacted the government officials about this problem in 2011 and they said they’d be happy to update the geo information if I provided them a list of case numbers and X and Y values.
That’s pretty helpful and I wish I had been in a position to jump on that opportunity when it was presented. (Alas…le Sigh…)
They might not be so helpful this time. We’ll see.
I had just started my first day as a data journalist at the venerable Des Moines Register and my new boss and I, the equally venerable @jameswilkerson, were chatting about web scraping and he mentioned the Iowa State Patrol’s Crash site. James said that he and my predecessor at the Register, the also venerable @mikejcorey, had kicked the tires on the idea of scraping the site but had deemed it impractical/undoable/just not worth it.
Being the cocky young buck that I was two years ago and wanting to make a good impression, I said “It shouldn’t be too hard to scrape that.” James looked at me quizzically (or like I was nuts) and showed me the HTML.
Gross for sure. But not impossible. But gross. And hard.
But, worth it.
So, I took a spin at processing the html, but I didn’t get far/never really tried and was soon distracted by my new job, vacations, college football…..etc.
Then, layoffs struck the Des Moines Register and folks were scattered to the winds.
I never forgot about those Iowa State Patrol Crash reports, though, and even mentioned them again to @mikejcorey at NICAR 2012. He gave me @jameswilkerson’s “You’re nuts” look, but agreed that that data would be great to get and mentioned some questions/analysis he’d like to run on the data.
So, I dusted off that Ruby script and set to work processing that cringe-worthy HTML. I made good progress but was distracted by my lack of a certain goal. I wasn’t sure what I wanted to do with the accident report data once it was processed, so I again set the script aside.
This time that Ruby script might have remained forever dusty and forgotten if not for a course of events set in motion by this talk by Ben Welsh (@palewire), a database producer at the LA Times, at NICAR 2013.
Ben’s talk was great and made an impression that I mentioned to Ben’s LA Times colleague, Ken Schwencke (@schwanksta), in the lobby of our hotel on the final day of the conference.
Ken told me that Ben had done the analysis for that project in Django. Ken didn’t expound on that point, as it didn’t really need any explanation. I knew exactly what he was talking about.
I had never thought about doing the analysis for a story or project within a Web framework, be it Django or Rails, but the obviousness of this idea and the fact that I didn’t do it made me feel like I’d been riding a bicycle with triangle wheels for the last few years.
A few weeks after NICAR 2013, this article on How The Data Sausage Gets Made by Jacob Harris (@harrisj) of the New York Times and a subsequent conversation with Troy Thibodeaux (@tthibo) of the Associated Press reinforced the idea of doing all the data work for a project within one’s framework of choice.
At this point, I was obsessed and walking around like Howard Hughes muttering “The way of the future…”. And, I knew my Iowa Crash Site reports project was a perfect test case with which to experiment. This project would have it all: scraping, a FOIA (or at least some back and forth with government officials), mapping, graphs and even a NoSQL database.
Perfect. (Plus, the college football season was far enough off that I could finish before the first kickoff….)
I deployed my first official Rails application this weekend. (By official, I mean actually serves a purpose in it’s existence.)
ReVeraFilms.com started out as a static site for a production company based in Los Angeles, and I had no real desire to take it any further once I had finished it.
After I had finished Michael Hartl’s Rails Tutorial book, I began thinking about a good first project. I’ve worked extensively with data and databases before and even wrote some models and a Rake import for an aborted application in 2010. So, I felt confident with my backend skills in Rails, confident enough, at least, to get started without too much fear.
Where I knew I needed more practice in Rails was in the front-end matter, i.e. the asset pipeline, partials, layouts, views and the relationship of all and sundry. I was tired of nibbling around the edges of Rails, though, so I was reluctant to start another “tutorial/exercise” type project.
I figured I’d just convert a static html site to Rails and that would give me some practice on the front-end portion of Rails.
It turned out to be a great idea and went pretty smoothly. (I even wrote a few tests.)
The only real problem was a niggling error related to the Asset Pipeline. The application.js file already requires JQuery and I mistakenly called the library again along with the js files for the Anything Slider. So, I kept getting a mysterious error and the Anything Slider didn’t work at all.
These sorts of errors are exactly what makes learning a framework so difficult and exactly why I wanted to (and am glad I did) start small and manageable.
What wasn’t a breeze was getting the domain name of reverafilms.com to point to the Heroku app. Bluehost doesn’t allow configuration of cname records unless the person also hosts his or her site on Bluehost. So, first I had to transfer the domain name registration to GoDaddy.
Then, I had to muck around in the guts of my Rackspace account to get the proper settings, change them in the GoDaddy DNS zone editor, then wait an hour or so for the changes to take effect to see if I had it right. If not, I had to return to the GoDaddy zone editor and try it again. (I now have no problem admitting that I have a lot to learn when it comes to DNS blah, blah, blah.) Late Sunday afternoon, my apparent triumph was cut short when an email I had sent to the owner of the site (firstname.lastname@example.org) was returned as undeliverable. I thought, hoped, and prayed that maybe I just needed to point the right record in GoDaddy to the mx server on Rackspace. Thankfully, I was right and there was only a little downtime for the email addresses associated with the site.
All in all, it was a great exercise. The lesson: never underestimate the value of fully implementing a seemingly trivial version of a project. You can always add refinements, refactor, build out, etc. But, taking a trivial version of a project from 0 to fully implemented forces you to deal with that last, most difficult 10 percent of a project, but on a more digestible scale.
I’ve been messing around with d3 as much as possible lately and I’m impressed. I’ve a backlog of posts about this great library but I thought I’d start off small and give a shout out to Scott Murray, @alignedleft and his great tutorials.
My first real project was to graph the Nebraska football wins over the history of the program. (I got this data from Wikipedia.)
Scott’s tutorials gave me enough knowledge to work through this example project and helped clear up some confusion I had regarding scales.
I plan on expanding this example Husker graphic to include all Big Ten teams and to add some interactivity as well.
Also, if you’re just starting out, don’t worry so much about “data” just yet. After working through most of another d3 book, Mike Dewar’s Getting Started With d3, I came to realize that the proverbial “rest of the iceberg” with d3 is the data parsing. You can make spectacular visuals with d3, as long as you can parse your data into the format that you want. Enter the scripting language of your choice….
Nevertheless, for now, just check out Scott’s tutorials. They really are fantastic. As for data, follow Scott’s lead and hard code your values in an array. (Always, start small, I say.)
Regarding Scott’s book, I can attest that it’s on the same level as the tutorials. So, if you dig the tutorials, you’ll dig the book too.
Also, if you have any questions on the Husker graphic above, hit me up in the comments.
Late last year, I wrote a script for a reporter using Facebook’s Graph api. The reporter was working on a much larger health-care related project and he was wondering whether he could use the now defunct site Openbook to find folks who had had adverse experiences with certain surgical procedures and/or medical conditions.
I said sure, he could use Openbook, but because that site is built on top of Facebook’s own platform, he’d be better off just using the Facebook data directly.
Could I write this script for him, he asked.
Sure, I said.
It didn’t take long to write the script, and the only real difficulty came with what to use as a delimiter. Unfiltered Facebook status updates are a burbling cesspool of every sort of character imaginable, and the first versions of the script yielded garbled Facebook statuses. Sanitizing the raw statuses solved this problem and I set the script to run daily and let it run a few weeks. After I had ~ 1000 lines of data, I converted the output to an Excel spreadsheet and sent the data to the reporter. We looked at the spreadsheet together, talked about removing duplicates and filtering the data to get what he wanted, then went back to our lives.
A couple weeks later I found out he had reached out to a number of individuals via Facebook, heard back from probably half and ended up interviewing a few of them.
I forgot about this script until yesterday when I was cleaning out my server. Those ~1000 lines of data had grown into 209520, an impressive number of Facebook statuses, even if only 25 percent are distinct statuses.
The effort I put into gathering these 200,000+ sources was minimal, probably an hour or so, at most, and the effort the reporter put into getting in touch with each individual was minimal as well. He said he focused on those individuals who were angry in their statuses and seemed like they wanted to talk or had issues they felt needed to be told.
An afternoon of work and 200,000+ possible sources got me wondering how else we could be using Facebook in reporting. Any ideas?