Our talented graphics guy, Mike Fisher, came to me a few weeks ago with a PDF of the above image. The image showed all the annexations into the City of San Antonio since 1936. (The big square in the middle of the image shows the extent of San Antonio in 1936.)
Fisher asked me if I could separate out all the annexations by year so that he could build an animation that showed how San Antonio has grown through the years. I told him I could if he could get me a shapefile. A few phone calls later, shapefile in hand, he returned.
I previewed the shapefile in QGIS and took a look at the file’s attributes. Each shape had a field called “annexation” which indicated the year that particular geography was brought into the city. I also noticed that there was a “type” field which classified an annexation as an addition (‘Annexation’) to the city or a subtraction (‘Disannexation’). (How this actually figured in to the final graphic is a longer story….)
I used the handy shp2pgsql command to load the shapefile into a spatially enabled postgresql database.
Then, I wrote a Python script that looped over the database twice. In each pass, the script selected either all the annexations or disannexations and grouped them by year.
Next, the script looped over each year in the resultset and selected, then exported, those shapes that were annexed (or deannexed) in that particular year. Shp2pgsql’s sibling, pgsql2shp, made this export easy.
I exported each shape along this file path: Annexation(or Disannexation) directory/year/year.shp. Then, I exported each resulting shapefile as a .dxf file to a DXF directory within the Annexation or Disannexation folder. The .dxf file was the final desired product. That format is readable by Illustrator and what our graphic’s guy needed to make the final animation.
The most challenging part of this script, really, was manipulating the files and filepaths. I had a definite idea of how I wanted the results to be structured: Annexation/year/year.shp, Disannexation/year/year.shp and then Annexation/DXF/year.dxf and Disannexation/DXF/year.dxf, and moving between directories from within a script can be confusing. My patience, though, was rewarded by a portable script with a structured file output.
importpsycopg2importsysimportosimportsubprocessdeftrunc(f,n):'''Truncates/pads a float f to n decimal places without rounding'''slen=len('%.*f'%(n,f))returnstr(f)[:slen]base_dir=os.path.dirname(__file__)con=Noneannexes=['Disannexation','Annexation']forannexinannexes:try:con=psycopg2.connect(database='dbname',user='username')cur=con.cursor()cur.execute("Select annexation from tablename where type = '"+annex+"' group by annexation")years=cur.fetchall()cur.close()exceptpsycopg2.DatabaseError,e:print'Error %s'%esys.exit(1)finally:ifcon:con.close()foryearinyears:str_year=trunc(year,0)ifnotos.path.exists(annex+"/"+str_year):os.makedirs(annex+"/"+str_year)subprocess.call("pgsql2shp -f "+annex+"/"+str_year+"/"+str_year+".shp -h localhost -u username dbname 'SELECT * FROM tablename WHERE annexation ='"+str_year,shell=True)ifnotos.path.exists(annex+"/"+"DXF"):os.makedirs(annex+"/"+"DXF")os.chdir(base_dir+annex+"/"+str_year)subprocess.call("ogr2ogr -f 'DXF' ../DXF"+"/"+str_year+".dxf "+str_year+".shp",shell=True)os.chdir("../..")
It’s not a robot at all but an automated manipulation of Twitter facilitated by the Twitter API, and it’s all about communication. The Twitter API (or any API) is a set of commands so your code and Twitter’s code can interact with predictable results.
The key is to start small. You have to first register your “application” with Twitter at apps.twitter.com. In the process of registering your application, you will get four crazy looking strings: your consumer_key, consumer_secret, oauth_token and oauth_token_secret. These four strings help Twitter know that you are you and not some other bot with malicious intent.
(Registering your application and getting all the particulars just so can be confusing. I always google for examples to mimic or just to point me in the right direction. This blog post helped me out.)
@newsapps_jobs was written in Ruby and so the above configuration follows the convention specified by the Ruby Twitter Gem. Whichever language you choose for your bot, there is almost assuredly a library that will help you interact with Twitter’s API.
As soon as you verify that your bot can do the basics, you’re on your way.
@newsapps_jobs uses a number of different Ruby libraries to get it’s job done: the bitly gem to shorten links, the google-drive gem to access the google spreadsheet, the twitter gem for tweeting and the whenever gem to manage the Cron job for timely Tweeting. All of these gems have their own respective documentations that can be found online.
The complete code for the bot can be found here and, previously, I wrote about the bot here.
why a bot?
I wrote this bot in the summer of 2012. I felt the state of affairs regarding jobs in the newsapps/data j world was unruly at best, chaotic at worst. The best offering up to that point was this embedded google spreadsheet. It was a decent solution, a start, but, obviously, not an endpoint. I thought that the bite-sized information delivery of Twitter was perfect for a jobs board. With a job ad, the information necessary for a reader to assess their level of personal interest easily fits within 140 characters: job title, location, employer is really all a person needs to know if they want to click. Scrolling the tweets on your Twitter client would take seconds even if you checked every day. Easy! So I set to work. (I also knew that writing a Twitter bot was a great exercise! And fun!)
The tally so far: the bot has 170 followers, 252 tweets and I’ve shared the spreadsheet with 16 folks. I’m not sure, though, if the bot has actually connected employer with future employee.
Nevertheless, in the week before NICAR 2014, I did see this Twitter account appear.
I encourage you to write your own bot. It’s fun and great practice. If you have any questions, comments or concerns or are struggling with your bot, hit me up in the comments. I’m helpful.
Ordering in the input is not defined. Ordering in the result is not important. Do not rely on the names of the hash keys other than :time. The field below executes your code in a Sandbox with $SAFE=4, so you can’t define new classes, use global variables, etc. The field will turn green if your solution returns a correct object.
I’ve long had a fascination with TwitterBots. I mean…they’re robots, after all. So, I decided to hack together a Bot to auto-tweet jobs from this Google spreadsheet.
It proved to be my most satisfying coding exercise, yet.
The attributes of the Class were straightforward: some jobs to tweet (:jobs), a configuration file (:twitter_config) to access the Twitter account of the bot (setting this up is a whole other post…), a Google drive session to access the spreadsheet (:gdrive_session), the jobs spreadsheet itself (:job_worksheet), and an instance of bitly (:bitly) to shorten the addresses that link back to the full job descriptions.
The class’s attributes are initialized when Twitterbot.new is called and an instance of Twitter is configured (Twitter.configure) to tweet the tweet after it is written.
The jobs worksheet is parsed by the make_jobs method. This method accesses the rows of the spreadsheet and, using a regular expression, plucks out all the jobs for 2013.
The only real tricky part of the script was when I used the bitly gem to shorten the provided link associated with each job.
Sometimes folks had pasted an already shortened link into the Google spreadsheet. Calling bitly’s shorten method on an already shortened link returns an error instead of the unchanged, already shortened bitly link (that was the behavior I had anticipated.)
The shorten_link method returns a shortened link if the link hasn’t already been shortened. If the link has already been shortened, the shorten_link method returns an error message. The unless method checks whether shorten_link returns ‘BitlyError’ or not. If not, I go ahead and shorten the link, if so, I do nothing and simply insert the already shortened link. (N.B. I’m not sure what happens if the link is already shortened but by another service. Also, submitting a patch to the bitly gem repo to get around the begin/end block here is a great idea…)
Constructing the actual tweet, then, is simply a matter of picking out the necessary array indices, arranging them in the correct order and adding some text for hashtags. I added a timestamp because I was originally tweeting every couple hours and Twitter won’t let identical tweets occur within a certain timeframe. That amount of tweeting got a little ridiculous, though. So, I changed the cronjob to once a day. I just left the timestamp in because I like it.
Overall, after a couple weeks to ponder it, the class is bit messier than I would like. Ideally, I’d just initialize the bot and it would take care of everything. Also, it doesn’t feel right to configure an instance of Twitter when the bot is initialized but then call Twitter.update outside the class.
Parsing the accident reports proved easier than I had anticipated.
There are eight sections of each report in which I’m interested: location (LOCATION), vehicles involved (UNIT), persons injured (PERSON_INJURED), the environment of the accident (ACCIDENT_ENVIR), the provided diagram of the accident (DIAGRAM) (which I’m not fetching quite yet), the narrative description of the accident (NARRATIVE), the details of the responding officer (OFFICER) and any properties damaged (PROPERTY_DAMAGE) . These sections are represented by constants at the top of the script and an html table or tables within the accident report.
The main method within the parsing script is the ‘dispatcher’ method at the bottom of the script. This method loads all the downloaded files and opens each with the HTML parsing library Nokogiri. Each HTML table is then analyzed to discover what type of table it is. If it’s a table with LOCATION data, the td elements of the table are passed to the parse_location method. It it’s a table with UNIT information, the tds are passed to the parse_unit method.
The main object of the script is the @rpt_hash instance variable. Each individual parsing method cycles through the td elements of the table for which it was designed and drills down into each td element to extract the relevant text. This text is then assigned as a value to a key within the @rpt_hash.
The script utilizes the powerful HTML parsing powers of the Nokogiri library. For an excellent tutorial of this library, see The Bastard’s Book of Ruby.
In essence, this code ‘tds.css(‘td’).children.text.strip.chomp’ first tells Nokogiri where to look: the second td element (css(‘td’)), then look at the second ‘thing’ within that element .children, next get the text (.text) then throw away any white space at the end of this text (.strip), finally throw away any newlines (.chomp).
The whole script is really just a bunch of assigning text as values to hash keys: @rpt_hash[:time_of_accident] = tds.css(‘td’).children.text.strip.chomp. Because we’re going to use this data to do some some analysis on the Iowa State Patrol’s response to accidents on Iowa highways, I wanted to take a look at all the data pieces as I went. It was more time consuming this way, but I was able to learn some things about the data. One, for example, was that we are going to have to do some serious work on the dates in each report, but more on that later….
Some tables, UNITS, for example, occurred in multiples, i.e while a crash could only involve one LOCATION or one NARRATIVE, any crash could involve multiple vehicles and/or multiple persons injured.
For these tables, for every accident report, instead of assigning each individual data point to a key within the main @rpt_hash object, I assigned each data point to a key within a temporary hash: tmp_hsh[:unit] = tds.css(‘td’).text.rm_space_tab_nline. When all the unit tables or persons injured tables are parsed, I pushed all the assigned values into the main @rpt_hash object, @units << tmp_hsh. In this way, I preserved the nested structure that makes json so useful.
Finally, I had the initial idea to accrue all the data from the reports into one giant @rpt_hash, then dump this data into Mongo all at once with the mongoimport command. That approach didn’t work very well. Mongo literally choked.
So, instead, I saved every @rpt_hash for each report separately (File.open….), then called mongoimport on each .json file I just saved (system…). This approach will work much better in the long run because, eventually, I won’t run the parsing script on every accident report, but, instead, just the new ones I haven’t previously saved.
Finally, I reset all the little holder objects, @units, @injuries, @properties and @rpt_hash before beginning the next accident report.
resets objects for next report
Next on my list is to clean the dates. The dates are somewhat irregular within each accident report and their format needs to be standardized or else any analysis involving Dates and Time will be difficult.
I needed to parse the html of each report into a format that is not crazy html.
Once parsed, the data needed to be entered into a database.
For the database, I selected the NoSql database mongoDB. A discourse on the differences between a NoSql and a SQL database is beyond the scope of this post, but this Stack Overflow answer not only provides useful links to compare the NoSql databases mongoDB and couchDB, but also provides a succinct summary of why I chose a NoSql db over a SQL db.
Namely, “for most things that you would do with MySQL or PostgreSQL, but having predefined columns really holds you back” choose a NoSql db.
That’s me to a “T.” I need basic SQL functionality but I can’t use predefined columns because each accident report is a distinct document with an unknown structure.
One document may have only a single vehicle (unit) involved in the crash whereas the next document may have ten units. One document could have a section for “Property Damage” or even multiple sections for damaged property, but, on the other hand, most vehicle crashes resulted in no property damage.
So, I needed a database that was flexible and could expand and contract with each report.
Having selected mongoDB, I next needed to install it.
This process was absolutely painless thanks to the Mac OS X package manager, Homebrew.
If you develop on Mac OS and don’t use Homebrew, you’re making your life needlessly difficult.
I thought these were dumb when I was growing up, but I thought of them again when I re-re-began to work on parsing the reports from http://accidentreports.iowa.gov.
I had never made much progress because I had never really wanted to make much progress, i.e. I had not decided, definitively, to “do or do not”.
So, I got up early last Saturday, brewed some coffee and started hacking.
Although the actual Iowa Accident Reports are difficult to parse, fetching the reports themselves is pretty straightforward.
I figured I’d start there. Fetch all the reports, store them locally and work from that base.
The URL for each report breaks down to a base URL and URL id number.
Getting all the report pages is simply a matter of cycling through all the id numbers and then downloading the corresponding report.
Not so fast.
For some reason, reports don’t actually start appearing until the id number 29734. Consequently, the reports only stretch back to July 12, 2005.
Now, this situation raises our first data journalism questions. Are there more reports? What happened to the reports prior to July 12, 2005? Can I get those reports? Why aren’t those reports online?
We might want to contact the government offices now and ask them these very questions.
But, having worked with this data previously, I know that I’ll have to contact the Iowa State Patrol at some point anyway because of a problem with the geographic data associated with each report. So, I’ve made note of these questions and put them in a safe, fire-proof place for a later date.
To deal with those empty reports, I decided to check for the presence of the Law Enforcement Case Number:
If this returns “false”, a Law Enforcement Case Number is present and I’ll download the file; if it’s empty (returns true), there’s no report and we move on to the next URL id number.
While I might be able to reasonably assume that all URL id numbers prior to 29734 are empty (I spot checked quite a few), it’s easy for me to have the script start at zero and run through all possible crash reports, so that’s what I’ll do.
Eventually, I’d like to automate this script to start and stop on its own and to run every two weeks to check for new reports, but that’s for later. For now, I hard coded the script to start at 0 and stop at 63304, which, as of this writing, is the most recent full accident report.
The script is straightforward. A fetch_page method grabs the page, a save_page method checks for the presence of the Law Enforcement Case Number, and, if present, calls write_to_raw_file and passes in the html of the fetched page and a loop_through_pages method cycles through the url_id numbers and calls save_page.
I ran the script and it took awhile, ~10 hours, but it worked like a charm.
Next on my list is to get a list of the Law Enforcement Case Numbers and associated X and Y coordinate values.
My previous experience with this data taught me that the geo information is cut off for each of the reports. I contacted the government officials about this problem in 2011 and they said they’d be happy to update the geo information if I provided them a list of case numbers and X and Y values.
That’s pretty helpful and I wish I had been in a position to jump on that opportunity when it was presented. (Alas…le Sigh…)
They might not be so helpful this time. We’ll see.
I had just started my first day as a data journalist at the venerable Des Moines Register and my new boss and I, the equally venerable @jameswilkerson, were chatting about web scraping and he mentioned the Iowa State Patrol’s Crash site. James said that he and my predecessor at the Register, the also venerable @mikejcorey, had kicked the tires on the idea of scraping the site but had deemed it impractical/undoable/just not worth it.
Being the cocky young buck that I was two years ago and wanting to make a good impression, I said “It shouldn’t be too hard to scrape that.” James looked at me quizzically (or like I was nuts) and showed me the HTML.
Gross for sure. But not impossible. But gross. And hard.
But, worth it.
So, I took a spin at processing the html, but I didn’t get far/never really tried and was soon distracted by my new job, vacations, college football…..etc.
Then, layoffs struck the Des Moines Register and folks were scattered to the winds.
I never forgot about those Iowa State Patrol Crash reports, though, and even mentioned them again to @mikejcorey at NICAR 2012. He gave me @jameswilkerson’s “You’re nuts” look, but agreed that that data would be great to get and mentioned some questions/analysis he’d like to run on the data.
So, I dusted off that Ruby script and set to work processing that cringe-worthy HTML. I made good progress but was distracted by my lack of a certain goal. I wasn’t sure what I wanted to do with the accident report data once it was processed, so I again set the script aside.
This time that Ruby script might have remained forever dusty and forgotten if not for a course of events set in motion by this talk by Ben Welsh (@palewire), a database producer at the LA Times, at NICAR 2013.
Ben’s talk was great and made an impression that I mentioned to Ben’s LA Times colleague, Ken Schwencke (@schwanksta), in the lobby of our hotel on the final day of the conference.
Ken told me that Ben had done the analysis for that project in Django. Ken didn’t expound on that point, as it didn’t really need any explanation. I knew exactly what he was talking about.
I had never thought about doing the analysis for a story or project within a Web framework, be it Django or Rails, but the obviousness of this idea and the fact that I didn’t do it made me feel like I’d been riding a bicycle with triangle wheels for the last few years.
A few weeks after NICAR 2013, this article on How The Data Sausage Gets Made by Jacob Harris (@harrisj) of the New York Times and a subsequent conversation with Troy Thibodeaux (@tthibo) of the Associated Press reinforced the idea of doing all the data work for a project within one’s framework of choice.
At this point, I was obsessed and walking around like Howard Hughes muttering “The way of the future…”. And, I knew my Iowa Crash Site reports project was a perfect test case with which to experiment. This project would have it all: scraping, a FOIA (or at least some back and forth with government officials), mapping, graphs and even a NoSQL database.
Perfect. (Plus, the college football season was far enough off that I could finish before the first kickoff….)