This summer, I had the opportunity to be a part of Washington State’s effort to make government data accessible and more meaningful to the average individual. I collaborated with a State of Washington agency, the Department of Labor and Industries, in order to create intuitive displays of labor and industry data. In addition to having an agency mentor, I also had the chance to work alongside several industry mentors from Socrata, an open-data solution that played a key role in the reproducibility of the visualizations.
In regard to producing the visualizations, there is no shortage of business intelligence software available to the public– many of which come at no cost. Microsoft Power BI and Tableau Desktop are excellent offerings when it comes to manipulating and displaying data. What’s more, is that both softwares offer an easy-to-use graphical interface. I chose instead to take a slightly more “challenging” route, one that would require a bit of programming. Having already used the R language in my senior semester at Gonzaga University, I had a solid foundation of visualizations to draw from. I was already familiar with the powerful charting packages of ggplot2, rCharts, and googleVis, so it made sense to build off of this knowledge base. Beyond this, I wanted to improve my programming abilities in R and expose myself to a wider breadth of packages and functions. R is a programming language that emerged in 1993, as an implementation of the preexisting S language. It is frequently used by statisticians and data scientists for data analysis and recently, studies reveal that R‘s popularity is growing at a rapid pace. Having previous experience with R in the context of data visualization and business analytics, it made sense to me to continue learning the language and to put my foundational knowledge to use.
Being able to easily update the visualizations to reflect new data was a primary concern for the project. Having the data hosted on Socrata’s public portal provided a huge benefit– version control. During the exchange of data from the agency, I encountered issues of having multiple revisions of spreadsheets– each having their own discrepancies. Quite humorously, some of my more common naming methods included finalcopy.xlsx and finalcopy1.xlsx. As you can imagine, making sense of which spreadsheet was the most up-to-date became a major inconvenience. In hosting the data on Socrata, there is only ever one working copy of the data set. When new annual data emerges, the online dataset can be updated with ease and upon running the R script I created, the visualizations are reproduced. A completely separate bonus that comes with Socrata’s use is transparency, in that the data is made accessible to the public. Any person can access the data and beyond that, they can explore the data for themselves. As a result, open-data directly encourages innovation and ushers in new opportunities through data.
The data I was provided pertained mainly to worker’s compensation in the State of Washington. This included data such as injury claims, worker’s compensation rates over time, and claimant demographics. The tabbed section below shows .gif’s of visualizations from each of these three areas of interest. Because the R visualizations are interactive, I used .gif’s to demonstrate their functionality. To view and interact with the entirety of the visualizations, view my “sandbox” demo site here.
The internship was the first of its kind, and I am honored to have taken part in it. One facet of the internship that set it apart from other learning experiences is the ability to work entirely online through collaboration tools such as Skype and Lync. Although the internship was based in Olympia, I was able to work comfortably from my home in Colorado. This granted a higher degree of flexibility and allowed me to earn an income through another job simultaneously.
The R packages most significant to this project are as follows:
- rSocrata– Provides an easy interaction with Socrata open data. A dataset URL is provided as a parameter, and an R data frame is returned.
- dplyr– A “grammar of data manipulation”, dplyr is a very handy package when it comes to filtering, arranging, grouping, and analyzing data.
- Reshape– Allows you to quickly restructure and aggregate data using only two functions: melt and cast.
Ultimately, the intent of these displays was to help people come to a better understanding of Washington State and it’s been exciting to see that come into fruition. Furthermore, I am glad that I was able to contribute to Washington’s efforts to make visualized data more accessible to not only Washington State citizens, but to people of all states and nationalities.