Continuing with the theme of plotting Choropleths, here is an interesting problem I encountered. The details are below, but the end product is to develop a static visualization as shown here ... Out Of Reach: National Low Income Housing Coalition.
I recently came across a common problem of visualizing simple state level data captured in PDFs as a choropleth. The data is from the following well researched report on housing data
I need the data extracted from the following PDF page imaged below.
I found that tabulapdf/tabula works quite well in extracting data. Yeah! Even on a windows machine!
Now, that we have got the data, let us create the state choropleth.
Creating state level choropleths is well explained here and it is easy and cool!
However, here is the interesting problem. The data in
state.regions does not match exactly with the dataset we have at hand.
So, instead of correcting the data so that it matches manually, let us try to use a algorithmic approach.
R packages that seem to be available to accomplish this task are:
- markvanderloo/stringdist. Well explained at Approximate text matching with the stringdist package
- R: String Metrics
Replacing patterns in the data worked quite well using
gsub explained here.
Using fuzzy matching, I was able match all the states using this pretty neat
require("stringdist") dt[,state_num:=amatch(State1, standard_states, maxDist = 5)] dt[,state_final:=standard_states[state_num]] dt[,region:=state_final]
With those changes, the final map is shown below.