In these “strange times,” running has become a lifeline to the outdoors. It is one of the few legitimate excuses to venture outside of my efficiently-sized apartment. I started running in graduate school to manage stress and, even as my physical body continues to deteriorate, I continue to use running to shore up my mental stability. As the severity of the COVID-19 situation raises the stress floor across the nation, maintaining--or even developing--a simple running routine is restorative.
I use the Strava phone app to track my runs. This app records times and distance traveled which is posted to a social-media-esque timeline for others to see. I choose this app after very little market research, but it seems to function well most of the time and is popular enough that many of my friends also use it. My favorite feature of the app is the post-run map. At the end of each session, it shows a little map collected via GPS coordinates throughout my jog.
This feature is not without its flaws. In 2018, Strava published a heatmap of all its users’ data, which included routes mapping overseas US military bases. Publishing your current location data is a huge operational security (OPSEC) violation. Strangers could easily identify your common routes and even get a good idea of where you live. I recommend updating your privacy settings to only show runs to confirmed friends.
With all that said, I wanted to create my own OPSEC-violating heatmap. Essentially, can I plot all of the routes that I have run in the past 18 months on a single map? Yes! Thanks to the regulations in Europe’s GDPR, many apps have made all your data available to you, the person who actually created the data. This includes Strava, which allows you to export your entire account. It is your data so you should have access to it.
If you use Strava, it is simple to download all of your information. Just login to your account via a web browser, go to settings, then my account, and, under “Download or Delete Your Account,” select “Get Started.” Strava will email you a .zip folder with all of your information. This folder is chock full of all kinds of goodies, but the real nuggets are in the “activities” folder. Here you will find a list of files with 10-digit names, each one representing an activity. You did all of these!
These files are stored in the GPS Exchange (GPX) file format, which tracks your run as a sequence of points. The latitude and longitude points are coupled with both the time and elevation at that point. Strava uses this raw information to calculate all your run statistics! With this data an enterprising young developer could make their own run-tracking application.
But that’s not me. Instead, I am doing much simpler: plotting the routes simultaneously on a single map. Here is what that looks like:
Again, this is a huge OPSEC violation so please do not be creepy. However, the routes are repetitive enough that it is not too revealing. Each red line represents a route that I ran. Each line is 80% transparent, so lighter pink lines were run less frequently than darker red lines. You can see that I run through East Potomac Park frequently. Massachusetts Avenue is a huge thoroughfare as well. I focused the map on the downtown Washington D.C. area. I used the SP and OpenStreetMap packages in R for plotting.
The well-tread paths on the map are not really surprising, but it does give me some ideas for ways to expand my route repertoire. My runs are centered tightly around the National Mall. I need to give SW and NE DC a little more love. I should also do some runs in Rosslyn (but the hills) or try to head south towards the airport on the Virginia side of the river.
What did we learn from this exercise? Very little. This is an example of using a person’s own available data. What other websites also allow total data downloads? How can that data be visualized? Make yourself aware of where your data exists in the digital world and, if you can, use that data to learn something about your real world.
My R code is available on GitHub.
Note: Eagle-eyed readers may be able to identify a route where I walked across water. Is this an error or am I the second-coming? Who can say?
Since 1973, the American Association for the Advancement of Science (AAAS) has facilitated the Science & Technology Policy fellowship (STPF). The goal of the program is to infuse scientific thinking into the political decision making process, as well as developing a workforce that is knowledgeable in both policymaking and science. Intuitively, it makes sense to place evidence-focused scientists in the government to support key decisions makers.
Each year doctoral-level scientists are placed throughout the federal government for one to two year fellowships. Initially the program placed scientists exclusively in the Legislative branch, but as the program grew, placements in the Executive branch became more common. In 2019, hundreds of scientists were placed in 21 different agencies throughout the federal government. As one of those fellows, I wanted to create a Microsoft Excel-based directory of current fellows. However, what began as a project to develop a simple CSV file turned into a visual exploration of the historic and current composition of the AAAS STPF program. Below are some of my observations.
Data was collected from the publicly available Fellow Directory.
In the beginning of the STPF program, 100% of fellows were placed in the Legislative Branch. This continued until the first Executive branch fellows around 1980 were placed in the State Department, Executive Office of the President (EOP), and the Environmental Protection Agency (EPA). In 1986, the number of Executive Branch fellows overtook the number of Legislative Branch Fellows for the first time.
Since those initial Executive Branch placements, fellows have found homes in 43 different organizations. The U.S. Senate has had the largest total number of fellows while the U.S. Agency for International Development (USAID) is the Executive Branch agency that has had the most placements. Unfortunately, for the clarity of the figure, agencies with fewer than twenty total fellow placements were grouped into a single "other" category. Despite the mundane label, this category represents strength and diversity of the AAAS STPF. The "other" category encompasses 25 different agencies including the Bureau of Labor Statistics, the World Bank, the Bill and Melinda Gates Foundation, and the RAND Corporation. In 2017, fellows were placed in 24 different organizations, the most diverse of any year.
The total number of fellows has dramatically increased over the past 45 years (as seen in the grey bar plot at the bottom of the figure). The initial cohort of congressional fellows in 1973 had just seven enterprising scientists. Compare that to 2013 when a total of 282 fellows were selected and placed. This year (2019) tied 2014 for the second highest number of placements with 268 fellows.
One of the most striking observations is the trends in placement at USAID. In 1982 USAID began to sponsor AAAS Executive Branch fellows, with one placement. Placements at USAID quickly grew, ballooning to over 50% of total fellow placements in 1992. However, just as rapidly, the placement fraction at USAID decreased during the 2000s despite only a small increase in the overall number of fellows. This trend ultimately began to reverse in 2010, and a large increase in the total number of fellows found placement opportunities at USAID. The reader is left to craft their own explanatory narrative.
One thing is clear from the data: the AAAS STPF is as strong as it has ever been. Placement numbers are close to all-time highs and fellows are represented at a robust number of agencies. Only time will tell if the experience these fellows gain will help them achieve the program's mission "to develop and execute solutions to address societal challenges."
If you want to learn more about the history of the STPF, including statistics for each class, AAAS has an interactive timeline on their website.
An unexpected surprise during the analysis was the discovery that Dr. Rodney McKay and John Sheppard (both of Stargate Atlantis fame) were STP fellows. Or--more likely--the developer for the Fellows Directory was a fan of the show. Unfortunately, as a Canadian citizen, Dr. McKay would be ineligible for the AAAS STPF.
Recently at brunch someone made a statement about there being only one person with a PhD in the US House of Representatives. This did not seem probable to me and after some Googling, I found that the House Library conveniently maintains a list of doctoral degree holders in the 116th House.
Though there is only one hard science PhD in the house (Bill Foster, D-IL; Physics), there are also other STEM doctorate holders in the House including two psychologists, a mathematician, and a monogastric nutritionist. There are also obviously quite a few other doctorate holders, most of which are in political science (obviously), but also a Doctor of Ministry from Alabama (Guess the political party!).
Overall 21 is a small fraction of the House (only 4.8%), especially compared to the 157 members that are lawyers. Given the wide-reaching and technical nature of the government and the laws that regulate it, it may be advantageous to increase the number of scientists represented in Congress. While that is a decision ultimately for each state's voters, there are a number of programs aimed at increasing the involvement of scientists in government policy.
As an infographic making exercise I would consider this a mixed success. I think it conveys the information effectively, but lacks a certain je ne sai quoi in the aesthetics department. My little emoji heads especially could use some work. Any graphic designers out there please reach out with tips.
The House Library maintains lists of lawyers, military service members, medical professionals, as well as other specialties in their membership profile. I am going to download these lists as a baseline for the analysis of future Congresses.
A little over halfway through the year and the US Food and Drug Administration (FDA) appears to be on track for either a big year of new drug approvals or....not. The number of new molecular entities (NMEs) approved by FDA's Center for Drug Evaluation and Research (CDER) are equal to the number approved at this point of the year in 2016 and only two product apporvals behind both 2018 and 2015. Despite starting the year off with the longest federal shutdown in history the FDA is keeping pace with past years.
However, the figure demonstrates another important fact: approval numbers mid-year do not correlate strongly with year-end approvals. While the number of approvals were similar in 2016 and 2018, the end year totals were wildly different. In 2018, CDER approved a record 59 NMEs while 2016 approved less than half of that number. Additionally, in 2017, the number of NME approvals at mid-year was much higher than any other year, but finished in line with the number of approvals in 2015 and well below the number of approvals in 2018.
It seems that the future could go either way. There could be a dramatic up-tic in CDER approval rate as in 2018 (perhaps from shutdown-delayed applications) or the rate could slow to a crawl like in 2016.
Remember Super Bowl LI you guys? It happened, at minimum, five days ago and of course Tom Brady won what was actually one of the best Super Bowls in recent memory. Football, however, is only one half of the Super Bowl Sunday coin. The other half are the 60 second celebrations of capitalism: the Super Bowl Commercials. Everyone has a list of favorites. Forbes has a list. Cracked has a video. But it is no longer politically correct in this Great country to hand out participation trophies, someone needs to decide who actually won the Advertisement Game.
To tackle (AHAHA) this question I turned to the infinite online data repository, Google Trends, which tracks online search traffic. Using a list of commercials compiled during the game (AKA I got zero bathroom breaks) I downloaded the relative search volume in the United States for each company/product relative to the first commercial I saw for Google Home. [Author’s note: Only commercials shown in Nebraska, before the 4th quarter when my stream was cut, are included]. Here’s an example of what that looked like:
!The search traffic for a product instantly increased when a commercial was shown! You can see exactly in which hour a commercial was shown based on the traffic spike. Using the traffic spike as ground zero, I added up search traffic 24 hours prior to and after the commercial to see if the ad significantly increased the public’s interest in the product.
Below is a plot of each commercial, with the percent of search traffic after the commercial on the vertical axis and the highest peak search volume on the horizontal. If you look closely you will see that some of them are labeled. If a point is below the dotted line the product had less search traffic after the commercial than before (not good).
On average 86% of products had more traffic after their Super Bowl ad than before it. But there are no participation trophies in the world of marketing and the clear winner is 84 Lumber. Damn. They are really in a league of their own (another sports reference!). Almost no one was searching for them before the Super Bowl but oh boy was everyone searching for them afterwards. They used the ole only-show-half-of-a-commercial trick where you need to see what happens next but can only do that by going to their website. Turns out its a construction supplies company
Pizza Hut had a pretty large spike during their commercial, but it actually was not their largest search volume of the night. Turns out most people are searching for pizza BEFORE the Super Bowl. Stranger Things 2 also drew a lot of searches for obvious reason. We all love making small children face existential Lovecraftian horrors.
Other people loved the tightly-clad white knight Mr. Clean and his sensual mopping moves. The Fate of the Furious commercial drew lots of searches, most likely of people trying to decipher WTF the plot is about. Finally there was the lovable Avocados from Mexico commercial. No one was searching for Avocados from Mexico before the Super Bowl, but now, like, a couple of people are searching for them. Win.
So congratulations 84 Lumber on your victory in the Advertisement Game. I’m sure this will set a dangerous precedent for the half-ads in Super Bowl LII.
It’s possible to find play-by-play win probability graphs for every NFL game, but that does not tell me much about how the game itself was played. Additionally, I only sporadically have time to actually WATCH a game so using play-by-play data, R, and Inkscape I threw together this visualization of every play in this past Sunday’s game between the Kansas City Chiefs and New Orleans Saints. Why isn’t this done more often?
The Department of Pharmacology at Creighton University School of Medicine is small, but mighty. There are only 10 professors or principal investigators (PIs) in the department, but this small size has its advantages. Or at least that is what we tell ourselves. A recent paper in Nature argued that bigger is not always better when it comes to labs and we are putting that to the test. Ideally with a smaller faculty, there would be more collaboration. Everyone knows what everyone else is doing, more or less, so they can more efficiently leverage the various expertise found throughout the department.
To measure how interconnected the pharmacology department was I created a network analysis visualization based on who published with whom. Using NCBI’s FLink tool I downloaded a list of the publications in the PubMed database for each PI in the pharmacology department at CU. A quick script in R formatted the authors and created a two-column “edge list” for each author, basically a list of every connection. This was imported into the free, open-sourced network analysis program Gephi which crunched the numbers and produced a stunning map of the connections in the pharmacology dept:
Gephi automatically determines similar clusters (seen as different colors) which are unsurprisingly centered on the various PIs in the department since those are the publications I was looking at. Dr. Murray, the department chair, has the most connections, also known as the highest degree, at 292, followed by Dr. Abel. Drs. Dravid and Scofield are ranked 2nd and 3rd respectively for betweenness centrality, after Dr. Murray. They are the gatekeepers that connect Drs. Abel, Bockman, and Tu to Dr. Murray. Each point’s size is proportional to its eigenvalue centrality, similar to Google’s Pagerank metric of importance.
I was a bit surprised at how disperse the department was. 60% of the PIs could be connected, and many have strong relationships. However the rest are floating on their own islands. Dr. Oldenburg is relatively new so this is not surprising. The Simeones (who are married) are closely connected. Also unsurprising.
This was a quick and dirty analysis and a few of the finer points slipped through the cracks. Some of the names are common in PubMed (especially Tu). so I did my best to filter what was there and only look at publications affiliated with Creighton. Unfortunately this filters out publications from other institutions by the same author. Also not everyone is attributed the same way on every manuscript. This is especially true for Drs. KA Simeone and Gelineau-Van Waes who have published under different last names, but also because sometimes a middle name is given and sometimes it is omitted. I tried my best to standardize the spellings for each PI, but with over 700 nodes I could not double check every author to ensure there were not duplicates elsewhere. If more than one PI shows up on a paper, that paper may show up under both searches. This should not increase the number of edges, but would affect the “strength” of those connections.
The connections are about what I had imagined. The brain people are on one side, everyone else is on the other. Expanding the search to include the papers from coauthors outside of the pharmacology department might discover more interesting connections. Just for fun I went ahead and pulled the data for every paper on PubMed with a Creighton affiliation. I could not even find my department on the visualization without searching for it. It is massive. The breadth of Creighton’s interconnected-ness forces me to marvel at how vast the community of scientists must truly be. So many people working to improve the body of knowledge of the human race. We are really just small bacteria in a very large petri dish.
I don’t have cable. So I did not get the chance to watch the Grammys this year. I was, however, happy to hear that Taylor Swift won the Grammy for Album of the Year for 1989 (since I recently wrote a post about how great she is). When I was writing the aforementioned post I did notice that she was nominated, but I felt pretty confident the National Academy of Recording Arts and Sciences would give it to Kendrick Lamar’s To Pimp a Butterfly. This is Swift’s second Grammy for Album of the year (she also won for Fearless as we all know).
Since the data have already been scraped from the Billboard Hot 100, I might as well get some mileage out of them. For each week since November of 2014 (around when 1989 and To Pimp a Butterfly were released) I assigned any song by any of the five artists nominated for Album of the Year a point value from 1 to 100 based on its position in the Hot 100. Songs ranked number 1 were given 100 points, and songs ranked 100 were given 1 point et cetera. Then for each artist I added up the point values for each week the results of which you can see below:
Notice someone missing from this visual? None of the songs from the Alabama Shakes’ album Sound & Color made it to the Hot 100. This is despite the fact that their album was on the Billboard 200 for album sales for 26 weeks, peaking at number 1. Chris Stapleton has a little purple blip around December, seven months after Traveller was released. Kendrick makes it on here and there, but the graph is clearly dominated by Taylor Swift and The Weeknd. The Weeknd has by far the highest peaks, but Taylor proves her popularity with the largest total area under the curve, 13,676 “points” vs. The Weeknd’s 11,156 “points”. TSwift also has the highest average per week, though not by much.
Fun fact: Swift’s “Bad Blood” was minimally successful until she added some bars by Kendrick Lamar...which went on to win the Grammy for Best Music Video. Does that say more about Taylor or Kendrick?
We can debate whether song popularity should be the metric by which we measure the value of an album. Obviously a lot of people thought Sound & Color was a world-class album despite its absence from the Hot 100. In fact the National Academy of Recording Arts and Sciences insists that Album of the Year is to “honor artistic achievement, technical proficiency and overall excellence in the recording industry, without regard to album sales or chart position.” However, of those nominated this year, they did pick the one with the best album sales and chart position.
“If you're horrible to me, I'm going to write a song about it, and you won't like it. That's how I operate.” – Taylor Swift
Taylor Swift. Love her, hate her, love her, or love her. There really is not much of a choice as everyone seems to be constantly showering Ms. Swift with praise. While her fans are eager to show Tay some love she also has no problem sharing that love with others. As Gawker points out, she has dated what appears to be every man in the universe.
Regardless of Taylor Allison Swift’s extracurricular activities you cannot deny that she is a powerhouse in the music industry. Of Taylor’s five studio albums, three have sold over a million copies in their first week making Taylor the first female artist to have done so. That is remarkable considering that I cannot recall the last time I actually bought any music. John Lennon famously claimed that The Beatles were “more popular than Jesus.” TSwift is not quite that popular, but she is close according to Google Search traffic:
Whether Taylor or The Beatles before her, these musicians are having a large impact on in the lives of the kids. Perhaps an impact comparable to The JC. (Justin Bieber actually IS a more popular search term than Jesus over the same time span, which should be worrisome.)
Being popular is nothing new for TSwift and we can measure that popularity using the charts produced by Billboard. Taylor first appeared on the Billboard Hot 100 on September 23, 2006 with her song “Tim McGraw” off of her debut album Taylor Swift. Billboard started tracking the hits of the day on its list Best Sellers in Stores way back in 1939. Shortly thereafter it started publishing two other lists, Most Played by Jockeys and Most Played in Jukeboxes. Three years later Billboard published its magnum opus, the Hot 100, which coalesced these various chart into a single definitive ranking across genres.
Today the Hot 100 is composed of three components: sales (35-45%), airplay (30-40%) and streaming (20-30%) which can vary weekly in order to hit the target of 100 songs. The Streaming Songs chart is the newest and began in 2007 with data from AOL and Yahoo. It expanded to include services such as Spotify in January of 2013 and one month later added YouTube views. The Digital Songs chart tracks sales data for digital downloads while the Radio Music chart tracks radio airplay audience impressions. Both charts are generated from data furnished by Nielsen Music. Additionally Billboard publishes its YouTube data as a separate chart on its website.
The Billboard charts provide a rich, publicly available data set. Using the tools over at Kimono I scraped the available data from the Hot 100, Radio Music, Digital Songs, Streaming Songs and YouTube charts since 2000. Now that I have data on hundreds of artists over the last one and a half decades I feel like anything is possible. Taylor Swift is our test case because (1) She is very popular (2) many of her songs make it onto the Hot 100 (3) her most recent album, 1989, was released after Spotify and YouTube were included on the Streaming Songs chart. Here is what 1989′s travel through the Hot 100 looks like:
Yes it’s a bunch of lines and dots, but they tell some interesting stories. Compare the difference between “Shake It Off” and its debut at the #1 spot with the slow methodical rise of “Style.” Or Bad Blood’s resurgence after the addition of Kendrick Lamar and the release of its star-studded music video. Overall, nine of the 15 songs from Swift’s fifth album appeared on the Hot 100. At first it seemed odd that Taylor’s songs would drop off the chart once they dropped to position #50. Was this some brilliant marketing strategy? Is it not worth promoting a song anymore after it has fallen so far? Unfortunately the answer is much more mundane. Billboard labels songs as “recurrent” and removes them after 20 weeks on the chart and after falling below the 50th position.
In November 2014 Swift pulled her music from the popular streaming site Spotify. This severely crippled her ability to gain on the charts through the streaming component of the Hot 100. TSwift did agree to have her music featured on Apple Music’s new service which launched June 30, 2015 after some controversy regarding its 3-month free trial period. Her ability to sway the actions of the second most valuable public company legitimizes her powerhouse status and further validates all of this work.
Swift can make up for her refusal to stream by dominating on another “streaming” service, YouTube. Below is a breakdown of her trends on each of the Billboard charts I analyzed. While the Hot 100 ranks songs 1-100, the other charts only display the top 25 songs in each category.
(NB physical sales of a CD are also a component of the Hot 100 ranking, but are not included above.)
Some take aways: Each song had a huge increase in digital downloads when 1989 was released at the end of October, but only “Blank Space” was able to maintain that (”Shake It Off” was released prior to 1989). However, the increase in digital downloads was enough for both “Bad Blood” and “Style” to make it onto the Hot 100. YouTube views are almost always above streaming views. This makes sense since Taylor did not allow streaming on other services. If “Shake It Off” and “Blank Space” had not been labeled as recurrent they would probably still be on the Hot 100 based on their YouTube views which are very persistent. YouTube views were a leading indicator for “Bad Blood.” It made it to the top 5 of the YouTube chart a week before charting on digital downloads, the Hot 100, or radio songs. It would appear that TSwift’s mastery of music videos can single-handedly make a single.
Finally I calculated the area under the curve (AUC) for each of Taylor’s songs off of 1989 and compared it to the number of YouTube views for each song. The AUC was calculated by assigning a 100 point value to position 1, 99 points to position 2 etc. and multiplied by a song’s weeks at that position. Without access to other streaming services like Spotify or Tidal and given the visual correlation between the YouTube and Streaming Songs charts I expected a strong relationship between YouTube views and success on the charts. The regression for TSwift was statistically significant. However Justin Bieber, who does not have the same qualms with distributing his music on Spotify had an even stronger correlation. This might have something to do with the fact that J-Biebs has a music video for every song off of his new album (although many of them have relatively few views) while Taylor only has videos for six of her songs. Would more music videos add buoyancy to some of her other songs? I suspect that it would.
I should quickly note that analyzing the Billboard Hot 100 is not a novel idea. On his blog Modern Insights (and minor observations) Michael Kling has plotted similar analyses looking at the rise and fall of artists and songs in the Billboard Hot 100. Also very recently Cristian Cibils from Stanford University published a brief paper where he used machine learning to predict a song’s future Hot 100 trajectory based on past chart performance. I would recommend you check their posts out. Next Big Sounds is a company that tracks streaming and social media to provide unique insights into what makes songs popular and actually makes money doing it.
This is just a quick overview of some of the interesting features that can be investigated using data from Billboard. Full disclosure I did most of this as a part of WNYC’s Note to Self’s Infomagical week. Tweet me which other artists you think might have interesting stories to uncover. Taylor Swift is here to stay, and while she is here she will continue to win over our hearts and most of the Grammys.
"En un agujero en el suelo, vivía un hobbit”
Did you ever get your report card and have to masterfully change some of those F’s to B’s on the bus ride home? I doubt that this has ever successfully fooled anyone, but this movie trope illustrates that report cards can be terrifying. They do, however, provide a useful piece of information: feedback. Without feedback, it is difficult to tell if learning is happening. Am I closer to mastery now than I was when I started?
Well, over a year ago I travelled to Mexico to visit my good friend Dave in Chiapas. I had worked my way through Duolingo and even did Rosetta Stone for a bit, so while I would not have considered myself fluent, I figured I could get by. This assumption was quickly smashed against the rocks of reality at Customs and I spent the remainder of the trip staring blankly at anyone who had the misfortune of attempting to speak to me.
The trip was still a blast, but I made a commitment to improve my Spanish over the next year. In the Spanish language section of Barnes & Noble I picked up a copy of El Hobbit by the legendary J.R.R. Tolkien. Maybe you’ve heard of it. I had been wanting to reread it for a while so why not try in Spanish. My goal was to read one page per day with a goal of finishing it within a year. At 283 pages, this seemed more than achievable.
Just last week I finally learned how the story ended (happily ever after), but I had no way of knowing if I actually improved my Spanish at all. There was no feedback.
There was one piece of information that I could leverage to my advantage, the number of words I translated per page. In an effort to actually retain some information I would write the translation of words or phrases I did not know over the words in the book. So I counted these up for each page. You can see the breakdown of the page totals sorted by chapter in the graph below. The average for that chapter is shown in red.
Overall the chapters had a statistically significant downward trend! I may have learned something after all. This works out to be about 2/5 of a word improvement per chapter on average, from an average of 20 words per page in chapter 1 to just under 12 words per page in chapter 19.
I am still far from being a fluent speaker. In fact this did not teach me anything about speaking Spanish. But at least I have a report card to show my parents.
P.S. When I told my dad about this he responded, “Son, you’re a huge nerd.” Thanks pops.