Showing posts with label visualization. Show all posts
Showing posts with label visualization. Show all posts

Saturday, February 27, 2016

Scraping crowd-sourced shake reports to produce a cumulative shake map for Oklahoma earthquakes

Last year, Oklahoma had more earthquakes than ever before. In 2015, the Oklahoma Geological Survey (OGS) counted 5,691 earthquakes[1] centered in the state. That’s 270 more quakes than what Oklahoma experienced in 2014.

Along with more reports of earthquakes, came more reports of earthquake damage[2]. In one of the worst earthquake swarms in 2015, a chimney was torn from a house in Edmond[3], and an exterior wall of bricks came tumbling down from an apartment complex in northeast Oklahoma City[4]Much has been learned since Oklahoma's earthquake surge began in 2009. 

Scientists now link these earthquakes to the injection of waste water into deep disposal wells[5]. Water exists naturally in the earth along with oil and gas deposits, and when the oil and gas is drawn from the earth, the water comes with it. This water is separated from the oil and gas, and is disposed of in deep wells. Because these quakes are caused by human activities, they are known as “induced earthquakes.”[6]

Many questions remain, however. Namely, what are the long-term effects of having so many small earthquakes so frequently? And how is it possible to compare the impact of these quakes across Oklahoma? The United States Geological Survey produces damage estimates automatically after significant earthquakes, but it does not produce damage estimates for swarms of smaller earthquakes, which may last for months or years.

Monday, December 19, 2011

A breakthrough in data visualization, what it means for data journalism, predicting the news

Earlier this month, the National Science Foundation announced a new system to help researchers make sense of stores of scientific papers, and potentially find the “next big thing.”

The Action Science Explorer, or ASE, developed jointly by University of Michigan and University of Maryland faculty, takes a difficult cognitive task -- backtracking through paper citations to identify a breakthrough -- and “offloads” it to the much easier task of perceiving density in network visualizations. In other words, it takes mounds of difficult to digest research, and uses social network analysis techniques and graphing to make the information immediately recognizable.

The ASE visually represents papers and concepts as they appear over time, identifies the moment where fields branched out and flourished, and also finds moments where other research became obsolete or lost. It also identifies emerging fields of study:

“Users can quickly appreciate the strength of relationships between groups of papers and see bridging papers that bring together established fields. Even more potent for those studying emerging fields is the capacity to explore an evolutionary visualization using a temporal slider. Temporal visualizations can show the appearance of an initial paper, the gradual increase in papers that cite it, and sometimes the explosion of activity for ‘hot’ topics. Other temporal phenomena are the bridging of communities, fracturing of research topics, and sometimes the demise of a hypotheses.”
(from the ASE tech report)

Here’s how it works:

The ASE researchers say this software has potential in the fields of linguistics, biology and sociology, writing “Both students and educators must have access to accurate surveys of previous work, ranging from short summaries to in-depth historical notes. Government decision-makers must learn about different scientific fields to determine funding priorities.”

But suppose data journalists use similar tools to analyze legislation over time, to forecast future bills and political alliances. Clusters would indicate where certain provisions failed, where lobbyists and special interests had influenced legislation the most, and possibly how those interests would proceed in the future. Instead of conducting reactionary reporting, or relying on too-late intelligence that lets legislation slip through unnoticed, reporters could use the system to help guide questions and investigations.

In September, computer scientist Kalev Leetaru here on the University of Illinois campus did something just as remarkable. He compiled more than 100 million media reports, text-mined and crunched them in a supercomputer, and was able to chart and even predict the instability in Libya and Egypt.

Impressively, Leetaru was also able to use those news reports to estimate the location of al-Qaeda leader Osama Bin Ladin with a 200km degree of accuracy. From the BBC news, who reported on Leetaru’s research:
The computer event analysis model appears to give forewarning of major events, based on deteriorating sentiment.
However, in the case of this study, its analysis is applied to things that have already happened.
According to Kalev Leetaru, such a system could easily be adapted to work in real time, giving an element of foresight.
"That's the next stage," said Mr Leetaru, who is already working on developing the technology.
"It looks like a stock ticker in many regards and you know what direction it has been heading the last few minutes and you want to know where it is heading in the next few.
“Predictive reporting” or “news forecasting” could prove invaluable to digital newsrooms, where seconds mean the difference between breaking the news and just being one of the reporting mob. And if news agencies work on integrating advances in computer and information science into the office, instead of just reporting on them, it could enhance reporting across the entire organization.

Wednesday, November 30, 2011

Being a More Versatile Journalist: Data Journalism Veteran Steve Doig Wants Journalists to Know Statistics

Aerial photograph of the devastation from hurricane Andrew in 1992. Steve Doig, who was a reporter for the Miami Herald at the time, used his data journalism chops to survey the damage and write a Pulitzer-prize winning expose on construction malpractice. Earlier this year, I asked him what aspiring data journalists should be learning.

I cringe when bloggers begin a post by apologizing to readers for a lack of updates. This is partly because most people do, or should, understand that the gig doesn’t pay. But mostly, every word you waste on explaining your absence is one more chance for a reader to lose interest and go somewhere else. So I’ll just say it’s been an eventful couple of months, and tell you why it’s actually relevant to this blog.

Having just finished a master’s in journalism at the University of Illinois, I was extremely lucky to find a National Science Foundation grant that is training better K-12 science teachers.

At the grant, we do this by teaching lessons in entrepreneurial leadership to science teachers. That translates into experiences like students constructing their own spectrophotometers, or high school students manufacturing their own biofuel, or even collaborations where high school students set up demonstrations on electricity for grade school students to work through.

It’s a radical, but practical approach that hopes to improve the nation’s competitiveness in science teaching. In January, results from the National Assessment of Educational Progress report card on teaching showed that 47 percent of all high school seniors in the country are deficient in the sciences.

Why would an NSF grant want a journalist? For one, I understood their language. Being a former undergraduate student of mechanical engineering, I had taken chemistry, physics, calculus, and statistics courses. Secondly, they wanted someone experienced in the ways of conducting interviews (i.e., collecting data) and translating the information into an easily digestible form (i.e., not only help write reports for the NSF but also write for public dissemination).

That was all they were looking for initially, until I mentioned I had worked with NodeXL, a template that turns Microsoft Excel into a tool for analyzing social networks. I was introduced to the program by Brant Houston, in his investigative reporting class at the university. The Excel plug-in comes in handy during an investigation when you need to do things like plot like the flow of money or political influence within organizations or among groups of people. As it turns out, the grant was conducting a first-of-its kind analysis of teaching networks and needed someone with my expertise.

The moral of this story could be that if you develop skills beyond traditional journalism in undergraduate/graduate school, it’s easier to parlay your skills into a new career when the journalism jobs market tanks. But the fact is I’m still practicing journalism, albeit during my off-hours.

I recently submitted an investigation of a local church with more than $100,000 in tax liens to, a Knight foundation-funded community news website. The investigation required digging up and looking through nonprofit tax records, federal tax liens, city ordinances, and even credit union call reports. The investigation stemmed from a legal notice I stumbled upon in the aforementioned investigative journalism class.

Rather, this is why a journalist should learn data journalism: to become a more versatile investigator.

When I was teaching introductory journalism classes to freshmen and sophomore university students, I wanted them to know exactly why it’s useful to have computer and data journalism skills. So I put together a presentation on data journalism for a lecture of about 100 students, and asked data journalism veteran Steve Doig, who is currently the Knight Chair at the Walter Cronkite school of journalism, for a few bits of advice.

Monday, September 12, 2011

What improved word clouds reveal in Obama, Bernanke jobs and economy speeches

The above is a word cloud using President Obama’s Sept. 8 address to Congress. As is customary with word clouds, the more times a word occurs in a text, the larger the font size in the cloud. Even if you weren’t aware of the nature of the speech, it’s obvious from the cloud that Obama’s address to Congress dealt with “jobs” in “America.”

But word clouds have limits. Seth Duncan, analytics director for the digital public relations firm WCG, wrote on the blog in 2010 that the simplicity of the word cloud could contribute to a decline of reading comprehension. In his post, “Word Clouds and the Cognitive Decline of PR and Marketing,” Duncan wrote that he strongly believed “that the word cloud is the biggest enemy of deep reading and lowest form of artificial intelligence in marketing and PR.”

“You can read the content very quickly (because they don’t contain much information) and they have a unique look. I also think that word clouds can provide useful information for SEM or SEO planning. But people are fooling themselves if they think that a word cloud offers a satisfactory summary of hundreds or thousands of pages of text,” he wrote.

NYU political science PhD student Drew Conway has a similar, but different beef with word clouds. Conway looked at a word cloud, essential a plot of words in three dimensions (x, y, and font size), and saw a missed opportunity. “They are meant to summarize a single statistics—word frequency—yet they use a two dimensional space to express that,” he wrote.

His solution came from his background in statistics, which oftentimes compares two sets of data. For his improved word cloud, he compared two speeches by political figures and used the x-axis to describe the similarity between two speeches. To accomplish this, he used the free, open-source statistical programming environment R, which has a data-mining and graphics plotting features, along with some custom coding.

But what to compare the Obama jobs speech to? That same day, bankers and business executives at the Economic Club of Minnesota waited eagerly to hear the Fed Chair Ben Bernanke outline what the Fed would do to alleviate economic concerns.

Obama and Bernanke were speaking to two very different audiences, and had different objectives. Obama was speaking to a Congress hell bent on being re-elected and an anxious, under-employed American public. Meanwhile, Bernanke was speaking to titans of industry and banking. These differences shouldn’t be an excuse not to compare the two speeches; rather, both speakers are components of the administration weighing in on essentially the same issue.

Differences in their speeches could signal a difference in opinion and discord about an appropriate response, while similarities could point to ideas with a measure of political support. If nothing else, it’s worth looking at how two high-ranking officials in an administration tailor speeches on economic issues to two different audiences.

Here’s what those two speeches look like in Conway’s “better word cloud.” Click to see the plot in a higher resolution.

Thursday, September 8, 2011

South-side children have greatest exposure to lead in Chicago, health department data shows

This interactive heat map, compiled using Chicago Department of Public Health data, GIS files, and Google Fusion, shows where Children with the highest rates elevated blood lead levels in Chicago live. Data is from 2010.

Chicago Department of Public Health data shows that children in the poorer, industrialized south of Chicago are more likely to have dangerous levels of lead in their bodies than children in more affluent neighborhoods.

The data, obtained by a FOIA request from the health department, shows the levels of lead the agency found in children 17 and under in the city of Chicago. Most children tested for lead, however, were under 6 years old.

“An EBL or elevated blood lead level, is defined… as the child’s highest venous test with a result of 6 or more micrograms lead (Pb) per deciliter blood,” the health department wrote.

According to the EPA, there is no safe level for lead in the human bloodstream. At 10 micrograms per deciliter of blood, children can develop symptoms such as “lowered intelligence, reading and learning disabilities, impaired hearing, reduced attention span, hyperactivity, and antisocial behavior.”

The most recent results are from 2010, but the file contains annual results back to 2005. They were compiled with the help of an epidemiologist in the department.

“Multiple blood lead tests were determined using an algorithm that matches children by name, date of birth and sex, while allowing for common typographical and data entry (eg, reversing first and last name) errors for blood lead tests conducted within a calendar year,” the health department wrote.

In the interactive heat map at the top of the post shows the rate at which children in each of Chicago’s 77 communities reported elevated levels of lead.

The Englewood community has the highest EBL rate, where 9.15 percent of the children who were tested for lead came back with a positive EBL. Neighborhoods in the north end of Chicago had EBL rates between 0.8 percent and 3.31 percent.

Wednesday, September 7, 2011

Visualization - is there injustice in Pilsen?

This visualization was produced as part of a series about Pilsen, a Chicago neighborhood, and its struggle against pollution. Parts one, two and three of that series have been published on

Tuesday, August 9, 2011

Breaking down the downgrade: distilling the message with visualizations and context.

Markets, both domestic and abroad, spent no time to react on the news that Standard & Poor’s, one of the three major companies that rate the solvency of nations, had downgraded the United States credit rating from AAA to AA+.

S&P released its report on the downgrade on Friday, Aug. 6, after American markets were closed. Overseas markets were the first to move, with Japan’s Nikkei index dropping 2.2 percent. A sell-off sent China’s mainland Shanghai market down 2.2 percent. The country’s Hang Seng index flirted with a 7 percent drop before settling down 4.5 percent for the day.

When it came time for America’s markets to open the following Monday, the Dow lost several hundred points in the first hour of trading, and ended down 512 points 4.3 percent. It was the biggest one-day drop since Dec. 1, 2008, the Wall Street Journal reported. It’s among the top 10 biggest one-day DIJA declines ever, the Journal wrote. Crude oil prices also fell amidst concerns about lower demand.

But what does the S&P report actually say? How can we distill and best represent it? The following word cloud identifies dominant words in the document, with the size of a word relating to its presence in the document.