Policy by the Numbers

Data for sound policymaking from Google and friends


Showing posts with label Data. Show all posts

International Broadband Pricing Study: Updated Dataset

Thursday, March 20, 2014

Fei Xue is a Staff Analyst at Google

For the last couple years, Google has worked with Communications Chambers to produce a dataset of retail broadband Internet service prices. We released the first dataset in August 2012 and updated it again in May 2013. This dataset enables international comparisons over time and can potentially be used to evaluate the efficacy of particular public policies on consumer prices.

Today, we’re happy to announce the 3rd edition of this dataset. This release expanded the coverage with improved quality: up to ~3000 mobile plans and 1800+ fixed from major ISPs over ~100 countries.

  • Price observations for fixed broadband plans can be found here.
  • Mobile broadband prices can be found here.
  • Explanatory notes here and ancillary data is here.

We received a lot of positive feedback after the first two releases, and we hope this dataset is useful for regulators, policy makers, academics and advocates in making informed, data-driven decisions.

Visualizing Online Takedown Requests: Challenge Winners!

Thursday, July 11, 2013

Alexandra Pappas is Community and Events Coordinator at Visualizing.org.

In our latest challenge, designers and creative coders visualized Google's Transparency Report with the aim of adding context or insight to our understanding of the openness of the internet. In the report, Google discloses the number of requests received from copyright owners and governments to remove information from their services. Of the fantastic projects that were submitted—check them all out in our gallery—judges selected projects that best made sense of the complexity of the data, offered innovation in approach and design, and compelled us to explore more.

Congratulations to Frontwise with their winning project Google Online Takedown Requests Browser. Judges appreciated its beautiful design and the ample functionality to discover patterns and trends, including filters by dataset, time period, copyright owner, and target domains. Additionally, the organization by time and volume and distribution between an outer ring and inner ring of the monthly overview of requests and targeted domains or products presented the data neatly and effectively.

Simon Schulz is awarded second place for Country Based Google Transparency Report. As the only project to offer a detailed breakdown of the data by country rather than a more summary approach, judges felt the project provided an important point of view, one that could be a nice complement to the Transparency Report itself.

Prism by Felix Gonda takes third place for its focus—breaking down the volume of the data by country of origin, reason of request, and Google's products—and fluid interactivity that allowed easy exploration and comparison. Judges also noted its polish and creative solution.

Frontwise, Simon, and Felix will receive $3250, $1250, and $500 respectively for their great work. Thank you to Google, our jurors, and all participants!

Want to try your hand at another project? Take a look at our Visualizing Hospital Price Data challenge, offering $30,000 in prizes. We look forward to seeing your work!

DataEDGE: A New Vision for Data Science

Thursday, May 16, 2013

Steven Weber is a professor in the School of Information and Political Science department at UC Berkeley.

It's commonly said that most people overestimate the impact of technology in the short term, and underestimate its impact over the longer term.

Where is Big Data in 2013? Starting to get very real, in our view, and right on the cusp of underestimation in the long term. The short term hype cycle is (thankfully) burning itself out, and the profound changes that data science can and will bring to human life are just now coming into focus. It may be that Data Science is right now about where the Internet itself was in 1993 or so. That's roughly when it became clear that the World Wide Web was a wind that would blow across just about every sector of the modern economy while transforming foundational things we thought were locked in about human relationships, politics, and social change. It's becoming a reasonable bet that Data Science is set to do the same—again, and perhaps even more profoundly—over the next decade. Just possibly, more quickly than that.

There are important differences which have equally come into focus. Let's face it: Data Science is just plain hard to do, in a way that the Web was not. Data is technically harder, from a hardware and a software perspective. It's intellectually harder, because the expertise and disciplines needed to work with this kind of data span (at a minimum) computer science, statistics, mathematics, and—controversially—domain expertise in the area of application. And it will be harder to manage issues of ethics, privacy, and access, precisely because the data revolution is, well, really a revolution.

Can data, no matter how big, change the world for the better? It may be the case that in some fields of human endeavor and behavior, the scientific analysis of big data by itself will create such powerful insights that change will simply have to happen, that businesses will deftly re-organize, that health care will remake itself for efficiency and better outcomes, that people will adopt new behaviors that make them happier, healthier, more prosperous and peaceful. Maybe. But almost everything we know about technology and society across human history argues that it won't be so straightforward.

Data Science is becoming mature enough to grapple confidently and creatively with humans, with organizations, with the power of archaic conventions that societies are stuck following. The field is broadening to a place where data science is becoming as much a social scientific endeavor as a technical one. The next generation of world class data scientists will need the technical skills to work with huge amounts of data, the analytical skills to understand how it is embedded in business and society, and the design and storytelling skills to pull these insights together and use them to motivate change.

What skills, knowledge, and experience do you and your organization need to thrive in a data-intensive economy? Come join senior industry and academic leaders at DataEDGE at UC Berkeley on May 30-31 to engage in what will be a lively and important conversation aimed at answering today's questions about the data science revolution—and formulating tomorrow's.

Celebrating data-driven innovation in Brussels

Monday, April 8, 2013

Sylwia Giepmans-Stepien is a Public Policy and Government Relations Analyst for Google in Brussels.

We now create as much information every two days as we did from the dawn of civilization up until 2003. And this rich flow is destined to accelerate. McKinsey projects 40% growth annually in global data generated. To showcase the potential of data for Europe’s economy and society, we recently teamed up with the European Innovation and Technology Foundation, the Bavarian Representation to the European Union and Euronews.

The forum, Data-Driven Innovation: The New Imperative for Growth, debated how data can improve the delivery of public services, provide accurate healthcare diagnosis, and generate higher business productivity. Androulla Vassiliou, European commissioner for education, culture and multilingualism, and Neelie Kroes, European commissioner in charge of the digital agenda, both called for unleashing a Big Data revolution in Europe. "This is the new frontier of the information age," Vassiliou said. "In the current path to stimulate European growth and jobs, there has never been a more critical time to harness the potential of data."

Androulla Vassilou
Alfred Spector

Senior representatives of the education, research, policy and business communities presented compelling evidence of how data could address big societal challenges. Computer-powered DNA sequencing open the possibility of accelerating medical diagnoses. Online college courses could revolutionize education. Google's own Vice President for Research Alfred Spector showed how we use data for products such as Google Translate.

Data also is powering entrepreneurs. New online business models make sense out of data include social media power startups such as news organiser Storify. Its founder Xavier Damman explained how established organisations and top politicians such as BBC, the White House or UK Prime Minister David Cameron use his company’s services to share knowledge from different online data sources, including Twitter, Google+, and traditional media websites.

The concluding panel looked at the ethical aspects of collecting, sharing and using data. Among other examples, they discussed how organizations such as DataKind are bringing together data scientists and NGOs to address social problems ranging from dirty water to urban sprawl. While speakers stressed that data-driven innovation is not based exclusively on data about people, they acknowledge, that all data regardless the source and type requires making tough ethical choices.

The Innovation Forum aims to inject data-driven innovation on the Brussels policy agenda. As well as focusing on privacy and data protection, we also need to encourage the unprecedented economic potential of data.

Imagining Better Cities through Apps

Wednesday, April 3, 2013

Adrienne St. Aubin is a Policy Analyst at Google

Google is excited to sponsor this year’s international AppMyCity! Prize from the New Cities Foundation, celebrating mobile applications that improve the urban experience, connect people, and make cities more fun, vibrant, sustainable places.

We're bullish on the value of open public data to inspire innovation and improve citizens' daily lives. Last year Francisca Rojas of Harvard Kennedy School’s Transparency Policy Project highlighted the positive impact of open transit data on the number of transit apps developed—and the indication that more people are likely to utilize public transportation systems when apps help improve the experience via real-time information. Imagine the possibilities for other kinds of public data like health, employment, education, environmental, demographic and cultural info.

The first step toward generating value from public data is for governments to make data available in machine-readable formats, not just PDFs or image files, and ensure it stays up to date. No one wants to build or use an app that shows out-of-date schedules or last year’s parking zones. But governments aren’t the only ones who have a responsibility here, even though they are the generators and keepers of the data. Developers and citizens have a role to play too, by using what’s out there, giving feedback about how it can be improved, and growing the demand side of the market.

Of course, the value of open data isn’t just about apps. But creating and using apps is one of the most concrete ways we can engage with the public information around us. Imagine together how it can make our communities—and the world—a better place.

About the AppMyCity! Prize

Entries are now being accepted at www.appmycity.org and the submission deadline is April 26, 2013. The New Cities Foundation will announce ten semi-finalists on April 30, 2013. This list will be assessed by a panel of expert judges, who will select the three finalists. The finalists will be announced on May 7, 2013.

Three AppMyCity! Prize finalists will be invited to attend the New Cities Foundation’s New Cities Summit in São Paulo June 4-6 to present their project to an international audience of urban leaders, thinkers and innovators, and the winner will receive 5,000 USD to support further development of the app.

Visualization: The Atlantic's Class-divided Cities Series

Tuesday, April 2, 2013

In January, Atlantic editor Richard Florida kicked off a series of posts called the "Class-divided Cities." Each post includes an analysis and map visualizations of socio-economic polarization within different areas of US cities.

This divide is seen most clearly in where members of each class live. A recent report from the Pew Research Center found that residential segregation between upper- and lower- income households has risen in 27 of America's 30 largest metros over the past several decades. Compounding this polarization between rich and poor neighborhoods, the share of middle-income neighborhoods has declined substantially.
[...]
To get a better sense of the scale of the divide in American cities, my research team at the Martin Prosperity Institute — relying on data from the U.S. Census Bureau's American Community Survey — plotted and mapped the residential locations of today's three major classes: the shrinking middle of blue-collar workers in manufacturing, transportation, and maintenance; the rising numbers of highly paid knowledge, professional, and creative workers in the creative class; and the even larger and faster-growing ranks of lower-paid, lower-skill service workers. For the next few weeks, I'll be exploring the various divides in some of America's largest cities and metros.

The series began with New York, and yesterday, San Francisco became the 11th.

List of city analyses, in the order in which they were posted:

Data Privacy Day: Google’s approach to government requests for user data

Monday, January 28, 2013

Data Privacy Day is recognized every year on January 28 in the US, Canada, and many EU countries (27, according to Wikipedia). In honor of Data Privacy Day 2013, Google SVP and Chief Legal Officer David Drummond wrote for the official Google blog about how Google handles government requests for data. We're reposting the text from his post, Google's approach to government requests for user data.

Today, January 28, is Data Privacy Day, when the world recognizes the importance of preserving your online privacy and security.

If it's like most other days, Google—like many companies that provide online services to users—will receive dozens of letters, faxes and emails from government agencies and courts around the world requesting access to our users' private account information. Typically this happens in connection with government investigations.

It's important for law enforcement agencies to pursue illegal activity and keep the public safe. We're a law-abiding company, and we don't want our services to be used in harmful ways. But it's just as important that laws protect you against overly broad requests for your personal information.

To strike this balance, we're focused on three initiatives that I'd like to share, so you know what Google is doing to protect your privacy and security.

First, for several years we have advocated for updating laws like the U.S. Electronic Communications Privacy Act, so the same protections that apply to your personal documents that you keep in your home also apply to your email and online documents. We’ll continue this effort strongly in 2013 through our membership in the Digital Due Process coalition and other initiatives.

Second, we'll continue our long-standing strict process for handling these kinds of requests. When government agencies ask for our users’ personal information—like what you provide when you sign up for a Google Account, or the contents of an email—our team does several things:

  • We scrutinize the request carefully to make sure it satisfies the law and our policies. For us to consider complying, it generally must be made in writing, signed by an authorized official of the requesting agency and issued under an appropriate law.
  • We evaluate the scope of the request. If it's overly broad, we may refuse to provide the information or seek to narrow the request. We do this frequently.
  • We notify users about legal demands when appropriate so that they can contact the entity requesting it or consult a lawyer. Sometimes we can’t, either because we’re legally prohibited (in which case we sometimes seek to lift gag orders or unseal search warrants) or we don’t have their verified contact information.
  • We require that government agencies conducting criminal investigations use a search warrant to compel us to provide a user's search query information and private content stored in a Google Account—such as Gmail messages, documents, photos and YouTube videos. We believe a warrant is required by the Fourth Amendment to the U.S.Constitution, which prohibits unreasonable search and seizure and overrides conflicting provisions in ECPA.

And third, we work hard to provide you with information about government requests. Today, for example, we've added a new section to our Transparency Report that answers many questions you might have. And last week we released data showing that government requests continue to rise, along with additional details on the U.S. legal processes—such as subpoenas, court orders and warrants—that governments use to compel us to provide this information.

We're proud of our approach, and we believe it’s the right way to make sure governments can pursue legitimate investigations while we do our best to protect your privacy and security.

Posted by David Drummond, Senior Vice President and Chief Legal Officer

Visualization: Foreign aid, corruption and internet use

Wednesday, January 16, 2013

A few weeks ago, we posted the winner of the second Google/Guardian Datastore data visualization competition. For the next few Wednesdays, we'll share other competition entries here. Today's featured entry, "Foreign aid, corruption and internet use," is an interactive chart created by Nikhil Sonnad.

The chart below shows OECD data on the total amount given -- since 1960 -- to every aid recipient country. Two other data points underlie the simple bar graph: the 2011 Corruption Perceptions Index produced by Transparency International, and rates of internet use per 100 people, provided by the World Bank (for 2011 or the bank's most recent figures). You can use the menu below to manipulate how these data points affect the order and coloring of the chart.
The image below shows corruption index data (sort) and internet access data (color). Click the image to play with Nikhil's visualization.

Data Hangout on Air #5: Alex Howard

Monday, December 17, 2012

Today we hung out with Alex Howard to talk about the big stories and trends in data from 2012 and get his outlook for 2013. Alex touches on some important policy issues relating to data, including privacy and security, identity, and ownership.

Why economic policymakers in the UK should listen to what the data are telling them

Thursday, December 13, 2012

Hasan Bakhshi is Director, Creative Economy and Juan Mateos-Garcia is Research Fellow in Nesta’s Policy & Research Unit.

Boston Consulting Group says that the UK is the largest nation in the G20 as a percentage of GDP—8.3% compared to an average of 4.1%. The UK is a nation of e-shoppers, with almost two-thirds of consumers reporting having purchased goods or services online in the previous three months. That’s almost twice the average for the countries in the Eurozone. The UK was also the first country in the world where online advertising spend overtook TV advertising in 2009.

Yet, a recent study of digital readiness from Booz & Co. puts the UK in an unimpressive twelfth place and ranks the nation eighteenth on average broadband connection speeds. They find that two-thirds of UK SMEs have "little or no presence online." Only 14% of UK SMEs sell online compared 30% in Norway. Eurostat data confirm that UK businesses are not among the leading pack of countries in e-commerce markets.

Our new research on how UK businesses collate and use their online customer data adds to this picture of lagging engagement with digital. Our findings suggest that collection of data is patchy, and that four out of five businesses with active online operations are not making full use of their data for decision making.

Even in our Internet-active sample, only 38% of businesses collect comprehensive transactions data. In the majority of businesses, the analysis of online data is only basic and descriptive. For example, only 27% run A/B experiments and other controlled trials and an even lower 13% use statistical techniques such as regression analysis.

Only 41% of businesses in our sample use online data to inform their business strategy, and fewer use it to optimise prices. Even among the sub-sample of firms for whom e-commerce makes up more than half of overall revenues, less than four in ten use their online customer data to set prices.

However, 18% of businesses in our sample—the 'datavores'—are showing the way. They are likely to collect, analyse, and, above all, act on their online customer data. They appear to be investing more aggressively in data capabilities than other firms, suggesting that companies who don’t learn to use data will be left even further behind.

Datavores are four times more likely than intuition-driven companies to report a positive contribution from their online data and are even more likely to be product innovators. The implication here is that that there may be an immediate benefit to the UK economy if more businesses made use of online data.

What might all this mean for policy?

Policymakers need to think about how to create a regulatory environment that strikes the right balance for consumers between data privacy and the potential benefits to be gleaned from data use, like more efficient pricing of products and rapid product innovation. Concerns about data privacy and security ranked highly as a barrier to greater use of online data by the datavores in our survey.

Policymakers should also heed the importance of sound analytical and management skills if the wish to encourage data-driven business. They should ask whether the education system attaches enough importance to such skills, and whether the system is prepared to cope with increasing demands as more businesses begin to unlock the value of data.

More data about copyright removals in Transparency Report

Tuesday, December 11, 2012

Fred Von Lohmann is Legal Director at Google.

We believe that data should play an important role in figuring out how to make copyright work better online. Six months ago, we launched a feature in our Transparency Report that discloses how many copyright removal requests we receive to remove Google Search results to help inform ongoing policy conversations.

Starting today, anyone interested in studying the data can download all the data shown for copyright removals in the Transparency Report. The data will be updated every day.

We are also providing information about how often we remove search results that link to allegedly infringing material. Specifically, we are disclosing how many URLs we removed for each request and specified website, the overall removal rate for each request and the specific URLs we did not act on. Between December 2011 and November 2012, we removed 97.5% of all URLs specified in copyright removal requests.

As policymakers evaluate how effective copyright laws are, they need to consider the collateral impact copyright regulation has on the flow of information online. When we launched the copyright removals feature, we received more than 250,000 requests per week. That number has increased tenfold in just six months to more than 2.5 million requests per week today. While we’re now receiving and processing more requests more quickly than ever (on average, within approximately six hours), we still do our best to catch errors or abuse so we don’t mistakenly disable access to non-infringing material.

We’ll continue to fine tune our removals process to fight online piracy while providing information that gives everyone a better picture of how it works. By making our copyright data available in detail, we hope policymakers will be able to see whether or not laws are serving their intended purpose and being enforced in the public interest.

Modeling a Market for White Space

Thursday, November 29, 2012

Kate Harrison is a graduate student at UC Berkeley and Anant Sahai is an Associate Professor at UC Berkeley.

Using TV white spaces means allowing wireless devices (e.g. wireless routers) to transmit on frequencies previously exclusive to over-the-air TV. The goal is not to eliminate over-the-air TV but instead to increase efficiency by maximizing existing resources. A useful analogy is pouring sand into a jar of large rocks, where the rocks in the jar naturally leave gaps for sand to fill in. We can think of the signals for TVs, called primaries, as the rocks, which leave room for signals from new devices, the secondaries, our sand.

The principle concern is preventing harm to primaries. Secondaries must be "quiet" enough that TV sets can still "hear" TV signals (in communications lingo, the signal-to-noise ratio must not drop too much). Consequently, we must enforce a limit on the collective "volume" of secondaries.

The standard approach is to hard-code the per-device limit on transmit power ("volume"). This works where devices have roughly the same requirements regardless of location. However, as the map below shows, white space availability varies greatly.


Large variation in the number of available white space channels.

The natural response to a variable environment is to adapt to it. To be legal, white space devices must contact servers to register and get permission to transmit, which ensures they don’t get too close to protected TV signals (in this sense, the policy is already data-driven). With this setup, it’s easy to simultaneously assign a custom transmit power. We showed with data-driven simulations that there is a power limit function which allows significantly higher mobile data rates without hurting TV coverage:


Single transmit power everywhereTransmit power varies by location

These maps were created in Matlab using US 2010 Census data by tract, the ITU propagation model, and a list of the 8,186 US TV towers, assuming white space ISP towers are placed to serve 2,000 people each. Find the code here.

Notice that data rates are much higher and more uniform in the variable-power map than in the single-power map. But an infinite number of functions satisfy the (linear) constraints of the problem (i.e. preserving TV reception). How should we pick in a principled way?

The traditional economics approach is to assign prices (using real or pretend money) to transmit power and allow people to trade freely until everyone is satisfied. Given the quantity of wireless devices, this is practically infeasible: imagine asking people to manually adjust the power of their wireless routers or even determine their valuation for a unit of power.

However, if we make the simple assumption that all devices (users) crave data rate, we can actually simulate their actions in a hypothetical market. This lets us approximate the optimal outcome easily without requiring any human interactions.

In our award-winning paper, "Seeing the bigger picture: context-aware regulations," we created a proof-of-concept “market” under the additional assumption that fair access to white space services is important to society. For example, San Franciscans will need more per-channel power than Montanans because they have fewer available white space channels. This hypothetical market is just a min/max convex optimization problem which can be solved quickly using today’s data centers and scales well even with thousands of constraints.

Since white space access already requires communication with a data center, we can easily apply changes there without deploying new white space devices. This lets us refine algorithms over time—including testing them in small regions before deploying them to the entire nation—in order to improve data rates for users. Through this, the white spaces could open up an exciting new realm of real-time data-driven policy.

Heading Towards Higher Well-Being for all Citizens of the World

Monday, November 19, 2012

Martine Durand is Chief Statistician at OECD.

On October 16, 2012 almost 400,000 babies were born in the world. On that same day, approximately 1000 people from around the world, including economists, statisticians, policy-makers and representatives from business and civil society, met to talk about the future lives of these babies. The 4th OECD World Forum on Measuring Well-Being for Development and Policy Making was held in New Delhi, featuring around 70 presentations, four roundtables, and several keynote lectures. The Forum provided a great opportunity for sharing knowledge and networking on Well-Being and Development.

Issues discussed by participants included: factors shaping trends in poverty and inequalities; business models and practices holding greater promise to improve well-being at work and beyond; links between effective and responsive institutions and people’s well-being; obstacles to gender equality and the type of environment needed for the start-up and success of women-owned businesses; policies helping children and at-risk youth to move into adulthood; preventing environmental degradation; improving the capacity of people, business and policy-makers to manage the consequences of disasters and conflicts; how to strengthen social cohesion.

The OECD World Fora on ‘Statistics, Knowledge and Policies’ have become one of the most important rendez-vous of the global community working on Well-Being. The 4th OECD Forum followed those held in Palermo (2004), Istanbul (2007) and Busan (2009). However, this forum marked a shift in the international well-being agenda. While previous Fora focused mainly on the “why” and the “how” to measure well-being, the 4th OECD Forum looks at how well-being can be made actionable. The OECD Better Life Initiative, launched in 2011 on the occasion of the Organisation’s 50th Anniversary—under the motto Better Policies for Better Lives—lies at the heart of this attempt to use improved well-being metrics to influence policy making. The OECD Better Life Initiative combines advanced statistical tools for measuring well-being with information on people’s aspirations and needs, as collected through the Better Life Index, a new innovative interactive platform.

But knowing what matters to citizens and where societies want to go is not enough to ensure that we will get there; this is one of the main messages coming out of the discussions held in New Delhi. We need to build our knowledge regarding what works or does not work to achieve better lives. We need new evidence and models to understand how people think and behave, and how policies can raise well-being given our new understanding. Part of the evidence is already there, though, and models are being developed. But the journey is long and will require the involvement of all actors—researchers coming from a range of disciplines, decision-makers, business, ordinary citizens.

Four additional key messages came out from New Delhi, and you can read the summary of conclusions here. The first is that the well-being agenda has made giant steps all over the world and that it is based on a common understanding of the issues. The second is that progress in measuring well-being has been uneven, with great advancements in areas such as subjective well-being but much more modest ones on measuring sustainability for example. The third is that more research is needed on the determinants of well-being, particularly on the role of policies. The fourth is that the well-being agenda is relevant for both developed and developing countries, although priorities may differ. The next OECD World Forum will take place in 2015 and be aligned with discussions on the outcomes of Rio+20 and the post-2015 agenda.. The 5th Forum will thus be an important landmark to judge whether Development Goals will have become, indeed, Well-Being Goals for all.

Data-driven Policy Debate: Censorship and Foreign Aid

Tuesday, October 30, 2012

On October 16, Google teamed up with The Guardian to host a debate in Kings Cross, London on the topic of the role of data in international development. Debate participants were Douglas Alexander MP, Salil Tripathi (Institute for Human Rights and Business), Rachel Rank (Publish What You Fund), and Simon Rogers (The Guardian Data Store). Issues covered include the importance of transparency around foreign aid and its impact on government behaviors, especially censorship and surveillance. Video below!

Tripathi and Alexander addressed the complex issue of how sanctions may help chill speech instead of foster it if data is not analyzed. The night ended with a great case for telling the story of political/social/economic development through data and also how critical the Internet has become in telling stories of people who don't usually have a voice.

Mapping the Ecology of Open Data Development

Thursday, October 25, 2012

Viktor Mayer-Schönberger is Professor of Internet Governance and Regulation at the Oxford Internet Institute/Oxford University. He is also a faculty affiliate of the Belfer Center of Science and International Affairs at Harvard University.

Many of us see open data as a potent tool to enhance and improve citizen empowerment and participation. The idea is not just that government data is brought to citizens in a more meaningful way. Many also hope for a rich ecosystem of open data sources and developers yielding amazing apps that provide society with novel insights.

Based on Zarino Zappia’s initial work and data collection, he and I haven taken a sharp look at this emerging network of open data sources and app developers. Data for developers of 175 open data apps were collected, including what data sets they had used. We then mapped the flow of information from initial data sources through the applications that developers had created to end-users.

Given the high hopes surrounding open data development, the results were somewhat sobering. The open data community that emerged from the data set we analyzed was relatively fragmented and disparate, with (as we noted in our paper), "far less participation and combination of data sources than originally hoped. Instead of a wide open playing field devoid of hierarchies we find developers and datasets alike become crucial linking points—crucial gateways for the flow of information—between sub-communities of open data development based on specific tasks or contexts."

We also found that most open data developers focused on a relatively narrow context for their applications. Thus insights they might have gained in one context – say local mapping apps - did not get transferred over easily into different contexts like apps on economic data or development.

Most open data apps were created by individuals (71 percent), and as far as we can tell only half of these individuals belong to an easily identifiable, working open data developer community. When and where these communities did form, data sources often provided a natural conduit. Thus perhaps unsurprisingly, apps that combine different data sources were also comparatively rare.


"The open data developer network," taken from Mayer-Schönberger/Zappia, Participation and Power: Intermediaries of Open Data

Moreover, the network of open data developers seems to replicate the tendencies towards a recreation of hierarchies and limitations on participation that research has shown to saddle the blogosphere, e-rulemaking, or (more recently) Wikipedia.

While these results may disappoint, a few words of caution are in order. First, despite our efforts our data collection may not have captured the ecosystem comprehensively enough. Second, we may have looked at open data developers too soon (data collection took place in 2011). It is still early days, and perhaps as more data sources are added, and apps gain in popularity, not only the number of developers may grow, but they may also become better connected. However, it is important to note that our initial findings were confirmed in in-depth interviews with a number of renowned open data developers.

Perhaps, though, these results also capture an important opportunity: if open data developers are not sufficiently connected with each other and the broader community, it may be because there is not yet an easy way to do so, independent of the large platforms providing data sources. Remedying that may help the open data ecosystem more than the release of further data sources or another application contest.

Data Hangout on Air #3

Wednesday, October 17, 2012

We hosted our third Hangout on Air on data-driven innovation this morning. This Hangout focused on data in sports, a theme that is never more appropriate than in the month of October (at least in the US). Our guests were Andy Brooks (UC Berkeley School of Information) and Robbie Allen (Automated Insights). The conversation ranged from data-driven scouting in baseball to the value of automating narratives around data from events, especially ones like fantasy sports match-up, where the outcome is relevant only to a few individuals. Video embedded below!

Using "Big Data" to Shape Public Policy

Tuesday, October 2, 2012

Thomas Byrne is a Research Assistant Professor at the School of Social Policy and Practice at University of Pennsylvania.

The term “big data” is often associated with the private sector, but an increasing number of governments on all levels are harnessing their vast stores of administrative data. These data, collected for the day-to-day operational purposes of public programs, provide a valuable basis for creating more effective and efficient public policies and programs. Places ranging from Philadelphia to Washington State have created integrated data systems (IDS) that link administrative records from their health, mental health, education and other human service systems into one data warehouse. Such systems provide policymakers to comprehensive and timely information that is often crucial for understanding—and in turn addressing—complicated problems. In short, as this report notes, IDS allow for “leaps of understanding” that can only occur when an issue is examined from the perspective of multiple public systems or agencies, instead of only one.

My colleagues—Dennis Culhane, Stephen Metraux, Manuel Moreno and Halil Toros—and I recently completed a study, funded by the Conrad N. Hilton Foundation, that is an example of how IDS can apply the power of big data to improve public sector services. Our study used an IDS created by Los Angeles County’s Office of the Chief Executive—called the Enterprise Linkages Project (ELP)—to examine young adult outcomes for 23,000 youth who had a history of involvement in the juvenile justice and/or foster care systems. As the graphic below shows, the results were quite striking. Within four years of being discharged from the juvenile justice system or from foster care (roughly when they were between 18-24), youth used $15,986 and $12,532 worth of county funded, health, mental health, drug/alcohol treatment, criminal justice and public welfare services, respectively. For those youth who had been involved in the juvenile justice and foster care systems—called "crossover youth"—the total was more than twice as high.

The study, which provides information at a scope that would have proven impossible in the absence of the ELP, points to the need for increased and more effective forms of assistance to facilitate successful transitions to adulthood by youth who are exiting juvenile systems of care. We anticipate that our findings will inform the ongoing implementation of state legislation that will extend the age of eligibility for child welfare services in California.

Studies such as ours can be linked across sites as well to provide a body of evidence on a particular issue that is of more general value to a broader range of jurisdictions and policymakers. As a report from the Coalition for Evidence Based Policy points out, IDS like the ELP can be a cost-effective means to conduct scientifically rigorous studies of programs and interventions that promote positive outcomes among youth aging out of juvenile systems of care, and other populations.

From a broader perspective, the demand for the unique type of information that comes from IDS is only likely to grow. The current fiscal environment makes it imperative that public resources are used in a cost-effective manner. This memo from the federal Office of Management and Budget makes it clear that evidence-based policy is on the rise, and that IDS and administrative data will have a large role to play in supplying the necessary evidence. That should be viewed as encouraging news by stakeholders from across the board.

GetRaised: Examining a Data-driven Solution

Thursday, September 27, 2012

Matt Wallaert is Lead Scientist at GetRaised.com.

Most of the time, when we talk about “policy by the numbers,” we are talking about using data to identify the existence of a problem. But I want to tell a somewhat different kind of story: that numbers can be not just identifiers, but solutions.

The story still starts with identification. A few years ago, I was the Head of Product at Thrive, a personal financial management site (like Mint.com) that was sold to LendingTree. Our main job was to help people change their financial behaviors: spend less, save more. But when we looked at our data, we found women were falling far behind in savings simply because they made only about 75% of what men did. This meant that even if we created the best budget program imaginable, women would still fall behind unless we could raise their income level to that of men.

Enter data as the solution. We created GetRaised.com, a free, entirely data-driven product to help women ask for, and successfully get, raises. On the surface, the process is relatively simple: figure out if someone is underpaid, help them generate a letter to do something about it, track the progress and process around that letter, from handing it over to a manager to meeting to discuss.

But the true engine of change in GetRaised is the data that sits behind it. Figuring out if someone is underpaid means pulling from a mashup up of data from the US Bureau of Labor Statistics (BLS) and user-contributed data. Indeed, one of the hardest parts developing GetRaised was translating the Standard Occupational Classification (SOC) used by the BLS into job titles that match up with the titles that women are likely to enter on the site.

Generating a raise letter means more data. Besides the salary numbers, we scrape job postings, and match similar open jobs in a user’s market that pay a higher salary than what the user currently receives. And we used large survey data from our interviews with human resources professionals to figure out how much is the right amount to ask for in a raise. (Statistically speaking, 8% is the sweet spot.)

The headline value of GetRaised, of course, is not the math that drives it but the math that comes out: to date, approximately 70% of women who turn in a GetRaised Raise Request letter receive their raise, and the average raise is ~$6,500. But for data scientists and policy makers, perhaps more important GetRaised's success is that it exists at all.

As was well said at the recent DataGotham conference in New York City, analysis is useless without action. While the majority of data analysis in academia is used to drive larger policies, products themselves can be based on data, not just as methods of problem detection but as actual solutions. We need more systems that need increasingly smaller amounts of human intervention to introduce increasingly larger amounts of change. Indeed, I can imagine a GetRaised that already knows your job title and salary data, and that monitors your work output to quantify it in a raise request. Data can be more than just looking at the patterns and problems; data-driven solutions can be developed for problems we already know exist.

Data Scientist: The Sexiest Job of the 21st Century [HBR]

Wednesday, September 19, 2012

Thomas Davenport (Harvard Business School, Deloitte) and DJ Patil (Greylock Partners) co-authored a piece for Harvard Business Review titled "Data Scientist: The Sexiest Job of the 21st Century," about the explosion of data science professionals onto the business scene. As businesses move to capture the value of data to drive innovation and decision-making, the role of the data scientist becomes essential to business strategy. In the piece, Davenport and Patil provide a profile of the data scientist:

Data scientists’ most basic, universal skill is the ability to write code. This may be less true in five years’ time, when many more people will have the title “data scientist” on their business cards. More enduring will be the need for data scientists to communicate in language that all their stakeholders understand—and to demonstrate the special skills involved in storytelling with data, whether verbally, visually, or—ideally—both.

But we would say the dominant trait among data scientists is an intense curiosity—a desire to go beneath the surface of a problem, find the questions at its heart, and distill them into a very clear set of hypotheses that can be tested. This often entails the associative thinking that characterizes the most creative scientists in any field. For example, we know of a data scientist studying a fraud problem who realized that it was analogous to a type of DNA sequencing problem. By bringing together those disparate worlds, he and his team were able to craft a solution that dramatically reduced fraud losses.

Perhaps it’s becoming clear why the word “scientist” fits this emerging role. Experimental physicists, for example, also have to design equipment, gather data, conduct multiple experiments, and communicate their results. Thus, companies looking for people who can work with complex data have had good luck recruiting among those with educational and work backgrounds in the physical or social sciences. Some of the best and brightest data scientists are PhDs in esoteric fields like ecology and systems biology. George Roumeliotis, the head of a data science team at Intuit in Silicon Valley, holds a doctorate in astrophysics. A little less surprisingly, many of the data scientists working in business today were formally trained in computer science, math, or economics. They can emerge from any field that has a strong data and computational focus.

It’s important to keep that image of the scientist in mind—because the word “data” might easily send a search for talent down the wrong path. As Portillo told us, “The traditional backgrounds of people you saw 10 to 15 years ago just don’t cut it these days.” A quantitative analyst can be great at analyzing data but not at subduing a mass of unstructured data and getting it into a form in which it can be analyzed. A data management expert might be great at generating and organizing data in structured form but not at turning unstructured data into structured data—and also not at actually analyzing the data. And while people without strong social skills might thrive in traditional data professions, data scientists must have such skills to be effective.

The article goes on to discuss recruitment strategies, including how to attract and maintain data scientists, and close with a summary of the extraordinarily lucrative future of the evolving role.

Visualization: The Forest of Advocacy

Wednesday, September 12, 2012

Via Information Aesthetics.

The "Forest of Advocacy," a visualization of political contribution data, is a "project of the LazerLAB and the Northeastern Centers for Computational Social Science and Digital Humanities (NECSS/NEDH)" at Northeastern University. The video below explains the visualization, which you can explore at their website.

From now until the November 2012 US elections, LazerLAB and company will post one visualization a week derived from political data. What's really great about the project is the thorough video explanation they provide alongside the visualization.