Wednesday, May 27, 2015

Housing Data Hub - from Open Data to Information

Joy Bonaguro Chief Data Officer, City and County of San Francisco. This is a repost from April at announcing the launch of their Housing Data Hub.

Housing is a complex issue and it affects everyone in the City. However, there is not a lot of broadly shared knowledge about the existing portfolio of programs. The Hub puts all housing data in one place, visualizes it, and provides the program context. This is also the first of what we hope to be a series of strategic open data releases over time. Read more about that below or check out the Hub, which took a village to create!

Evolution of Open Data: Strategic Releases

The Housing Data Hub is also born out of a belief that simply publishing data is no longer sufficient. Open data programs need to take on the role of adding value to open data versus simply posting it and hoping for its use. Moreover, we are learning how important context is to understanding government datasets. While metadata is an essential part of context, it’s a starting not endpoint.

For us a strategic release is one or more key datasets + a data product. A data product can be a report, a website, an analysis, a package of visualizations, an get the idea. The key point: you have done something beyond simply publishing the data. You provide context and information that transforms the data into insights or helps inform a conversation. (P.S. That’s also why we are excited about Socrata’s new dataset user experience for our open data platform).

Will we only do strategic releases?

No! First off - it’s a ton of work and requires amazing partnerships. Strategic (or thematic) releases should be a key part of an open data program but not the only part. We will continue to publish datasets per department plans (coming out formally this summer). And we’ll also continue to take data nominations to inform department plans.

We’ll reserve strategic releases to:

  • Address a pressing information gap or need
  • Inform issues of high public interest or concern
  • Tie together disparate data that may otherwise be used in isolation
  • Unpack complex policy areas through the thoughtful dissemination of open data
  • Pair data with the content and domain expertise that we are uniquely positioned to offer (e.g answer the questions we receive over and over again in a scalable way)
  • Build data products that are unlikely to be built by the private sector
  • Solve cross-department reporting challenges

And leverage the open data program to expose the key datasets and provide context and visualizations via data products.

We also think this is a key part of broadening the value of open data. Open data portals have focused more on a technical audience (what we call our citizen programmers). Strategic releases can help democratize how governments disseminate their data for a local audience that may be focused on issues in addition to the apps and services built on government data. It can also be a means to increase internal buyin and support for open data.

Next steps

As part of our rolling release, we will continue to work to automate the datasets feeding the hub. You can read more about our rollout process, inspired by the UK Government Digital Service. We’ll also follow up with technical post on the platform, which is available on GitHub, including how we are consuming the data via our open data APIs.

Thursday, May 21, 2015

Is the Internet Healthy?

Meredith Whittaker is Open Source Research Lead at Google.

We are big fans of open data. So we're happy to see that the folks over at Battle for the Net launched The Internet Health Test earlier this week, a nifty tool that allows Internet users test their connection speed across multiple locations.

The test makes use of M-Lab open source code and infrastructure, which means that all of the data gathered from all of the tests will be put into the public domain.

One of the project's goals is to make more public data about Internet performance available to advocates and researchers. Battle for the Net and others will use this data to identify problems with ISP interconnections, and, they claim, to hold ISPs accountable to the FCC's Open Internet Order.

This is certainly a complex issue but we are always thrilled by more data that can be used to inform policy.

You can learn more and run the test over at their site:

Thursday, May 14, 2015

New data, more facts: an update to the Transparency Report

Cross-posted from the Official Google Blog.

We first launched the Transparency Report in 2010 to help the public learn about the scope of government requests for user data. With recent revelations about government surveillance, calls for companies to make encryption keys available to police, and a wide range of proposals, both in and out of the U.S., to expand surveillance powers throughout the world, the issues today are more complicated than ever. Some issues, like ECPA reform, are less complex, and we’re encouraged by the broad support in Congress for legislation that would codify a standard requiring warrants for communications content.

Google's position remains consistent: We respect the important role of the government in investigating and combating security threats, and we comply with valid legal process. At the same time, we'll fight on behalf of our users against unlawful requests for data or mass surveillance. We also work to make sure surveillance laws are transparent, principled, and reasonable.

Today's Transparency Report update
With this in mind, we're adding some new details to our Transparency Report that we're releasing today.

  • Emergency disclosure requests. We’ve expanded our reporting on requests for information we receive in emergency situations. These emergency disclosure requests come from government agencies seeking information to save the life of a person who is in peril (like a kidnapping victim), or to prevent serious physical injury (like a threatened school shooting). We have a process for evaluating and fast-tracking these requests, and in true emergencies we can provide the necessary data without delay. The Transparency Report previously included this number for the United States, but we’re now reporting for every country that submits this sort of request.

  • Preservation requests. We're also now reporting on government requests asking us to set aside information relating to a particular user's account. These requests can be made so that information needed in an investigation is not lost while the government goes through the steps to get the formal legal process asking us to disclose the information. We call these "preservation requests" and because they don't always lead to formal data requests, we keep them separate from the country totals we report. Beginning with this reporting period, we're reporting this number for every country.

In addition to this new data, the report shows that we've received 30,138 requests from around the world seeking information about more than 50,585 users/accounts; we provided information in response to 63 percent of those requests. We saw slight increases in the number of requests from governments in Europe (2 percent) and Asia/Pacific (7 percent), and a 22 percent increase in requests from governments in Latin America.

The fight for increased transparency
Sometimes, laws and gag-orders prohibit us from notifying someone that a request for their data has been made. There are some situations where these restrictions make sense, and others not so much. We will fight—sometimes through lengthy court action—for our users' right to know when data requests have been made. We've recently succeeded in a couple of important cases.

First, after years of persistent litigation in which we fought for the right to inform Wikileaks of government requests for their data, we were successful in unsealing court documents relating to these requests. We’re now making those documents available to the public here and here.

Second, we've fought to be more transparent regarding the U.S. government's use of National Security Letters, or NSLs. An NSL is a special type of subpoena for user information that the FBI issues without prior judicial oversight. NSLs can include provisions prohibiting the recipient from disclosing any information about it. Reporters speculated in 2013 that we challenged the constitutionality of NSLs; after years of litigation with the government in several courts across multiple jurisdictions, we can now confirm that we challenged 19 NSLs and fought for our right to disclose this to the public. We also recently won the right to release additional information about those challenges and the documents should be available on the public court dockets soon.

Finally, just yesterday, the U.S. House of Representatives voted 338-88 to pass the USA Freedom Act of 2015. This represents a significant step toward broader surveillance reform, while preserving important national security authorities. Read more on our U.S. Public Policy blog.

Posted by Richard Salgado, Legal Director, Law Enforcement and Information Security

Thursday, May 7, 2015

Exploring the world of data-driven innovation

Mike Masnick is founder of the Copia Institute.

In the last few years, there’s obviously been a tremendous explosion in the amount of data floating around. But we’ve also seen an explosion in the efforts to understand and make use of that data in valuable and important ways. The advances, both in terms of the type and amount of data available, combined with advances in computing power to analyze the data, are opening up entirely new fields of innovation that simply weren’t possible before.

We recently launched a new think tank, the Copia Institute, focused on looking at the big challenges and opportunities facing the innovation world today. An area we’re deeply interested in is data-driven innovation. To explore this space more thoroughly, the Copia Institute is putting together an ongoing series of case studies on data-driven innovation, with the first few now available in the Copia library.

Our first set of case studies includes a look at how the Polymerase Chain Reaction (PCR) helped jumpstart the biotechnology field today. PCR is, in short, a machine for copying DNA, something that was extremely difficult to do (outside of living things copying their own DNA). The discovery was something of an accident: A scientist discovered that certain microbes survived in the high temperatures of the hot springs of Yellowstone National Park, previously thought impossible. This resulted in further study that eventually led to the creation of PCR.

PCR was patented but licensed widely and generously. It basically became the key to biotech and genetic research in a variety of different areas. The Human Genome Project, for example, was possible only thanks to the widespread availability of PCR. Those involved in the early efforts around PCR were actively looking to share the information and concept rather than lock it up entirely, although there were debates about doing just that. By making sure that the process was widely available, it helped to accelerate innovation in the biotech and genetics fields. And with the recent expiration of the original PCR patents, the technology is even more widespread today, expanding its contribution to the field.

Another case study explores the value of the HeLa cells in medical research—cancer research in particular. While the initial discovery of HeLa cells may have come under dubious circumstances, their contribution to medical advancement cannot be overstated. The name of the HeLa cells comes from the patient they were originally taken from, a woman named Henrietta Lacks. Unlike previous human cell samples, HeLa cells continued to grow and thrive after being removed from Henrietta. The cells were made widely available and have contributed to a huge number of medical advancements, including work that has resulted in five Nobel prizes to date.

With both PCR and HeLa cells, we saw an important pattern: an early discovery that was shared widely, enabling much greater innovation to flow from proliferation of data. It was the widespread sharing of information and ideas that contributed to many of these key breakthroughs involving biotechnology and health.

At the same time, both cases raise certain questions about how to best handle similar developments in the future. There are questions about intellectual property, privacy, information sharing, trade secrecy and much more. At the Copia Institute, we plan to more dive into many of these issues with our continuing series of case studies, as well as through research and events.

Friday, May 1, 2015

Five ways for states to make the most of open data

Mariko Davidson serves as an Innovation Fellow for the Commonwealth of Massachusetts where she works on all things open data. These opinions are her own. You can follow her @rikohi.

States struggle to define their role in the open data movement. With the exception of some state transportation agencies, states watch their municipalities publish local data, create some neat visualizations and applications, and get credit for being cool and innovative.

States see these successes and want to join the movement. Greater transparency! More efficient government! Innovation! The promise of open data is rich, sexy, and non-partisan. But when a state publishes something like obscure wildlife count data and the community does not engage with it, it can be disappointing.

States should leverage their unique role in government rather than mimic a municipal approach to open data. They must take a different approach to encourage civic engagement, more efficient government, and innovation. Here are few recommendations based on my time as a fellow:

  1. States are a treasure trove of open data. This is still true. When prioritizing what data to publish, focus on the tangible data that impacts the lives of constituents—think aggregating 311 request data from across the state. Mark Headd, former Chief Data Officer for the City of Philadelphia, calls potholes the “gateway drug to civic engagement.”

  2. States can open up data sharing with their municipalities—which leads to a conversation on data standards. States can use their unique position to federate and facilitate data sharing with municipalities. This has a few immediate benefits: a) it allows citizens a centralized source to find all levels of data within the state; b) it increases communication between the municipalities and the state; and c) it begins to push a collective dialogue on data standards for better data sharing and usability.

  3. States in the US create an open data technology precedent for their towns and municipalities. Intentional or not, the state sets an open data technology standard—so they should leverage this power strategically. When a state selects a technology platform to catalog its data, it incentivizes municipalities and towns within the state to follow its lead. If a state chooses a SaaS solution, it creates a financial barrier to entry for municipalities that want to collaborate. The Federal Government understood this when it moved to the open source solution CKAN. Bonus: open source software is free and embodies the free and transparent ethos of the greater open data movement.

  4. States can support municipalities and towns by offering open data as a service. This can be an opportunity to provide support to municipalities and towns that might not have the resources to stand up their own open data site.

  5. Finally, states can help facilitate an “innovation pipeline” by providing the data infrastructure and regularly connecting key civic technology actors with government leadership. Over the past few years, the civic technology movement experienced a lot of success in cities with groups like Code for America leading the charge with their local Brigade Chapters. After publishing data and providing the open data infrastructure, states must also engage with the super users and data consumers. States should not shy away from these opportunities. More active state engagement is a crucial element still missing in the civic innovation space in order to collectively create sustainable technology solutions for the communities they serve.

Tuesday, April 28, 2015

Visualization: The future of the World Bank

This visualization of the World Bank Borrowers today and in 2019 isn't the most technologically sophisticated visualization we've ever posted but it is a stark illustration of what the future of the World Bank looks like.

As Tom Murphy writes over on Humanosphere:

The World Bank’s influence is waning. Some point to the emerging Asian Infrastructure Investment Bank as evidence of the body’s declining power, but it is the World Bank’s own projections that illustrate the change. Thirty-six countries will graduate from World Bank loans over the next four years (see the above gif).

The images in Murphy's gif come from a policy paper titled "The World Bank at 75" by Scott Morris and Madeleine Gleave at the Center for Global Development. The paper provides a thorough data-driven analysis of current World Bank lending models and systematic trends that will shape its future. From the paper:
The World Bank continues to operate according to the core model some 71 years after the founding of IBRD and 55 years after the founding of IDA: loans to sovereign governments with terms differentiated largely according to one particular measure (GNI per capita) of a country’s ability to pay. Together, concessional and non-concessional loans to countries still account for 67 percent of the institution’s portfolio.

So when the World Bank looks at the world today, it sees a large number of countries organized by IDA and IBRD status.

And what will the World Bank see in 2019, on the occasion of its 75th anniversary? On its current course and with rote application of existing rules, the picture could look very different, with far fewer of those so-called “IDA” and “IBRD” countries.

But does this picture accurately reflect the development needs that will be pressing in the years ahead? Or instead, does it simply reflect an institutional model that is declining in relevance?

It is remarkable how enduring the World Bank’s basic model has been. The two core features (lender to sovereign governments; terms differentiated by countries’ income category) have tremendous power within the institution, which has grown up around them. The differentiation in terms has generated two of the core silos within the institution: the IBRD and IDA. And lending to national governments (what we will call the “loans to countries” model) is so dominant that it has crowded out other types of engagement, even when there has been political will to do other things (notably, climate-related financing).

So while the model has been laudably durable in some respects, it is also increasingly seems to be stuck at a time when external dynamics call for change.

This paper examines ways in which seeming immoveable forces underlying the World Bank’s work might finally be ripe for change in the face of shifting development needs. Specifically, we offer examples of 1) how country eligibility standards might evolve; and 2) how the bank might move further away from the “loans to countries” model that has long defined it.

Friday, April 24, 2015

How do political campaigns use data analysis?

Looking through SSRN this morning, I came across a paper by David Nickerson (Notre Dame) and Todd Rogers (Harvard), "Political Campaigns and Big Data" (February 2014). It's a nice follow-up to yesterday's post about the software supporting new approaches to data analysis in Washington, DC.

In the paper, Nickerson and Rogers get into the math behind the statistical methods and supervised machine learning employed by political campaign analysts. They discuss the various types of predictive scores assigned to voters—responsiveness, behavior, and support—and the variety of data that analysts pull together to model and then target supporters and potential voters.

In the following excerpt, the authors explain how predictive scores are applied to maximize the value and efficiency of phone bank fundraising calls:

Campaigns use predictive scores to increase the efficiency of efforts to communicate with citizens. For example, professional fundraising phone banks typically charge $4 per completed call (often defined as reaching someone and getting through the entire script), regardless of how much is donated in the end. Suppose a campaign does not use predictive scores and finds that upon completion 17 of the call 60 percent give nothing, 20 percent give $10, 10 percent give $20, and 10 percent give $60. This works out to an average of $10 per completed call. Now assuming the campaign sampled a diverse pool of citizens for a wave of initial calls. It can then look through the voter database that includes all citizens it solicited for donations and all the donations it actually generated, along with other variables in the database such as past donation behavior, past volunteer activity, candidate support score, predicted household wealth, and Census-based neighborhood characteristics (Tam Cho and Gimpel 2007). It can then develop a fundraising behavior score that predicts the expected return for a call to a particular citizen. These scores are probabilistic, and of course it would be impossible to only call citizens who would donate $60, but large gains can quickly be realized. For instance, if a fundraising score eliminated half of the calls to citizens who would donate nothing, so that in the resulting distribution would be 30 percent donate $0, 35 percent donate $10, 17.5 percent donate $20, and 17.5 percent donate $60. The expected revenue from each call would increase from $10 to $17.50. Fundraising scores that increase the proportion of big donor prospects relative to small donor prospects would further improve on these efficiency gains.

If you've ever wanted to know more about how campaigns use data analysis tools and techniques, this paper is a great primer.