The Big Benefits of Small Data

Loading

My talk proposal for PyData Global


Description
"The term "big data" is about machines and "small data" is about people." -Allen Bonde. In this Small Data project, a team from the US Green Party's presidential campaign hand curates data in order to connect voters with politicians. Much like the "Slow Food Movement", this talk lays out the principals of the "Small Data Movement". 
Abstract

"The term "big data" is about machines and "small data" is about people." Allen Bonde, quoted on Wikipedia's Small Data page.

Maps.Howie2020.tech is a web site and Discord Bot which provide information about the US Green Party. The server is hosted by the Howie Hawkin's presidential campaign. By the time of this conference, this Small Data project is expected to be used by tens of millions of voters.

US Green Party Map

Small Data is Beautiful

Maps.Howie2020.tech shows the Green Party politicians and organizations for the 2020 race. It makes it easy to find your local Green Party politicians, parties and events. It looks nice.

Small Data is Useful

These maps serve a useful function. They allow people to connect to their nearest Green Party or politician, and get engaged in the fight against the 1%, climate change, and injustices.

Small Data is Hand Curated

Sure we crawled what web sites we could, but really every page was inspected by a human, who had to read and poke around all kinds of different web sites to get the information, and often do custom searches as well. A computer could not have collated this data. It takes a human working with a machine to create high quality small data. We have a growing team of 17 people responsible for curating the data.

The biggest Small Data sets are the Awesome lists on Github. Hundreds of hand curated lists, managed by volunteer teams of experts.

Small Data is Managed

We not only have a team of people entering the data, we have a quality control team. Each QC person is responsible to make sure that the data in 7 states is correct and complete. Someone at the top checks to make sure they did their job correctly.

Small Data is Loved

People really care about this data. It has been a long long time since I have worked on a website that gets so much user feedback. People really really want the data to be correct.

Small Data is Low-Noise

First of all, the bad data is eliminated by hand curation and user feedback. As each state is completed it is broadcast through social media. Many people see it and volunteer additions and corrections.

But if you only want to join a party close to you, that is still a lot of pins to sort through. If you live in Pennsylvania, and you are looking for a Pennsylvania Green Party website, most of those map pins are noise. So here is a map of just Pennsylvania. Much higher signal to noise ration.

US Green Party Map

Small Data is often Hierarchical or a Graph

The awesome lists are organized as a taxonomy, i.e. a hierarchical tree. Organizational charts are usually a hierarchy. Social networks are graphs. In this project, the data starts as a hierarchy, but is really a graph. .

Let me show you why it is natural to model the Green Party as a hierarchy. Best to do that using a video. It lasts 1 minute and 30 seconds.

Small Data has Great Human Factors

Hierarchical data makes it easy for people to understand complexity. A basic principal in Human Factors is that there should be no more than about 7 items in any category. Assembling multiple categories creates taxonomies. That is why library catalogues use the Dewey decimal system. That is why we have org charts. We do follow that rule, so we have a tree of webpages.

The rule of 7 is for concepts. For images that rule does not apply. For two dimensional images, the mind can process more than 7 items. There are 50 states in the US in a 2 dimensional visual map. 7x7 is 49, so even there we follow that rule. There are more pins than that, but the human visual system does not need to follow the 7 items rule. It is just our conceptual thinking that follows that rule.

The domain model of the Green Party is not just a hierarchy, really it is a graph. Every state party elects members to multiple national committees The State Parties elect them, but the national committees can fire them for failure to perform their duties. The application itself does not yet support committees, but but the underlying software does support graphs. The html templates are also accessed using a graph model.

Small Data's Achilles Heel

The big human factors problem is getting volunteers. Originally, on the web, people were happy to comment or contribute. Now they do not trust anyone. They only post on the brand name sites. They love to consume Small Data, but they never want to contribute. For a successful project you need passionate contributors.

The awesome lists have passionate contributors. The content contributors want publicity for their librries. The editors care about their industry. The system works.

In the US, the Green party has passionate contributors. People are unemployed, homeless, without health care, burdened by college loans and scared of becoming incarcerated. They are really passionate about changing the political system, but even with that motivation, they were still reluctant to enter data.

In the Awesome Lists example, everyone participates "locally". They contribute to the lists of technology that they are using. They are often part of a technical community.

We did find that the best way to engage people was at the local level. In a number of states, Green politicians have a good chance of winning local elections. People are social. They connect with other people in their state. Sure they all watch the presidential elections, but they engage with their local Green parties. They volunteer locally. They want to build up their state maps. Entering data for a remote state is less likely to occur. They enter their state data, and then they stop. By recruiting volunteers for each state, we built up the national map. A global map is even further from their concerns. Which is why it does not yet exist.

And of course the people who care most about their data, are the politicians running for office. They care a lot about getting their own data correct, and benefitting from the publicity. And we care about getting them elected.

Small Data is Interactive

There is a Discord bot that works with this application. It answers questions about the Green Party, and it crowdsources the collection and curating of news links.

The bot uses the rich data model to answer questions. Rather than getting the computer to perform natural language processing, it gets the users to understand a taxonomy of named entities. It reads a JSON feed of the data from the database in order to answer questions about any named entity.

The bot crowdsources links. It watches the server to crowdsource links. Editors can review edit and approve links. Readers can then vote on the links. When enough votes are accumulated, the links can be uploaded to the server.

Data Science

Even though all of the data has not been entered, already the maps are showing the national party leaders which states to invest in. For them it was like turning on the lights. They could finally see what was going on in the US. By the time of the conference, all of the data will be entered, and much more valuable insights will have been gained and will be reported.

Personally I have been running a number of reports. I was quite surprised that about 120 Green Party oliticials and organizations do not even have a website. The problem is that they all choose wordpress, and then they need a developer. Not scalable.

I approached the politicians. Many of them are hard at work on it, or they are not running for an important office. But in the organizations, some 10 state parties do not have a website. They just have a facebook page or a Twitter account.

It is quite easy to use an existing state party website, grab their custom data from the map, and create 10 state websites. And then we can create another hundred websites for local parties. The data gave us a clear indication of what is needed next.

Technical Aspects

Small Data Requires a CMS

Or some kind of security system. The Awesome lists use github, where every pull request has a different author, and has to be approved. In this map project, we do not want everyone changing each other's data, so a hierarchical security mode is provided.

The national party curators can edit anything, the state party curators can edit content in their own state. The local parties and politicians can only edit their own content. The national curators are able to grant security rights to state curators who can then grant security rights to local parties and politicians. It is a simple security model to implement on an object-graph database. And yet it is very effective.

Initially all the data entry was done at the national level. Increasingly the national party just assigns permissions to state curators who do the data entry for their state. Over time we hope the state curators will just assign permissions to their local parties and politicians to maintain their own data.

Small Data Requires a Rich Security Model

People are reluctant to login to submit their own data. So anonymous users can submit some types of data, but the results have to be approved before publication. Unknown but registered users are also able to submit data for review. Once approved, they can later edit their own data. Approved curators are able to submit content which is immediately published.

Small Data Uses a Rich Data Model

The object model includes national, state and local parties, national and local caucuses, national, state and local politicians, online and in person meetups, videos, and links. Python's dynamic binding and multiple inheritance are critical. The world and national parties have maps. States or Provinces can optionally have maps.

Small Data requires a rich GUI model

We use the Pyramid Views on Objects approach. End users see the index view, curators can edit with either the WYSIWYG ckEditor view, or the more technical Ace Editor view. When they make mistakes they can restore using historical views. Json views are used by the Discord bots.

There are views to add organizations, add politicians, add links, and add videos.

The ZMI view is designed for managing a tree of objects. Renaming retitling them. Cutting, pasting and copying individual objects or entire branches of the tree.

Backup restore, pack, delete, index, and configure views are all supported, but for obvious reasons are not publicly visible.

Small Data is Small

Not counting images, the entire database fits in just 12M. Actually it is smaller than that.

Small Data is Fast

Because Small Data is often small, we can cache it all in RAM, and generate pages very quickly. For heavy traffic the single Python server process can get overloaded, so pages for anonymous users are cached in Apache .

Small Data Needs a Small Server

Currently, the application server is just a single Python process. The unix top command says it requires 21632 Kilobytes of Virtual memory. Not much. A single Python application server may be enough, even at the height of the presidential campaign. Since few people edit the data, and do so infrequently, even at the height of the campaign, we may not need multiple Python processes. Remember this is the Green Party, we respect people's privacy, we do not track everything. We are fine with small servers.

Of course the Apache web server will need to be able to scale up on demand.

Small Data is Energy Efficient

Data centers consume 2% rising to 8% of global energy consumption, This is a double crime against humanity. First for the impact on climate change, and secondly for storing all of our data, without really giving us an option.

Small Data Can Be Very Big

The important thing about small data is not its size, it is the human factors. The example given earlier is six levels deep. When the global map is added, we will be 7 levels deep in the hierarchy/ A seven level deep hierarchy will easily support 7**7 = 823543 items , and yet the user is not lost in a sea of data. The point of small data is not that it is small, but that it is human understandable.

Hierarchical or Graph Data Works Best on an Object or Graph Database

A lot of big data uses a relational database. This small data uses an object-graph database. Trying to squish a graph into a relational database, is like trying to put a square peg into a round hole. You can do it, but it takes a lot more work, and lot more code. And a core principal of the Small Data movement is to keep the code small as well.

Small Data Prefers Small Code

The efficiency of small data should be matched by small code. Not counting libraries, the entire Python application is only 12,242 lines of code. How was that possible? Well we used an object-graph database and 55 Python libraries with great abstractions.

Small Data Needs a Small Software Team

The principals of Small Data also apply to small teams. Not counting libraries, this software was written by one developer. Is that strange? Not at all. The primary library, Cromlech, was also written by one developer. Souheil Chelfou spent 5 years cleaning up some previous user interface libraries. And the database, while it had more contributors, was also the invention of a sole developer. How is this possible? While Google's GoLang explicitly prevents code inheritance, this project uses Python's multiple inheritance extensively. Why the difference? Google has tens of thousands of developers. They must not step on each other's toes. This project has one developer. Not someone who job hops every two years, but one guy who really knows all the details of the code base. Who has the luxury of enough time to get the abstractions right. Who sees opportunities to simplify things. Whose most productive day was throwing out 150 lines of code.

Small Data is the Future

The web is constantly evolving. The denser your information, the more people like it. The mainstream model of providing infinite lists of computer curated results is not as popular as human curated results.

 
Notes

Here is the invite for the Discord development server. https://discord.gg/TYjsZ7

Currently there is a demo channel. . That may change. You can read the demo On the day you review this proposal, the bot may be up or down. It may be a production server. I am not sure what the future holds for the Discord bot and server. While discord does not censor chat, I have had no luck finding a large active left leaning server for the bot to call home. Hugely surprising.

 
Speaker Bio

Christopher Lozinski is an MIT graduate, serial entrepreneur, dual US-EU citizen, and polyglot. He has been a Python developer since 1999. Instead of seeking Venture Capital, he moved from Silicon Valley to Poland. There he built the Forest Wiki, and more recently has been volunteering with the Green Party's presidential campaign for Howie Hawkins, where he built a data model of the organization

 


  The Big Benefits of Small Data   1 item