Small Data


The Big Benefits of Small Data.

"The term "big data" is about machines and "small data" is about people." Allen Bonde, quoted on Wikipedia's Small Data page is a web site and Discord Bot which provide information about the US Green Party. It makes it easy to find your local Green Party politicians, parties and events.  By the time of this conference, this Small Data project will have involved the entire US Green party and tens of millions of their voters.     

Small Data is beautiful. shows the Green Party politicians and organizations for the 2020 race. It looks nice. 

Small Data is useful. These maps serve a useful function.  They allow people to connect to their nearest Green Party or politician, and get engaged in the fight against the 1%, climate change, and injustices. 

Small Data is loved.  People really care about this data.  It has been a long long time since I have worked on a website that gets so much user feedback.  People really really want the data to be correct. 

Small Data is low-noise. First of all, the bad data is eliminated by hand curation and user feedback.  But that is still a lot of map pins to sort through, if you only want the ones closest to you.  If you live in Pennsylvania, and you are looking for a Pennsylvania Green Party website, most of those map pins are noise.  So here is a map of just Pennsylvania.   Much higher signal to noise ration. 

Small Data is hand curated.  Sure we crawled what web sites we could, but really every page was inspected by a human, who had to read and poke around all kinds of different web sites to get the information, and often do custom searches as well.  A computer could not have collated this data.  It takes a human working with a machine to create high quality small data. We have a growing team of 17 people responsible for curating the data.   

The biggest Small Data sets are the Awesome lists on Github.  Hundreds of hand curated lists, managed by volunteer teams of experts.   

Small Data is managed.  We not only have a team of people entering the data, we have a quality control team.  Each QC person is responsible to make sure that the data in 7 states is correct and complete.  Someone at the top checks to make sure they did their job correctly. 

Small Data is often hierarchical or a graph.    The awesome lists are organized as a taxonomy, i.e. a hierarchical tree.  Organizational charts are usually a hierarchy.  Social networks are graphs.  In this project, the data starts as a hierarchy, but is really a graph. .  Let me show you why it is hierarchy..  At the top level is the root of the tree.

Eventually the next level down will be a global map. 

Below that you can see the US map.

Below that is a California state map.

Below that is a candidate's page for Jake Tonkel

Below that is a scheduled meeting.

Below that is one of his videos.  (Sorry if this is not working today.  The software does work, as you can see at PythonLinks,info.  The problem is that the system admin is on vacation and we are waiting for him to return and add in a valid YouTube Api Key. )

With this hierarchical structure, users do not get lost. 

Small Data has great Human Factors. Hierarchical data makes it easy for people to understand complexity.   A basic principal in Human Factors is that there should be no more than about 7 items in any category.  That is for concepts. We do follow that rule,  so we have a tree of webpages. There is one exception to that rule.    For two dimensional images, the mind can process more than 7 items. There are 50 states in the US in a 2 dimensional  visual map.  7x7 is 49, so even there we follow that rule.  There are more pins than that, but the human visual system does not need to follow the 7 items rule.  It is just our conceptual thinking that follows that rule.  

It is not just a hierarchy, really it is a graph.   Every state party elects members to multiple national committees  The State Parties elect  them, but the national committees can fire them for failure to perform their duties.  The application is not doing that yet, but the underlying software does support graphs. The html templates are also accessed using a graph model.    

Small Data can be very big.  the above example is six levels deep. When the global map is added, we will be 7 levels deep in the hierarchy/  A seven level deep hierarchy will easily support 7**7 = 823543 items , and yet the user is not lost in a sea of data.  The point of small data is not that it is small, but that it is human understandable.  

Small Data is interactive.  We have a discord bot that answers your questions.  The bot is best seen in an interactive demo.


Technical Aspects

Small Data is small.  Not counting images, the entire database fits in just 12M.

Small Data needs a small server.  The application server is just a single Python process.  Top says it requires 21632 Kilobytes of Virtual memory. Not much. 

Small Data is fast.  Because the data is so small, we can cache it all in RAM, and generate pages very quickly.  For heavy traffic  the single Python server process can get overloaded, so pages for anonymous users are cached in Apache .

Small Data is energy efficient.  Data centers consume 2% rising to 8% of global energy consumption,  This is a double crime against humanity.  First for the impact on climate change, and secondly for storing all of our data, without really giving us an option.

Small Data requires a CMS.  Or somekind of  security system. The Awesome lists use github, where every pull request has a different author, and has to be approved.   In this map project, every state and local party, and politician can manage their own data.  We do not want all of them changing each other's data, so  a hierarchical security mode is provided.  We use the Forest Wiki Content Management System  It is a first cousin to Plone.  Souheil Chelfou spent 5 years cleaning up the user interface libraries.  Chirstopher Lozinski spent an overlapping 6 years building the developer tools and application. 

Hierarchical or Graph data works best on an Object or Graph database.    A lot of big data uses a relational database.  This small data uses an object-graph database. Trying to squish a graph into a relational database, is like trying to put a square peg into a round hole.  You can do it, but it takes a lot  more work, and lot more code.  And a core principal of the Small Data movement is to keep the code small as well.  

Small Data Prefers Small Code.  The efficiency of small data should be matched by small code.  The entire Python application is only 10K lines of code.  How was that possible, well we used an object-graph database. 

Small Data needs a Small software Team.  The principals of Small Data also apply to small code. Not counting libraries, this  software was written by one developer.  Is that strange?  No.  The primary library, Cromlech, was also written by one developer.  And the database, while it had more contributors, was also the invention of a sole developer.  How is this possible? While Google's GoLang explicitly prevents code inheritance, this project uses Python's multiple inheritance extensively.   Why the difference?  Google has tens of thousands of developers.  They must not step on each other's toes.   This project has one developer.   Not someone who job hops every two years, but one guy who really knows all the details of the code base.  

Small Data is the future.  The web is constantly evolving.  The denser your information, the more people like it.   The mainstream model of providing infinite lists of results is not as popular. 

Software Architecture.  The software is written in Python using an object-graph database.  We use Open Street Map Data and the leaflet library.   There are a total of 55 Python libraries used.  if time remains, I will speak more about the software architecture. 


  Small Data   1 item