Book Notes: AIQ by Nick Polson and James Scott
Machine Learning and Artificial Intelligence are words that appear in public discourse on an almost daily basis as we discuss and debate their impressive, and sometimes not so impressive, achievements. However it's rare to find simple explanations about how such things were achieved so that they can seem "almost indiscernible from magic". Even as an engineer, I was unaware of some of the basic ideas behind the most common machine learning algorithms. AIQ does a fantastic job of demystifying these and many others with some thoughtful discussion about what the future might hold for us.
Here are some of my notes and thoughts on the book and some links to further reading. I was lucky enough to given an advance copy by a friend, Sophie Christopher, from Penguin Random House - thanks Sophie!
Abraham Wald, Personalization & Netflix
There's a well known legend that the US military were considering armouring planes based on the damage the returning planes had received and Wald convinced them otherwise by explaining survivorship bias. The authors could find no evidence of this and say the real gift Wald gave the US Air Force was a survivability recommender system. He analysed returning aircraft for damage and put together a data set. From this he could work out the conditional probability of an aircraft that had returned safely having fuselage damage. Working out the number that return safely after taking fuselage damage is harder as that data is unknown (conditional probabilities are not symmetric).
Wald estimated how many planes took damage to the fuselage and never made it home through research and filled in his data using imputation. He then had a model he could use to estimate the probabilities of planes not returning after being hit in different areas.
Netflix has a similar problem with their recommendation algorithm, how do you know if a user will like a given film based on their previous ratings of other films in the catalogue. They also have thousands of films.
Netfix created a prize for researchers to submit an improved algorithm, AT&T's team won a $1 million prize. Whilst much of Netflix's tech is proprietary you can read a paper of this submission online. It is essentially described in the book as:
Predicted Rating = Overall Average + Film offset + User offset + User-Film Interaction
Popular films have positives offsets and critical users have negative offsets.
The User-Film interaction part uses a latent feature model based on a user's iteration with similar films. These groupings of films are formed organically and are then labelled into the categories shown in the app, each grouping is an axis in a multidimensional space and your position is unique to your preferences for each category (I'm going to try and do some more research on this).
As we've seen in recent months these algorithms have a much uglier side too: How YouTube's Algorithm Distorts the Truth
Further reading: How Does Spotify Know You So Well?
Henrietta Leavitt, Standard Candles & Machine Recognition
Whilst at Harvard she discovered 1,777 previously unknown pulsating stars. 25 of these, that were later discovered to be Cepheid Variables, were clustered together and were assumed to be about the same distance from Earth. Given this, she therefore assumed that if the star seemed brighter at the source then it actually was. She collected data on each of the star, how long it took their pulses to complete and their brightness/luminosity during this time (intrinsic brightness).
The cycle of these stars ranged from 1.25 days to 127 days and when she plotted this data it fell almost exactly on a straight line: http://www.astro.sunysb.edu/metchev/PHY515/ceph_pl.gif
Once period is measured, the brightness of the star can be inferred.
The distance to one of these stars was subsequently calculated and with this meant that the distance of any other Cepheid Variable (or any galaxy containing one) could be calculated by: Calculating the period of the star, using Leavitt's graph to infer what the actual brightness of the star would be, measuring the intrinsic brightness of the star on Earth. It was then possible to use the difference between actual and intrinsic brightness to work out the distance based on the assumption that brightness drops off as a square of distance.
Henrietta's discovery proved revolutionary and led to many, many discoveries mainly by male astronomers like Hubble. As with many women in science, she went largely unrecognised.
Much of AI and Machine Learning at the moment is pattern recognition similar to the above, finding the equation for a line that best fits a given data set/model based on some input parameters. Each input parameter is assigned a multiplier whose values are continually tweaked using trial and error until the best fit for the data is found. A higher multiplier for a parameter will represent it having more importance towards predicating the outcome than the others. The book gives an example of house price prediction:
Price = 10,000 + 125 * (sq ft) + 26,000 * (Bathrooms)
One big danger of this method is overfitting. If your model does not contain enough data then your algorithm could seem to be perfect based on your tests but terrible when exposed to more data in the real world.
The connection between these stories seems a little tenuous as I'm pretty sure Leavitt was not the first to really make use of a line of best fit but it does make a nice segue for a beginner and give her some much deserved attention.
Self Driving Cars, Localization & Bayes' Rule
The third chapter focuses on self driving cards and one of the biggest problems for them to solve, SLAM - simultaneous location and mapping. Their "ability to construct or update a map of an unknown space and the agents positioning in it".
As humans SLAM is something we are able to solve at a young age but it is incredibly difficult for robots. The Moravec Paradox states "it is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility".
USS Scorpion was a nuclear submarine which went missing in 1968 somewhere between the Azores and Norfolk, Virginia - a distance of 2,670 miles. Initial searches by the Navy revealed nothing. John Craven, Chief Scientist of the Navy's special projects office took over the search and used Bayesian Search Theory which had played a part in finding 1 of 4 hydrogen bombs in a crashed B-52 off the Spanish coast in 1966.
Bayesian Search Theory (named after Thomas Bayes) can be summarised as a way to "update our beliefs in light of new information, turning prior probabilities into posterior probabilities" or Prior beliefs + Data = Revised Beliefs
To find the Scorpion Craven's team: were able to narrow down where they thought the sub could be based on underwater sonar recordings from surveillance stations to 140 sq miles. They interviewed submariners, ran simulations of different scenarios and created underwater explosions to calibrate their acoustical data then created a probability map with this data. They continued to run tests and update their map until the ship was found in October 1968. The submarine was 260 yards from their grid square where they though it was most likely to be (!).
The authors state that self driving cars with LIDAR sensors use Bayes' theory to create a probability of their position and where they think the car, and all other objects on the road, are likely to be the next time the LIDAR's lasers take a snapshot of the surroundings. This gives the car a best guess and what will happen next and allows it to take action if, for example, another car looks like it has a high chance of moving into its path.
Bayes' Theroem can be particularly interesting when you apply it to medical diagnostics: 1% of women have breast cancer, 80% of mammograms detect the cancer when it is there, 9.6% of mammograms are false positives. If you get a positive result what is the probability you have cancer? The answer, surprisingly is 7.8% not, as many doctors thought, 70-80%. It's explained well here. It sounds surprising due to "base rate neglect": we ignore the prior information of the low 1% rate of cancer in the population.
Word Vectors and Language Understanding
When working with machine learning it is necessary to encode your data in a way that computers understand. This chapter focused on techniques to get code to understand text.
Label encoding: imagine you have a table of countries with population data, each country row can be encoded by giving the country a numerical value. This works, but as explained here, this could lead the model to assume a correlation between the numerical value increasing and the population increasing.
One way to get around this is One Hot Encoding: this involves taking your label-encoded data and then splitting the country column into a column for each label (country name). The numbers we used in label encoding are replaced with a 1 in a the matching country and 0 in every other column. There's another nice explanation here. Each country row is essentially a vector that describes that country.
To use One Hot Encoding to understand sentences then we could create a table with a column for each word in our vocabulary and then given a particular sentence add a row for each work and mark the column that matches the word with 1, fill all others with 0. This is called a Word Vector and if there were five words in our vocabulary then it is essentially a five dimensional space where each word occupies one dimension.
As explained in more detail here, this is good but it doesn't give us any indication of similarities between words of a similar context: there would be no connection between good or nice despite in many sentences their use being closely related. To capture the context of a word and it's similarity to others we need to use Word Embedding.
Word2Vec is one of the most popular techniques for creating word embeddings. The algorithm uses a two-layer neural network algorithm developed by Tomáš Mikolov at Google to create a distributed representation of a word.
Although distributed representation is, in a way, explained in the book I only came across the term after reading afterwards reading this. Basically if you want to encode a "big yellow BMW" then you using the technique above we would have to create a column for big, yellow and BMW and then when we wanted to encode a "small green Audi" we could have to add three more. This is called localist representation and can cause the columns/memory units/dimensions needed to expand exponentially. Instead, it is much better to create three dimensions size, colour and brand with values on a continuous range from 0-1 (weights) in each column, this way we can represent thousands of cars easily and the representation of the word is spread across all units in the vector.
I need to look more into how Word2Vec actually creates the vectors, but as I understand it when trained with a large dataset of words it outputs a vocabulary with a vector for each word using upwards of 100 dimensions but unlike the examples above the dimensions aren't pre-defined and are decided on by the neural network. Words that are deemed to be similar will be closer in this 100 dimensional space.
The mind-blowing thing about word vectors though is that they can be used to do maths. As the authors explain: you can take the vector for man and deduct king which leaves you with the gender-neutral concept of royalty. If you then add the vector for woman to this the result is Queen. This technique can also work for tenses: captured - capture + go = went.
If there's one chapter that made the book worth reading it was this, it really illustrates something simply that I thought was far beyond anything I could conceive. I did have to look up some more information to write about them a bit more thoroughly here though. Here's some further reading:
An introduction to Word2Vec: https://skymind.ai/wiki/word2vec
The Amazing Power of Word Vectors: https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/
Variability and Anomalies
The Trial of Pyx is an system implemented by the Royal Mint that has been operating continuously since 1150. It's purpose is to check coins issued by the Mint and whilst ceremonial now, was previously a key part of ensuring silver coins weren't made too light allowing mint workers to pocket a nice bit of silver.
Out of every 60 pounds of silver struck by the Mint, 1 coin was set aside and when enough had accumulated they were weight en masse. The bounds for the acceptable weight of one coin was set at ±1% but instead of weighing each coin separately they weighed them collectively. It might seem obvious to assume the acceptable bounds of 2500 coins would also be ±1% (I did!) but this is actually incorrect.
The reason, the author's state is De Moivre's Equation or the Square Root Rule but I can't seem to find any relation to this online and statistics, just this. Their explanation does make sense though: The variability of a sample average gets smaller as the root of the sample gets bigger ie: in a small sample a single anomaly can effect the average a lot but in a larger data set it is more likely to be balanced out. In the case of the issue above the bounds for 2500 coins should be something like ±0.02% (for a 100g coin)!
The rest of the chapter talks about large datasets being used for anomaly detection in sports, industry and finance and some techniques used to do that.
Medical Technology and Statistics
Florence Nightingale ("The Lady with the Lamp") is probably the most famous nurse in the UK if not the world. Up until recently, and reading this book I was not aware of her pioneering work in statistics to improve outcomes for her patients as well as her use of infographics to illustrate the issues. She was the first woman ever elected to the Royal Statistical Society in 1859.
During the Crimean War when Florence made her name tending to the wounded the mortality rate was 60% from disease alone, a higher rate than the Great Plague in 1665. One of the charts she used to illustrate changes in mortality was the coxcomb (shown at the top of this article)
There is an interesting story in this chapter about a patient in the US whose health was failing and who has 126 kidney tests over the course of a few years. The test results when plotted showed clearly diminishing kidney function over time but no doctor has the time to do this and the patient eventually died. It's a good example of how applied statistics could be used to save lives, but which at the present moments many doctors don't have the time to do.
Dr. Zoltan Takats at Imperial College has invented a smart electro-surgical knife that can detect if the cells being cut are cancerous as they are vaporized using a mass spectrometer. While the authors say this is AI, I can't immediately find any evidence of this.
The final chapter tackles the issue of AI bias and how the data sets used to train algorithms can affect their outcome. Something I've found particularly of interest for a project I'm working on.
One of the examples the authors mention regards an image recognition algorithm developed by the US Army to detect tanks. It worked well when being trained but in the real world failed miserably as it turns out the images they had of Russian Tanks were all taken in the sunshine. It turns out the algorithm was detecting the sky conditions. This is a great story but I'm really disappointed to say trawling Google hasn't brought up a single bit of evidence that it is actually a true story.
It's a nice example of how AI bias might happen but as this author states it seems to be a very much exaggerated article to make AI seem far more rudimentary than it actually is. There are far better, more nuanced examples illustrating the same thing. As Benedict Evans explains here we often think about sociological biases but it is often the biases we can't so easily detect that might prove to be the real problem with AI.
The next example regards COMPAS a recidivism detection algorithm used in the US which when investigated by ProPublica was found to more likely give black defendants a higher risk factor. Throughout the media the algorithm has been branded as racist. This may be the case since the algorithm is not shockingly not open-source but as the authors and this article point out it is more likely the data that is provided to it is biased probably not deliberately (the training data isn't available either so we can't be sur), but due to the structural inequality and racism in society.
The authors argue that we should oppose absolutely closed source algorithms that hand out jail time without proper scrutiny but that well-studied and tested and open AI shouldn't necessarily be kept out of important decisions as human bias in one form or another already effects almost everything we do and perhaps there would be a chance of changing things for the better.
I agree with this but with every week comes a new story of an algorithm gone wrong and it does become harder to stay optimistic or see any big real-world application for AI decision-making that wouldn't reflect our current biases.
As you can probably tell from my notes I learnt a lot from the book but found it much meatier and thought provoking towards the beginning. Some of my brief research has called into question some of the anecdotes but they do illustrate the author’s points well. I hoped for more discussion around hybrid human and artificial intelligence, the centaur model, but generally I could recommend this book to anyone with no or very basic knowledge of AI, like myself, who wants to explore what it all means and how it works. I'm glad to be taking away a beginners understanding of word vectors which I find fascinating, hopefully I'll get a chance to work with them at some point soon!