So what is this?
Just as my title says, this weekend project (actually developed over summer break over the course of a week is a system that uses machine learning to detect conspiracy theories. In other words, I applied the miracles of modern mathematics and computer science to the prickly problem of detecting people who are totally off their rocker, or in the words of the verdict that my machine gives, “bonkers!”. If you would like, you can read more about my motivations and the actual development of the project here.
You can also check out the code to the project here, and I have put together a web app that you can access here. Unfortunately, because I am self hosting (on the same box as this site) now hosting this site on Heroku (because setting it up for their free tier was a whole lot simpler than trying to run a server accessible to the wider world on a university network from my dorm room, that model has a simplified tokenizer to save processing power and is thus less accurate.
And what does it actually do?
What it does…
This state of the art system, trained upon the craziness of the ilk of Alex Jones and David Duke (alongside many others, as well as more rational literature to prevent the machine from gaining too dim a view of humanity if it ever becomes sentient), delivers a verdict of “Bonkers!” or “Not too nuts!” when given a text file as an input.
What it does not do…
An important caveat is that this system does not detect bias or even fake news (a much tougher problem because it can look pretty similar to reputable news sources). Instead it focuses on the previously (at least to the best of my knowledge) untackled problem of statistically determining if a given piece of text is absolutely nuts (shape-shifting, Jewish, alien lizard nuts).
Another important factor is that this system is entirely constructed from my biases. It believes that the New World Order of Dick Cheney and the Communist, gay, Jews is a conspiracy theory (although I would not put it past Cheney) precisely because I think that this is a completely off its rocker conspiracy theory constructed by a lonely and chemically imbalanced talk show host, terrorist, or pseudo-intellectual who forgot to take his or her pills. if you do not agree with my assessment of what constitutes a crazy conspiracy theory, then you will not agree with this systems verdicts which reflect pretty strongly exactly what I think of the likes of Alex Jones, David Duke, Lyndon LaRouche, William Jennings Bryan, and company. In other words, if you believe Conservapedia, then this bot is probably not for you.
Can it actually do that?
Yes! This system can scientifically determine based only upon its training and the input text file whether or not that text is a conspiracy theory. In fact, if you stick around a little longer, I will show you some results, as well as just what exactly a conspiracy theory really is.
And how does it do any of this?
Magic!
No, not quite.
Dark Magic!
Closer, but still not quite accurate.
Support Vector Machines!
Yep, almost indistinguishable from dark magic, this system is actually powered by math! It uses a support vector machine (SVM), a machine learning model that can be used as a binary classifier, meaning a much fancier spam filter which is capable of assigning new inputs (after training) to one of two categories (so bonkers and not bonkers in this case). My blog post on this project here delves a little deeper into the actual mechanics of this process (hint: Thanks to David Duke and his love for Jews and communists, I had to add word stemming among other things to the tokenizaton process to keep this system from jumping every time it heard one of those two words).
So what constitutes a crazy text?
After each training session, the system extracts the top ten features that you can see here:
Top 10 most important features:
Rank | Weight | Bonkers | Weight | Fine |
---|---|---|---|---|
1 | -1.91936 | o | 0.964309 | change |
2 | -1.21264 | jew | 0.87151 | wildlife |
3 | -1.09732 | government | 0.82531 | reynard |
4 | -1.09311 | ei | 0.805876 | trump |
5 | -1.08023 | louse | 0.805513 | senator |
6 | -1.06831 | n | 0.773819 | specie |
7 | -1.062 | txt | 0.753482 | â |
8 | -1.01539 | ee | 0.749294 | prince |
9 | -0.999346 | obama | 0.719461 | sea |
10 | -0.930854 | written | 0.703385 | energy |
11 | -0.91092 | jewish | 0.678142 | place |
12 | -0.901538 | magazine | 0.672481 | climate |
13 | -0.886269 | medium | 0.670092 | important |
14 | -0.885437 | american | 0.655379 | point |
15 | -0.829171 | right | 0.651944 | section |
16 | -0.807909 | st | 0.648213 | photon |
17 | -0.7689 | d | 0.641958 | orofino |
18 | -0.761795 | litical | 0.641508 | king |
19 | -0.744603 | generator | 0.619919 | conflict |
20 | -0.742536 | freedom | 0.60663 | himself |
Or, if bar charts are more your speed, you can see the top 20 features for each category here:
Now, some weird things come up, mainly on the crazy side, because the input text is still a little dirty despite several attempts to clean it up. However, the system actually works quite well as you can see from its confusion matrix and classification report (from the test set):
Classification Report:
precision recall f1-score support
bonkers 0.99 0.99 0.99 298
fine 0.99 0.99 0.99 303
avg / total 0.99 0.99 0.99 601
Confusion Matrix:
[[294 4]
[ 4 299]]
Overall, the top 10 features seem to indicate that a lot of conspiracy theories tend to focus on Jews, Obama, and the Government, as well as on our freedom being taken away, as the words “American” and “freedom” seem to indicate.
So, can I see it in action?
Of course! I have open sourced the code here, and here is a sample result:
Input classified as:
Verdict | Not too nuts! |
Certainty | 0.23487657038082999 |
This sample corresponds to the Communist Manifesto, a text which originally gave me some trouble because while it is not an insane conspiracy theory, the repeated use of the word “communist” was not doing it any favors (the same tokenization modifications I made to deal with David Duke’s obsession with Jews also fixed this). Also, the certainty does not correspond to a probability, but is actually according to sklearn’s documentation the “The confidence score for a sample is the signed distance of that sample to the hyperplane”, or in other words a completely useless metric for most people that will be the first thing that I will change when I revisit the code.
How does it perform in the real world?
In addition to the 99% accuracy that you saw above, my ultimate test was seeing if the system could differentiate between a Washington Post article and a Conservapedia Article (a Christian fundamentalist version of Wikipedia that among other things believes that Obama is an Islamist Terrorist, that Liberals are pretty much genocidal maniacs, and that conversion therapy works).
The system passed with flying colors!
Awards
None, I just had a lot of fun working on this and felt like putting it up here because of that.