why ai trying to figure out what you want is garbage
Artificial Intelligence and Bad Data
Facebook, Google, and twitter lawyers gave testimony to congress on how they missed the Russian influence campaign. Even though the ads were bought in Russian currency on platforms chalk full of analytics engines, the problematic nature of the influence campaign went undetected. "Rubles + U.s.a. politics" did not trigger an warning, because the nature of off-the-shelf deep learning is that it only looks for what it knows to await for, and on a deeper level, information technology is learning from really messy (unstructured) or corrupted and biased information. Agreement the unstructured nature of public data (mixed with private data) is improving by leaps and bounds every mean solar day. That'south one of the main things I piece of work on. Let's focus instead on the data quality trouble.
Here are a few of the many common data quality problems:
- Data sparsity: We know a bit of the picture about a lot of things, but no clear flick on nigh things.
- Data abuse: Convert a PDF to text and print information technology. Yep. Lots of garbage comes out besides the text.
- Lots of irrelevant data: In a chess game, we tin can clip whole sections of the tree search, and more generally, in a motion picture of a cat, near of the pixels don't tell usa how cute the cat is. In totally random information, we humans (and AI) can see patterns where there actually is none.
- Learning from bad labeling: Bias of the labeling system, possibly due to man bias.
- Missing unexpected patterns: Black swans, regime alter, class imbalance, etc.
- Learning wrong patterns: Correlation that is not really causation tin can be trained into an AI, which so assumes wrongly that the correlation is causative.
- I could go on.
We know that labelled data is actually hard to come by for basically any problem, and even labelled information can be full of bias. I visited a prospective client on Friday that had a great data team simply no ability to collect the data they needed from the real earth because of ownership and IP problems. This "Rubles + US politics" example of good data that is missed past AI is not surprising to experts. Why? Well, AI needs to know what to look for, and the social media giants were looking for more aggressive types of attacks like monitoring soldier's movements based on their facebook profiles. Indeed, the reason we miss signals from skilful data in general is the huge corporeality of BAD data in real systems similar twitter. This is a bespeak to noise ratio trouble. If at that place are too many alerts, the warning system is ignored. Too few, and the system misses critical alerts. It is non only adversaries like the Russians trying to gain influence. The good guys, companies and brands, do the same affair. Drip campaigns and guerrilla marketing are just as much a tactic for spreading influence in shoe sales as in political meddling in an ballot. Then, the real reason we miss signals from good data is bad data. Using simple predicate logic, nosotros know that Fake assumptions tin can imply anything (too this). So learning from information we know is error-riddled carries some real baggage.
Ane example of bad data is finding that your AI model was trained on the wrong type of data. The text from chat conversation is not like text from a newspaper. Both are composed of text, just their content is very different. AI trained on the Wikipedia dataset or Google News articles will not correctly understand (i.e. "model") the free-form text we humans use to communicate in chat applications. Hither is a slightly better dataset for that, and maybe the comments from the hackernews dataset too. Oftentimes nosotros demand to utilise the right pre-trained model or off the shelf dataset for the right problem, and and then do some transfer learning to improve from the baseline. Still, this assumes we can use the data at all. many public datasets take even bigger bad data issues that cause the model to but fail. Sometimes a field is used and sometimes it is left blank (sparsity), Sometimes not-numeric data creeps into numerical columns ("one" vs 1). I institute an outlier in a large private real estate dataset where i entry amidst a million was a huge number entered by a homo as a fat finger error.
Problems similar the game of go (AlphaGo zero) has no bad data to clarify. Instead the AI evaluates more relevant and less relevant data. Games are a nice constrained problem set, but in most existent earth information, there is bias. Lots of it. Boosting and other techniques tin be helpful too. The truth is that some aspects of machine learning are withal open up problems, and shocking improvements happen all the time. Example: Sheathing network beats CNN.
It is important to know when fault is acquired by bad things in the data rather than acquired past improperly fitting to the data. And alive systems that learn while they operate, similar humans do, are particularly susceptible to learning incorrect information from bad data. This is kind of like Simpson's paradox, in that the data is unremarkably right, and and so plumbing fixtures the data is a expert affair, only sometimes fitting to the data produces paradoxes because the method itself (plumbing fixtures to the data) is based on a bad assumption that all information is footing truth data. See this video for more than on Simpson's paradox fun. And hither is some other link to Autodesk'south datasaurus, which I just love. It is totally worth reading in full.
We talked virtually the fact that most existent-world data is full of corruption and bias. That kind of sucks, but not all is lost. There are a diverseness of techniques for combating bad information quality, not the least of which are collecting more than information, and cleaning up the data. More avant-garde techniques like ensembles with NLP, knowledge graphs and commercial-class analytics are not easy to get your easily on. More on this in future articles.
If you enjoyed this article on bad data and artificial intelligence, then please try out the handclapping tool. Tap that. Follow u.s. on medium. Share on Facebook and twitter. Go for information technology. I'k also happy to hear your feedback in the comments. What practise you think?
Happy Coding!
-Daniel
daniel@lemay.ai ← Say how-do-you-do.
Lemay.ai
1(855)LEMAY-AI
Other articles you lot may enjoy:
- How to Toll an AI Project
- How to Hire an AI Consultant
- Bogus Intelligence: Get your users to label your data
burtonoffected1959.blogspot.com
Source: https://towardsdatascience.com/artificial-intelligence-and-bad-data-fbf2564c541a
0 Response to "why ai trying to figure out what you want is garbage"
Post a Comment