Open
Description
Right now we assume no feedback between adversary and classifier.
What if the adversary has access to the labels? What if the adversary has access to the raw probabilities? What is the adversary has access to some observation that can be linked back to the label or probability?
These are very broad, and while some have been addressed in machine learning literature, there are many possible takes on this as it specifically applies to text classification.
Potential ideas (this list will grow):
- Use Lime to identify words that are important to classification results and apply targeted attacks
- Simulate a sequence of back-and-forths between classifier and adversary