Mostly it depends on what your goals are and what your dataset looks like. There are two big divides here on both sides.
Structured Data – here, duplicates very much come with the territory. In this situation you’ve also likely got a lot of implicit ambiguity in your problem. Let’s say that you have 5 structured columns with 3 categories each. In addition you’ve got one target column. You might have a million rows in your dataset and there’s no way that your columns are totally predictive of your target column. In this case you need to understand how reliable a given outcome is given a set of inputs. Duplicate inputs result in some distribution across your output and thus you need to retain that distribution. In this case removing examples is highly destructive and must be avoided.
Unstructured Data – here duplicates are weird. They’re drastically less common than in the structured space, and importantly much more problematic. They typically represent strange edge cases, ETL issues, or other aberrations in data processing and if you see duplicates with any kind of real frequency here, either your unstructured task is actually a structured task or (much more likely) you’ve got an ETL issue. In these cases as a general rule it’s a good idea to remove duplicate data – or at the very least understand why it’s there and whether or not it is legitimate.
Modeling – Maybe not the best term, but the idea here is that in a modeling context you have pretty strong certainty and guarantees that the data you’re looking at is fully representative of the data you’ll be testing against. You have a clear sense of what metric you’re looking to optimize and your goal really is to just optimize that metric within your closed system. If you have a very high degree of certainty that this is the case then feel free to leave your duplicates in. These problems are common in academia, and rare in industry. Important: This does not apply to most problems, and is generally assumed to be more common than it actually is.
Understanding – Here, quite simply, the goal is to create some piece of functionality (i.e. sentiment analysis) that operates in an open system. Something with free user input, something where you can say with a reasonable degree of certainty that your training data is not fully representative of the eventual data your model will be applied to. Hopefully the distributions are close, but especially in the unstructured case your ability to make any assumptions around this are very limited. In these cases you absolutely should remove duplicate data as it will otherwise give you an inflated sense of model efficacy. The high-level logic simply being that for general understanding no singular example is worth more than another, and any memorization is not understanding.
Now, this is where things get tricky. If you’re reading carefully you’ll note that the first point I made on the structured side – that duplicate inputs can be important to inform the output distribution. This is much rarer on the unstructured side, but importantly it means something very different.
To use the example of sentiment analysis – this means that you have two identical pieces of text, one that has been tagged as positive, and the other that has been tagged as negative. This means that you have an issue with your problem framing. There are three main possibilities:
- You’re asking the wrong question – this is the most likely. In the case of sentiment analysis the issue is that positive/negative actually makes little sense in most cases. Adding an additional neutral class, or further specifying what you mean by “sentiment” (i.e. “I love your customer support, but I don’t want to renew my subscription”) can help dramatically. Ideally you discover issues like this early on and determine resolutions that heavily mitigate any disagreements on duplicates.
- The problem is intrinsically ambiguous – We like to believe that humans generally agree on common sense, but study after study shows that this just isn’t the case. Even for something as basic as sentiment there may not be a true sense of “positive” and “negative” the concept itself is fuzzy. In this case a small number of duplicates is acceptable, but if you start seeing duplicate levels greater than a couple percent and there are frequent disagreements, then you’re back up at point 1.
- The problem is impossible – We like to joke at indico about the use case that every customer starts with: “Is there any chance that you could just scrape the internet and find us trading signal?” No joke, we get that ask every day. We once worked with a customer asking us to again “scrape the internet” and then tell us which companies were about to go through M&A activity. After digging a bit deeper we found that they had an existing team of ~60 people that were already doing this. We looked at a small sample of data and found large distributional errors (and a high duplicate rate). After digging in a bit deeper we asked: “Does this process work?”, the answer was a resounding “no”. As a general rule if humans can’t do something with unstructured data then computers sure as hell can’t.
The last point I’ll bring up is that above all academic integrity is what matters. In some cases including duplicate data makes sense, in some it does not. In some cases the mere presence of duplicate data might sour the entire use case. Duplicate data has very unintuitive effects on metrics of model efficacy that mean that interpretation of even something as simple as an accuracy metric is impossible without a good understanding of the rates of duplication and contradiction in your dataset. Correspondingly you must be certain to disclose these rates. Ideally you should even report efficacy metrics both with and without duplicates to give a little more light into the generalizability of this particular model.
Maintain academic integrity, disclose curiosities of your data, and if you wittingly allow test/train contamination then you must disclose that. Everything else is secondary.
View original question on Quora >