Déjà views? A brief look at methods for cutting down on duplicates.

“We have seen so many copies of the same document! Can’t you do more to deduplicate?”

Yes, but! There are many excellent ways to reduce the number of duplicate or similar documents that your team has to lay eyes on. But they need to be applied with caution.

First of all: “global deduplication” usually refers to the process, during initial processing of documents, whereby the software identifies exact duplicates by their “hash value.” For example, imagine 12 of your custodians were addressees of a blast email announcing there were donuts in the break room, with an attachment detailing the types of pastries available.

A good deduplication process will remove 11 copies of the email from your review set, and add the names of a 12 custodians into the “custodian” (or “duplicate custodian”) field of the remaining copy.
It should not remove or eliminate:
- the copy of the attachment that the sender had saved to her file share;
- the email “FWD: Donuts in the Break Room!” that one of your custodians sent to a friend;
- the reply from one of your custodians thanking the sender; or
- the copy of the email that was printed out by an overzealous donut fan, then mistakenly filed in a folder containing documents relevant to your case, such that it was scanned to PDF during collection.

Needless to say, if anyone accidentally hit reply-all, it will not remove all of the replies saying “take me off this list” and “stop using reply-all”. So even with proper deduplication up front, your reviewers are going to know a lot more about the office’s confectionary splurge than they ever wanted to.

How do we “deduplicate” things that are not, according to the computer’s strict definition, duplicates? And what can go wrong when we do?

First: You can use technology-assisted review tools to prioritize review of responsive and issue-relevant documents, putting the ones you want at the top of the pile. Unless something went really wrong with those free donuts, it’s unlikely any of the above emails will be prioritized for review. Your reviewers may only see discussion of glazed vs. plain when “training” the system using a random set, or when doing a confirmatory review of a set of low-probability documents at the end of the review process.

Second: You can set up your review so it only shows reviewers the most “inclusive” or “complete” version of an email “thread.” Instead of seeing every Re: and Fwd:, they will see:

the last email in the reply-all fiasco (from the CEO: “The next person who replies to this thread or does anything but delete it is getting fired.”), with all of the other ill-advised contributions to the discussion represented below it, back to the original donut email;
the original email, but only once, since it had an important attachment;
forwarded versions or ones that changed the recipient list, since each change in recipient created a unique discussion thread.

Threading is a great way to reduce the number of times your review team has to look at the same thing — and the number of documents you ultimately have to produce. But your reviewers need to be careful to look at the whole document, lest they miss the email from counsel (“threatening to fire an employee based on a poorly-thought-through internal email may violate the Americans Who Can’t Use Email Protection Act”) halfway through.

Moreover, threading needs to be done up front. If you try to eliminate incomplete versions of email threads as an afterthought later in the review, can create unintended chaos.

Third, you can “propagate” coding across your database to every duplicate document. This avoids inconsistent coding and redaction on the same document, and lets your reviewers work more quickly since some of the items in their sets are already tagged. But beware. Propagation looks for duplicates at the document level, regardless whether it is an attachment or a standalone. Your reviewer may casually tag that list of donut flavors responsive when it is attached to a larger email relevant to the plaintiff’s claim; the result will mean that the attachment to the original donut email, the saved and scanned versions, and the forwarded emails are all going to be pulled into your production set, along with (depending on your production setup) their parent emails and sibling attachments. In sum: like mass-tagging, propagation can introduce errors and sneak things into your production set that you did not expect.

Fourth, you can tell your reviewers to soldier on through the junk (food) emails. But inaction has its own consequences. You will pay for more review hours to put sharp eyes on those dull documents. You will also grind down your reviewers’ attention, increasing the chance they will make a mistake while their eyes are … glazed over.

Déjà views? A brief look at methods for cutting down on duplicates.

Dance with the [vendor] that brought you.

Spark eDiscovery, P.C.